Spotify Exploratory Dataset Analysis

Summary

Spotify is a supplier of audio streaming and other media services that was established in October 2008 and is headquartered in Sweden. It is now one of the largest digital music, podcast, and video streaming services in the world, and it provides users with access to millions of songs created by artists from all over the world. In this report we performed data munging before starting our exploratory data analysis. We also visualized some general features to show the relationship between the variables. Finally, we summarized our main characteristics using data visualizations. We went through the steps of data munging, exploration and finally analysis.

Data Munging

To start we check and identified NA values, and remove them to keep only the observations with no missing values.

# Check NA values
which(is.na(spotify_songs), arr.ind=TRUE)

##         row col
##  [1,]  8152   2
##  [2,]  9283   2
##  [3,]  9284   2
##  [4,] 19569   2
##  [5,] 19812   2
##  [6,]  8152   3
##  [7,]  9283   3
##  [8,]  9284   3
##  [9,] 19569   3
## [10,] 19812   3
## [11,]  8152   6
## [12,]  9283   6
## [13,]  9284   6
## [14,] 19569   6
## [15,] 19812   6

# Getting  the names of the R data frame columns that contain missing values.
spotify_songs1 <- as.data.frame(cbind(lapply(lapply(spotify_songs, is.na), sum)))
rownames(subset(spotify_songs1, spotify_songs1$V1 != 0))

## [1] "track_name"       "track_artist"     "track_album_name"

# Removing and keeping the clean data without NA values
clean_data <- na.omit(spotify_songs)

Initial Data Exploration

In this section, we try to explore some general questions about the data, like what are the top albums, artists and genres? We can see in the following figure that Martin Garrix is the most popular artist.

# Top 5 Artists
top_artists <- clean_data %>%
  group_by(track_artist) %>%
  summarise(n_apperance = n()) %>%
  filter(n_apperance > 100) %>%
  arrange(desc(n_apperance))

top_artists$track_artist <- factor(top_artists$track_artist, levels = top_artists$track_artist [order(top_artists$n_apperance)])

ggplot(top_artists, aes(x = track_artist, y = n_apperance)) +
  geom_bar(stat = "identity", width = 0.6,fill = "pink", color="pink") +
  labs(title = "Popular Artists", x = "Artists", y = "Number of Appearances in the list") +
  theme(plot.title = element_text(size=15, hjust=0, face = "bold"), axis.title = element_text(size=12)) +
  geom_text(aes(label=n_apperance), hjust=2, size = 3, color = 'white') +
  coord_flip()

The next figure shows how Greatest Hits followed by Ultimate Freestyle Mega Mix Gold are the most popular Albums.

# Top 5 Albums
top_albums <- clean_data %>%
  group_by(track_album_name) %>%
  summarise(n_apperance = n()) %>%
  filter(n_apperance >= 29) %>%
  arrange(desc(n_apperance))

top_albums$track_album_name <- factor(top_albums$track_album_name, levels = top_albums$track_album_name [order(top_albums$n_apperance)])

ggplot(top_albums, aes(x = track_album_name, y = n_apperance)) +
  geom_bar(stat = "identity" , width = 0.6,fill = "pink", color="pink") +
  labs(title = "Popular Albums", x = "Albums", y = "Number of Apperances in the list") +
  theme(plot.title = element_text(size=15, hjust=0, face = "bold"), axis.title = element_text(size=12)) +
  geom_text(aes(label=n_apperance), hjust=2, size = 3, color = 'white') +
  coord_flip()

In this figure we observe that Pop immediatly followed by Latin are the most popular genres.

# The Most Popular Genre
clean_data %>%
  group_by(playlist_genre)%>%
  summarise(avg = mean(track_popularity))%>%
  ggplot(aes(playlist_genre, avg))+
  geom_col(color="pink", fill="pink")+
  theme_light()

Exploratory Data Analysis

After initial exploration of the data, we came up with additional questions:

Popularity

Are there any features that could have influence on the popularity of Latin?

To answer the question we compared features among all genres, we can see that Latin stands out in danceability and valence.

# Isolate the 12 features
spotify_songs %>%
  select(playlist_genre, !track_id:playlist_subgenre) -> featuresOnly

# How do features compare between genres
featuresOnly %>%
  pivot_longer(!c(playlist_genre), names_to = "features")%>%
  ggplot(aes(value))+
  geom_density(aes(color = playlist_genre), alpha = 0.5)+
  facet_wrap(vars(features), scales="free")

And if we go into further detail by drawing a boxplot we can certainly see that Latin has a higher valence than others.

ggplot(clean_data, aes(x=valence, y=playlist_genre)) + 
  geom_boxplot() +
  geom_boxplot(fill='pink', color="grey")+
  theme_light()

What is the most common song features?

This plot demonstrates that Spotify users favor songs with higher energy levels.

energy_only <- cut(clean_data$energy, breaks = 5)
clean_data %>%
  ggplot( aes(x = energy_only )) +
  geom_bar(fill = "pink", colour = "pink") +
  scale_x_discrete(name = "Energy")+
  theme_light()

While this plot shows how popularity in terms of valence is normally distributed

valence_only <- cut(clean_data$valence, breaks = 5)
clean_data %>%
  ggplot( aes(x = valence_only )) +
  geom_bar( fill = "pink", colour = "pink") +
  scale_x_discrete(name = "Valence")+
  theme_light()

Correlation

Next, we’re going to study to see if there exists any correlation between energy and loudness and acousticness. We found that there is a positive correlation between energy and loudness, while energy is negatively correlated with acousticness.

# Select only numerical variables
clean_data %>%
  dplyr::select(where(is.numeric)) -> clean_data_numeric
corr <- round(cor(clean_data_numeric),3)
ggcorrplot(corr, hc.order = TRUE, type = "lower", lab = TRUE)

Does this correlation exist in specifc genres?

We can see that there is a correlation between loudness and Energy in most genres and the same for acousticness and Energy.

# loudness & Energy.
clean_data%>%
  ggplot(aes(x=loudness,y=energy))+
  geom_point()+
  ggtitle("Loudness/Energy")+
  theme_light()+
  geom_smooth(se=FALSE)+
  facet_wrap(~playlist_genre) -> val_nrgy
val_nrgy # More energetic song means the more loudness nearly in all genre

# acousticness & Energy.
clean_data%>%
  ggplot(aes(x=acousticness,y=energy))+
  geom_point()+
  ggtitle("Acousticness/Energy")+
  theme_light()+
  geom_smooth(se=FALSE)+
  facet_wrap(~playlist_genre) -> val_nrgy1
val_nrgy1 ## Energy & Acoustic. The lower the energy the more the acoustic.

Release Date

Is there a relationship between album release date and popularity?
Are features affected by album release date?

Y <- clean_data$acousticness
X <- as.numeric(substr(clean_data$track_album_release_date, 1,4))
by_years <- data.frame(X,Y)
by_years <- aggregate(Y~X, data = by_years, mean)
ggplot(by_years, aes(x = X, y = Y)) +
  geom_line(size=1.4) +
  labs(title = "Acousticness",
       x = "Years",
       y = "Avg. Acousticness") -> g

Y1 <- clean_data$loudness
X1 <- as.numeric(substr(clean_data$track_album_release_date, 1,4))
by_years1 <- data.frame(X1,Y1)
by_years1 <- aggregate(Y1~X1, data = by_years1, mean)
ggplot(by_years1, aes(x = X1, y = Y1)) +
  geom_line(size=1.4) +
  labs(title = "Loudness",
       x = "Years",
       y = "Avg. Loudness")-> g1

Y2 <- clean_data$danceability
X2 <- as.numeric(substr(clean_data$track_album_release_date, 1,4))
by_years2 <- data.frame(X2,Y2)
by_years2 <- aggregate(Y2~X2, data = by_years2, mean)
ggplot(by_years2, aes(x = X2, y = Y2)) +
  geom_line(size=1.4) +
  labs(title = "Danceability",
       x = "Years",
       y = "Avg. Danceability") -> g2

Y3 <- clean_data$speechiness
X3 <- as.numeric(substr(clean_data$track_album_release_date, 1,4))
by_years3 <- data.frame(X3,Y3)
by_years3 <- aggregate(Y3~X3, data = by_years3, mean)
ggplot(by_years3, aes(x = X3, y = Y3)) +
  geom_line(size=1.4) +
  labs(title = "Speechiness",
       x = "Years",
       y = "Avg. Speechiness")-> g3

Y4 <- clean_data$track_popularity
X4 <- as.numeric(substr(clean_data$track_album_release_date, 1,4))
by_years4 <- data.frame(X4,Y4)
by_years4 <- aggregate(Y4~X4, data = by_years4, mean)
ggplot(by_years4, aes(x = X4, y = Y4)) +
  geom_line(size=1.4) +
  labs(title = "Popularity",
       x = "Years",
       y = "Avg. popularity")-> g4

Y5 <- clean_data$energy
X5 <- as.numeric(substr(clean_data$track_album_release_date, 1,4))
by_years5 <- data.frame(X5,Y5)
by_years5 <- aggregate(Y5~X5, data = by_years5, mean)
ggplot(by_years5, aes(x = X5, y = Y5)) +
  geom_line(size=1.4) +
  labs(title = "Energy",
       x = "Years",
       y = "Avg. Energy")-> g5

figure <- ggarrange(g, g1, g2,g3, g4,g5, ncol = 3, nrow = 2)
figure

Unlike the acousticness, we can see that there is a major drop in 1960 for popularity, energy, danceability, loudness, and speechiness. However, we can notice that new songs are less acoustic, with lower speechiness and popularity than the older ones while danceability and loudness have an increase trend with time. on the other hand, energy became slightly stable after 1980.

To Further invistigate the major drops, we zoomed in on years between 1960 and 1980

Which introduces another question: Is the huge dip in the plot due to the small number of observations around the year 1960? In the table below we can see number of track albums released in each year suggesting that indeed the massive dips in the plot are caused by small number of observations around the year 1960.

The massive difference in number of observations per year is also clear in the bar plot below

Track duration

Does the track duration affect the popularity of the song?

To answer this question we compared the popularity of the songs with the duration after cutting it into intervals to better visualize the data, surprisingly, the song duration does not have a clear effect on the song popularity.

clean_data %>%
  group_by(duration_ms)%>%
  summarise(popularity = mean(track_popularity))%>%
  ggplot(aes(cut_number(duration_ms,5), popularity))+
  geom_col(color="pink", fill="pink")+
  theme_light()

Conclusion

The analysis of the Spotify dataset yielded the following results:

Pop and Latin are the top most popular genres.
The higher the danceability/ valence, the more positively it correlates to the popularity.
Energy and loudness are positively correlated.
Energy and acoustics are negativelly correlated.
Release date does not have a clear effect on the track popularity.
Track duration does not have a clear effect on the track popularity.