Spotify Exploratory Dataset Analysis

Introduction

Code and Documentation

Version control:

  • Git

  • Github

  • Consistent file structure and naming (e.g., 0-dataMunging)

Github page that includes all the files

Clear and properly commented code

Background

  • The data was squired through Spotify API in 2020 by TidyTuesday

  • The class of the data frame

[1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame" 
  • The number of rows
[1] 32833

The data set

  • The variables (columns)
 [1] "track_id"                 "track_name"              
 [3] "track_artist"             "track_popularity"        
 [5] "track_album_id"           "track_album_name"        
 [7] "track_album_release_date" "playlist_name"           
 [9] "playlist_id"              "playlist_genre"          
[11] "playlist_subgenre"        "danceability"            
[13] "energy"                   "key"                     
[15] "loudness"                 "mode"                    
[17] "speechiness"              "acousticness"            
[19] "instrumentalness"         "liveness"                
[21] "valence"                  "tempo"                   
[23] "duration_ms"             

The different Genres

  • The main genres in the data

  edm latin   pop   r&b   rap  rock 
 6043  5155  5507  5431  5746  4951 

Cleaning: NA values

  • Check for the NA observation
# A tibble: 5 × 4
  track_name track_artist track_album_name track_id              
  <chr>      <chr>        <chr>            <chr>                 
1 <NA>       <NA>         <NA>             69gRFGOWY9OMpFJgFol1u0
2 <NA>       <NA>         <NA>             5cjecvX0CmC9gK0Laf5EMQ
3 <NA>       <NA>         <NA>             5TTzhRSWQS4Yu8xTgAuq6D
4 <NA>       <NA>         <NA>             3VKFip3OdAvv4OfNTgFWeQ
5 <NA>       <NA>         <NA>             69gRFGOWY9OMpFJgFol1u0
  • After investigating, these are unique songs

Cleaning: Duplicates

  • Songs that have the same ID
[1] 4477
  • Songs are duplicated because they are on multiple playlists

  • Since the were not concerned with how many playlists a song is on, removing the duplicate is justified.

Ordinals variable

  • the class of “Key” and “Mode” is numerical

    [1] "numeric"
    [1] "numeric"
  • However, they should be categorical variables with different levels

    [1] "factor"
    [1] "factor"

Clean data

  • The new data with no duplicates

    
      edm latin   pop   r&b   rap  rock 
     4877  4137  5132  4504  5401  4305 

Exploratory Data Analysis

Initial Data Exploration

  • Most popular artists

Initial Data Exploration

  • Most popular albums

Initial Data Exploration

  • Most popular genres

Exploratory Data Analysis

How features change within genres

Latin

Latin stands out in danceability and valence.

  • danceability

Latin

  • Valance

Release Date

  • Is there a relationship between album release date and popularity?
  • Are features affected by album release date?

Release Date

The massive dips in the plot are caused by small number of observations around the year 1960 which is clear in the bar plot below.

Correlation

What’s the correlation between energy, loudness and acousticness?

  • Positive correlation between energy and loudness

  • Negative correlation between energy and acousticness

Correlation

  • Does this correlation exist in a specific genre?

Correlation

  • Does this correlation exist in a specific genre?

Track duration

Does the track duration affect the popularity of the song?

Conclusion

  • Pop and Latin are the top most popular genres.

  • The higher the danceability/ valence, the more positively it correlates to the popularity.

  • Energy and loudness are positively correlated.

  • Energy and acoustics are negativelly correlated.

  • Release date does not have a clear effect on the track popularity.

  • Track duration does not have a clear effect on the track popularity