Finding new wedding bops with {tidyclust} and {spotifyr}

Last November, I (finally) popped the big question and proposed! Since then, my fiance and I have been diligently planning our wedding. While we have most of the big-ticket items checked off (venue, catering, photography, etc.), one area we still have more work to do is on the wedding playlist. We’ve started putting together a playlist on spotify, but it feels like it’s come to a bit of a stand-still. Currently, there’s a mix of zesty bops and tame songs on the playlist (we need to accommodate both our college friends and our grandparents!), but spotify’s track recommender only wants to suggest tamer songs right now. Our goal is to have a full dance floor the entire night — to achieve this, we can use spotifyr and the new tidyclust package to pull in the current playlist, cluster the songs based on their features, and find new songs based on the bop cluster.


If you’d like to follow along, I’d recommend installing the development versions of parsnip and workflows, as some of the functionality that interacts with tidyclust isn’t yet on CRAN.

Pulling in the playlist

spotifyr is an R interface to spotify’s web API and gives access to a host of track features (you can follow this tutorial to get it setup). I’ll use the functions get_user_playlists() and get_playlist_tracks() to pull in songs that are currently on our wedding playlist (appropriately named “Ding dong”).

# get the songs that are currently on the wedding playlist
ding_dong <- 
  get_user_playlists("12130039175") %>%
  filter(name == "Ding dong") %>%
  pull(id) %>%
  get_playlist_tracks() %>% 
  as_tibble() %>%
  select(,, track.popularity) %>%
  rename_with(~stringr::str_replace(.x, "\\.", "_"))
track_id track_name track_popularity
5jkFvD4UJrmdoezzT1FRoP Rasputin 65
1D066zixBwqFYqBhKgdPzp Fergalicious 71
12jjuxN1gxlm29cqL5M6MW I Got You 65
2grjqo0Frpf2okIBiifQKs September 81
2RlgNHKcydI9sayD2Df2xp Mr. Blue Sky 80
6x4tKaOzfNJpEJHySoiJcs Mambo No. 5 (a Little Bit of…) 77
3n3Ppam7vgaVa1iaRUc9Lp Mr. Brightside 66
7Cp69rNBwU0gaFT8zxExlE Ymca 50
3Gf5nttwcX9aaSQXRWidEZ Ride Wit Me 76
3wMUvT6eIw2L5cZFG1yH9j Country Grammar (Hot Shit) 70

Spotify estimates quite a few features for each song in their catalog: speechiness (the presence of words on a track), acousticness (whether or not a song includes acoustic instruments), liveness (estimates whether or not the track is live or studio-recorded), etc. We can use get_track_audio_features() to get the features for each song based on its track_id.

# pull in track features of songs on the playlist
track_features <- 
  ding_dong %>%
  pull(track_id) %>%

# join together
ding_dong <- 
  ding_dong %>%
  left_join(track_features, by = c("track_id" = "id"))

In my case, I’m interested in the energy and valence (positivity) of each song, so I’ll select these variables to use in the cluster analysis.

track_name valence energy
Rasputin 0.966 0.893
Fergalicious 0.829 0.583
I Got You 0.544 0.399
September 0.979 0.832
Mr. Blue Sky 0.478 0.338
Mambo No. 5 (a Little Bit of…) 0.892 0.807
Mr. Brightside 0.240 0.918
Ymca 0.671 0.951
Ride Wit Me 0.722 0.700
Country Grammar (Hot Shit) 0.565 0.664

Clustering with tidyclust

Currently, the playlist covers a wide spectrum of songs. For new songs on the playlist, I’m really just interested in songs similar to others in the top right corner of the below chart with high energy and valence.

Broadly, there are three generic categories that the songs on the current playlist fall into: high energy and valence, low energy, or low valence (songs with low energy and valence will fall into one of the “low” categories). Rather than manually assign categories, we can use tidyclust to cluster the songs into three groups using the kmeans algorithm.

There’s some great documentation on the tidyclust site, but to get started, we’ll categorize the songs on the current playlist by “fitting” a kmeans model (using the stats engine under the hood).

# create a clustering obj
ding_dong_clusters <- 
  k_means(num_clusters = 3) %>%
  fit(~ valence + energy,
      data = ding_dong) 

As expected, the majority of songs in the current playlist fall into the bop cluster. Let’s explore this cluster using in more detail with the custom metric vibe.

# assign to clusters
ding_dong_vibes <- 
  ding_dong_clusters %>%
  augment(ding_dong) %>%
         .pred_cluster) %>%
  mutate(vibe = valence + energy)

# what are songs with the biggest vibe?
ding_dong_vibes %>%
  arrange(desc(vibe)) %>%
  slice_head(n = 10) %>%
track_name valence energy .pred_cluster vibe
Hey Ya! 0.965 0.974 Cluster_1 1.939
Rasputin 0.966 0.893 Cluster_1 1.859
September 0.979 0.832 Cluster_1 1.811
She Bangs - English Version 0.858 0.950 Cluster_1 1.808
Take on Me 0.876 0.902 Cluster_1 1.778
The Legend of Chavo Guerrero 0.913 0.858 Cluster_1 1.771
Can’t Hold Us (feat. Ray Dalton) 0.847 0.922 Cluster_1 1.769
Toxic 0.924 0.838 Cluster_1 1.762
Timber (feat. Ke$ha) 0.788 0.963 Cluster_1 1.751
Shake It Off 0.942 0.800 Cluster_1 1.742

As expected, when arranging by vibe, the top songs are all a part of the first cluster. And they are, indeed, a vibe:

Compare that with the second cluster, which are generally lower energy (I’d personally disagree with spotify ranking Mr. Blue Sky and Single Ladies as “low energy,” but most others make sense).

ding_dong_vibes %>%
  filter(.pred_cluster == "Cluster_2") %>%
  arrange(vibe) %>%
  slice_head(n = 10) %>%
track_name valence energy .pred_cluster vibe
Mr. Blue Sky 0.478 0.338 Cluster_2 0.816
Single Ladies (Put a Ring on It) 0.272 0.584 Cluster_2 0.856
Low (feat. T-Pain) 0.304 0.609 Cluster_2 0.913
I Got You 0.544 0.399 Cluster_2 0.943
Wake Up in the Sky 0.367 0.578 Cluster_2 0.945
Summer, Highland Falls - Live at the Bayou, Washington, D.C. - July 1980 0.452 0.544 Cluster_2 0.996
Wagon Wheel 0.634 0.403 Cluster_2 1.037
Hung Up 0.405 0.647 Cluster_2 1.052
Take Me Out 0.527 0.663 Cluster_2 1.190
Country Grammar (Hot Shit) 0.565 0.664 Cluster_2 1.229

Finally, the third cluster mostly contains songs with low valence but relatively high energy.

ding_dong_vibes %>%
  filter(.pred_cluster == "Cluster_3") %>%
  arrange(vibe) %>%
  slice_head(n = 10) %>%
track_name valence energy .pred_cluster vibe
Clarity 0.176 0.781 Cluster_3 0.957
Titanium (feat. Sia) 0.301 0.787 Cluster_3 1.088
Mr. Brightside 0.240 0.918 Cluster_3 1.158
All Night (feat. Knox Fortune) 0.392 0.777 Cluster_3 1.169
Forever 0.445 0.819 Cluster_3 1.264
Shout, Pts. 1 & 2 0.416 0.866 Cluster_3 1.282
The Spins 0.550 0.766 Cluster_3 1.316
Club Can’t Handle Me (feat. David Guetta) 0.473 0.869 Cluster_3 1.342
Body (feat. Brando) 0.582 0.764 Cluster_3 1.346
Levels - Radio Edit 0.464 0.889 Cluster_3 1.353

Now that I have the songs in the current playlist sorted by cluster, let’s pull in some new songs and assign them to the appropriate cluster!

Adding new songs

To go searching for new songs, we’ll start by casting a wide net then narrow the search with some of the get_*() functions from spotifyr. I’ll start by using get_categories() to explore the categories available in spotify.

get_categories() %>%
  as_tibble() %>%
  select(id, name) %>%
  slice_head(n = 10) %>%
id name
toplists Top Lists
hiphop Hip-Hop
pop Pop
country Country
0JQ5DAqbMKFxXaXKP7zcDp Latin
rock Rock
summer Summer
0JQ5DAqbMKFAXlCG6QvYQ4 Workout
edm_dance Dance/Electronic

I don’t really want to play country music or R&B during the wedding, so I’ll filter to a few categories before using get_category_playlists() to pull in the featured playlists available in each category.

# pull in playlist ids
playlists <- 
  get_categories() %>%
  as_tibble() %>%
  filter(id %in% c("toplists", "hiphop", "pop", "rock", "summer")) %>%
  pull(id) %>%
  map_dfr(get_category_playlists) %>%
  as_tibble() %>%
  select(id, name, description) %>%
  distinct(id, .keep_all = TRUE)

playlists %>%
  slice_head(n = 10) %>%
id name description
37i9dQZF1DXcBWIGoYBM5M Today’s Top Hits Steve Lacy is on top of the Hottest 50!
37i9dQZF1DX0XUsuxWHRQd RapCaviar Music from Drake, Offset and 42 Dugg.
37i9dQZF1DXcF6B6QPhFDv Rock This The latest from Panic! At The Disco along with the Rock songs you need to hear today.
37i9dQZF1DX4dyzvuaRJ0n mint The world’s biggest dance hits. Cover: Zedd & Maren Morris
37i9dQZF1DX1lVhptIYRda Hot Country Today’s top country hits of the week, worldwide! Cover: Tyler Hubbard
37i9dQZF1DX10zKzsJ2jva Viva Latino Today’s top Latin hits, elevando nuestra música. Cover: Anitta, Maluma.
37i9dQZF1DX4SBhb3fqCJd Are & Be The pulse of R&B music today. Cover: Tink
37i9dQZEVXbLRQDuF5jeBp Top 50 - USA Your daily update of the most played tracks right now - USA.
37i9dQZEVXbMDoHDwVN2tF Top 50 - Global Your daily update of the most played tracks right now - Global.
37i9dQZEVXbLiRSasKsNU9 Viral 50 - Global Your daily update of the most viral tracks right now - Global.

There’s a lot of playlists in playlists, so I’ve gone through and selected a few that I’m interested in exploring further.

selected_playlists <-
  c("Today's Top Hits",
    "Top 50 - US",
    "Top 50 - Global",
    "Viral 50 - US",
    "Viral 50 - Global",
    "New Music Friday",
    "Most Necessary",
    "Internet People",
    "Gold School",
    "Hot Hits USA",
    "Pop Rising",
    "teen beats",
    "big on the internet",
    "Party Hits",
    "Mega Hit Mix",
    "Pumped Pop",
    "Hit Rewind",
    "The Ultimate Hit Mix",
    "00s Rock Anthems",
    "Summer Hits",
    "Barack Obama's Summer 2022 Playlist",
    "Summer Hits of the 10s",
    "Family Road Trip")

With this shorter list of playlists, I can pull in the all the songs that appear on each with get_playlist_tracks(). Some songs may appear on multiple playlists, so we’ll only look at unique songs by track_id. I’ve already pulled in features for songs currently on the playlist, so we can filter those out as well. Finally, get_track_audio_features() limits queries to a maximum of 100 songs, so we’ll select the top 100 most popular songs within the sample.

new_songs <- 
  playlists %>%
  filter(name %in% selected_playlists) %>%
  pull(id) %>%
  map_dfr(get_playlist_tracks) %>%

new_songs <- 
  new_songs %>%
         track.popularity) %>%
  rename_with(~stringr::str_replace(.x, "\\.", "_")) %>%
  distinct(track_id, .keep_all = TRUE) %>%
  arrange(desc(track_popularity)) %>%
  filter(!track_id %in% ding_dong$track_id) %>%
  slice_head(n = 100)
track_id track_name track_popularity
2tTmW7RDtMQtBk7m2rYeSw Quevedo: Bzrp Music Sessions, Vol. 52 100
6Sq7ltF9Qa7SNFBsV5Cogx Me Porto Bonito 99
1IHWl5LamUGEuP4ozKQSXZ Tití Me Preguntó 97
4LRPiXqCikLlN15c3yImP7 As It Was 96
6xGruZOHLs39ZbVccQTuPZ Glimpse of Us 96
5Eax0qFko2dh7Rl2lYs3bx Efecto 96
3k3NWokhRRkEPhCzPmV8TW Ojitos Lindos 96
6Xom58OOXk2SoU711L2IXO Moscow Mule 95
0mBP9X2gPCuapvpZ7TGDk3 Left and Right (Feat. Jung Kook of BTS) 94

Now let’s assign these 100 news songs to the clusters we found earlier based on their valence and energy!

new_song_features <- 
  new_songs %>%
  pull(track_id) %>%

new_songs <- 
  new_songs %>%
  left_join(new_song_features, by = c("track_id" = "id"))

new_songs_clustered <- 
  ding_dong_clusters %>%
  augment(new_songs) %>%
         .pred_cluster) %>%
  mutate(vibe = valence + energy)