the data diary - Finding new wedding bops with {tidyclust} and {spotifyr}

Code

sysfonts::font_add_google("Roboto Slab")
showtext::showtext_auto()

ggplot2::theme_set(
  ggplot2::theme_minimal(base_family = "Roboto Slab", 
                         base_size = 14) +
    ggplot2::theme(plot.title.position = "plot",
                   plot.background = ggplot2::element_rect(fill = "white", color = "white"),
                   plot.title = ggtext::element_markdown(),
                   plot.subtitle = ggtext::element_markdown())
)

Last November, I (finally) popped the big question and proposed! Since then, my fiance and I have been diligently planning our wedding. While we have most of the big-ticket items checked off (venue, catering, photography, etc.), one area we still have more work to do is on the wedding playlist. We’ve started putting together a playlist on spotify, but it feels like it’s come to a bit of a stand-still. Currently, there’s a mix of zesty bops and tame songs on the playlist (we need to accommodate both our college friends and our grandparents!), but spotify’s track recommender only wants to suggest tamer songs right now. Our goal is to have a full dance floor the entire night — to achieve this, we can use spotifyr and the new tidyclust package to pull in the current playlist, cluster the songs based on their features, and find new songs based on the bop cluster.

Code

library(tidymodels)
library(tidyclust)
library(spotifyr)

If you’d like to follow along, I’d recommend installing the development versions of parsnip and workflows, as some of the functionality that interacts with tidyclust isn’t yet on CRAN.

Pulling in the playlist

spotifyr is an R interface to spotify’s web API and gives access to a host of track features (you can follow this tutorial to get it setup). I’ll use the functions get_user_playlists() and get_playlist_tracks() to pull in songs that are currently on our wedding playlist (appropriately named “Ding dong”).

Code

# get the songs that are currently on the wedding playlist
ding_dong <- 
  get_user_playlists("12130039175") %>%
  filter(name == "Ding dong") %>%
  pull(id) %>%
  get_playlist_tracks() %>% 
  as_tibble() %>%
  select(track.id, track.name, track.popularity) %>%
  rename_with(~stringr::str_replace(.x, "\\.", "_"))

ding_dong %>%
  slice_head(n = 10) %>%
  knitr::kable()

track_id	track_name	track_popularity
5jkFvD4UJrmdoezzT1FRoP	Rasputin	61
1D066zixBwqFYqBhKgdPzp	Fergalicious	66
12jjuxN1gxlm29cqL5M6MW	I Got You	62
2grjqo0Frpf2okIBiifQKs	September	78
2RlgNHKcydI9sayD2Df2xp	Mr. Blue Sky	76
6x4tKaOzfNJpEJHySoiJcs	Mambo No. 5 (a Little Bit of…)	72
3n3Ppam7vgaVa1iaRUc9Lp	Mr. Brightside	62
7Cp69rNBwU0gaFT8zxExlE	Ymca	45
3Gf5nttwcX9aaSQXRWidEZ	Ride Wit Me	72
3wMUvT6eIw2L5cZFG1yH9j	Country Grammar (Hot Shit)	65

Spotify estimates quite a few features for each song in their catalog: speechiness (the presence of words on a track), acousticness (whether or not a song includes acoustic instruments), liveness (estimates whether or not the track is live or studio-recorded), etc. We can use get_track_audio_features() to get the features for each song based on its track_id.

Code

# pull in track features of songs on the playlist
track_features <- 
  ding_dong %>%
  pull(track_id) %>%
  get_track_audio_features()

# join together
ding_dong <- 
  ding_dong %>%
  left_join(track_features, by = c("track_id" = "id"))

In my case, I’m interested in the energy and valence (positivity) of each song, so I’ll select these variables to use in the cluster analysis.

Code

ding_dong %>%
  select(track_name, valence, energy) %>%
  slice_head(n = 10) %>%
  knitr::kable()

track_name	valence	energy
Rasputin	0.966	0.893
Fergalicious	0.829	0.583
I Got You	0.544	0.399
September	0.979	0.832
Mr. Blue Sky	0.478	0.338
Mambo No. 5 (a Little Bit of…)	0.892	0.807
Mr. Brightside	0.240	0.918
Ymca	0.671	0.951
Ride Wit Me	0.722	0.700
Country Grammar (Hot Shit)	0.565	0.664

Clustering with `tidyclust`

Currently, the playlist covers a wide spectrum of songs. For new songs on the playlist, I’m really just interested in songs similar to others in the top right corner of the below chart with high energy and valence.

Code

# how are valence/energy related?
obj <- 
  ding_dong %>%
  ggplot(aes(x = valence,
             y = energy,
             tooltip = track_name)) + 
  ggiraph::geom_point_interactive(size = 3.5, alpha = 0.5) +
  scale_x_continuous(labels = scales::label_percent(accuracy = 1)) +
  scale_y_continuous(labels = scales::label_percent(accuracy = 1)) +
  labs(title = "The current wedding playlist",
       subtitle = "Hover over each point to see the song's name!")

ggiraph::girafe(
  ggobj = obj,
  options = list(
    ggiraph::opts_tooltip(opacity = 0.8,
                          css = "background-color:gray;color:white;padding:2px;border-radius:2px;font-family:Roboto Slab;"),
    ggiraph::opts_hover(css = "fill:#1279BF;stroke:#1279BF;cursor:pointer;")
  )
)

Broadly, there are three generic categories that the songs on the current playlist fall into: high energy and valence, low energy, or low valence (songs with low energy and valence will fall into one of the “low” categories). Rather than manually assign categories, we can use tidyclust to cluster the songs into three groups using the kmeans algorithm.

There’s some great documentation on the tidyclust site, but to get started, we’ll categorize the songs on the current playlist by “fitting” a kmeans model (using the stats engine under the hood).

Code

# create a clustering obj
set.seed(918)
ding_dong_clusters <- 
  k_means(num_clusters = 3) %>%
  fit(~ valence + energy,
      data = ding_dong)

Code

pal <- MetBrewer::MetPalettes$Egypt[[1]]

obj <- 
  ding_dong_clusters %>%
  augment(ding_dong) %>%
  ggplot(aes(x = valence,
             y = energy,
             color = .pred_cluster,
             tooltip = track_name)) +
  ggiraph::geom_point_interactive(size = 3.5, alpha = 0.75) +
  scale_x_continuous(labels = scales::label_percent(accuracy = 1)) +
  scale_y_continuous(labels = scales::label_percent(accuarcy = 1)) +
  theme(legend.position = "none") +
  labs(title = "Clusters in the current playlist",
       subtitle = glue::glue("Clustered into",
                             "**{riekelib::color_text('zesty bops',pal[1])}**,",
                             "**{riekelib::color_text('angsty bangers', pal[3])}**,",
                             "and",
                             "**{riekelib::color_text('mellow jams', pal[2])}**",
                             .sep = " ")) +
  MetBrewer::scale_color_met_d("Egypt")

ggiraph::girafe(
  ggobj = obj,
  options = list(
    ggiraph::opts_tooltip(opacity = 0.8,
                          use_fill = TRUE,
                          css = "color:white;padding:2px;border-radius:2px;font-family:Roboto Slab;"),
    ggiraph::opts_hover(css = "fill:#1279BF;stroke:#1279BF;cursor:pointer;")
  )
)

As expected, the majority of songs in the current playlist fall into the bop cluster. Let’s explore this cluster using in more detail with the custom metric vibe.

Code

# assign to clusters
ding_dong_vibes <- 
  ding_dong_clusters %>%
  augment(ding_dong) %>%
  select(track_name,
         valence, 
         energy, 
         .pred_cluster) %>%
  mutate(vibe = valence + energy)

# what are songs with the biggest vibe?
ding_dong_vibes %>%
  arrange(desc(vibe)) %>%
  slice_head(n = 10) %>%
  knitr::kable()

track_name	valence	energy	.pred_cluster	vibe
Hey Ya!	0.965	0.974	Cluster_1	1.939
Rasputin	0.966	0.893	Cluster_1	1.859
September	0.979	0.832	Cluster_1	1.811
She Bangs - English Version	0.858	0.950	Cluster_1	1.808
Take on Me	0.876	0.902	Cluster_1	1.778
The Legend of Chavo Guerrero	0.913	0.858	Cluster_1	1.771
Can’t Hold Us (feat. Ray Dalton)	0.847	0.922	Cluster_1	1.769
Toxic	0.924	0.838	Cluster_1	1.762
Timber (feat. Ke$ha)	0.788	0.963	Cluster_1	1.751
Shake It Off	0.942	0.800	Cluster_1	1.742

As expected, when arranging by vibe, the top songs are all a part of the first cluster. And they are, indeed, a vibe:

Compare that with the second cluster, which are generally lower energy (I’d personally disagree with spotify ranking Mr. Blue Sky and Single Ladies as “low energy,” but most others make sense).

Code

ding_dong_vibes %>%
  filter(.pred_cluster == "Cluster_2") %>%
  arrange(vibe) %>%
  slice_head(n = 10) %>%
  knitr::kable()

track_name	valence	energy	.pred_cluster	vibe
Mr. Blue Sky	0.478	0.338	Cluster_2	0.816
Single Ladies (Put a Ring on It)	0.272	0.584	Cluster_2	0.856
Low (feat. T-Pain)	0.304	0.609	Cluster_2	0.913
I Got You	0.544	0.399	Cluster_2	0.943
Wake Up in the Sky	0.367	0.578	Cluster_2	0.945
Summer, Highland Falls - Live at the Bayou, Washington, D.C. - July 1980	0.452	0.544	Cluster_2	0.996
Gimme! Gimme! Gimme! (A Man After Midnight)	0.536	0.491	Cluster_2	1.027
Wagon Wheel	0.634	0.403	Cluster_2	1.037
Hung Up	0.405	0.647	Cluster_2	1.052
Take Me Out	0.527	0.663	Cluster_2	1.190

Finally, the third cluster mostly contains songs with low valence but relatively high energy.

Code

ding_dong_vibes %>%
  filter(.pred_cluster == "Cluster_3") %>%
  arrange(vibe) %>%
  slice_head(n = 10) %>%
  knitr::kable()

track_name	valence	energy	.pred_cluster	vibe
Clarity	0.176	0.781	Cluster_3	0.957
Love Story	0.296	0.741	Cluster_3	1.037
Titanium (feat. Sia)	0.301	0.787	Cluster_3	1.088
Mr. Brightside	0.240	0.918	Cluster_3	1.158
All Night (feat. Knox Fortune)	0.392	0.777	Cluster_3	1.169
Turn to Stone	0.458	0.720	Cluster_3	1.178
Body	0.469	0.732	Cluster_3	1.201
Forever	0.445	0.819	Cluster_3	1.264
Shout, Pts. 1 & 2	0.416	0.866	Cluster_3	1.282
The Spins	0.550	0.766	Cluster_3	1.316

Now that I have the songs in the current playlist sorted by cluster, let’s pull in some new songs and assign them to the appropriate cluster!

Adding new songs

To go searching for new songs, we’ll start by casting a wide net then narrow the search with some of the get_*() functions from spotifyr. I’ll start by using get_categories() to explore the categories available in spotify.

Code

get_categories() %>%
  as_tibble() %>%
  select(id, name) %>%
  slice_head(n = 10) %>%
  knitr::kable()

id	name
toplists	Top Lists
0JQ5DAqbMKFQ00XGBls6ym	Hip-Hop
0JQ5DAqbMKFEC4WFtoNRpw	Pop
0JQ5DAqbMKFKLfwjuJMoNC	Country
0JQ5DAqbMKFxXaXKP7zcDp	Latin
0JQ5DAqbMKFDXXwE9BDJAr	Rock
0JQ5DAqbMKFLVaM30PMBm4	Summer
0JQ5DAqbMKFAXlCG6QvYQ4	Workout
0JQ5DAqbMKFEZPnFQSFB1T	R&B
0JQ5DAqbMKFHOzuVTgTizF	Dance/Electronic

I don’t really want to play country music or R&B during the wedding, so I’ll filter to a few categories before using get_category_playlists() to pull in the featured playlists available in each category.

Code

# pull in playlist ids
playlists <- 
  get_categories() %>%
  as_tibble() %>%
  filter(id %in% c("toplists", "hiphop", "pop", "rock", "summer")) %>%
  pull(id) %>%
  map_dfr(get_category_playlists) %>%
  as_tibble() %>%
  select(id, name, description) %>%
  distinct(id, .keep_all = TRUE)

playlists %>%
  slice_head(n = 10) %>%
  knitr::kable()

id	name	description
37i9dQZF1DXcBWIGoYBM5M	Today’s Top Hits	The Kid LAROI is on top of the Hottest 50!
37i9dQZF1DX0XUsuxWHRQd	RapCaviar	New music from GloRilla and NLE Choppa.
37i9dQZF1DXcF6B6QPhFDv	Rock This	YONAKA along with the biggest Rock songs you need to hear today!
37i9dQZF1DX4dyzvuaRJ0n	mint	The world’s biggest dance hits. Cover: Young Marco
37i9dQZF1DX1lVhptIYRda	Hot Country	Today’s top country hits of the week, worldwide! Cover: Elle King
37i9dQZF1DX10zKzsJ2jva	Viva Latino	Today’s top Latin hits, elevando nuestra música. Cover: KAROL G & Romeo Santos
37i9dQZF1DX4SBhb3fqCJd	Are & Be	The pulse of R&B music today. Cover: Ella Mai
37i9dQZEVXbLRQDuF5jeBp	Top 50 - USA	Your daily update of the most played tracks right now - USA.
37i9dQZEVXbMDoHDwVN2tF	Top 50 - Global	Your daily update of the most played tracks right now - Global.
37i9dQZEVXbLiRSasKsNU9	Viral 50 - Global	Your daily update of the most viral tracks right now - Global.

There’s a lot of playlists in playlists, so I’ve gone through and selected a few that I’m interested in exploring further.

Code

selected_playlists <-
  c("Today's Top Hits",
    "mint",
    "Top 50 - US",
    "Top 50 - Global",
    "Viral 50 - US",
    "Viral 50 - Global",
    "New Music Friday",
    "Most Necessary",
    "Internet People",
    "Gold School",
    "Hot Hits USA",
    "Pop Rising",
    "teen beats",
    "big on the internet",
    "Party Hits",
    "Mega Hit Mix",
    "Pumped Pop",
    "Hit Rewind",
    "The Ultimate Hit Mix",
    "00s Rock Anthems",
    "Summer Hits",
    "Barack Obama's Summer 2022 Playlist",
    "Summer Hits of the 10s",
    "Family Road Trip")

With this shorter list of playlists, I can pull in the all the songs that appear on each with get_playlist_tracks(). Some songs may appear on multiple playlists, so we’ll only look at unique songs by track_id. I’ve already pulled in features for songs currently on the playlist, so we can filter those out as well. Finally, get_track_audio_features() limits queries to a maximum of 100 songs, so we’ll select the top 100 most popular songs within the sample.

Code

new_songs <- 
  playlists %>%
  filter(name %in% selected_playlists) %>%
  pull(id) %>%
  map_dfr(get_playlist_tracks) %>%
  as_tibble()

new_songs <- 
  new_songs %>%
  select(track.id,
         track.name,
         track.popularity) %>%
  rename_with(~stringr::str_replace(.x, "\\.", "_")) %>%
  distinct(track_id, .keep_all = TRUE) %>%
  arrange(desc(track_popularity)) %>%
  filter(!track_id %in% ding_dong$track_id) %>%
  slice_head(n = 100)

new_songs %>%
  slice_head(n = 10) %>%
  knitr::kable()

track_id	track_name	track_popularity
0yLdNVWF3Srea0uzk55zFn	Flowers	100
4nrPB8O7Y7wsOCJdgXkthe	Shakira: Bzrp Music Sessions, Vol. 53	96
2dHHgzDwk4BJdRwy9uXhTO	Creepin’ (with The Weeknd & 21 Savage)	94
3nqQXoyQOWXiESFLlDF1hG	Unholy (feat. Kim Petras)	93
4uUG5RXrOk84mYEfFvj3cK	I’m Good (Blue)	93
5ww2BF9slyYgNOk37BlC4u	La Bachata	93
0V3wPSX9ygBnCm8psDIegu	Anti-Hero	92
1Qrg8KqiBpW07V7PNxwwwL	Kill Bill	91
0WtM2NBVQNNJLh6scP13H8	Calm Down (with Selena Gomez)	91
78Sw5GDo6AlGwTwanjXbGh	Here With Me	91

Now let’s assign these 100 news songs to the clusters we found earlier based on their valence and energy!

Code

new_song_features <- 
  new_songs %>%
  pull(track_id) %>%
  get_track_audio_features()

new_songs <- 
  new_songs %>%
  left_join(new_song_features, by = c("track_id" = "id"))

new_songs_clustered <- 
  ding_dong_clusters %>%
  augment(new_songs) %>%
  select(track_name,
         valence,
         energy,
         .pred_cluster) %>%
  mutate(vibe = valence + energy)

obj <- 
  new_songs_clustered %>%
  ggplot(aes(x = valence,
             y = energy,
             color = .pred_cluster,
             tooltip = track_name)) +
  ggiraph::geom_point_interactive(size = 3.5, alpha = 0.75) +
  scale_x_continuous(labels = scales::label_percent(accuracy = 1)) +
  scale_y_continuous(labels = scales::label_percent(accuarcy = 1)) + 
  theme(legend.position = "none") +
  labs(title = "New songs, same old clusters",
       subtitle = glue::glue("Clustered into",
                             "**{riekelib::color_text('zesty bops',pal[1])}**,",
                             "**{riekelib::color_text('angsty bangers', pal[3])}**,",
                             "and",
                             "**{riekelib::color_text('mellow jams', pal[2])}**",
                             .sep = " ")) +
  MetBrewer::scale_color_met_d("Egypt")

ggiraph::girafe(
  ggobj = obj,
  options = list(
    ggiraph::opts_tooltip(opacity = 0.8,
                          use_fill = TRUE,
                          css = "color:white;padding:2px;border-radius:2px;font-family:Roboto Slab;"),
    ggiraph::opts_hover(css = "fill:#1279BF;stroke:#1279BF;cursor:pointer;")
  )
)

Nice! It looks like the new songs are far more broad than the original playlist, but we can look at just the songs in the first cluster with the biggest vibe.

Code

new_songs_clustered %>%
  filter(.pred_cluster == "Cluster_1") %>%
  arrange(desc(vibe)) %>%
  slice_head(n = 10) %>%
  knitr::kable()

track_name	valence	energy	.pred_cluster	vibe
In The Yuma (feat. Aatig)	0.940	0.891	Cluster_1	1.831
MERCHO	0.962	0.790	Cluster_1	1.752
Zona De Perigo	0.970	0.767	Cluster_1	1.737
I’m Not Here To Make Friends	0.838	0.895	Cluster_1	1.733
PRC	0.893	0.826	Cluster_1	1.719
Boy’s a liar Pt. 2	0.857	0.809	Cluster_1	1.666
Late Night Talking	0.901	0.728	Cluster_1	1.629
I Ain’t Worried	0.825	0.797	Cluster_1	1.622
All By Myself	0.773	0.848	Cluster_1	1.621
Calm Down (with Selena Gomez)	0.802	0.806	Cluster_1	1.608

Now for the true vibe check — do these songs belong on the playlist?

Oh hell yeah!

This analysis was originally done on Aug. 20th, 2022 — Spotify’s featured playlists and tracks change on on a regular basis and also are time dependent on unique user data. When remapping from blogdown to Quarto in February 2023, it’s likely that the songs mentioned in the text differ from the songs pulled in from Spotify’s API.

Citation

BibTeX citation:

@online{rieke2022,
  author = {Rieke, Mark},
  title = {Finding New Wedding Bops with \{Tidyclust\} and \{Spotifyr\}},
  date = {2022-08-20},
  url = {https://www.thedatadiary.net/posts/2022-08-20-finding-new-wedding-bops-with-tidyclust-and-spotifyr},
  langid = {en}
}

For attribution, please cite this work as:

Rieke, Mark. 2022. “Finding New Wedding Bops with {Tidyclust} and {Spotifyr}.” August 20, 2022. https://www.thedatadiary.net/posts/2022-08-20-finding-new-wedding-bops-with-tidyclust-and-spotifyr.

Finding new wedding bops with {tidyclust} and {spotifyr}

Pulling in the playlist

Clustering with `tidyclust`

Adding new songs

Citation

Stay in touch

Support my work with a coffee

Share

Pulling in the playlist

Clustering with tidyclust

Adding new songs

Citation

Stay in touch

Support my work with a coffee

Share

Clustering with `tidyclust`