This is the last week of my data science Bootcamp at General Assembly and I started this blog 3 months ago, what a ride! Yet I feel like the journey is just starting. Overall, this was a challenging but positive experience. I wanted to combine two passions of mine when I started writing here and it is only right that music is the center of my final project.

I was lucky to find a Spotify dataset on the data word website containing about 25.000 song titles, performers, and their characteristics collected through the Spotify API. Fascinating! When I wrote my first blog post, I was thinking, almost wishing to come across a dataset such as this one.

It is essential to frame the project as a data scientist and come up with a problem statement: Let’s say that I am an executive in a record company and that we recently launched a streaming service. I need to build a song recommender in order to improve the customer's experience, build customer’s loyalty, and therefore increase our profits.

So I not only have the opportunity to do a thorough exploratory data analysis on this dataset but I can also build a cool recommender model out of it, awesome! Now I always was interested in songs key being a musician myself. I already knew that some key and chord changes were just coming back all the time and I found that C, G, and D were the most popular keys in my dataset.

I have access to much more information on each unique song present in my data frame like the duration of each song. The energy, danceability, acousticness, instrumentalness, liveness and, song popularity have a score assigned to each song. It is also indicated whether or not a song contains explicit lyrics, its tempo, time signature and, genre. Now only 2574 songs out of 20977 songs contained explicit lyrics and that was a shocker to me! I would have totally believed that 20000 songs contained explicit lyrics with all the rap going on today but who knew?

Another interesting fact: The most popular song in my data frame is “Dance Monkey” by Tones and I. It was the first time I heard of that band and even thought I made a mistake in my code somewhere. I went on youtube to check and yes, it is a really popular song with over a billion views and old people dancing in the video; I didn’t get it. I thought it was a meme at first but it’s a legit song that I had no clue about.

Aside from other interesting findings during my EDA process, the main goal is to build a recommender. Since I don’t have any user input and just the song's characteristics, it will be a content-based recommender with the songs as the index. The metric used to find the most similar songs is the cosine similarity. Once you enter a song title, the recommender would return 10 songs with the higher cosine similarity.

I think one of the most interesting aspects of this capstone project will be the presentation! I can’t wait to showcase it interactively but I am also eager to see my classmate's projects as well and admire the progress each one of us made during those 12 weeks.

Tones and I ??

immersive data science bootcamp @ General Assembly