#!/usr/bin/env python
# coding: utf-8

# # Music Through The Ages: Billboard Top 100 Charts Analysis
# 
# Frank Chen
# 
# Data 512 Fall 2019
# 
# ## Introduction
# 
# I analyzed Billboard Hot 100 Weekly Charts from 1958 to 2019 to show the relationship between popularity vs. relevance, velocity climb to #1 vs. #1 streak, as well as Billboard peak position vs. corresponding YouTube music video views. I show that songs from the 2010s are not outperforming the songs from the past, and confirmed that YouTube views do have correlations with higher Billboard peak positions, but noted that is already a confounding factor.
# 
# As someone who is constantly looking for new music to listen to, I definitely discovered some new songs from this research. I believe this type of analysis would be interesting to historians, ethnographers, anthropologists, music enthusiasts, and even music managers, since it provides a data-driven approach to understanding how music popularity evolves as society evolves, and vice versa.
# 
# ## Background & Related Work
# 
# Throughout history, music has consistently been used as one of the cultural indicators of society, in addition to literature, art, and film. The Billboard Charts have been calculating top performing songs since the 1930s [[1](https://www.wikiwand.com/en/Billboard_charts)], and serves as one of the indicators of successful songs. In recent years, with the introduction of social media platforms such as Facebook, Instagram, YouTube, and TikTok, there now exists strategies for making a song go viral [[2](https://www.grammy.com/grammys/news/what-music-goes-viral-tiktok)]. However, streaming platforms are only one part of the complex formular used in calculating Billboard Top 100. My main motivation for this project is to answer the question: Are today's (2010s) music exceeding the Billboard performances of music from the past?
# 
# There are been many related work analyzing Billboard charts and music that topped the Billboard charts. The Billboard website has several blogs on this topic, including its analysis on digital song sale and chart beat, the website's blogs about top-performing songs [[3](https://www.billboard.com/charts/digital-song-sales)][[4](https://www.billboard.com/chart-beat)]. In addition, others have tried combining this Billboard Charts dataset with another dataset such as Spotify to analyze song preference changes in society, such as this blog post from Towards Data Science [[5](https://towardsdatascience.com/billboard-hot-100-analytics-using-data-to-understand-the-shift-in-popular-music-in-the-last-60-ac3919d39b49)].
# 
# ## Research Questions
# 
# We will be dividing out analysis into 3 sections, aiming to answer the 3 questions:
# 
# **I. How are song popularity and relevance related in the Billboard charts?**
# 
# **II. How are song velocity and no. 1 streak related in the Billboard charts?**
# 
# **III. How does other streaming platforms such as YouTube influence Billboard charts?**
# 
# ### Definitions
# 
# - **Popularity**: how long the song stayed at no. 1 given all the weeks the song stayed on the chart. We represent this as the percentage of no. 1 counts over the total number of weeks the song spent on chart.
# - **Relevance**: how long the song stayed in the Billboard charts. We represent this as the total number of weeks the song spent on the chart.
# - **Velocity**: how many weeks did it take the song to reach no. 1 on Billboard
# - **#1 Streak**: how many weeks did the song stay at no. 1 on Billboard
# 
# ## The Data
# 
# In this analysis, I will be using two datasets (both in CSV format, and can be found in `raw_data/` folder). Both datasets are labeled as public domain data. More details about CC0 licenses can be found [here](https://creativecommons.org/publicdomain/zero/1.0/).
# 
# [**Billboard Weekly Charts Data**](https://data.world/kcmillersean/billboard-hot-100-1958-2017): Weekly Hot 100 singles chart from August 1958 to June 2019 (317795 rows, 10 columns)
# 
# > url, WeekID, Week Position, Song, Performer, SongID, Instance, Previous Week Position, Peak Position, Weeks on Chart
# 
# [**YouTube Trending Video Data**](https://www.kaggle.com/datasnaek/youtube-new): Daily trending YouTube videos from Nov 2017 - May 2018 (40949 rows, 16 columns)
# 
# Since I plan on joining this data with the Billboards data to connect music to views, the only columns I will be using in the YouTube dataset are:
# 
# > title, channel_title, views
# 
# ## Methodology
# 
# Methods used in this analysis: pivot table, joins, duplicate removals, lambdas, merge on substring matches. Specific methods are highlighted with links to stackoverflow guidance in the comments. I used `pandas` for the data transformations, and `plotly` for the visualization.
# 
# I wanted to ensure my results were generated from the same input dataset without overlapping transformations, so I will first perform some initial data cleaning, then use the cleaned data for the rest of the analysis; they will be split into 3 sections to answer the 3 research questions. Please continue to see each section in detail, split into 3 parts: data preparation, data visualization, and data analysis.

# ### Initial Data Cleaning
# 
# We first load the raw data, then do initial data cleaning: separate the `WeekID` field into `month`, `day`, `year` for easier analysis in the future

# In[1]:


# import all libraries
import pandas as pd
import numpy as np

# standard plotly imports
import plotly.graph_objects as go


# In[2]:


# load raw data
bboard_data = pd.read_csv('../raw_data/hot_100.csv')


# In[3]:


bboard_data.head()


# In[4]:


# use to_datetime to separate one column into multiple
bboard_data['WeekID'] = pd.to_datetime(bboard_data['WeekID'], format='%m/%d/%Y')


# In[5]:


bboard_data['year'] = bboard_data['WeekID'].dt.year
bboard_data['month'] = bboard_data['WeekID'].dt.month
bboard_data['day'] = bboard_data['WeekID'].dt.day
bboard_data.to_csv('tmp_data/cleaned_bboard_data.csv')


# ### I. How are song popularity and relevance related in the Billboard charts?
# 
# This dataset provides potential for rich analysis into both the popularity **and** relevance of no. 1 songs on the Billboard charts.
# 
# I will be preparing a marker chart that represents 3 dimensions of data: x-axis will show the songs, y-axis will show the popularity percentage of that song, and the data point itself will be a marker with area measuring the relevance of the song. 

# #### Step 1: Data Preparation

# In[6]:


# read cleaned bboard_data
bboard_data = pd.read_csv('tmp_data/cleaned_bboard_data.csv')


# In[7]:


# use pivot table to extract counts of week positions for each song
# stackoverflow link: https://stackoverflow.com/questions/54527134/counting-column-values-based-on-values-in-other-columns-for-pandas-dataframes
bboard_data['count'] = 1
result = bboard_data.pivot_table(
    index=['Song'], columns='Week Position', values='count',
    fill_value=0, aggfunc=np.sum
)
# save result to csv for future use
result.to_csv('tmp_data/bboard_song_position_count.csv')


# In[8]:


# read song_position_count.csv
song_position_count = pd.read_csv("tmp_data/bboard_song_position_count.csv")


# In[9]:


# keep only the song and no. 1 column
song_position_count = song_position_count[['Song','1']]
song_position_count.to_csv('tmp_data/bboard_song_no1_count.csv')


# In[10]:


# join song_no1_count with bboard_data
song_no1_count = pd.read_csv('tmp_data/bboard_song_no1_count.csv')
bboard_data = pd.merge(bboard_data, song_no1_count, how='left', left_on='Song', right_on='Song')


# In[11]:


# remove irrelevant columns
# stackoverflow link: https://stackoverflow.com/questions/14940743/selecting-excluding-sets-of-columns-in-pandas
bboard_data = bboard_data.drop(['Unnamed: 0_x', 'Unnamed: 0_y'], axis=1)


# In[12]:


# clean rows to keep only entry for total weeks on chart
# stackoverflow link: https://stackoverflow.com/questions/50283775/python-pandas-keep-row-with-highest-column-value
bboard_data_tmp = bboard_data.sort_values('Weeks on Chart').drop_duplicates(["Song"],keep='last')


# In[13]:


# clean columns to keep only data needed for visualization
bboard_data_tmp = bboard_data_tmp[['Song', 'Performer', 'Weeks on Chart', '1', 'year', 'month']]


# In[14]:


# rename column values
# stackoverflow link: https://stackoverflow.com/questions/11346283/renaming-columns-in-pandas
bboard_data_tmp.columns = ['Song', 'Performer', 'Relevance (Total Weeks on Chart)', 'Count of no. 1', 'Year', 'Month']
# calculate popularity
# stackoverflow link: https://stackoverflow.com/questions/26133538/round-a-single-column-in-pandas
bboard_data_tmp['Popularity'] = bboard_data_tmp['Count of no. 1']/bboard_data_tmp['Relevance (Total Weeks on Chart)']
bboard_data_tmp['Popularity'] = bboard_data_tmp['Popularity'].round(2)


# In[15]:


# add month and year to performer column for visualizing later
bboard_data_tmp['Performer'] = bboard_data_tmp.Performer.map(str) + " (" + bboard_data_tmp.Month.map(str) +"-" + bboard_data_tmp.Year.map(str) + ")"


# In[16]:


# save to csv for future analysis
bboard_data_tmp.to_csv('tmp_data/bboard_song_pop_relevance.csv')


# #### Step 2: Data Visualization

# In[17]:


# read in the data and drop unneeded column
bboard_song_pop_relevance = pd.read_csv('tmp_data/bboard_song_pop_relevance.csv')
bboard_song_pop_relevance = bboard_song_pop_relevance.drop(['Unnamed: 0'], axis=1)


# In[18]:


# only use the top 100 songs that spent the most weeks on no. 1
bboard_song_pop_relevance_top100 = bboard_song_pop_relevance.nlargest(100, 'Count of no. 1')
bboard_song_pop_relevance_top100.to_csv('tmp_data/bboard_song_pop_relevance_top100.csv')
bboard_song_pop_relevance_top100 = bboard_song_pop_relevance_top100.sort_values(by=['Year'])


# In[19]:


# create bubble chart
# stackoverflow link: https://plot.ly/python/bubble-charts/
# plotly colorscale: https://plot.ly/python/v3/matplotlib-colorscales/
fig = go.Figure()

fig.add_trace(go.Scatter(
    x=bboard_song_pop_relevance_top100['Song'],
    y=bboard_song_pop_relevance_top100['Popularity'],
    mode='markers',
    marker=dict(
        line=dict(width=2, color='LightGrey'),
        size=16,
        cmax=70,
        cmin=10,
        color=bboard_song_pop_relevance_top100['Relevance (Total Weeks on Chart)'],
        colorbar=dict(
            title="Relevance"
        ),
        colorscale="magma",
        sizeref=2.*15/(5.**2)
    ),
    marker_size=bboard_song_pop_relevance_top100['Relevance (Total Weeks on Chart)'],
    text=bboard_song_pop_relevance_top100['Performer'],
    hovertemplate = "<b>%{x}</b><br><i>%{text}</i><br><br>Popularity: %{y}<br>Relevance: %{marker.size}",
))

fig.update_layout(
    title={
        'text': "Top 100 Billboard #1 Songs in terms of Relative Popularity & Relevance",
        'y':0.9,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'},
    autosize=False,
        width=1000,
        height=800,
        margin=go.layout.Margin(
            l=50,
            r=50,
            b=300,
            t=100,
            pad=4
        ))
fig.update_xaxes(title_text='Song')
fig.update_xaxes(tickangle=45)
fig.update_yaxes(title_text='Popularity')

fig.show()


# #### Step 3: Findings - Data Analysis
# 
# There are a few interesting observations to be made from the interactive chart above. 
# 
# First, we see that relatively, the most popular songs through the ages generally do not have as much relevance (they have a higher no. 1-vs-rest ratio on the billboard charts, but spend less weeks on the charts). 
# 
# Next, we see that relatively, no.1 Billboard ranking songs after the 2000s tend to have more relevance (they tend to spend more weeks on the charts than pre-2000 no. 1 songs). This may be due to some external factors such as the rise of YouTube music videos (more on this in the third section).
# 
# Lastly, we see that the data points that stand out (the ones near the top) are the ones that have [high number of weeks spent in no. 1 position](https://www.billboard.com/articles/columns/chart-beat/6077132/hot-100-songs-longest-leading-no-1s). A clear example is 'Old Town Road', with `0.73` popularity score. There are some exceptions to this, including 'Despacito' and 'Uptown Funk!'.
# 
# Next, we'll extract the top 10 songs. Songs from the 2010s only account for 4 out of the 10 songs in this list.

# In[20]:


# extract the top 10 songs by popularity first, then relevance
bboard_song_pop_relevance_top100.sort_values(['Popularity', 'Relevance (Total Weeks on Chart)'], ascending=[False, False]).head(10)


# ### II. How are song velocity and no. 1 streak related in the Billboard charts?

# In addition to analyzing song popularity and relevance, it is also important to understand the trend about a song's velocity and #1 streak.
# 
# I will be preparing a marker chart that represents 3 dimensions of data: x-axis will show the songs, y-axis will show the velocity of the song, and the data point itself will be a marker with area measuring the #1 streak of the song.
# 
# #### Step 1: Data Preparation

# In[21]:


# read cleaned bboard_data
bboard_data = pd.read_csv('tmp_data/cleaned_bboard_data.csv')
# clean rows to keep only entry for first time the song reaches peak position
# stackoverflow link: https://stackoverflow.com/questions/50283775/python-pandas-keep-row-with-highest-column-value
bboard_data = bboard_data.sort_values(['Week Position', 'Weeks on Chart'], ascending=[True, True]).drop_duplicates(["Song"],keep='first')
# join cleaned dataframe with top 100 songs
bboard_data = pd.merge(bboard_song_pop_relevance_top100, bboard_data, how='left', left_on='Song', right_on='Song')
bboard_data = bboard_data[['Song','Performer_x', 'Count of no. 1', 'Year', 'Popularity', 'Weeks on Chart']]
bboard_data.columns = ['Song', 'Performer', 'no. 1 Streak', 'Year', 'Popularity', 'Weeks before no. 1']
bboard_data.to_csv('tmp_data/bboard_song_velocity_top100.csv')


# #### Step 2: Data Visualization

# In[22]:


# create bubble chart
# stackoverflow link: https://plot.ly/python/bubble-charts/
# plotly colorscale: https://plot.ly/python/v3/matplotlib-colorscales/
bboard_song_velocity_top100 = pd.read_csv('tmp_data/bboard_song_velocity_top100.csv')
bboard_song_velocity_top100 = bboard_song_velocity_top100.sort_values(by=['Year'])
bboard_song_velocity_top100.head()
fig = go.Figure()

fig.add_trace(go.Scatter(
    x=bboard_song_velocity_top100['Song'],
    y=bboard_song_velocity_top100['Weeks before no. 1'],
    mode='markers',
    marker=dict(
        line=dict(width=2, color='DarkSlateGrey'),
        size=15,
        cmax=15,
        cmin=5,
        color=bboard_song_velocity_top100['no. 1 Streak'],
        colorbar=dict(
            title="#1 Streak"
        ),
        colorscale="VIRIDIS",
        sizeref=2.*15/(7.**2),
    ),
    marker_size=bboard_song_velocity_top100['no. 1 Streak'],
    text=bboard_song_velocity_top100['Performer'],
    hovertemplate = "<b>%{x}</b><br><i>%{text}</i><br><br>Velocity: %{y}<br>#1 Streak: %{marker.size}",
))

fig.update_layout(
    title={
        'text': "Top 100 Billboard #1 Songs in terms of Velocity and #1 Streak",
        'y':0.9,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'},
    autosize=False,
        width=1000,
        height=800,
        margin=go.layout.Margin(
            l=50,
            r=50,
            b=300,
            t=100,
            pad=4
        ))
fig.update_xaxes(title_text='Song')
fig.update_xaxes(tickangle=45)
fig.update_yaxes(title_text='Velocity')

fig.show()


# #### Step 3: Findings - Data Analysis
# 
# There are (again) some interesting observations to be made from this visualization. Immediately, we see an outlier datapoint with a high count of weeks before reaching #1 (indicating low velocity). This song, 'Macarena', however, proceeded to spend 14 weeks at no. 1, an impressive feat. Another observation is the 1990s had many songs that reached #1 with fewer weeks than much of the songs in the 2000s. We see the pattern in the 1990s repeating again from the 2010s, with songs consistently spending less than 7-8 weeks to reach no. 1. This could be due to the additional factors added into the Billboard charting calculations, including YouTube streaming, as we will explore in the third section.
# 
# Next, we'll extract the top 10 songs. Songs from the 2010s again only account for 4 out of the 10 songs in this list.

# In[23]:


# extract the top 10 songs by no. 1 streak, then weeks before no. 1
bboard_song_velocity_top100.sort_values(['no. 1 Streak', 'Weeks before no. 1'], ascending=[False, True]).head(10)


# ### III. How does other streaming platforms such as YouTube influence Billboard charts?
# 
# In 2013, Billboard added YouTube streaming to its Hot 100 calculations ([link](https://www.billboard.com/articles/news/1549399/hot-100-news-billboard-and-nielsen-add-youtube-video-streaming-to-platforms)). It would be interesting to see the contribution of trending YouTube music videos (that correlate to the songs) to the Billboard charting songs.
# 
# **Note**: the YouTube dataset only contains data from December 1, 2017 to May 31, 2018. I will be using a subset of the Billboard data to try and find correlations with this dataset.
# 
# #### Step 1: Data Preparations

# In[24]:


# load the youtube data
yt_data = pd.read_csv('../raw_data/yt_us_videos.csv')


# In[25]:


yt_data.head()


# In[26]:


# keep only the videos from official YouTube music accounts (VEVO)
# stackoverflow link: https://stackoverflow.com/questions/11350770/select-by-partial-string-from-a-pandas-dataframe
yt_data_vevo = yt_data[yt_data['channel_title'].str.contains("vevo",case=False)]
yt_data_vevo.to_csv('tmp_data/yt_data_vevo.csv')


# In[27]:


# read billboard cleaned data
bboard_data = pd.read_csv('tmp_data/cleaned_bboard_data.csv')
# isolate dates to between Dec. 1, 2017 and May 31, 2018
bboard_data['WeekID'] = pd.to_datetime(bboard_data['WeekID'])
mask = (bboard_data['WeekID'] > '2017-12-01') & (bboard_data['WeekID'] <= '2018-05-31')
bboard_data = bboard_data.loc[mask]
bboard_data.to_csv('tmp_data/bboard_data_yt.csv')


# In[28]:


# join yt with billboard data
yt_data_vevo = pd.read_csv('tmp_data/yt_data_vevo.csv')
# merge on str.contains
# stackoverflow link: https://stackoverflow.com/questions/54756025/how-to-merge-pandas-on-string-contains
rhs = (bboard_data.Song
          .apply(lambda song: yt_data_vevo[yt_data_vevo.title.str.find(song).ge(0)]['title'])
          .bfill(axis=1)
          .iloc[:, 0])
tmp = pd.concat([bboard_data.Song, rhs], axis=1, ignore_index=True).rename(columns={0: 'Song', 1: 'Video Title'})
# merge again to include the Video Title
rhs = (tmp.Song
          .apply(lambda song: yt_data_vevo[yt_data_vevo.title.str.find(song).ge(0)]['views'])
          .bfill(axis=1)
          .iloc[:, 0])
tmp = pd.concat([tmp, rhs], axis=1, ignore_index=True).rename(columns={0: 'Song', 1: 'Video Title', 2: 'Views'})
tmp.to_csv('tmp_data/song_views.csv')


# In[29]:


# Extract peak position for each song
bboard_data = pd.read_csv('tmp_data/cleaned_bboard_data.csv')
bboard_data = bboard_data.sort_values(['Peak Position', 'Weeks on Chart'], ascending=[True, False]).drop_duplicates(["Song"],keep='first')
bboard_data['Performer'] = bboard_data.Performer.map(str) + " (" + bboard_data.month.map(str) +"-" + bboard_data.year.map(str) + ")"
bboard_data.to_csv('tmp_data/bboard_song_peak.csv')


# In[30]:


# read bboard_song_peak data
bboard_song_peak = pd.read_csv('tmp_data/bboard_song_peak.csv')
song_view = pd.read_csv('tmp_data/song_views.csv')
# merge bboard_song_peak with song_view to combine views, youtube title with the rest of bboard data
bboard_data = pd.merge(bboard_song_peak, song_view, how='left', left_on='Song', right_on='Song')
bboard_data = bboard_data.dropna()
# stackoverflow link: https://stackoverflow.com/questions/29370057/select-dataframe-rows-between-two-dates
bboard_data = bboard_data.loc[(bboard_data['year'] == 2018) | (bboard_data['year'] == 2017)]
bboard_data = bboard_data.drop_duplicates(["Song"],keep='last')
# divide by a million for ease of visualization later
bboard_data['Views'] = bboard_data['Views']/1000000
bboard_data['Views'] = bboard_data['Views'].round(2)


# In[31]:


# sort on both year and month, select relevant columns, save to csv
bboard_data = bboard_data.sort_values(['year', 'month'], ascending=[True, True])
bboard_data = bboard_data[['Song','Performer', 'Peak Position', 'Weeks on Chart', 'Views', 'Video Title', 'year', 'month']]
bboard_data.columns = ['Song', 'Performer', 'Peak Position', 'Weeks on Chart', 'Views', 'Video Title', 'Year', 'Month']
bboard_data.to_csv('tmp_data/bboard_song_views.csv')


# #### Step 2: Data Visualization
# 
# I will be preparing a marker chart that represents 3 dimensions of data: x-axis will show the songs, y-axis will show the peak position of the song, and the data point itself will be a marker with area measuring the corresponding YouTube music video views of the song.

# In[32]:


# read bboard_song_view dataframe
bboard_song_view = pd.read_csv('tmp_data/bboard_song_views.csv')


# In[33]:


# create bubble chart
# stackoverflow link: https://plot.ly/python/bubble-charts/
# plotly colorscale: https://plot.ly/python/v3/matplotlib-colorscales/
bboard_song_view = pd.read_csv('tmp_data/bboard_song_views.csv')
fig = go.Figure()

fig.add_trace(go.Scatter(
    x=bboard_song_view['Song'],
    y=bboard_song_view['Peak Position'],
    mode='markers',
    marker=dict(
        line=dict(width=2, color='DarkSlateGrey'),
        size=6,
        cmax=35,
        cmin=0,
        color=bboard_song_view['Views'],
        colorbar=dict(
            title="YouTube Views (in millions)"
        ),
        colorscale="RdBu",
        sizeref=2.*15/(8.**2)
    ),
    marker_size=bboard_song_view['Views'],
    text=bboard_song_view['Performer'],
    hovertemplate = "<b>%{x}</b><br><i>%{text}</i><br><br>Peak Position: %{y}<br>YouTube Views: %{marker.size}",
))

fig.update_layout(
    title={
        'text': "Top 100 Billboard #1 Songs in terms of Peak Position and YouTube Views 2017-2018",
        'y':0.9,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'},
    autosize=False,
        width=1000,
        height=700,
        margin=go.layout.Margin(
            l=50,
            r=50,
            b=100,
            t=100,
            pad=4
        ))
fig.update_xaxes(title_text='Song')
fig.update_xaxes(tickangle=45)
fig.update_yaxes(title_text='Peak Position')

fig.show()


# #### Step 3: Findings - Data Analysis
# 
# The complicated join between the YouTube dataset and Billboard resulted in only a few matching songs (57 in total) between 2017 and 2018, but it is clear that there are some relationships between a YouTube trending video with millions (or even billions) of views and the song's performance on the Billboard charts. 
# 
# Take 'This is America', by Childish Gambino, for example. The song has the most views on YouTube in this dataset, 32.65 million to be exact at the time of dataset publication (as of Nov. 2019: 621 million views), and it peaked at no. 1 on Billboard. 'Havana', by Camila Cabello, on the other hand, only has 5.48 million views at time of dataset publication, but also peaked at no. 1 (the video as of Nov. 2019, has 1.6 billion views).
# 
# Next, we'll extract the top 10 songs. It's interesting to note that all 10 songs are from 2018.

# In[34]:


# extract the top 10 songs by popularity first, then relevance
bboard_song_view.sort_values(['Peak Position', 'Views'], ascending=[True, False]).head(10)


# ### Data Limitations
# 
# This study has generated some valuable insights about the many factors that shape a song's performance on the Billboard Top 100 Charts. However, there are still some limitations of this study, as well as confounding factors, that should be addressed. Please see below for the main assumptions of the study:
# 
# #### Song Representation
# 
# This study only looked at Billboard Top 100 Charts for US songs. It may very well be the case that a non-US song also performed well on the Billboard non-US Charts. In addition, due to the limitations of visualization, I kept only the top 100 songs with most number 1 week positions on the Billboard charts. If I included the rest of the songs, then there may be opportunities to uncover more patterns.
# 
# #### Assumption about Popularity
# 
# I create a rudimentary definition of popularity, based on the data columns I had. However, count of weeks the song is #1 dividedd by total of weeks the song spent on Billboard is not a comprehensive definition of popularity. There are many other factors that contribute to a song's popularity, and from a human-centered perspective, not everyone has the same definition of what makes a song popular. It is important to put this in context as we try to understand the visualizations I built.
# 
# #### Billboard and Streaming Platforms
# 
# In 2013,Billboard began accounting for song performance on streaming platforms as an additional factor in their charting algorithm. While we do not know what their formula is for calculating the ranking, we do know that a song's corresponding YouTube music video is already contributing to that song's performance after 2013. This confounding factor, when taken into account, makes the results from the third visualization intuitive to understand.
# 
# ### Implication & Future Work
# 
# The results of this study highlights the tight relationships around a song's popularity, relevance, #1 streak, velocity, and corresponding YouTube music video virality. From the analysis, it is clear that a song tends to reach the Billboard top charts when it has a viral hit on YouTube with millions of views, but as we've seen in the limitations, this could be due to the confounding factor that YouTube is already accounted for in the Billboard chart formula.
# 
# Similar to my method of combining the Billboard dataset with another dataset, there is great potential in adding more contextual datasets to the 4 dimensions of data I extracted (popularity, relevance, #1 streak, and velocity) for further insights. One example would be to analyze the song lyrics in addition to their performance on the charts; another is to extract song genres and artist category. 

# ### Conclusion & Reflection
# 
# In both popularity & relevance as well as velocity & #1 streak, 2010s music only took 4 out of the top 10. This indicates that there are still many songs from the past (mostly 1980s-2000s) that have maintained their popularity, relevance, velocity, and #1 streak. While YouTube music video views indeed correlate with songs peaking near the top 10 on Billboard, this is already a confounding factor due to the fact that YouTube streaming data has been incorporated into Billboard Top 100 calculations since 2013.
# 
# From a human-centered data science perspective, it was important for me to construct visualizations that not only helps me answer my initial research questions, but also provide the interactivity that allows viewers to explore & understand the data on their own.
# 
# To a music enthusiast like myself, the results are comforting; in every decade, there are songs that defy their generations and continue to be appreciated by us; whether we heard the song in person during a concert in the 1980s, or listened on Spotify during the commute home in 2019, we continue to appreciate the music. Furthermore, the results show the importance of appreciating music from the past, because the music we love now can _become_ the past in a few decades.
# 
# Lastly, please enjoy the Spotify playlist I compiled as part of my analysis!
# 
# * Q1 Spotify Playlist: https://tinyurl.com/kfrankc-512-pl1 
# * Q2 Spotify Playlist: https://tinyurl.com/kfrankc-512-pl2 
# * Q3 Spotify Playlist: https://tinyurl.com/kfrankc-512-pl3 

# ### Reference
# 
# [1] https://www.wikiwand.com/en/Billboard_charts
# 
# [2] https://www.grammy.com/grammys/news/what-music-goes-viral-tiktok
# 
# [3] https://www.billboard.com/charts/digital-song-sales
# 
# [4] https://www.billboard.com/chart-beat
# 
# [5] https://towardsdatascience.com/billboard-hot-100-analytics-using-data-to-understand-the-shift-in-popular-music-in-the-last-60-ac3919d39b49

# In[ ]: