Music Through The Ages: Billboard Top 100 Charts Analysis¶

Frank Chen

Data 512 Fall 2019

Introduction¶

I analyzed Billboard Hot 100 Weekly Charts from 1958 to 2019 to show the relationship between popularity vs. relevance, velocity climb to #1 vs. #1 streak, as well as Billboard peak position vs. corresponding YouTube music video views. I show that songs from the 2010s are not outperforming the songs from the past, and confirmed that YouTube views do have correlations with higher Billboard peak positions, but noted that is already a confounding factor.

As someone who is constantly looking for new music to listen to, I definitely discovered some new songs from this research. I believe this type of analysis would be interesting to historians, ethnographers, anthropologists, music enthusiasts, and even music managers, since it provides a data-driven approach to understanding how music popularity evolves as society evolves, and vice versa.

Throughout history, music has consistently been used as one of the cultural indicators of society, in addition to literature, art, and film. The Billboard Charts have been calculating top performing songs since the 1930s [1], and serves as one of the indicators of successful songs. In recent years, with the introduction of social media platforms such as Facebook, Instagram, YouTube, and TikTok, there now exists strategies for making a song go viral [2]. However, streaming platforms are only one part of the complex formular used in calculating Billboard Top 100. My main motivation for this project is to answer the question: Are today's (2010s) music exceeding the Billboard performances of music from the past?

There are been many related work analyzing Billboard charts and music that topped the Billboard charts. The Billboard website has several blogs on this topic, including its analysis on digital song sale and chart beat, the website's blogs about top-performing songs [3][4]. In addition, others have tried combining this Billboard Charts dataset with another dataset such as Spotify to analyze song preference changes in society, such as this blog post from Towards Data Science [5].

Research Questions¶

We will be dividing out analysis into 3 sections, aiming to answer the 3 questions:

I. How are song popularity and relevance related in the Billboard charts?

II. How are song velocity and no. 1 streak related in the Billboard charts?

III. How does other streaming platforms such as YouTube influence Billboard charts?

Definitions¶

Popularity: how long the song stayed at no. 1 given all the weeks the song stayed on the chart. We represent this as the percentage of no. 1 counts over the total number of weeks the song spent on chart.
Relevance: how long the song stayed in the Billboard charts. We represent this as the total number of weeks the song spent on the chart.
Velocity: how many weeks did it take the song to reach no. 1 on Billboard
#1 Streak: how many weeks did the song stay at no. 1 on Billboard

The Data¶

In this analysis, I will be using two datasets (both in CSV format, and can be found in raw_data/ folder). Both datasets are labeled as public domain data. More details about CC0 licenses can be found here.

Billboard Weekly Charts Data: Weekly Hot 100 singles chart from August 1958 to June 2019 (317795 rows, 10 columns)

url, WeekID, Week Position, Song, Performer, SongID, Instance, Previous Week Position, Peak Position, Weeks on Chart

YouTube Trending Video Data: Daily trending YouTube videos from Nov 2017 - May 2018 (40949 rows, 16 columns)

Since I plan on joining this data with the Billboards data to connect music to views, the only columns I will be using in the YouTube dataset are:

title, channel_title, views

Methodology¶

Methods used in this analysis: pivot table, joins, duplicate removals, lambdas, merge on substring matches. Specific methods are highlighted with links to stackoverflow guidance in the comments. I used pandas for the data transformations, and plotly for the visualization.

I wanted to ensure my results were generated from the same input dataset without overlapping transformations, so I will first perform some initial data cleaning, then use the cleaned data for the rest of the analysis; they will be split into 3 sections to answer the 3 research questions. Please continue to see each section in detail, split into 3 parts: data preparation, data visualization, and data analysis.

Initial Data Cleaning¶

We first load the raw data, then do initial data cleaning: separate the WeekID field into month, day, year for easier analysis in the future

In [1]:

# import all libraries
import pandas as pd
import numpy as np

# standard plotly imports
import plotly.graph_objects as go

In [2]:

# load raw data
bboard_data = pd.read_csv('../raw_data/hot_100.csv')

In [3]:

bboard_data.head()

Out[3]:

	url	WeekID	Week Position	Song	Performer	SongID	Instance	Previous Week Position	Peak Position	Weeks on Chart
0	http://www.billboard.com/charts/hot-100/1990-0...	2/10/1990	75	Don't Wanna Fall In Love	Jane Child	Don't Wanna Fall In LoveJane Child	1	NaN	75	1
1	http://www.billboard.com/charts/hot-100/1990-0...	2/17/1990	53	Don't Wanna Fall In Love	Jane Child	Don't Wanna Fall In LoveJane Child	1	75.0	53	2
2	http://www.billboard.com/charts/hot-100/1990-0...	2/24/1990	43	Don't Wanna Fall In Love	Jane Child	Don't Wanna Fall In LoveJane Child	1	53.0	43	3
3	http://www.billboard.com/charts/hot-100/1990-0...	3/3/1990	37	Don't Wanna Fall In Love	Jane Child	Don't Wanna Fall In LoveJane Child	1	43.0	37	4
4	http://www.billboard.com/charts/hot-100/1990-0...	3/10/1990	27	Don't Wanna Fall In Love	Jane Child	Don't Wanna Fall In LoveJane Child	1	37.0	27	5

In [4]:

# use to_datetime to separate one column into multiple
bboard_data['WeekID'] = pd.to_datetime(bboard_data['WeekID'], format='%m/%d/%Y')

In [5]:

bboard_data['year'] = bboard_data['WeekID'].dt.year
bboard_data['month'] = bboard_data['WeekID'].dt.month
bboard_data['day'] = bboard_data['WeekID'].dt.day
bboard_data.to_csv('tmp_data/cleaned_bboard_data.csv')

This dataset provides potential for rich analysis into both the popularity and relevance of no. 1 songs on the Billboard charts.

I will be preparing a marker chart that represents 3 dimensions of data: x-axis will show the songs, y-axis will show the popularity percentage of that song, and the data point itself will be a marker with area measuring the relevance of the song.

Step 1: Data Preparation¶

In [6]:

# read cleaned bboard_data
bboard_data = pd.read_csv('tmp_data/cleaned_bboard_data.csv')

In [7]:

# use pivot table to extract counts of week positions for each song
# stackoverflow link: https://stackoverflow.com/questions/54527134/counting-column-values-based-on-values-in-other-columns-for-pandas-dataframes
bboard_data['count'] = 1
result = bboard_data.pivot_table(
    index=['Song'], columns='Week Position', values='count',
    fill_value=0, aggfunc=np.sum
)
# save result to csv for future use
result.to_csv('tmp_data/bboard_song_position_count.csv')

In [8]:

# read song_position_count.csv
song_position_count = pd.read_csv("tmp_data/bboard_song_position_count.csv")

In [9]:

# keep only the song and no. 1 column
song_position_count = song_position_count[['Song','1']]
song_position_count.to_csv('tmp_data/bboard_song_no1_count.csv')

In [10]:

# join song_no1_count with bboard_data
song_no1_count = pd.read_csv('tmp_data/bboard_song_no1_count.csv')
bboard_data = pd.merge(bboard_data, song_no1_count, how='left', left_on='Song', right_on='Song')

In [11]:

# remove irrelevant columns
# stackoverflow link: https://stackoverflow.com/questions/14940743/selecting-excluding-sets-of-columns-in-pandas
bboard_data = bboard_data.drop(['Unnamed: 0_x', 'Unnamed: 0_y'], axis=1)

In [12]:

# clean rows to keep only entry for total weeks on chart
# stackoverflow link: https://stackoverflow.com/questions/50283775/python-pandas-keep-row-with-highest-column-value
bboard_data_tmp = bboard_data.sort_values('Weeks on Chart').drop_duplicates(["Song"],keep='last')

In [13]:

# clean columns to keep only data needed for visualization
bboard_data_tmp = bboard_data_tmp[['Song', 'Performer', 'Weeks on Chart', '1', 'year', 'month']]

In [14]:

# rename column values
# stackoverflow link: https://stackoverflow.com/questions/11346283/renaming-columns-in-pandas
bboard_data_tmp.columns = ['Song', 'Performer', 'Relevance (Total Weeks on Chart)', 'Count of no. 1', 'Year', 'Month']
# calculate popularity
# stackoverflow link: https://stackoverflow.com/questions/26133538/round-a-single-column-in-pandas
bboard_data_tmp['Popularity'] = bboard_data_tmp['Count of no. 1']/bboard_data_tmp['Relevance (Total Weeks on Chart)']
bboard_data_tmp['Popularity'] = bboard_data_tmp['Popularity'].round(2)

In [15]:

# add month and year to performer column for visualizing later
bboard_data_tmp['Performer'] = bboard_data_tmp.Performer.map(str) + " (" + bboard_data_tmp.Month.map(str) +"-" + bboard_data_tmp.Year.map(str) + ")"

In [16]:

# save to csv for future analysis
bboard_data_tmp.to_csv('tmp_data/bboard_song_pop_relevance.csv')

Step 2: Data Visualization¶

In [17]:

# read in the data and drop unneeded column
bboard_song_pop_relevance = pd.read_csv('tmp_data/bboard_song_pop_relevance.csv')
bboard_song_pop_relevance = bboard_song_pop_relevance.drop(['Unnamed: 0'], axis=1)

In [18]:

# only use the top 100 songs that spent the most weeks on no. 1
bboard_song_pop_relevance_top100 = bboard_song_pop_relevance.nlargest(100, 'Count of no. 1')
bboard_song_pop_relevance_top100.to_csv('tmp_data/bboard_song_pop_relevance_top100.csv')
bboard_song_pop_relevance_top100 = bboard_song_pop_relevance_top100.sort_values(by=['Year'])

In [19]:

# create bubble chart
# stackoverflow link: https://plot.ly/python/bubble-charts/
# plotly colorscale: https://plot.ly/python/v3/matplotlib-colorscales/
fig = go.Figure()

fig.add_trace(go.Scatter(
    x=bboard_song_pop_relevance_top100['Song'],
    y=bboard_song_pop_relevance_top100['Popularity'],
    mode='markers',
    marker=dict(
        line=dict(width=2, color='LightGrey'),
        size=16,
        cmax=70,
        cmin=10,
        color=bboard_song_pop_relevance_top100['Relevance (Total Weeks on Chart)'],
        colorbar=dict(
            title="Relevance"
        ),
        colorscale="magma",
        sizeref=2.*15/(5.**2)
    ),
    marker_size=bboard_song_pop_relevance_top100['Relevance (Total Weeks on Chart)'],
    text=bboard_song_pop_relevance_top100['Performer'],
    hovertemplate = "<b>%{x}</b><br><i>%{text}</i><br><br>Popularity: %{y}<br>Relevance: %{marker.size}",
))

fig.update_layout(
    title={
        'text': "Top 100 Billboard #1 Songs in terms of Relative Popularity & Relevance",
        'y':0.9,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'},
    autosize=False,
        width=1000,
        height=800,
        margin=go.layout.Margin(
            l=50,
            r=50,
            b=300,
            t=100,
            pad=4
        ))
fig.update_xaxes(title_text='Song')
fig.update_xaxes(tickangle=45)
fig.update_yaxes(title_text='Popularity')

fig.show()

Step 3: Findings - Data Analysis¶

There are a few interesting observations to be made from the interactive chart above.

First, we see that relatively, the most popular songs through the ages generally do not have as much relevance (they have a higher no. 1-vs-rest ratio on the billboard charts, but spend less weeks on the charts).

Next, we see that relatively, no.1 Billboard ranking songs after the 2000s tend to have more relevance (they tend to spend more weeks on the charts than pre-2000 no. 1 songs). This may be due to some external factors such as the rise of YouTube music videos (more on this in the third section).

Lastly, we see that the data points that stand out (the ones near the top) are the ones that have high number of weeks spent in no. 1 position. A clear example is 'Old Town Road', with 0.73 popularity score. There are some exceptions to this, including 'Despacito' and 'Uptown Funk!'.

Next, we'll extract the top 10 songs. Songs from the 2010s only account for 4 out of the 10 songs in this list.

In [20]:

# extract the top 10 songs by popularity first, then relevance
bboard_song_pop_relevance_top100.sort_values(['Popularity', 'Relevance (Total Weeks on Chart)'], ascending=[False, False]).head(10)

Out[20]:

	Song	Performer	Relevance (Total Weeks on Chart)	Count of no. 1	Year	Month	Popularity
15893	Old Town Road	Lil Nas X Featuring Billy Ray Cyrus (6-2019)	15	11	2019	6	0.73
21056	Jump	Van Halen (6-1984)	21	13	1984	6	0.62
22423	One Sweet Day	Mariah Carey & Boyz II Men (6-1996)	27	16	1996	6	0.59
21625	Lose Yourself	Eminem (3-2003)	23	12	2003	3	0.52
22685	I Will Always Love You	Whitney Houston (3-2012)	29	14	2012	3	0.48
22376	The Boy Is Mine	Brandy & Monica (11-1998)	27	13	1998	11	0.48
18585	Hey Jude	The Beatles (1-1969)	19	9	1969	1	0.47
15469	I Want To Hold Your Hand	The Beatles (4-1964)	15	7	1964	4	0.47
22304	Hello	Adele (5-2016)	26	12	2016	5	0.46
21315	In My Feelings	Drake (12-2018)	22	10	2018	12	0.45

In addition to analyzing song popularity and relevance, it is also important to understand the trend about a song's velocity and #1 streak.

I will be preparing a marker chart that represents 3 dimensions of data: x-axis will show the songs, y-axis will show the velocity of the song, and the data point itself will be a marker with area measuring the #1 streak of the song.

Step 1: Data Preparation¶

In [21]:

# read cleaned bboard_data
bboard_data = pd.read_csv('tmp_data/cleaned_bboard_data.csv')
# clean rows to keep only entry for first time the song reaches peak position
# stackoverflow link: https://stackoverflow.com/questions/50283775/python-pandas-keep-row-with-highest-column-value
bboard_data = bboard_data.sort_values(['Week Position', 'Weeks on Chart'], ascending=[True, True]).drop_duplicates(["Song"],keep='first')
# join cleaned dataframe with top 100 songs
bboard_data = pd.merge(bboard_song_pop_relevance_top100, bboard_data, how='left', left_on='Song', right_on='Song')
bboard_data = bboard_data[['Song','Performer_x', 'Count of no. 1', 'Year', 'Popularity', 'Weeks on Chart']]
bboard_data.columns = ['Song', 'Performer', 'no. 1 Streak', 'Year', 'Popularity', 'Weeks before no. 1']
bboard_data.to_csv('tmp_data/bboard_song_velocity_top100.csv')

Step 2: Data Visualization¶

In [22]:

# create bubble chart
# stackoverflow link: https://plot.ly/python/bubble-charts/
# plotly colorscale: https://plot.ly/python/v3/matplotlib-colorscales/
bboard_song_velocity_top100 = pd.read_csv('tmp_data/bboard_song_velocity_top100.csv')
bboard_song_velocity_top100 = bboard_song_velocity_top100.sort_values(by=['Year'])
bboard_song_velocity_top100.head()
fig = go.Figure()

fig.add_trace(go.Scatter(
    x=bboard_song_velocity_top100['Song'],
    y=bboard_song_velocity_top100['Weeks before no. 1'],
    mode='markers',
    marker=dict(
        line=dict(width=2, color='DarkSlateGrey'),
        size=15,
        cmax=15,
        cmin=5,
        color=bboard_song_velocity_top100['no. 1 Streak'],
        colorbar=dict(
            title="#1 Streak"
        ),
        colorscale="VIRIDIS",
        sizeref=2.*15/(7.**2),
    ),
    marker_size=bboard_song_velocity_top100['no. 1 Streak'],
    text=bboard_song_velocity_top100['Performer'],
    hovertemplate = "<b>%{x}</b><br><i>%{text}</i><br><br>Velocity: %{y}<br>#1 Streak: %{marker.size}",
))

fig.update_layout(
    title={
        'text': "Top 100 Billboard #1 Songs in terms of Velocity and #1 Streak",
        'y':0.9,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'},
    autosize=False,
        width=1000,
        height=800,
        margin=go.layout.Margin(
            l=50,
            r=50,
            b=300,
            t=100,
            pad=4
        ))
fig.update_xaxes(title_text='Song')
fig.update_xaxes(tickangle=45)
fig.update_yaxes(title_text='Velocity')

fig.show()

Step 3: Findings - Data Analysis¶

There are (again) some interesting observations to be made from this visualization. Immediately, we see an outlier datapoint with a high count of weeks before reaching #1 (indicating low velocity). This song, 'Macarena', however, proceeded to spend 14 weeks at no. 1, an impressive feat. Another observation is the 1990s had many songs that reached #1 with fewer weeks than much of the songs in the 2000s. We see the pattern in the 1990s repeating again from the 2010s, with songs consistently spending less than 7-8 weeks to reach no. 1. This could be due to the additional factors added into the Billboard charting calculations, including YouTube streaming, as we will explore in the third section.

Next, we'll extract the top 10 songs. Songs from the 2010s again only account for 4 out of the 10 songs in this list.

In [23]:

# extract the top 10 songs by no. 1 streak, then weeks before no. 1
bboard_song_velocity_top100.sort_values(['no. 1 Streak', 'Weeks before no. 1'], ascending=[False, True]).head(10)

Out[23]:

	Unnamed: 0	Song	Performer	no. 1 Streak	Year	Popularity	Weeks before no. 1
32	32	One Sweet Day	Mariah Carey & Boyz II Men (6-1996)	16	1996	0.59	1
94	94	Despacito	Luis Fonsi & Daddy Yankee Featuring Justin Bie...	16	2018	0.31	17
36	36	Candle In The Wind 1997/Something About The Wa...	Elton John (7-1998)	14	1998	0.33	1
28	28	I'll Make Love To You	Boyz II Men (3-1995)	14	1995	0.42	3
69	69	I Gotta Feeling	The Black Eyed Peas (7-2010)	14	2010	0.25	3
72	72	I Will Always Love You	Whitney Houston (3-2012)	14	2012	0.48	3
58	58	We Belong Together	Mariah Carey (2-2006)	14	2006	0.33	8
87	87	Uptown Funk!	Mark Ronson Featuring Bruno Mars (3-2016)	14	2016	0.25	8
35	35	Macarena (Bayside Boys Mix)	Los Del Rio (2-1997)	14	1997	0.23	33
39	39	The Boy Is Mine	Brandy & Monica (11-1998)	13	1998	0.48	2

III. How does other streaming platforms such as YouTube influence Billboard charts?¶

In 2013, Billboard added YouTube streaming to its Hot 100 calculations (link). It would be interesting to see the contribution of trending YouTube music videos (that correlate to the songs) to the Billboard charting songs.

Note: the YouTube dataset only contains data from December 1, 2017 to May 31, 2018. I will be using a subset of the Billboard data to try and find correlations with this dataset.

Step 1: Data Preparations¶

In [24]:

# load the youtube data
yt_data = pd.read_csv('../raw_data/yt_us_videos.csv')

In [25]:

yt_data.head()

Out[25]:

	video_id	trending_date	title	channel_title	category_id	publish_time	tags	views	likes	dislikes	comment_count	thumbnail_link	comments_disabled	ratings_disabled	video_error_or_removed	description
0	2kyS6SvSYSE	17.14.11	WE WANT TO TALK ABOUT OUR MARRIAGE	CaseyNeistat	22	2017-11-13T17:13:01.000Z	SHANtell martin	748374	57527	2966	15954	https://i.ytimg.com/vi/2kyS6SvSYSE/default.jpg	False	False	False	SHANTELL'S CHANNEL - https://www.youtube.com/s...
1	1ZAPwfrtAFY	17.14.11	The Trump Presidency: Last Week Tonight with J...	LastWeekTonight	24	2017-11-13T07:30:00.000Z	last week tonight trump presidency\|"last week ...	2418783	97185	6146	12703	https://i.ytimg.com/vi/1ZAPwfrtAFY/default.jpg	False	False	False	One year after the presidential election, John...
2	5qpjK5DgCt4	17.14.11	Racist Superman \| Rudy Mancuso, King Bach & Le...	Rudy Mancuso	23	2017-11-12T19:05:24.000Z	racist superman\|"rudy"\|"mancuso"\|"king"\|"bach"...	3191434	146033	5339	8181	https://i.ytimg.com/vi/5qpjK5DgCt4/default.jpg	False	False	False	WATCH MY PREVIOUS VIDEO ▶ \n\nSUBSCRIBE ► http...
3	puqaWrEC7tY	17.14.11	Nickelback Lyrics: Real or Fake?	Good Mythical Morning	24	2017-11-13T11:00:04.000Z	rhett and link\|"gmm"\|"good mythical morning"\|"...	343168	10172	666	2146	https://i.ytimg.com/vi/puqaWrEC7tY/default.jpg	False	False	False	Today we find out if Link is a Nickelback amat...
4	d380meD0W0M	17.14.11	I Dare You: GOING BALD!?	nigahiga	24	2017-11-12T18:01:41.000Z	ryan\|"higa"\|"higatv"\|"nigahiga"\|"i dare you"\|"...	2095731	132235	1989	17518	https://i.ytimg.com/vi/d380meD0W0M/default.jpg	False	False	False	I know it's been a while since we did this sho...

In [26]:

# keep only the videos from official YouTube music accounts (VEVO)
# stackoverflow link: https://stackoverflow.com/questions/11350770/select-by-partial-string-from-a-pandas-dataframe
yt_data_vevo = yt_data[yt_data['channel_title'].str.contains("vevo",case=False)]
yt_data_vevo.to_csv('tmp_data/yt_data_vevo.csv')

In [27]:

# read billboard cleaned data
bboard_data = pd.read_csv('tmp_data/cleaned_bboard_data.csv')
# isolate dates to between Dec. 1, 2017 and May 31, 2018
bboard_data['WeekID'] = pd.to_datetime(bboard_data['WeekID'])
mask = (bboard_data['WeekID'] > '2017-12-01') & (bboard_data['WeekID'] <= '2018-05-31')
bboard_data = bboard_data.loc[mask]
bboard_data.to_csv('tmp_data/bboard_data_yt.csv')

In [28]:

# join yt with billboard data
yt_data_vevo = pd.read_csv('tmp_data/yt_data_vevo.csv')
# merge on str.contains
# stackoverflow link: https://stackoverflow.com/questions/54756025/how-to-merge-pandas-on-string-contains
rhs = (bboard_data.Song
          .apply(lambda song: yt_data_vevo[yt_data_vevo.title.str.find(song).ge(0)]['title'])
          .bfill(axis=1)
          .iloc[:, 0])
tmp = pd.concat([bboard_data.Song, rhs], axis=1, ignore_index=True).rename(columns={0: 'Song', 1: 'Video Title'})
# merge again to include the Video Title
rhs = (tmp.Song
          .apply(lambda song: yt_data_vevo[yt_data_vevo.title.str.find(song).ge(0)]['views'])
          .bfill(axis=1)
          .iloc[:, 0])
tmp = pd.concat([tmp, rhs], axis=1, ignore_index=True).rename(columns={0: 'Song', 1: 'Video Title', 2: 'Views'})
tmp.to_csv('tmp_data/song_views.csv')

In [29]:

# Extract peak position for each song
bboard_data = pd.read_csv('tmp_data/cleaned_bboard_data.csv')
bboard_data = bboard_data.sort_values(['Peak Position', 'Weeks on Chart'], ascending=[True, False]).drop_duplicates(["Song"],keep='first')
bboard_data['Performer'] = bboard_data.Performer.map(str) + " (" + bboard_data.month.map(str) +"-" + bboard_data.year.map(str) + ")"
bboard_data.to_csv('tmp_data/bboard_song_peak.csv')

In [30]:

# read bboard_song_peak data
bboard_song_peak = pd.read_csv('tmp_data/bboard_song_peak.csv')
song_view = pd.read_csv('tmp_data/song_views.csv')
# merge bboard_song_peak with song_view to combine views, youtube title with the rest of bboard data
bboard_data = pd.merge(bboard_song_peak, song_view, how='left', left_on='Song', right_on='Song')
bboard_data = bboard_data.dropna()
# stackoverflow link: https://stackoverflow.com/questions/29370057/select-dataframe-rows-between-two-dates
bboard_data = bboard_data.loc[(bboard_data['year'] == 2018) | (bboard_data['year'] == 2017)]
bboard_data = bboard_data.drop_duplicates(["Song"],keep='last')
# divide by a million for ease of visualization later
bboard_data['Views'] = bboard_data['Views']/1000000
bboard_data['Views'] = bboard_data['Views'].round(2)

In [31]:

# sort on both year and month, select relevant columns, save to csv
bboard_data = bboard_data.sort_values(['year', 'month'], ascending=[True, True])
bboard_data = bboard_data[['Song','Performer', 'Peak Position', 'Weeks on Chart', 'Views', 'Video Title', 'year', 'month']]
bboard_data.columns = ['Song', 'Performer', 'Peak Position', 'Weeks on Chart', 'Views', 'Video Title', 'Year', 'Month']
bboard_data.to_csv('tmp_data/bboard_song_views.csv')

Step 2: Data Visualization¶

I will be preparing a marker chart that represents 3 dimensions of data: x-axis will show the songs, y-axis will show the peak position of the song, and the data point itself will be a marker with area measuring the corresponding YouTube music video views of the song.

In [32]:

# read bboard_song_view dataframe
bboard_song_view = pd.read_csv('tmp_data/bboard_song_views.csv')

In [33]:

# create bubble chart
# stackoverflow link: https://plot.ly/python/bubble-charts/
# plotly colorscale: https://plot.ly/python/v3/matplotlib-colorscales/
bboard_song_view = pd.read_csv('tmp_data/bboard_song_views.csv')
fig = go.Figure()

fig.add_trace(go.Scatter(
    x=bboard_song_view['Song'],
    y=bboard_song_view['Peak Position'],
    mode='markers',
    marker=dict(
        line=dict(width=2, color='DarkSlateGrey'),
        size=6,
        cmax=35,
        cmin=0,
        color=bboard_song_view['Views'],
        colorbar=dict(
            title="YouTube Views (in millions)"
        ),
        colorscale="RdBu",
        sizeref=2.*15/(8.**2)
    ),
    marker_size=bboard_song_view['Views'],
    text=bboard_song_view['Performer'],
    hovertemplate = "<b>%{x}</b><br><i>%{text}</i><br><br>Peak Position: %{y}<br>YouTube Views: %{marker.size}",
))

fig.update_layout(
    title={
        'text': "Top 100 Billboard #1 Songs in terms of Peak Position and YouTube Views 2017-2018",
        'y':0.9,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'},
    autosize=False,
        width=1000,
        height=700,
        margin=go.layout.Margin(
            l=50,
            r=50,
            b=100,
            t=100,
            pad=4
        ))
fig.update_xaxes(title_text='Song')
fig.update_xaxes(tickangle=45)
fig.update_yaxes(title_text='Peak Position')

fig.show()

Step 3: Findings - Data Analysis¶

The complicated join between the YouTube dataset and Billboard resulted in only a few matching songs (57 in total) between 2017 and 2018, but it is clear that there are some relationships between a YouTube trending video with millions (or even billions) of views and the song's performance on the Billboard charts.

Take 'This is America', by Childish Gambino, for example. The song has the most views on YouTube in this dataset, 32.65 million to be exact at the time of dataset publication (as of Nov. 2019: 621 million views), and it peaked at no. 1 on Billboard. 'Havana', by Camila Cabello, on the other hand, only has 5.48 million views at time of dataset publication, but also peaked at no. 1 (the video as of Nov. 2019, has 1.6 billion views).

Next, we'll extract the top 10 songs. It's interesting to note that all 10 songs are from 2018.

In [34]:

# extract the top 10 songs by popularity first, then relevance
bboard_song_view.sort_values(['Peak Position', 'Views'], ascending=[True, False]).head(10)

Out[34]:

	Unnamed: 0	Song	Performer	Peak Position	Weeks on Chart	Views	Video Title	Year	Month
36	1000	This Is America	Childish Gambino (9-2018)	1	17	31.65	Childish Gambino - This Is America (Official V...	2018	9
39	438	Nice For What	Drake (10-2018)	1	25	24.42	Drake - Nice For What	2018	10
30	121	Havana	Camila Cabello Featuring Young Thug (7-2018)	1	45	5.48	Camila Cabello - Havana (Vertical Video) ft. Y...	2018	7
40	1875	No Tears Left To Cry	Ariana Grande (11-2018)	3	27	15.87	Ariana Grande - No Tears Left To Cry	2018	11
24	1969	Finesse	Bruno Mars & Cardi B (6-2018)	3	23	4.87	Bruno Mars and Cardi B - Finesse (LIVE From Th...	2018	6
33	2687	Call Out My Name	The Weeknd (8-2018)	4	18	7.83	The Weeknd - Call Out My Name (Official Video)	2018	8
13	2462	No Limit	G-Eazy Featuring A$AP Rocky & Cardi B (4-2018)	4	28	5.05	G-Eazy - No Limit REMIX ft. A$AP Rocky, Cardi ...	2018	4
19	2368	Thunder	Imagine Dragons (5-2018)	4	51	0.40	Imagine Dragons - Thunder (Live On The Ellen D...	2018	5
41	2902	The Middle	Zedd, Maren Morris & Grey (11-2018)	5	40	0.24	dodie - In The Middle	2018	11
14	3536	MotorSport	Migos, Nicki Minaj & Cardi B (4-2018)	6	21	4.51	Migos, Nicki Minaj, Cardi B - MotorSport	2018	4

Data Limitations¶

This study has generated some valuable insights about the many factors that shape a song's performance on the Billboard Top 100 Charts. However, there are still some limitations of this study, as well as confounding factors, that should be addressed. Please see below for the main assumptions of the study:

Song Representation¶

This study only looked at Billboard Top 100 Charts for US songs. It may very well be the case that a non-US song also performed well on the Billboard non-US Charts. In addition, due to the limitations of visualization, I kept only the top 100 songs with most number 1 week positions on the Billboard charts. If I included the rest of the songs, then there may be opportunities to uncover more patterns.

Assumption about Popularity¶

I create a rudimentary definition of popularity, based on the data columns I had. However, count of weeks the song is #1 dividedd by total of weeks the song spent on Billboard is not a comprehensive definition of popularity. There are many other factors that contribute to a song's popularity, and from a human-centered perspective, not everyone has the same definition of what makes a song popular. It is important to put this in context as we try to understand the visualizations I built.

Billboard and Streaming Platforms¶

In 2013,Billboard began accounting for song performance on streaming platforms as an additional factor in their charting algorithm. While we do not know what their formula is for calculating the ranking, we do know that a song's corresponding YouTube music video is already contributing to that song's performance after 2013. This confounding factor, when taken into account, makes the results from the third visualization intuitive to understand.

Implication & Future Work¶

The results of this study highlights the tight relationships around a song's popularity, relevance, #1 streak, velocity, and corresponding YouTube music video virality. From the analysis, it is clear that a song tends to reach the Billboard top charts when it has a viral hit on YouTube with millions of views, but as we've seen in the limitations, this could be due to the confounding factor that YouTube is already accounted for in the Billboard chart formula.

Similar to my method of combining the Billboard dataset with another dataset, there is great potential in adding more contextual datasets to the 4 dimensions of data I extracted (popularity, relevance, #1 streak, and velocity) for further insights. One example would be to analyze the song lyrics in addition to their performance on the charts; another is to extract song genres and artist category.

Conclusion & Reflection¶

In both popularity & relevance as well as velocity & #1 streak, 2010s music only took 4 out of the top 10. This indicates that there are still many songs from the past (mostly 1980s-2000s) that have maintained their popularity, relevance, velocity, and #1 streak. While YouTube music video views indeed correlate with songs peaking near the top 10 on Billboard, this is already a confounding factor due to the fact that YouTube streaming data has been incorporated into Billboard Top 100 calculations since 2013.

From a human-centered data science perspective, it was important for me to construct visualizations that not only helps me answer my initial research questions, but also provide the interactivity that allows viewers to explore & understand the data on their own.

To a music enthusiast like myself, the results are comforting; in every decade, there are songs that defy their generations and continue to be appreciated by us; whether we heard the song in person during a concert in the 1980s, or listened on Spotify during the commute home in 2019, we continue to appreciate the music. Furthermore, the results show the importance of appreciating music from the past, because the music we love now can become the past in a few decades.

Lastly, please enjoy the Spotify playlist I compiled as part of my analysis!