Music Through The Ages: Billboard Top 100 Charts Analysis

Frank Chen

Data 512 Fall 2019

Introduction

I analyzed Billboard Hot 100 Weekly Charts from 1958 to 2019 to show the relationship between popularity vs. relevance, velocity climb to #1 vs. #1 streak, as well as Billboard peak position vs. corresponding YouTube music video views. I show that songs from the 2010s are not outperforming the songs from the past, and confirmed that YouTube views do have correlations with higher Billboard peak positions, but noted that is already a confounding factor.

As someone who is constantly looking for new music to listen to, I definitely discovered some new songs from this research. I believe this type of analysis would be interesting to historians, ethnographers, anthropologists, music enthusiasts, and even music managers, since it provides a data-driven approach to understanding how music popularity evolves as society evolves, and vice versa.

Throughout history, music has consistently been used as one of the cultural indicators of society, in addition to literature, art, and film. The Billboard Charts have been calculating top performing songs since the 1930s [1], and serves as one of the indicators of successful songs. In recent years, with the introduction of social media platforms such as Facebook, Instagram, YouTube, and TikTok, there now exists strategies for making a song go viral [2]. However, streaming platforms are only one part of the complex formular used in calculating Billboard Top 100. My main motivation for this project is to answer the question: Are today's (2010s) music exceeding the Billboard performances of music from the past?

There are been many related work analyzing Billboard charts and music that topped the Billboard charts. The Billboard website has several blogs on this topic, including its analysis on digital song sale and chart beat, the website's blogs about top-performing songs [3][4]. In addition, others have tried combining this Billboard Charts dataset with another dataset such as Spotify to analyze song preference changes in society, such as this blog post from Towards Data Science [5].

Research Questions

We will be dividing out analysis into 3 sections, aiming to answer the 3 questions:

I. How are song popularity and relevance related in the Billboard charts?

II. How are song velocity and no. 1 streak related in the Billboard charts?

III. How does other streaming platforms such as YouTube influence Billboard charts?

Definitions

  • Popularity: how long the song stayed at no. 1 given all the weeks the song stayed on the chart. We represent this as the percentage of no. 1 counts over the total number of weeks the song spent on chart.
  • Relevance: how long the song stayed in the Billboard charts. We represent this as the total number of weeks the song spent on the chart.
  • Velocity: how many weeks did it take the song to reach no. 1 on Billboard
  • #1 Streak: how many weeks did the song stay at no. 1 on Billboard

The Data

In this analysis, I will be using two datasets (both in CSV format, and can be found in raw_data/ folder). Both datasets are labeled as public domain data. More details about CC0 licenses can be found here.

Billboard Weekly Charts Data: Weekly Hot 100 singles chart from August 1958 to June 2019 (317795 rows, 10 columns)

url, WeekID, Week Position, Song, Performer, SongID, Instance, Previous Week Position, Peak Position, Weeks on Chart

YouTube Trending Video Data: Daily trending YouTube videos from Nov 2017 - May 2018 (40949 rows, 16 columns)

Since I plan on joining this data with the Billboards data to connect music to views, the only columns I will be using in the YouTube dataset are:

title, channel_title, views

Methodology

Methods used in this analysis: pivot table, joins, duplicate removals, lambdas, merge on substring matches. Specific methods are highlighted with links to stackoverflow guidance in the comments. I used pandas for the data transformations, and plotly for the visualization.

I wanted to ensure my results were generated from the same input dataset without overlapping transformations, so I will first perform some initial data cleaning, then use the cleaned data for the rest of the analysis; they will be split into 3 sections to answer the 3 research questions. Please continue to see each section in detail, split into 3 parts: data preparation, data visualization, and data analysis.

Initial Data Cleaning

We first load the raw data, then do initial data cleaning: separate the WeekID field into month, day, year for easier analysis in the future

In [1]:
# import all libraries
import pandas as pd
import numpy as np

# standard plotly imports
import plotly.graph_objects as go
In [2]:
# load raw data
bboard_data = pd.read_csv('../raw_data/hot_100.csv')
In [3]:
bboard_data.head()
Out[3]:
url WeekID Week Position Song Performer SongID Instance Previous Week Position Peak Position Weeks on Chart
0 http://www.billboard.com/charts/hot-100/1990-0... 2/10/1990 75 Don't Wanna Fall In Love Jane Child Don't Wanna Fall In LoveJane Child 1 NaN 75 1
1 http://www.billboard.com/charts/hot-100/1990-0... 2/17/1990 53 Don't Wanna Fall In Love Jane Child Don't Wanna Fall In LoveJane Child 1 75.0 53 2
2 http://www.billboard.com/charts/hot-100/1990-0... 2/24/1990 43 Don't Wanna Fall In Love Jane Child Don't Wanna Fall In LoveJane Child 1 53.0 43 3
3 http://www.billboard.com/charts/hot-100/1990-0... 3/3/1990 37 Don't Wanna Fall In Love Jane Child Don't Wanna Fall In LoveJane Child 1 43.0 37 4
4 http://www.billboard.com/charts/hot-100/1990-0... 3/10/1990 27 Don't Wanna Fall In Love Jane Child Don't Wanna Fall In LoveJane Child 1 37.0 27 5
In [4]:
# use to_datetime to separate one column into multiple
bboard_data['WeekID'] = pd.to_datetime(bboard_data['WeekID'], format='%m/%d/%Y')
In [5]:
bboard_data['year'] = bboard_data['WeekID'].dt.year
bboard_data['month'] = bboard_data['WeekID'].dt.month
bboard_data['day'] = bboard_data['WeekID'].dt.day
bboard_data.to_csv('tmp_data/cleaned_bboard_data.csv')

This dataset provides potential for rich analysis into both the popularity and relevance of no. 1 songs on the Billboard charts.

I will be preparing a marker chart that represents 3 dimensions of data: x-axis will show the songs, y-axis will show the popularity percentage of that song, and the data point itself will be a marker with area measuring the relevance of the song.

Step 1: Data Preparation

In [6]:
# read cleaned bboard_data
bboard_data = pd.read_csv('tmp_data/cleaned_bboard_data.csv')
In [7]:
# use pivot table to extract counts of week positions for each song
# stackoverflow link: https://stackoverflow.com/questions/54527134/counting-column-values-based-on-values-in-other-columns-for-pandas-dataframes
bboard_data['count'] = 1
result = bboard_data.pivot_table(
    index=['Song'], columns='Week Position', values='count',
    fill_value=0, aggfunc=np.sum
)
# save result to csv for future use
result.to_csv('tmp_data/bboard_song_position_count.csv')
In [8]:
# read song_position_count.csv
song_position_count = pd.read_csv("tmp_data/bboard_song_position_count.csv")
In [9]:
# keep only the song and no. 1 column
song_position_count = song_position_count[['Song','1']]
song_position_count.to_csv('tmp_data/bboard_song_no1_count.csv')
In [10]:
# join song_no1_count with bboard_data
song_no1_count = pd.read_csv('tmp_data/bboard_song_no1_count.csv')
bboard_data = pd.merge(bboard_data, song_no1_count, how='left', left_on='Song', right_on='Song')
In [11]:
# remove irrelevant columns
# stackoverflow link: https://stackoverflow.com/questions/14940743/selecting-excluding-sets-of-columns-in-pandas
bboard_data = bboard_data.drop(['Unnamed: 0_x', 'Unnamed: 0_y'], axis=1)
In [12]:
# clean rows to keep only entry for total weeks on chart
# stackoverflow link: https://stackoverflow.com/questions/50283775/python-pandas-keep-row-with-highest-column-value
bboard_data_tmp = bboard_data.sort_values('Weeks on Chart').drop_duplicates(["Song"],keep='last')
In [13]:
# clean columns to keep only data needed for visualization
bboard_data_tmp = bboard_data_tmp[['Song', 'Performer', 'Weeks on Chart', '1', 'year', 'month']]
In [14]:
# rename column values
# stackoverflow link: https://stackoverflow.com/questions/11346283/renaming-columns-in-pandas
bboard_data_tmp.columns = ['Song', 'Performer', 'Relevance (Total Weeks on Chart)', 'Count of no. 1', 'Year', 'Month']
# calculate popularity
# stackoverflow link: https://stackoverflow.com/questions/26133538/round-a-single-column-in-pandas
bboard_data_tmp['Popularity'] = bboard_data_tmp['Count of no. 1']/bboard_data_tmp['Relevance (Total Weeks on Chart)']
bboard_data_tmp['Popularity'] = bboard_data_tmp['Popularity'].round(2)
In [15]:
# add month and year to performer column for visualizing later
bboard_data_tmp['Performer'] = bboard_data_tmp.Performer.map(str) + " (" + bboard_data_tmp.Month.map(str) +"-" + bboard_data_tmp.Year.map(str) + ")"
In [16]:
# save to csv for future analysis
bboard_data_tmp.to_csv('tmp_data/bboard_song_pop_relevance.csv')

Step 2: Data Visualization

In [17]:
# read in the data and drop unneeded column
bboard_song_pop_relevance = pd.read_csv('tmp_data/bboard_song_pop_relevance.csv')
bboard_song_pop_relevance = bboard_song_pop_relevance.drop(['Unnamed: 0'], axis=1)
In [18]:
# only use the top 100 songs that spent the most weeks on no. 1
bboard_song_pop_relevance_top100 = bboard_song_pop_relevance.nlargest(100, 'Count of no. 1')
bboard_song_pop_relevance_top100.to_csv('tmp_data/bboard_song_pop_relevance_top100.csv')
bboard_song_pop_relevance_top100 = bboard_song_pop_relevance_top100.sort_values(by=['Year'])
In [19]:
# create bubble chart
# stackoverflow link: https://plot.ly/python/bubble-charts/
# plotly colorscale: https://plot.ly/python/v3/matplotlib-colorscales/
fig = go.Figure()

fig.add_trace(go.Scatter(
    x=bboard_song_pop_relevance_top100['Song'],
    y=bboard_song_pop_relevance_top100['Popularity'],
    mode='markers',
    marker=dict(
        line=dict(width=2, color='LightGrey'),
        size=16,
        cmax=70,
        cmin=10,
        color=bboard_song_pop_relevance_top100['Relevance (Total Weeks on Chart)'],
        colorbar=dict(
            title="Relevance"
        ),
        colorscale="magma",
        sizeref=2.*15/(5.**2)
    ),
    marker_size=bboard_song_pop_relevance_top100['Relevance (Total Weeks on Chart)'],
    text=bboard_song_pop_relevance_top100['Performer'],
    hovertemplate = "<b>%{x}</b><br><i>%{text}</i><br><br>Popularity: %{y}<br>Relevance: %{marker.size}",
))

fig.update_layout(
    title={
        'text': "Top 100 Billboard #1 Songs in terms of Relative Popularity & Relevance",
        'y':0.9,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'},
    autosize=False,
        width=1000,
        height=800,
        margin=go.layout.Margin(
            l=50,
            r=50,
            b=300,
            t=100,
            pad=4
        ))
fig.update_xaxes(title_text='Song')
fig.update_xaxes(tickangle=45)
fig.update_yaxes(title_text='Popularity')

fig.show()