%load_ext autoreload
%autoreload 2
#Science and Data
import pandas as pd
import numpy as np
# Infrastructure
from pathlib import Path
import sys
import os
#Plotting Tools
import seaborn as sns
import matplotlib.pyplot as plt
Set up options
# Matplotlib
%matplotlib inline
plt.rcParams['figure.figsize'] = (20, 10)
plt.rcParams['axes.spines.left'] = False
plt.rcParams['axes.spines.right'] = False
plt.rcParams['axes.spines.top'] = False
plt.rcParams['axes.spines.bottom'] = False
plt.rcParams['xtick.bottom'] = False
plt.rcParams['xtick.labelbottom'] = True
plt.rcParams['ytick.labelleft'] = True
plt.rcParams.update({'font.size': 18})
Set up paths
PROJECT_ROOT = !git rev-parse --show-toplevel
PROJECT_ROOT = Path(PROJECT_ROOT[0])
In the examples shown in this article, I will be using a data set taken from the Kaggle website. It is designed for a machine learning classification task and contains information about medical appointments and a target variable which denotes whether or not the patient showed up to their appointment.
It can be downloaded here.
In the code below I have imported the data and the libraries that I will be using throughout the article.
data = pd.read_csv(str(PROJECT_ROOT / "notebooks" / "gist.pandas.value_counts" / "data" / "raw.csv"))
data.head()
The value_counts()
function can be used in the following way to get a count of unique values for one column in the data set. The code below gives a count of each value in the Gender
column.
data['Gender'].value_counts()
To sort values in ascending or descending order we can use the sort
argument. In the code below I have added sort=True
to sort the Age
column in descending order.
data['Age'].value_counts(sort=True)
groupby()
¶The value_counts
function can be combined with other Pandas functions for richer analysis techniques. One example is to combine with the groupby()
function. In the below example I am counting values in the Gender column and applying groupby()
to further understand the number of no-shows in each group.
data['No-show'].groupby(data['Gender']).value_counts(sort=True)
normalize
¶In the above example displaying the absolute values does not easily enable us to understand the differences between the two groups. A better solution would be to show the relative frequencies of the unique values in each group.
We can add the normalize argument to value_counts()
to display the values in this way.
data['No-show'].groupby(data['Gender']).value_counts(normalize=True)
For columns where there are a large number of unique values the output of the value_counts()
function is not always particularly useful. A good example of this would be the Age column which we displayed value counts for earlier in this post.
Fortunately value_counts()
has a bins argument. This parameter allows us to specificy the number of bins (or groups we want to split the data into) as an integer. In the example below I have added bins=9
to split the Age counts into 5 groups. We now have a count of values in each of these bins.
data['Age'].value_counts(bins=9)
Once again showing absolute numbers is not particularly useful so let’s add the normalize=True
argument as well. Now we have a useful piece of analysis.
data['Age'].value_counts(bins=9, normalize=True)
We can also parse a list to be use as the bins intervals. For this case, we define
bins=[-np.inf, 10, 20, 30, 40, 50, 60, 70, 80, np.inf]
data["Age"].value_counts(bins=bins, sort=False)
Note that it produces the same output as using pd.cut
data.groupby(pd.cut(data["Age"].values, bins=bins))["Age"].count()
nlargest()
¶There are other columns in our data set which have a large number of unique values where binning is still not going to provide us with a useful piece of analysis. A good example of this would be the Neighbourhood
column.
If we simply run value_counts()
against this we get an output that is not particularly insightful.
data['Neighbourhood'].value_counts(sort=True)
A better way to display this might be to view the top 10 neighbourhoods. We can do this by combining with another Pandas function called nlargest()
as shown below.
data['Neighbourhood'].value_counts(sort=True).nlargest(10)
We can also use nsmallest()
to display the bottom 10 neighbourhoods which might also prove useful.
data['Neighbourhood'].value_counts(sort=True).nsmallest(10)
Another handy combination is the Pandas plotting functionality together with value_counts()
. Having the ability to display the analyses we get from value_counts()
as visualisations can make it far easier to view trends and patterns.
We can display all of the above examples and more with most plot types available in the Pandas library. A full list of available options can be found here
Let’s look a few examples.
We can use a bar plot to view the top 10 neighbourhoods.
data['Neighbourhood'].value_counts(sort=True).nlargest(10).plot.bar()
We can make a pie chart to better visualise the Gender
column.
data['Gender'].value_counts().plot.pie()
The value_counts()
function is often one of my first starting points for data analysis as it enables me to very quickly plot trends and derive insights from individual columns in a data set. This article has given a quick overview of various types of analyses you can use this for but this function has more uses beyond the scope of this post.