This notebook demonstrates how to load and use Anaconda package data. For more details, see the Github repository. Due to limitations on Binder, you might find some of the analysis examples below run slowly or require more memory than is available on the Binder instance. Feel free to download this notebook locally and run it.
To start we need to install the needed packages by running conda install dask intake numpy pandas
and conda install -c conda-forge hvplot
. Then we can import the packages:
import dask.dataframe as dd
from datetime import datetime
import hvplot.pandas
import intake
import numpy as np
import pandas as pd
This enables the Dask progress bar on all operations:
from dask.diagnostics import ProgressBar
pbar = ProgressBar()
pbar.register()
There are multiple ways to load Anaconda package data. Below we show examples of loading one month of data for December 2018.
First, we can read parquet files directly from S3 url. We recommend using dask.dataframe
to read data files into a Dask DataFrame. Please visit the Dask website for more information.
df = dd.read_parquet('s3://anaconda-package-data/conda/hourly/2018/12/2018-12-31.parquet',
storage_options={'anon': True})
Second, we can load data from an Intake catalog file. One advantage of using intake catalog is that we can define the cache
specifications in the catelog so that intake caches remote data source files locally. This saves bandwidth and improves the performance of future analyses. If you would like to remove the intake cache, simply run intake cache clear
. For more information on Intake catalogs, click here.
Before loading the data file, we need to load the Intake catalog file. We can use a URL to the catalog file directly:
cat = intake.open_catalog('https://raw.githubusercontent.com/ContinuumIO/anaconda-package-data/master/catalog/anaconda_package_data.yaml')
Then we can load the data with user specified year and month.
df = cat.anaconda_package_data_by_month(year=2018, month=12).to_dask()
In addition, if you would like to load one year of data, you can simply define the dataframe as
df = cat.anaconda_package_data_by_year(year=2018).to_dask()
Similarly, if you would like to load one day of data, you can define the dataframe as
df = cat.anaconda_package_data_by_day(year=2018, month=12, day=1).to_dask()
Note that .to_dask()
reads data into a dask dataframe. If you would like to read data directly into a Pandas dataframe, please use:
cat.anaconda_package_data_by_month(year=2018, month=12).read()
Third, we can install the data from a conda package by running (which we've already done in the Binder environment):
conda install -c intake anaconda-package-data
This data package installs the Intake catalog (but not the data) into user's conda environment directly. The global Intake catalog intake.cat
will then have entries from this data package. If we run list(intake.cat)
, we can see that 'anaconda_package_data_by_month'
, 'anaconda_package_data_by_year'
, and 'anaconda_package_data_by_day'
show up in the list. Then, similiar to Method 2, we just need to specifiy year and month and load the data.
df = intake.cat.anaconda_package_data_by_month(year=2018, month=12).to_dask()
Again, if you would like to read data directly into a Pandas Dataframe, please use intake.cat.anaconda_package_data_by_month(year=2018, month=12).read()
.
After loading the data, we can do a lot of data wrangling and visualization to answer interesting questions. Below we show a few examples of how people can use the data.
In this first example, we are looking at the download statistics of Pandas. First, let's see how many times Pandas are installed this month from Anaconda distribution:
df.loc[(df.data_source=='anaconda') & (df.pkg_name=='pandas')]['counts'].sum().compute()
Note that .compute()
is needed when df is a dask dataframe. Delete .compute()
if you load data into a pandas dataframe. Please visit dask website for more information.
Next, let's take a look at the daily trends of pandas usage.
df['day'] = df.time.dt.day
pkg_day_agg = df\
.loc[(df.data_source=='anaconda') & (df.pkg_name=='pandas')]\
.groupby(['day'])\
.sum()\
.reset_index()\
.compute()
pkg_day_agg.head()
pkg_day_agg.hvplot('day','counts')
In 2020, Python 2 will not be maintained and many key projects such as pandas will stop Python 2 support. Many developers and stakeholders are interested to see how Python 2 and Python 3 usage change over time. We can plot this with our data.
First, we need to recode the required package python version variable. Here we created a variable python2vs3
based on the variable pkg_python
:
df.groupby(['pkg_python'])['counts'].sum().compute()
df['python2vs3'] = df['pkg_python'].\
map(lambda x: 'Python 2' if x.startswith('2') else 'Python 3' if x.startswith('3') else np.nan)
df.groupby(['python2vs3'])['counts'].sum().compute()
Second, let's get the daily counts for Python 2 and Python 3.
python_day_agg = df\
.groupby(['day','python2vs3'])\
.sum()\
.compute()\
.reset_index()
python_day_agg.head()
Finally, we can plot the Python 2 and Python 3 usage trend.
python_day_agg.hvplot('day','counts',by='python2vs3')
We can also compare package platforms. Here we calculated the total number of downloads from each platform and visualize the results in a bar chart. (Note that "noarch" packages have no platform value because they work on all platforms.)
platform_month = df.groupby(['pkg_platform'])['counts'].sum().reset_index().compute()
platform_month
platform_month.hvplot.bar('pkg_platform', 'counts', rot=90)