Material for a University of Illinois course offered by the Physics Department. This content is maintained on GitHub and is distributed under a BSD3 license.
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import numpy as np
import pandas as pd
During the initial setup of your environment, you installed the data for this course with the mls
package, which also provides a function to locate it:
from mls import locate_data
locate_data('pong_data.hf5')
'/root/miniconda/envs/DAMLA/lib/python3.6/site-packages/mls/data/pong_data.hf5'
Data files are stored in the industry standard binary HDF5 and text CSV formats, with extensions .hf5
and .csv
, respectively. HDF5 is more efficient for larger files but requires specialized software to read. CSV files are just plain text:
with open(locate_data('line_data.csv')) as f:
# Print the first 5 lines of the file.
for lineno in range(5):
print(f.readline(), end='')
x,y,dy 0.3929383711957233,0.08540861657452603,0.3831920560881885 -0.42772133009924107,-0.5198803411067978,0.38522044793317467 -0.5462970928715938,-0.8124804852644906, 0.10262953816578246,0.10527828529558633,0.38556680974439583
The first line specifies the names of each column ("feature") in the data file. Subsequent lines are the rows ("samples") of the data file, with values for each column separated by commas. Note that values might be missing (for example, at the end of the third row).
We will use the Pandas package to read data files into DataFrame objects in memory. This will only be a quick introduction. For a deeper dive, start with Data Manipulation with Pandas in the Phython Data Science Handbook.
pong_data = pd.read_hdf(locate_data('pong_data.hf5'))
line_data = pd.read_csv(locate_data('line_data.csv'))
You can think of a DataFrame as an enhanced 2D numpy array, with most of the same capabilities:
line_data.shape
(2000, 3)
Individual columns also behave like enhanced 1D numpy arrays:
line_data['y'].shape
(2000,)
line_data['x'].shape
(2000,)
For a first look at some unknown data, start with some basic summary statistics:
line_data.describe()
x | y | dy | |
---|---|---|---|
count | 2000.000000 | 2000.000000 | 1850.000000 |
mean | -0.000509 | -0.086233 | 0.479347 |
std | 0.585281 | 0.782878 | 0.228198 |
min | -0.999836 | -2.390646 | 0.151793 |
25% | -0.513685 | -0.648045 | 0.302540 |
50% | -0.006021 | -0.068052 | 0.431361 |
75% | 0.501449 | 0.473741 | 0.610809 |
max | 0.999289 | 2.365710 | 1.506188 |
Jot down a few things you notice about this data from this summary.
Summarize pong_data
the same way. Does anything stick out?
pong_data.describe()
x0 | x1 | x2 | x3 | x4 | x5 | x6 | x7 | x8 | x9 | y0 | y1 | y2 | y3 | y4 | y5 | y6 | y7 | y8 | y9 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 1000.000000 | 1000.000000 | 1000.000000 | 1000.000000 | 1000.000000 | 1000.000000 | 1000.000000 | 1000.000000 | 1000.000000 | 1000.000000 | 1000.0 | 1000.000000 | 1000.000000 | 1000.000000 | 1000.000000 | 1000.000000 | 1000.000000 | 1000.000000 | 1000.000000 | 1000.000000 |
mean | 0.049004 | 0.132093 | 0.212905 | 0.291504 | 0.367950 | 0.442301 | 0.514615 | 0.584949 | 0.653355 | 0.719888 | 0.0 | 0.125206 | 0.217122 | 0.276658 | 0.304702 | 0.302116 | 0.269740 | 0.208390 | 0.118860 | 0.001921 |
std | 0.062998 | 0.067380 | 0.075805 | 0.086806 | 0.099285 | 0.112547 | 0.126175 | 0.139919 | 0.153624 | 0.167196 | 0.0 | 0.010876 | 0.021454 | 0.031742 | 0.041748 | 0.051481 | 0.060946 | 0.070153 | 0.079107 | 0.087815 |
min | -0.161553 | -0.089041 | -0.018516 | 0.050077 | 0.116790 | 0.181677 | 0.244785 | 0.306165 | 0.365863 | 0.415850 | 0.0 | 0.093722 | 0.155016 | 0.184769 | 0.183846 | 0.153088 | 0.093310 | 0.005310 | -0.110141 | -0.252291 |
25% | -0.001755 | 0.079435 | 0.157023 | 0.229517 | 0.293469 | 0.353604 | 0.414068 | 0.473338 | 0.532280 | 0.590583 | 0.0 | 0.115816 | 0.198597 | 0.249250 | 0.268654 | 0.257665 | 0.217116 | 0.147817 | 0.050555 | -0.073903 |
50% | 0.076534 | 0.148675 | 0.205846 | 0.270214 | 0.338380 | 0.406922 | 0.476322 | 0.542847 | 0.608249 | 0.673589 | 0.0 | 0.127098 | 0.220852 | 0.282177 | 0.311961 | 0.311068 | 0.280338 | 0.220589 | 0.132616 | 0.017191 |
75% | 0.100177 | 0.187800 | 0.286463 | 0.383127 | 0.475724 | 0.565217 | 0.651398 | 0.734418 | 0.816378 | 0.896600 | 0.0 | 0.132847 | 0.232193 | 0.298956 | 0.334029 | 0.338281 | 0.312554 | 0.257672 | 0.174431 | 0.063610 |
max | 0.151118 | 0.261095 | 0.370325 | 0.476563 | 0.579891 | 0.684321 | 0.787124 | 0.887111 | 0.984358 | 1.078941 | 0.0 | 0.144799 | 0.255769 | 0.333838 | 0.379908 | 0.394854 | 0.379530 | 0.334764 | 0.261364 | 0.160113 |
Some things that stick out from this summary are:
# Add your solution here...
A subset is specified by limiting the rows and/or columns. We have already seen how to pick out a single column, e.g. with line_data['x']
.
We can also pick out specific rows (for details on why we use iloc
see here):
line_data.iloc[:4]
x | y | dy | |
---|---|---|---|
0 | 0.392938 | 0.085409 | 0.383192 |
1 | -0.427721 | -0.519880 | 0.385220 |
2 | -0.546297 | -0.812480 | NaN |
3 | 0.102630 | 0.105278 | 0.385567 |
Note how the missing value in the CSV file is represented as NaN = "not a number". This is generally how Pandas handles any data that is missing / invalid or otherwise not available (NA).
We may not want to use any rows with missing data. Select the subset of useful data with:
line_data_valid = line_data.dropna()
line_data_valid[:4]
x | y | dy | |
---|---|---|---|
0 | 0.392938 | 0.085409 | 0.383192 |
1 | -0.427721 | -0.519880 | 0.385220 |
3 | 0.102630 | 0.105278 | 0.385567 |
4 | 0.438938 | 0.582137 | 0.509960 |
You can also select rows using any logical test on its column values. For example, to select all rows with dy > 0.5 and y < 0:
xpos = line_data[(line_data['dy'] > 0.5) & (line_data['y'] < 0)]
xpos[:4]
x | y | dy | |
---|---|---|---|
13 | -0.880644 | -1.482074 | 0.698284 |
16 | -0.635017 | -1.192232 | 0.619905 |
30 | -0.815790 | -0.172324 | 0.643215 |
35 | -0.375478 | -1.320013 | 0.574198 |
Use describe
to compare the summary statistics for rows with x < 0 and x >= 0. Do they make sense?
line_data[line_data['x'] < 0].describe()
x | y | dy | |
---|---|---|---|
count | 1006.000000 | 1006.000000 | 938.000000 |
mean | -0.507065 | -0.689012 | 0.472889 |
std | 0.288074 | 0.498581 | 0.227474 |
min | -0.999836 | -2.390646 | 0.159862 |
25% | -0.758180 | -1.005357 | 0.294420 |
50% | -0.511167 | -0.643512 | 0.419482 |
75% | -0.264287 | -0.338449 | 0.611192 |
max | -0.000128 | 0.757903 | 1.506188 |
line_data[line_data['x'] >= 0].describe()
x | y | dy | |
---|---|---|---|
count | 994.000000 | 994.000000 | 912.000000 |
mean | 0.512162 | 0.523822 | 0.485989 |
std | 0.287312 | 0.491520 | 0.228875 |
min | 0.001123 | -1.154558 | 0.151793 |
25% | 0.266587 | 0.163363 | 0.312799 |
50% | 0.502736 | 0.471419 | 0.436676 |
75% | 0.761346 | 0.821626 | 0.607731 |
max | 0.999289 | 2.365710 | 1.378183 |
# Add your solution here...
You can easily add new columns derived from existing columns, for example:
line_data['yprediction'] = 1.2 * line_data['x'] - 0.1
The new column is only in memory, and not automatically written back to the original file.
EXERCISE: Add a new column for the "pull", defined as: $$ y_{pull} \equiv \frac{y - y_{prediction}}{\delta y} \; . $$ What would you expect the mean and standard deviation (std) of this new column to be if the prediction is accuracte? What do the actual mean, std values indicate?
line_data['ypull'] = (line_data['y'] - line_data['yprediction']) / line_data['dy']
The mean should be close to zero if the prediction is unbiased. The RMS should be close to one if the prediction is unbiased and the errors are accurate. The actual values indicate that the prediction is unbiased, but the errors are overerestimated.
line_data.describe()
x | y | dy | yprediction | ypull | |
---|---|---|---|---|---|
count | 2000.000000 | 2000.000000 | 1850.000000 | 2000.000000 | 1850.000000 |
mean | -0.000509 | -0.086233 | 0.479347 | -0.100611 | 0.036367 |
std | 0.585281 | 0.782878 | 0.228198 | 0.702338 | 0.661659 |
min | -0.999836 | -2.390646 | 0.151793 | -1.299803 | -2.162585 |
25% | -0.513685 | -0.648045 | 0.302540 | -0.716422 | -0.429185 |
50% | -0.006021 | -0.068052 | 0.431361 | -0.107225 | 0.033875 |
75% | 0.501449 | 0.473741 | 0.610809 | 0.501739 | 0.484257 |
max | 0.999289 | 2.365710 | 1.506188 | 1.099146 | 2.033837 |
# Add your solution here...
Most of the data files for this course are in data/targets pairs (for reasons that will be clear soon).
Verify that the files pong_data.hf5
and pong_targets.hf5
have the same number of rows but different column names.
pong_data = pd.read_hdf(locate_data('pong_data.hf5'))
pong_targets = pd.read_hdf(locate_data('pong_targets.hf5'))
print('#rows: {}, {}.'.format(len(pong_data), len(pong_targets)))
assert len(pong_data) == len(pong_targets)
print('data columns: {}.'.format(pong_data.columns.values))
print('targets columns: {}.'.format(pong_targets.columns.values))
#rows: 1000, 1000. data columns: ['x0' 'x1' 'x2' 'x3' 'x4' 'x5' 'x6' 'x7' 'x8' 'x9' 'y0' 'y1' 'y2' 'y3' 'y4' 'y5' 'y6' 'y7' 'y8' 'y9']. targets columns: ['th0' 'hit' 'grp'].
# Add your solution here...
Use pd.concat
to combine the (different) columns, matching row by row. Verify that your combined data has the expected number of rows and column names.
pong_both = pd.concat([pong_data, pong_targets], axis='columns')
print('#rows: {}'.format(len(pong_both)))
print('columns: {}.'.format(pong_both.columns.values))
#rows: 1000 columns: ['x0' 'x1' 'x2' 'x3' 'x4' 'x5' 'x6' 'x7' 'x8' 'x9' 'y0' 'y1' 'y2' 'y3' 'y4' 'y5' 'y6' 'y7' 'y8' 'y9' 'th0' 'hit' 'grp'].
# Add your solution here...
Finally, here is an example of taking data from an external source and adapting it to the standard format we are using. The data is from the 2014 ATLAS Higgs Challenge which is now documented and archived here. More details about the challenge are in this writeup.
EXERCISE:
atlas-higgs-challenge-2014-v2.csv.gz
using the link at the bottom of this page.higgs_data.hf5
: Input data with 30 columns, ~100Mb size.higgs_targets.hf5
: Ouput targets with 1 column, ~8.8Mb size.def prepare_higgs(filename='atlas-higgs-challenge-2014-v2.csv.gz'):
# Read the input file, uncompressing on the fly.
df = pd.read_csv(filename, index_col='EventId', na_values='-999.0')
# Prepare and save the data output file.
higgs_data = df.drop(columns=['Label', 'KaggleSet', 'KaggleWeight']).astype('float32')
higgs_data.to_hdf(locate_data('higgs_data.hf5', check_exists=False), 'data', mode='w')
# Prepare and save the targets output file.
higgs_targets = df[['Label']]
higgs_targets.to_hdf(locate_data('higgs_targets.hf5', check_exists=False), 'targets', mode='w')
prepare_higgs()
Check that locate_data
can find the new files:
locate_data('higgs_data.hf5')
'/root/miniconda/envs/DAMLA/lib/python3.6/site-packages/mls/data/higgs_data.hf5'
locate_data('higgs_targets.hf5')
'/root/miniconda/envs/DAMLA/lib/python3.6/site-packages/mls/data/higgs_targets.hf5'
You can new safely delete the downloaded CSV file. Uncomment and run the line below if you would like to do this directly from your notebook (this is an example of a shell command).
#!rm atlas-higgs-challenge-2014-v2.csv.gz