Forget where you put some sensitive information on your cloud database?
Tired of downloading files from s3 only to preview them in excel, or textpad?
Want to touch and reshape data but feel it's caged off from you?
Look no further, your s3 problems are solved!
The only requirements are setting AWS environment variables or setting up the AWS CLI, and installing the requirements.txt
modules.
For this tutorial, we'll use the red wine quality dataset from UCI Center for Machine Learning and Intelligent Systems.
import os
import s3
s3.ls will list all the files and directories in a bucket/key akin to os.listdir()
see the code
s3_path = 's3://prod-datalytics/playground/'
It takes in a bucket or bucket, key pair.
s3.ls(s3_path)
['s3://prod-datalytics/playground/json_bourne.json', 's3://prod-datalytics/playground/wine_is_fine.csv', 's3://prod-datalytics/playground/wine_is_not_fine.tsv']
s3.ls also supports regex-like wildcard patterns exactly like glob.glob()
s3.ls(s3_path + '*.csv')
['s3://prod-datalytics/playground/wine_is_fine.csv']
With a programmatic method of getting s3 file paths, we can start doing some cools stuff.
top
f = s3.ls(s3_path + '*.csv')[0]
f
's3://prod-datalytics/playground/wine_is_fine.csv'
we can open the file as a streaming body of bytes.
s3.open(f)
<botocore.response.StreamingBody at 0x10a52ada0>
this is helpful sometimes, but typically we want to read a file like Python's native
Open(filename, 'r') as f:
f.read()
s3.read(f, encoding='utf-8')[:200] # displays the first 200 characters.
'fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality\n7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5'
For more structured data, we can leverage Pandas' parsing engines...
top
s3.read_csv and read_json are identical to their Pandas'
ancestor and backbone.
Using this handy function, you have data displayed in a nice tabular format:
see the code
df = s3.read_csv(f, sep=',')
df.head(3)
fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 |
1 | 7.8 | 0.88 | 0.00 | 2.6 | 0.098 | 25.0 | 67.0 | 0.9968 | 3.20 | 0.68 | 9.8 | 5 |
2 | 7.8 | 0.76 | 0.04 | 2.3 | 0.092 | 15.0 | 54.0 | 0.9970 | 3.26 | 0.65 | 9.8 | 5 |
a csv is the most simple use case, we can handle alternative delimiters and json files too.
files = s3.ls(s3_path)
files
['s3://prod-datalytics/playground/json_bourne.json', 's3://prod-datalytics/playground/wine_is_fine.csv', 's3://prod-datalytics/playground/wine_is_not_fine.tsv']
here are tab-separated values (tsv).
print("We can read the {} tsv easily.".format(files[-1]))
df = s3.read_csv(files[-1], sep='\t')
df.tail(3)
We can read the s3://prod-datalytics/playground/wine_is_not_fine.tsv tsv easily.
fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1596 | 6.3 | 0.510 | 0.13 | 2.3 | 0.076 | 29.0 | 40.0 | 0.99574 | 3.42 | 0.75 | 11.0 | 6 |
1597 | 5.9 | 0.645 | 0.12 | 2.0 | 0.075 | 32.0 | 44.0 | 0.99547 | 3.57 | 0.71 | 10.2 | 5 |
1598 | 6.0 | 0.310 | 0.47 | 3.6 | 0.067 | 18.0 | 42.0 | 0.99549 | 3.39 | 0.66 | 11.0 | 6 |
here's a json file
print("We can also read the {} file easily.".format(files[0]))
df = s3.read_json(files[0])
df.sample(3)
We can also read the s3://prod-datalytics/playground/json_bourne.json file easily.
alcohol | chlorides | citric acid | density | fixed acidity | free sulfur dioxide | pH | quality | residual sugar | sulphates | total sulfur dioxide | volatile acidity | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1289 | 10.2 | 0.068 | 0.30 | 0.99914 | 7.0 | 20.0 | 3.30 | 5 | 4.5 | 1.17 | 110.0 | 0.60 |
607 | 10.5 | 0.092 | 0.41 | 0.99820 | 8.8 | 26.0 | 3.31 | 6 | 3.3 | 0.53 | 52.0 | 0.48 |
675 | 10.2 | 0.064 | 0.39 | 0.99840 | 9.3 | 12.0 | 3.26 | 5 | 2.2 | 0.65 | 31.0 | 0.41 |
they're actually all the same file-- in different formats!
If you're new to Pandas, you'll be happy to learn that it is the de-facto tool for data manipulation.
df.dtypes
alcohol float64 chlorides float64 citric acid float64 density float64 fixed acidity float64 free sulfur dioxide float64 pH float64 quality int64 residual sugar float64 sulphates float64 total sulfur dioxide float64 volatile acidity float64 dtype: object
Getting basic stats and distributions are a function away..
df.describe().T
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
alcohol | 1599.0 | 10.422983 | 1.065668 | 8.40000 | 9.5000 | 10.20000 | 11.100000 | 14.90000 |
chlorides | 1599.0 | 0.087467 | 0.047065 | 0.01200 | 0.0700 | 0.07900 | 0.090000 | 0.61100 |
citric acid | 1599.0 | 0.270976 | 0.194801 | 0.00000 | 0.0900 | 0.26000 | 0.420000 | 1.00000 |
density | 1599.0 | 0.996747 | 0.001887 | 0.99007 | 0.9956 | 0.99675 | 0.997835 | 1.00369 |
fixed acidity | 1599.0 | 8.319637 | 1.741096 | 4.60000 | 7.1000 | 7.90000 | 9.200000 | 15.90000 |
free sulfur dioxide | 1599.0 | 15.874922 | 10.460157 | 1.00000 | 7.0000 | 14.00000 | 21.000000 | 72.00000 |
pH | 1599.0 | 3.311113 | 0.154386 | 2.74000 | 3.2100 | 3.31000 | 3.400000 | 4.01000 |
quality | 1599.0 | 5.636023 | 0.807569 | 3.00000 | 5.0000 | 6.00000 | 6.000000 | 8.00000 |
residual sugar | 1599.0 | 2.538806 | 1.409928 | 0.90000 | 1.9000 | 2.20000 | 2.600000 | 15.50000 |
sulphates | 1599.0 | 0.658149 | 0.169507 | 0.33000 | 0.5500 | 0.62000 | 0.730000 | 2.00000 |
total sulfur dioxide | 1599.0 | 46.467792 | 32.895324 | 6.00000 | 22.0000 | 38.00000 | 62.000000 | 289.00000 |
volatile acidity | 1599.0 | 0.527821 | 0.179060 | 0.12000 | 0.3900 | 0.52000 | 0.640000 | 1.58000 |
Everything is indexed!
Here we get a quick calculation for the 75th percentile of alcohol content.
df.describe()['alcohol']['75%']
11.1
It's easy to filter a dataframe:
Here we're going to get all the heavily alcoholic wines...
df_alcoholic = df[df['alcohol'] > df.describe()['alcohol']['75%']]
df_alcoholic.head()
alcohol | chlorides | citric acid | density | fixed acidity | free sulfur dioxide | pH | quality | residual sugar | sulphates | total sulfur dioxide | volatile acidity | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
45 | 13.1 | 0.054 | 0.15 | 0.9934 | 4.6 | 8.0 | 3.90 | 4 | 2.1 | 0.56 | 65.0 | 0.52 |
95 | 12.9 | 0.058 | 0.17 | 0.9932 | 4.7 | 17.0 | 3.85 | 6 | 2.3 | 0.60 | 106.0 | 0.60 |
131 | 13.0 | 0.049 | 0.09 | 0.9937 | 5.6 | 17.0 | 3.63 | 5 | 2.3 | 0.63 | 99.0 | 0.50 |
132 | 13.0 | 0.049 | 0.09 | 0.9937 | 5.6 | 17.0 | 3.63 | 5 | 2.3 | 0.63 | 99.0 | 0.50 |
142 | 14.0 | 0.050 | 0.00 | 0.9916 | 5.2 | 27.0 | 3.68 | 6 | 1.8 | 0.79 | 63.0 | 0.34 |
It's also stupid easy to plot-- as Pandas extends the Matplotlib package.
#this line is run once, typically at the beginning of the notebook to enable plotting.
%matplotlib inline
df_alcoholic.plot(kind='scatter', x='residual sugar', y='density')
<matplotlib.axes._subplots.AxesSubplot at 0x10de76e10>
What is that outlier?
df_alcoholic[df_alcoholic['residual sugar'] > 12]
After processing and normalizing the data, we may want to upload this new file to s3.
top
s3.read_csv and read_json are almost identical to their Pandas ancestor and backbone.
The difference is that s3.to_csv takes the dataframe as an argument, rather than being a function of a dataframe.
see the code
# where will the file get stored?
s3_target = 's3://prod-datalytics/playground/wine_list.tsv.gz'
We can now use our filtered dataset, to write a new file to s3.
Using Pandas to_csv args, we have a lot of control of the output format.
s3.to_csv(df_alcoholic, s3_target, sep='\t',
index=False, compression='gzip')
"File uploaded to 's3://prod-datalytics/playground/wine_list.tsv.gz'"
We can send local files to s3 too, first let's write a file to local disk using the built-in Pandas to_csv()
.
local_file = 'wine_list.tsv.gz'
df_alcoholic.to_csv(local_file, sep='\t', index=False, compression='gzip')
s3.disk_2_s3(file=local_file,
s3_path=s3_target)
"'wine_list.tsv.gz' loaded to 's3://prod-datalytics/playground/wine_list.tsv.gz'"
# purge it!
os.remove(local_file)
If you're into machine learning, you're in luck!
see the code
from sklearn.ensemble import RandomForestClassifier
for the example let's just use a vanilla Random Forest Model
clf = RandomForestClassifier()
clf
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini', max_depth=None, max_features='auto', max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1, oob_score=False, random_state=None, verbose=0, warm_start=False)
Here is where we'd train and evaluate the model...
# fit the model!
# clf.fit(X, y)
ACCURACY ON TRAINING SET: 0.99 ACCURACY OF TEST SET: 0.61
My first run (not shown) I got a an test set accuracy of only 61%, which is pretty bad.
You should try to beat that score!
'''
write some code here:
look into train_test_split, gridsearchCV, and kfolds from Scikit-Learn.
This is also a great dataset to practice:
scaling values (see standardScaler)
dimensionality reduction (see PCA)
and a linear model (see Lasso or Logistic Regression)
'''
Once you're happy with the performance, we can persist the model as a pickle file.
s3.dump_clf(clf, 's3://prod-datalytics/playground/models/clf.pkl')
"'clf.pkl' loaded to 's3://prod-datalytics/playground/models/clf.pkl'"
And re-use it when the time is right!
s3.load_clf('s3://prod-datalytics/playground/models/clf.pkl')
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini', max_depth=None, max_features='auto', max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1, oob_score=False, random_state=None, verbose=0, warm_start=False)
In the interest of good file-keeping let's move our saved classifier to it's own special folder (key).
s3.cp(old_path='s3://prod-datalytics/playground/models/clf.pkl',
new_path='s3://prod-datalytics/production_space/models/clf.pkl',)
{'CopyObjectResult': {'ETag': '"fd28ec0656661ce2a86373b097a95b89"', 'LastModified': datetime.datetime(2017, 3, 2, 0, 13, 43, tzinfo=tzutc())}, 'CopySourceVersionId': 'ov3ei3i4mEGFOcBMEmi2g8atvAbVHKJx', 'ResponseMetadata': {'HTTPStatusCode': 200, 'HostId': 'ax6Q2HTAn+86P6wz6v2MWX3ZLsYoksdpqgcJtyKaXcEur80A4awZMiEEDuMLzzcydYNoyX3wBGQ=', 'RequestId': 'AE98FD0B9CE85D9F'}, 'VersionId': 'OJX.ffdLzhrSD5kYc3wyMAWhtSWBIawN'}
to move the file (and delete the old instance) we use mv
, instead of cp
.
s3.mv(old_path='s3://prod-datalytics/playground/models/clf.pkl',
new_path='s3://prod-datalytics/production_space/models/clf.pkl',)
{'DeleteMarker': True, 'ResponseMetadata': {'HTTPStatusCode': 204, 'HostId': 'wevPToOM9kIl8QId3r9NzheWcq5c1Rw43kUnH9Js7Ja8N3Ah/8G5DxzfKO9JVaL4uZ8RMkSkN5o=', 'RequestId': '9E3FC788589304B4'}, 'VersionId': 'tnPVq8F1usk.sBxewcQ2SUWaYOrG3KXN'}