We can get data, we can twist data, we can visualise data, but how do we effectively store and share data?
Rudimentary knowledge of Data Storage and Data Formats are a major part of the Data Science ecosystem.
I'm very biased, YMMV.
We've been playing with pandas
for a while, reading data with read_csv
, and the eagle eyed may have noticed a write_csv
as well, but CSV is a woefully inadequate (if 'simple') format, especially for numerical data.
pandas
supports a huge range of IO capabilities straight out of the box, but now that we're going a little lower level, lets just make up some data and see how different formats perform:
import pandas as pd
import numpy as np
import string
import random
from pathlib import Path
def get_random_string(length):
letters = string.ascii_lowercase
result_str = ''.join(random.sample(letters,k=length))
return result_str
def get_random_unicode(length):
"""shamelessly stolen https://stackoverflow.com/a/21666621/252556"""
try:
get_char = unichr
except NameError:
get_char = chr
# Update this to include code point ranges to be sampled
include_ranges = [
( 0x0021, 0x0021 ),
( 0x0023, 0x0026 ),
( 0x0028, 0x007E ),
( 0x00A1, 0x00AC ),
( 0x00AE, 0x00FF ),
( 0x0100, 0x017F ),
( 0x0180, 0x024F ),
( 0x2C60, 0x2C7F ),
( 0x16A0, 0x16F0 ),
( 0x0370, 0x0377 ),
( 0x037A, 0x037E ),
( 0x0384, 0x038A ),
( 0x038C, 0x038C ),
]
alphabet = [
get_char(code_point) for current_range in include_ranges
for code_point in range(current_range[0], current_range[1] + 1)
]
return ''.join(random.choice(alphabet) for i in range(length))
size = int(1e6)
cats = [get_random_string(12) for _ in range(4)]
df = pd.DataFrame({'randn': np.random.randint(0,100, size=size), # ints
'randnorm': np.random.normal(size=size),# floats
'randstr': [get_random_string(8) for _ in range(size)], #strs
'randutf': [get_random_unicode(8) for _ in range(size)], #unicode
'randcat': random.choices(cats,k=size) # potential categories
})
csv_path = Path('data/stress.csv')
df.to_csv(csv_path, index=False)
df.head()
Check out the pandas
IO Tools documentation
Pick 4 Data Formats, and evaluate them on these characteristics:
Data Stability: Is the result of reading it the same as what you put in?
Compression Size: How much smaller is the resultant file compared to data/stress.csv
Decompression Speed: How quickly can you perform operations on the data you read?
This should take no more than 10 minutes (less if you read ahead a bit...)
(Bonus, try different numbers for size
)
Cross-language in-memory data sharing format and interface protocol.
(i.e "You don't have to convert everyting to json for inter-process communication")
pandas
, so they play well together.pyarrow
is directly supported for use with to_parquet
pq_path = Path('data/stress.pa.pq')
df.to_parquet(pq_path, engine='pyarrow')
pq_path.stat().st_size/1024**2 #MB
csv_path.stat().st_size/1024**2 #MB
(No notebook this time, answers in the Miro Board)
So far we've only dealt with non-timeseries data.
Can you find an example dataset that has a timeseries component and convert it to a pyarrow
parquet format?
In this section we got a whistle stop tour of pandas.io
and all the formats you can play with, but I strongly recommend that unless you have a good reason not to, Parquet with pyarrow
is your best bet