First Contact with Data

Every time I encounter new data file. There are few initial "looks" that I take on it. This help me understand if I can load the whole set to memory and what are the fields there. Since I'm command line oriented, I use linux command line utilities to do that (which are easily accesible from Jupython with !), but it's easily done with Python as well.

As an example, we'll use a subset of the NYC taxi dataset. The file is called taxi.csv.

File Size

In [1]:
# Command line
!ls -lh taxi.csv
-rw-r--r-- 1 miki miki 16M Oct  2 12:40 taxi.csv
In [2]:
# Python
from os import path
print('%.2f KB' % (path.getsize('taxi.csv')/(1<<10)))
print('%.2f MB' % (path.getsize('taxi.csv')/(1<<20)))
15789.13 KB
15.42 MB

Number of Lines

In [3]:
# Command line
!wc -l taxi.csv
100001 taxi.csv
In [4]:
# Python
with open('taxi.csv') as fp:
    print(sum(1 for _ in fp))
100001
In [5]:
# Command line
!head -1 taxi.csv | tr , \\n
!printf "%d fields" $(head -1 taxi.csv | tr , \\n | wc -l)
VendorID
lpep_pickup_datetime
Lpep_dropoff_datetime
Store_and_fwd_flag
RateCodeID
Pickup_longitude
Pickup_latitude
Dropoff_longitude
Dropoff_latitude
Passenger_count
Trip_distance
Fare_amount
Extra
MTA_tax
Tip_amount
Tolls_amount
Ehail_fee
improvement_surcharge
Total_amount
Payment_type
Trip_type 
21 fields
In [6]:
# Python
import csv
with open('taxi.csv') as fp:
    fields = next(csv.reader(fp))
print('\n'.join(fields))
print('%d fields' % len(fields))
VendorID
lpep_pickup_datetime
Lpep_dropoff_datetime
Store_and_fwd_flag
RateCodeID
Pickup_longitude
Pickup_latitude
Dropoff_longitude
Dropoff_latitude
Passenger_count
Trip_distance
Fare_amount
Extra
MTA_tax
Tip_amount
Tolls_amount
Ehail_fee
improvement_surcharge
Total_amount
Payment_type
Trip_type 
21 fields

Sample Data

In [7]:
# Command line
!head -2 taxi.csv | tail -1 | tr , \\n
!printf "%d values" $(head -2 taxi.csv | tail -1 | tr , \\n | wc -l)
2
2015-03-04 15:39:16
2015-03-04 15:42:30
N
1
-73.992240905761719
40.690120697021484
-73.999664306640625
40.684993743896484
2
.71
4.5
0
0.5
0
0

0.3
5.3
2
1


23 values
In [8]:
# Python
with open('taxi.csv') as fp:
    fp.readline()  # Skip header
    values = next(csv.reader(fp))
print('\n'.join(values))
print('%d values' % len(values))
2
2015-03-04 15:39:16
2015-03-04 15:42:30
N
1
-73.992240905761719
40.690120697021484
-73.999664306640625
40.684993743896484
2
.71
4.5
0
0.5
0
0

0.3
5.3
2
1


23 values
In [9]:
# Python (with field names)
from itertools import zip_longest
with open('taxi.csv') as fp:
    reader = csv.reader(fp)
    header = next(reader)
    values = next(reader)
for col, val in zip_longest(header, values, fillvalue='???'):
    print('%-20s: %s' % (col, val))
VendorID            : 2
lpep_pickup_datetime: 2015-03-04 15:39:16
Lpep_dropoff_datetime: 2015-03-04 15:42:30
Store_and_fwd_flag  : N
RateCodeID          : 1
Pickup_longitude    : -73.992240905761719
Pickup_latitude     : 40.690120697021484
Dropoff_longitude   : -73.999664306640625
Dropoff_latitude    : 40.684993743896484
Passenger_count     : 2
Trip_distance       : .71
Fare_amount         : 4.5
Extra               : 0
MTA_tax             : 0.5
Tip_amount          : 0
Tolls_amount        : 0
Ehail_fee           : 
improvement_surcharge: 0.3
Total_amount        : 5.3
Payment_type        : 2
Trip_type           : 1
???                 : 
???                 : 

In both methods (with fields or without) we see that we have some extra empty fields at the end of each data row.

Loading as DataFrame

After the initial look, we know we can load the whole data to memory and have a good idea what to tell pandas for parsing it.

In [10]:
import pandas as pd
import numpy as np
date_cols = ['lpep_pickup_datetime', 'Lpep_dropoff_datetime']
with open('taxi.csv') as fp:
    header = next(csv.reader(fp))
    df = pd.read_csv(fp, names=header, usecols=np.arange(len(header)), parse_dates=date_cols)
df.head()
Out[10]:
VendorID lpep_pickup_datetime Lpep_dropoff_datetime Store_and_fwd_flag RateCodeID Pickup_longitude Pickup_latitude Dropoff_longitude Dropoff_latitude Passenger_count ... Fare_amount Extra MTA_tax Tip_amount Tolls_amount Ehail_fee improvement_surcharge Total_amount Payment_type Trip_type
0 2 2015-03-04 15:39:16 2015-03-04 15:42:30 N 1 -73.992241 40.690121 -73.999664 40.684994 2 ... 4.5 0.0 0.5 0.00 0.0 NaN 0.3 5.30 2 1
1 2 2015-03-22 17:36:49 2015-03-22 17:45:39 N 5 -73.930038 40.819576 -73.907173 40.811306 2 ... 12.0 0.0 0.0 0.00 0.0 NaN 0.0 12.00 2 2
2 2 2015-03-25 22:08:45 2015-03-25 22:53:29 N 1 -73.961082 40.807022 -73.984642 40.663147 1 ... 45.0 0.5 0.5 9.26 0.0 NaN 0.3 55.56 1 1
3 2 2015-03-16 13:45:20 2015-03-16 13:52:04 N 1 -73.913200 40.777962 -73.926994 40.772743 2 ... 6.5 0.0 0.5 0.00 0.0 NaN 0.3 7.30 2 1
4 2 2015-03-19 18:53:50 2015-03-19 18:59:04 N 1 -73.925888 40.827602 -73.916351 40.824966 1 ... 5.5 1.0 0.5 0.00 0.0 NaN 0.3 7.30 2 1

5 rows × 21 columns