First Contact with Data¶

Every time I encounter new data file. There are few initial "looks" that I take on it. This help me understand if I can load the whole set to memory and what are the fields there. Since I'm command line oriented, I use linux command line utilities to do that (which are easily accesible from Jupython with !), but it's easily done with Python as well.

As an example, we'll use a subset of the NYC taxi dataset. The file is called taxi.csv.

File Size¶

In [1]:

# Command line
!ls -lh taxi.csv

-rw-r--r-- 1 miki miki 16M Oct  2 12:40 taxi.csv

In [2]:

# Python
from os import path
print('%.2f KB' % (path.getsize('taxi.csv')/(1<<10)))
print('%.2f MB' % (path.getsize('taxi.csv')/(1<<20)))

15789.13 KB
15.42 MB

Number of Lines¶

In [3]:

# Command line
!wc -l taxi.csv

100001 taxi.csv

In [4]:

# Python
with open('taxi.csv') as fp:
    print(sum(1 for _ in fp))

In [5]:

# Command line
!head -1 taxi.csv | tr , \\n
!printf "%d fields" $(head -1 taxi.csv | tr , \\n | wc -l)

VendorID
lpep_pickup_datetime
Lpep_dropoff_datetime
Store_and_fwd_flag
RateCodeID
Pickup_longitude
Pickup_latitude
Dropoff_longitude
Dropoff_latitude
Passenger_count
Trip_distance
Fare_amount
Extra
MTA_tax
Tip_amount
Tolls_amount
Ehail_fee
improvement_surcharge
Total_amount
Payment_type
Trip_type 
21 fields

In [6]:

# Python
import csv
with open('taxi.csv') as fp:
    fields = next(csv.reader(fp))
print('\n'.join(fields))
print('%d fields' % len(fields))

VendorID
lpep_pickup_datetime
Lpep_dropoff_datetime
Store_and_fwd_flag
RateCodeID
Pickup_longitude
Pickup_latitude
Dropoff_longitude
Dropoff_latitude
Passenger_count
Trip_distance
Fare_amount
Extra
MTA_tax
Tip_amount
Tolls_amount
Ehail_fee
improvement_surcharge
Total_amount
Payment_type
Trip_type 
21 fields

Sample Data¶

In [7]:

# Command line
!head -2 taxi.csv | tail -1 | tr , \\n
!printf "%d values" $(head -2 taxi.csv | tail -1 | tr , \\n | wc -l)

2
2015-03-04 15:39:16
2015-03-04 15:42:30
N
1
-73.992240905761719
40.690120697021484
-73.999664306640625
40.684993743896484
2
.71
4.5
0
0.5
0
0

0.3
5.3
2
1


23 values

In [8]:

# Python
with open('taxi.csv') as fp:
    fp.readline()  # Skip header
    values = next(csv.reader(fp))
print('\n'.join(values))
print('%d values' % len(values))

2
2015-03-04 15:39:16
2015-03-04 15:42:30
N
1
-73.992240905761719
40.690120697021484
-73.999664306640625
40.684993743896484
2
.71
4.5
0
0.5
0
0

0.3
5.3
2
1


23 values

In [9]:

# Python (with field names)
from itertools import zip_longest
with open('taxi.csv') as fp:
    reader = csv.reader(fp)
    header = next(reader)
    values = next(reader)
for col, val in zip_longest(header, values, fillvalue='???'):
    print('%-20s: %s' % (col, val))

VendorID            : 2
lpep_pickup_datetime: 2015-03-04 15:39:16
Lpep_dropoff_datetime: 2015-03-04 15:42:30
Store_and_fwd_flag  : N
RateCodeID          : 1
Pickup_longitude    : -73.992240905761719
Pickup_latitude     : 40.690120697021484
Dropoff_longitude   : -73.999664306640625
Dropoff_latitude    : 40.684993743896484
Passenger_count     : 2
Trip_distance       : .71
Fare_amount         : 4.5
Extra               : 0
MTA_tax             : 0.5
Tip_amount          : 0
Tolls_amount        : 0
Ehail_fee           : 
improvement_surcharge: 0.3
Total_amount        : 5.3
Payment_type        : 2
Trip_type           : 1
???                 : 
???                 :

In both methods (with fields or without) we see that we have some extra empty fields at the end of each data row.

Loading as DataFrame¶

After the initial look, we know we can load the whole data to memory and have a good idea what to tell pandas for parsing it.

In [10]:

import pandas as pd
import numpy as np
date_cols = ['lpep_pickup_datetime', 'Lpep_dropoff_datetime']
with open('taxi.csv') as fp:
    header = next(csv.reader(fp))
    df = pd.read_csv(fp, names=header, usecols=np.arange(len(header)), parse_dates=date_cols)
df.head()

Out[10]:

	VendorID	lpep_pickup_datetime	Lpep_dropoff_datetime	Store_and_fwd_flag	RateCodeID	Pickup_longitude	Pickup_latitude	Dropoff_longitude	Dropoff_latitude	Passenger_count	...	Fare_amount	Extra	MTA_tax	Tip_amount	Ehail_fee	improvement_surcharge	Total_amount	Payment_type	Trip_type
0	2	2015-03-04 15:39:16	2015-03-04 15:42:30	N	1	-73.992241	40.690121	-73.999664	40.684994	2	...	4.5	0.0	0.5	0.00	NaN	0.3	5.30	2	1
1	2	2015-03-22 17:36:49	2015-03-22 17:45:39	N	5	-73.930038	40.819576	-73.907173	40.811306	2	...	12.0	0.0	0.0	0.00	NaN	0.0	12.00	2	2
2	2	2015-03-25 22:08:45	2015-03-25 22:53:29	N	1	-73.961082	40.807022	-73.984642	40.663147	1	...	45.0	0.5	0.5	9.26	NaN	0.3	55.56	1	1
3	2	2015-03-16 13:45:20	2015-03-16 13:52:04	N	1	-73.913200	40.777962	-73.926994	40.772743	2	...	6.5	0.0	0.5	0.00	NaN	0.3	7.30	2	1
4	2	2015-03-19 18:53:50	2015-03-19 18:59:04	N	1	-73.925888	40.827602	-73.916351	40.824966	1	...	5.5	1.0	0.5	0.00	NaN	0.3	7.30	2	1

5 rows × 21 columns

In [ ]:

First Contact with Data¶

File Size¶

Number of Lines¶

Header¶

Sample Data¶

Loading as DataFrame¶