Notebook

Author: Karen Yin-Yee Ng karenyng@ucdavis.edu

Stat 250 Winter 2014 with Prof. Duncan Temple Lang

Assignment 1

Written with an ipython notebook (v 1.1.0) Github repository: https://github.com/karenyng/HW1_Stat250_Winter14/tree/master

In [72]:

#%install_ext https://raw.github.com/minrk/ipython_extensions/master/extensions/nbtoc.py
#%load_ext nbtoc
%autosave 60

Autosaving every 60 seconds

Methods¶

In [73]:

#%nbtoc

We benchmark different approaches for computing the statistics of airline delays between 1987 and 2012.

Design considerations, due to the large size of the data, we have to :

avoid holding all the data in memory at same time but only want to extract the relevant data
avoid numerical instabilities
make it fast - avoid expensive operations such as making copies of data or I/O to hard disk
- may want to make it parallelizable depending on performance

Other possible consideration if this was made into a production code:

extensibility

Major consideration if this is to be used once .........

minimizing development time

Exact approach 1¶

frequency table approach using the shell & R (or item 5)¶

Pros:

development time is faster than using any of the compiled language

( other than probably python for me)

one of the shortest implementation in terms of lines of code
no fancy algorithms are needed
good for prototyping which is what usually people do in industry
super cool that the shell can do so much stuff so fast with so little code

Cons:

runtime may be slower than using pure compiled language for doing string manipulation
the computation of the frequency table is not the most parallelizble
some implementation for fetching column values were hackish - not very extensible
a lot of the parameters are hard coded
not a good production / extensible code due to the above cons

Other thoughts about implementation:

Not going to write the code in the most general / extensible / parallelizable way but I focus on getting an estimate of the answer from this first method

I usually write code as functions and use OOP concepts but I don't have time for this.

Shell scripting part¶

Observation: there are two types of files with two different headers.

month-by-month csv
year-by-year csv

extracting columns from the csv files is quite straight forward except that, in the month-by-month csv, there are some values with "," in them, making the naive use of cut with "," as delimiter fail to fetch the correct column. My quick and dirty solution is to increment the column number count by 2 as a correction.

Manual testing¶

I made sure that the functions work on individual files before looping them over more files. Test codes are called 'exam_col.sh' and 'test.ipynb'

Other possible tests that I do not have time for:¶

output all the headers of the year-by-year csv and month-by-month csv and make sure there are no variety within each type of csv

Also, on second thought I should have written a test code, using python or R to write out mean, median and standard deviation for each of the csv files.

Then also use my code to write output of same format. Then just use diff in the shell to check if there is any differences to check for as many possible exceptional cases as possible.

Now I'm just manually comparing statistics of individual files / writing out fake data for testing.

I really should have done a better job planning for unit tests ......

R part - computing the statistics¶

When I write in a scripting language, I usually

keep code short
avoid loops and use vectorized operations

Since R is not particularly strong in string manipulation or "developers of R did not have string manipulation in mind when they designed R" (Temple Lang 2014), I try to keep all string operations within the shell script.

Considerations for preventing numerical inaccuracies with the frequency table approach:¶

To get a sense of the order of magnitude of the numbers that we are computing :

total unique frequency count is $\omega_{total} \sim 10^8$
range of the delay time around [-1437, 2598]
all delay time are integers which simplifies that problem by a lot
precision requirement is lax (up to one minute - I will report the answer with one or two more decimal places at most or else we are over confident about the precision of the data)

Of course these stats are obtained after I have run them through my scripts. On second thought I should have find/guesstimate the properties of the data before deciding on what languages / tools to use for this homework.

For R,

the maximum value that an integer can hold is $\sim 10^9$, and
double can hold much larger integer $\sim 10^{308}$ and is the default R data type for numbers
Machine precision for double is $\sim 10^{-16}$.

Since R uses double for storing the ArrDelay field by default and data take integer form, we really shouldn't have to worry too much about numerical inaccuracies like overflow (when summing number), underflow (dividing by large number) or cancellation error (subtraction). I argue we do not need more rigorous algorithms in the world for just computing the statistics of ArrDelay with the frequency table approach. If we compute the statistics with other methods, we may have to worry more, especially when computing the standard deviation.

Computing the mean:¶

danger of loss of accuracies of this naive formula???

$$ \bar{t} = \frac{ \sum_i^n (t_i ~\omega_i)}{\omega_{total}}$$

where $t_i$ denotes the delay time $\omega_i $ denotes the corresponding count for that $t_i$. Numerator may overflow if we are just summing a bunch of positive numbers that are too large, but some of the numbers in the summation are negative.

As a safety precaution I will compute the mean using

$$ \bar{t} = \sum_i^n \frac{(t_i ~\omega_i)}{\omega_{total}}$$

Note $t_i, w_i \ll w_{total}$, by multiplying before dividing, there shouldn't be any underflow problems. And also I am using the default R sum() method which uses partial sums so I think it should be accurate enough.

Computing the std. dev.:¶

After reading John Cook 's entry I decide to use an adapted version of the two-pass formula because

I have already computed total frequency
it is not less accurate
I can write vectorized code instead of using Welford's recursive formula in R (looping is slow and recursion performance may be worse than looping)

$$s = \sqrt{\sum_i^n \frac{w_i}{w_{total}}(t_i - \bar{t})^2} $$

Computing the median:¶

I tested total frequency count is odd or even to avoid grabbing the wrong median.

Results of method 1 :¶

mean = 6.(566504) min

median = 0.(00000) min

standard deviation = 31.(556326) min

should probably just report up to 1 or 2 significant figure(s).

......eyeballing if answer make sense by plotting the frequency counts Maybe I should have made bar plots but I am just trying to quickly visualize this.

Being able to visualize the results is a great strength of the frequency table approach

In [74]:

import pandas as pd
import matplotlib.pyplot

In [75]:

# somehow pandas has trouble reading the sorted frequency table
a = pd.read_table("sorted_freq.txt", names = ['col']) 
c = []
for i in range(a.shape[0]):
    b = a['col'][i].split()
    if (len(b) == 2):
        if (b[1] != 'NA') and (b[1]!='ArrDelay'):
            c.append([float(b[0]), float(b[1])])
c = np.array(c).transpose()

delay = c[1]
freq = c[0]

In [76]:

plt.plot(delay, freq, '.b')
plt.xlabel('Delay time (min)', size=14)
plt.ylabel('Frequency', size=14)
xlim_low = -100
xlim_high = 100
plt.title('Truncated view of frequency table from'+\
          '{0} to {1}'.format(xlim_low, xlim_high), size=15)

plt.xlim(xlim_low, xlim_high)
plt.axvline('6.566504', label = 'mean', color = 'r', ls = ':', lw = 3 )
plt.axvline('0.00', label = 'median', color = 'g', ls = '--', lw = 3 )
plt.legend(loc = 'best')

Out[76]:

<matplotlib.legend.Legend at 0x9289510>

In [77]:

plt.plot(delay, freq, '.b')
plt.xlabel('Delay time (min)', size=14)
plt.ylabel('Frequency', size=14)
xlim_low = -100
xlim_high = 100
plt.title('Total view of frequency table', size=15)
plt.axvline('6.566504', label = 'mean', color = 'r', ls = ':', lw = 3)
plt.axvline('0.00', label = 'median', color = 'g', ls = '--', lw = 3 )
plt.legend(loc = 'best')

Out[77]:

<matplotlib.legend.Legend at 0x93d5690>

It makes sense that the mean is slightly positive which is explained by the longer tail on the positive end.

It also makes sense that the median could be at zero, the positive tail is offset by the taller and fatter peak on the negative end if you look at the truncated plot.

Standard deviation is also plausible given the long tails.

Computation time: ~5 mins¶

This is probably not the fastest approach in terms of CPU time.

Possible improvement in the bash script: the bash script grabs the columns from different csv and write them all out to a big file before sorting then counting frequencies. I expect the I/O to hard disk to be one of the bigger bottlenecks. Dynamic arrays in the shell can be used to remedy this but I am not familiar how bash handles the memory for growing array to comment on if it will not use up a lot of memory.

Also the computation of the frequency table is also a place for possible improvements. Currently I have sorted all the 500 MB of ArrDelay column data before using "uniq" to compute the frequency table. If we can implement a fast hashing method then we might be able to avoid holding all 500MB of data. Even though 500 MB data should not be a limiting factor for even laptop as old as 5 years old but if the data is bigger than this is not an extensible solution.

Memory usage:¶

Negligible amount of swap was used as far as I am concerned

Exact Method 2: Python with Pandas and Numpy (calling compiled code from python)¶

Background:¶

This implementation is probably the most straight forward with only around 47 lines.

Python pandas is a data analysis python package that imitates R syntax and it has a C backend highly optimized for speed. I took a quick look at the relevant source code on github. The relevant C / Cython code are scattered in mainly in https://github.com/pydata/pandas/tree/master/pandas and https://github.com/pydata/pandas/tree/master/pandas/src

The actual "python" implementation of the read.csv function looks like it is mostly in Python: https://github.com/pydata/pandas/blob/master/pandas/io/parsers.py BUT, it is naive to think the read.csv function is really written in python since it parses a bunch of C libraries to replace original Python parts. For example the imported pandas.lib https://github.com/pydata/pandas/blob/master/pandas/lib.pyx seems to replace the python data variables such as integers, nans, infs with C variables and make calls to C standard libraries.

here 's an excerpt of parsers.py where the calls to wrapped libraries are made:

import pandas.lib as lib
import pandas.tslib as tslib
import pandas.parser as _parser

What my code does:¶

loop over all the csv files , extracting only relevant column
first extract columns of the year-by-year csv into one pandas dataframe then the month-by-month csv into another pandas dataframe
remove the "column name" of each dataframe by turning them to numpy arrays
concatenate them together using numpy.append() -- this is NOT slow since Python libraries pass most objects by reference not by value
convert them back to pandas dataframe since a dataframe can handle nan values automatically, numpy array handles nan more clumsily
call existing methods of pandas dataframe to compute mean, median and standard deviation

Pros of this method:

short and simple
fast
extensible - pandas can do subsetting and group by and apply functions very easily, and also handles NA among other things. If you want a different field you can just grab that column by changing the script by one variable.

Cons of this method:

most of the code is hidden away from me so if pandas developer has a bug somewhere it is harder for me to debug but it is doable since pandas uses a BSD license so I can fix the bug myself by downloading it
code compatibility of future pandas version is at the mercy of pandas developers -- I better document exactly the version I am using in case pandas change drastically in the future

Tests of the code¶

informative with print statements
results agree with first method

Runtime of method 2: ~3.1 min¶

user system elapsed 89.872 16.604 185.723

Method 3 - Approximate approach using FastCSVSample¶

Background¶

The idea is to reduce the amount of data to be analyzed by sampling uniformly across all the csv files using https://github.com/duncantl/FastCSVSample

Therefore in the coding process, it is important to ensure the (almost) same percentages of lines are sampled from each csv file. Since I do not usually do sampling as a physics student I would like to see what percentage of lines I need to sample in order to get results close to the exact method.

Design considerations:¶

Total number of valid lines in all the files $\sim 1.3 \times 10^8$ so if we just sample $1\%$ of the lines, we still have to go over $\sim 10^6$ lines. My patience runs out for a computation that takes more than half an hour so let's see how we well we need to do on the speed for me not to explode due to impatience:

In [78]:

print 'each line should at most take '+\
'{0}s'.format( 30 * 60 / (1e6))

each line should at most take 0.0018s

And that is a long time for the computation of each line--- good sign to reassure myself that I am going to finish this on time...

the R functions for grabbing column and removing NaNs only takes $10\mu~s$

This is similar to method 1 but then the number of lines should be even fewer, so I expect the total runtime to be less than 5 mins unless I write some loops that are awfully slow.

How to compute the statistics for this method:¶

It is ok to hold all the sampled lines in memory $500MB \times 1\% \approx 5MB$ and just use R default mean, median and standard deviation function.

So I will just piece all the sampled lines into one big list (or whatever R object it is) then use default functions.

Other options ...¶

do statistics completely on the run so only have to go through the files only once:

first pass through the data -> use recurrence formale provided by Duncan for the mean and standard deviation
for the median, use selection algorithm

but I really want to avoid writing loops in R ............

(Manual) Tests¶

sample all the lines in 2008_May.csv and 1987.csv respectively and see if output agrees with exact approach --- and they agree!

Result¶

mean = 6.(591345) min
median = 0.(000000) min
sd = 31.(465116) min

Sampling only 1% of the lines give results good up to 2 significant figures.

Dependencies of the codes¶

move all the csv files inside the data directory
R version 3.0.2 (2013-09-25) -- "Frisbee Sailing" (method 1 & 3)
Python v. 2.7.4, numpy v.1.7.1, pandas v.0.10.1 (method 2)
a forked R package "NotSoFastCSVSample"

How to call the codes for different methods:¶

Rscript "PATH TO GIT DIR"/methodN.R

replace N with the method number.

Machine specification:¶

Corsair Vengeance 2x8GB DDR3 1600 MHz Desktop Memory,
Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz, http://www.cpubenchmark.net/cpu.php?cpu=Intel+Core+i7-4770K+%40+3.50GHz&id=1919
GeForce GTX 770 SuperClocked
Samsung 840 Pro 256 GB SSD
Motherboard: Asus Z87-Deluxe DDR3 1600 LGA 1150
Linux Mint 15 Olivia (GNU/Linux 3.8.0-19-generic x86_64)

Thanks to an anonymous physicist in Portland who has kindly allowed me to use his gaming desktop for the computing of this homework. He helped me repair the partition table to my Linux partition and installed a HDD for use.

Discussion¶

Comparison of results of different methods¶

In [87]:

y = [1, 2, 3]
result1 = [6.566504, 0.00000, 31.556326]
result2 = [6.56650421703, 0.0, 31.5563262623]
result3 = [6.591345,  0.000000, 31.465116]

plt.plot(result1, y, 'rs', markeredgewidth=1, 
         alpha = .25, markersize = 10, label="Method 1")
plt.plot(result2, y, 'gx', markeredgewidth=3, 
         alpha = .25, markersize = 10, label="Method 2")
plt.plot(result3, y, 'b*', markeredgewidth=1, 
         alpha = .25, markersize = 10, label="Method 3")

plt.legend(loc='lower right')
plt.title('Comparison of statistics from different approaches', size = 15)
label = ['mean', 'median', 'std. dev',]
plt.yticks( y, label, size = 15)
plt.margins(0.2)
plt.xlabel('Arrival delay (min)', size = 15)

Out[87]:

<matplotlib.text.Text at 0x98958d0>

No idea how to make the plot above better, the differences of the data points are too small to be noticible

In [80]:

plt.title('Comparison of runtime of different methods', size = 16)

y = [1, 2, 3]
time_taken = [282.900/60., 181.094/60., 269.420/60.] 

plt.plot(time_taken, x, 'o')
label = ['Freq table - Shell + R ', 'Python Pandas', 'FastCSVSample + R',]
plt.yticks( y, label, size = 12)
plt.margins(0.2)
plt.xlim(0,6)
plt.xlabel('Wall clock time (min)', size = 15)

Out[80]:

<matplotlib.text.Text at 0x94f7390>

Relevant packages for R that can also extract specific columns from a csv file:¶

After googling around it seems like there are also packages in R that can read csv files with specified columns. Such as:

colbycol on CRAN
limma on Bioconductor

So it is possible to do the computation purely in R without having to use the shell. It would be interesting to compare the speeds between a "pure R" implementation with these packages vs the "pure Python" implementation with pandas and numpy. It concerns me that R seems to pass by value if I try to use the same approach as my python code to append the data frames. The memory usage will definitely be more than my python implementation. Or maybe I will just have to adapt the R version in a different style than my python code.

Miscellaneous : Bonus question¶

How did Duncan download the data for use in this homework ? http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236

Short answer: download & start reading Duncan's new book from P.333 of http://link.springer.com/book/10.1007%2F978-1-4614-7900-0

Long answer: I could guesstimate how to fill in the dynamic form on that website with some scripts and pull the links to those forms for downloading the data. But I will just get back to working on the other parts of the homework...

Actual code:¶

Method 1: shell script to compute entire frequency table then use R to compute statistics¶

(I do not like how the cat command messes with the indentation of my bash script but here it is)

In [81]:

!cat ../freq_count.sh

#!/bin/bash

#---------------------------------------------------------------------------
# Author: Karen Ng
# This script 
# * reads in airline csv data files 
# * finds the column that corresponds to delay time
# * cut the column then count the frequencies of a certain delay time
# * returns frequency count in files named 1.txt 2.txt etc. 
# * the frequency count files still has header and NA values mixed inside
# * have to use R or other language to discard those 
# version 1: no growing arrays but have slow I/O 
#---------------------------------------------------------------------------

shopt -s nullglob
dir="./data"
# store all file names in a bash array
files=( "$dir"/*.csv ) 
fileno=${#files[*]}

# check for csv input error
if [ ${fileno} -eq 0 ]; then
	echo "--------SHELL SCRIPT ERROR-----------------------------"
	echo "ERROR: no suitable *.csv file found in $PWD/data" 
	echo "For this script to work, put *.csv files in $PWD/data" 
	echo "-------------------------------------------------------"
	exit 0;
fi

# loop through the files one by one
for ((j=0; j < ${fileno}; j++));
do 
	# look at the header, split them by "," then store in array
	arr=($(head -1 ${files[$j]} | tr "," "\n"))
	arrLen=${#arr[*]}
	col=-99

	# loop through array of headers and count which column it is in  
	for ((i=0; i < ${arrLen}; i++)); 
	do 
		x=${arr[$i]}
		if [ $x == "ArrDelay" ] || [ $x == "\"ARR_DELAY\"" ]; then 
			# definition of column count is off by 1 between bash array and cut 
			col=$i
			(( col += 1 ))

			# if it is a monthly csv, have to add two to the column count for the
			# column value to be fetched correctly by cut. 
			# This is because some values in the monthly CSV contain commas
			if [ $x == "\"ARR_DELAY\"" ]; then 
				(( col += 2 ))
			fi

			break
		fi
	done

	#echo "Writing the file-$j delay time to freq_count.txt"
	# store the frequency count of all the csv files in one single file
	# possible time improvement: 
	# use an array to hold these instead of writing it to a file first 
	# the col-finding and the writing out takes around 1 min 8 s real time  
	if [ $j -eq 0 ]; then 
		cut -d',' -f${col} ${files[$j]} > freq_count.txt 
	else 
		cut -d',' -f${col} ${files[$j]} >> freq_count.txt 
	fi

done

# pipe the content of frequency count to sed 
# use sed to remove the trailing decimal places for some delay time entries
# sort them then find unique counts 
# remove header --- could have just used grep -v  
cat freq_count.txt | sed -E 's/([0-9]+)\.00/\1/g' |\
	sort -n | uniq -c |\
	sed -e '/ArrDelay/d' -e '/ARR_DEL15/d' 

# can output to a text file to check values 
#> sorted_freq.txt

In [82]:

!cat ../method1.R

#----------------------------------------------------------------------
# Author: Karen Ng 
# Purpose: 
# reads in a two column file with frequency of each entry and entry value
#---------------------------------------------------------------------

run <- function(){
  col.names <- c('freq', 'delay')
  print("First method: compute_stat.R: ")
  print("computing and reading in sorted frequency table from bash script")
  print("This takes ~5 mins for a 3.5 GHz machine with sufficient RAM")
  # call the shell script for cutting columns and doing frequency count 
  DF <- try(read.table(pipe("./freq_count.sh"), col.names = col.names, 
                            fill = TRUE), silent=TRUE)
  if (class(DF) == "try-error"){
    print("ERROR: failed to read in $PWD/data/*.csv") 
    print("USAGE: place data in $PWD/data/*.csv for this script to work")
    q("no", 1, FALSE)
  }
  # REMOVE NAN!!! 
  DF <- na.omit(DF)

  print("compute_stat.R")
  print("compute_stat.R: computing total frequencies")
  w.total <- sum(DF[['freq']])  
  print("total number of valid total frequency count")
  print(w.total)

  print("compute_stat.R: computing mean")
  t.mean <- sum(DF[['freq']] * ( DF[['delay']] / w.total), na.rm = TRUE)

  print("compute_stat.R: computing median")
  i <- 1
  Sum <- DF[['freq']][i]
  medianFreqCount <- floor(w.total / 2) 
  ## sorry don't know better than to write a loop...
  while(Sum < medianFreqCount) { 
    i <- i + 1 
    # this vectorized operation 
    Sum <- sum(DF[['freq']][1:i], na.rm = TRUE)
    # is faster than 
    ## if ( !is.na(DF[['freq']][i]) ) {
    ##   Sum <- Sum + DF[['freq']][i] 
    ## }
  }
  ## check for corner case:  
  ## or else there the median will may be off  
  print("compute_stat.R: computing standard dev.")
  if( Sum == medianFreqCount &&  w.total %% 2 == 0){
    # print("going through special case")
    t.median <- (DF[['delay']][i] + DF[['delay']][i+1])/2   
  }else{
    t.median <- DF[['delay']][i]
  }

  # after reading Jook Cook 's entry on computation of std. dev
  # I decide to go with the (two-pass) direct method  because this can be 
  # written entirely in vectorized form
  std.dev <- sqrt(sum(DF[['freq']] * (DF[['delay']] - t.mean) ^ 2 / (w.total-1))) 
  results1 <- c(t.mean, t.median, std.dev)
  #print(results)
  results1
}

time.method1 <- system.time(results1 <- run())
save(results1, time.method1, file="results1.rda")

Method 2: Python + Pandas + Numpy (calling a compiled language)¶

In [83]:

!cat ../compute_stat.py

#!/usr/bin/env python

import pandas as pd
import numpy as np

data_path = "./data/"

# First read in the data from 1987 to 2007
year = [ str(i) for i in range(1987,2008) ]
# create empty dataframe
delay1 = pd.DataFrame()
# loop through the year-by-year csvs
for yr in year:
    # read in relevant column from csv file using pandas
    file = yr +'.csv'
    temp = pd.read_csv(data_path + file, usecols=["ArrDelay"])
    # append the dataframes - this is done by reference not by value
    delay1 = delay1.append(temp)
    print 'appending '+ file + ' - total lines = ' + \
            '{0}'.format(delay1.shape[0])

# create another empty dataframe for handling month by month csv
delay2 = pd.DataFrame()
year = [ str(i) for i in range(2008, 2013) ]
month = ['January','February', 'March', 'April', 'May', 'June', 'July',
         'August','September', 'October', 'November', 'December']
# loop through all the month-by-montyh csv
for yr in year:
    for mth in month:
        file = yr + '_' + mth + '.csv'
        # tell pandas to read only the relevant column in the csv
        temp = pd.read_csv(data_path + file, usecols=["ARR_DELAY"])
        # append them to the dataframe by reference
        delay2 = delay2.append(temp)
        print 'appending ' + file + ' - total lines = ' + \
            '{0}'.format(delay2.shape[0] + delay1.shape[0])

# hackish way to remove the column name of the dataframe to append
# the two types of csv columns together
# so I can compute the statistics in one pass later on
delay1 = np.array(delay1)
delay2 = np.array(delay2)
delay = np.append(delay1, delay2)
delay = pd.DataFrame(delay)
print 'total number of valid lines = {0}'.format(delay.dropna().shape[0])

# note that pandas ignores nans automatically while computing stats
#print 'mean = {0} \n'.format(delay[0].mean()) +\
#    'median = {0} \n'.format(delay[0].median()) +\
#    'std. dev. = {0}\n'.format(delay[0].std())

print 'saving to results2.txt'
f = open('results2.txt', 'w')
f.write('mean = {0}\n'.format(delay[0].mean()))
f.write('median = {0}\n'.format(delay[0].median()))
f.write('std = {0}\n'.format(delay[0].std()))
f.close()

In this python code I don't even have to do exception handling since Pandas is so smart.

In [84]:

!cat ../method2.R

#----------------------------------------------------------------------
# Author: Karen Ng 
# Purpose: 
# R wrapper around python code so there would be consistent profiling 
#---------------------------------------------------------------------
time.method2 <- system.time(system("./compute_stat.py"))
# read in results from python script
con = file("./results2.txt","r")
results2 <- readLines(con)
save(results2, time.method2, file="results2.rda")

Method 3: (NotSo)FastCSVSample¶

In this method I forked Duncan's FastCSVSample R package and added some R code (which might have made it not so fast).

Since we are sampling from files with quite different sizes, it is important that we sample approximately a fixed percentage of lines.

In [85]:

!cat ../method3.R

#----------------------------------------------------------------------
# Author: Karen Ng <karenyng@ucdavis.edu>
# Method 3.R 
# Purpose: test what percentage we need to sample for us to be able to 
# compute statistics that is close enough to those of an exact approach
#----------------------------------------------------------------------
require(NotSoFastCSVSample)

run <- function()
{
  # initialize some options 
  dataDir <- "./data/"
  samplePercent <- .01
  files <- list.files(path = dataDir , pattern = "*.csv")

  # can actually run these in parallel by splitting the files 
  # in half for a dual core machine and have each cpu run half of the files 

  # create empty dataframe to hold data ... 
  # this might be stupid but I don't care I want to be done i want to be done
  for(i in 1:length(files)){
    filepath <- paste(dataDir, files[i], sep = "")
    filelength <- getNumLines(filepath)
    sampleNo <- as.integer(floor(filelength * samplePercent)) 

    # try to tell the month-by-month csv apart from the year-by-year csv

    if (grepl("([0-9_]+)([A-Za-z]+).csv", files[i]))
    {
      print(paste("going through mnth-by-mnth csv", files[i]))
      col.Name <- "\"ARR_DELAY\""
    }
    else 
    {
      print(paste("going through yr-by-yr csv", files[i]))
      col.Name <- "ArrDelay"
    }
    # kluegy way of suppressing warnings about NAs ...
    Iduncare <- capture.output(temp<-csvSample(filepath, n = sampleNo, 
                      colName = col.Name))

    # create a dataframe for appending later files 
    # if this is the first file being processed 
    if(i == 1)
    {
      data <- temp
    }
    else
    {
      data <- c(data, temp)
    }
  }
  results3 <- c(mean(data, na.rm = T), 
                median(data, na.rm =T), 
                sd(data, na.rm=T))
}

time.method3 <- system.time(results3 <- run())
save(results3, time.method3, file="results3.rda")

In [86]:

!cat ../NotSoFastCSVSample/R/csvSample.R

#@example  tt = csvSample("~/Data/Airline/Airlines/2002.csv", 100)
#          read.csv(textConnection(tt), header = FALSE)
csvSample =
function(file, n, rows = sample(1:numRows, n),
         numRows = getNumLines(file),
         randomize = FALSE, header = TRUE, colName = NULL)
{
  file = path.expand(file)
  if(!file.exists(file))
    stop("file does not exist")

  rows = sort(as.integer(rows))
  if(header)
    rows = rows + 1L

  ans = .Call("R_csv_sample", file, rows)
  names(ans) = rows

  if(header & !is.null(colName))
    ans <- getCol(file, colName, ans)

  if(randomize)
    sample(ans)
  else
    ans
}

getNumLines =
function(file)
{
  txt = system(sprintf("wc -l %s", file), intern = TRUE)
  as.integer(gsub("^ ?([0-9]+) .*", "\\1", txt))
}

getColNum = 
function(file, colName)
{ 
  colNum <- -99L
  csvHeader <- system(sprintf('head -1 %s | tr "," "\n"', file), 
                      intern = TRUE)
  for(i in 1:length(csvHeader))
  {
    if(csvHeader[[i]] == colName)
    # not sure if this is the best way to do string comparison in R
    #if(grepl(csvHeader[[i]], colName) & !grepl("^$", csvHeader[[i]]))
    {
      colNum <- i 
      #print(paste("found ", colName, " at column", colNum))
      break
    }
  }
  if(colNum == -99L)
    stop(paste(colName, "does not exist in the csv header"))

  colNum
}

getCol = 
function(file, colName, ans)
{
  colNum <- getColNum(file, colName)  
  ans <- extractCol(colNum, ans, colName) 
}

extractCol=
function(colNum, ans, colName)
# have not found time to write a fancy one that would match the double quotes 
# in the csv intelligently
# if there is time, should write one that compares consecutive elements 
# in the split list to see if \" and \" has been split between two elements
# then join them back together
# ideally this should also be written in C or C++ for speed 
{
  splitAns <- strsplit(ans, ",")
  if (colName == '\"ARR_DELAY\"') 
  {
    splitAns <- sapply(splitAns, "[", c(colNum+2L)) 
    splitAns <- gsub("*NA*","NaN", splitAns) 
  }
  else if(colName == "ArrDelay")
  {
    splitAns <- sapply(splitAns, "[", c(colNum)) 
    splitAns <- gsub("\"([-0-9]+)\"","\\1", splitAns) 
    # replace all the blank entries with NaN
    splitAns <- gsub("^$","NaN", splitAns) 
  } 
  
  # trying to make R stop complaining about NAs introduced by coercion
  #splitAns <- as.data.frame(splitAns)
  #splitAns <- na.omit(splitAns)
  splitAns <- sapply(splitAns, as.integer)
}