Author: Karen Yin-Yee Ng karenyng@ucdavis.edu
Stat 250 Winter 2014 with Prof. Duncan Temple Lang
Assignment 1
Written with an ipython notebook (v 1.1.0) Github repository: https://github.com/karenyng/HW1_Stat250_Winter14/tree/master
#%install_ext https://raw.github.com/minrk/ipython_extensions/master/extensions/nbtoc.py
#%load_ext nbtoc
%autosave 60
Autosaving every 60 seconds
#%nbtoc
We benchmark different approaches for computing the statistics of airline delays between 1987 and 2012.
Design considerations, due to the large size of the data, we have to :
avoid holding all the data in memory at same time but only want to extract the relevant data
avoid numerical instabilities
make it fast - avoid expensive operations such as making copies of data or I/O to hard disk
Other possible consideration if this was made into a production code:
Major consideration if this is to be used once .........
Pros:
( other than probably python for me)
Cons:
Other thoughts about implementation:
Not going to write the code in the most general / extensible / parallelizable way but I focus on getting an estimate of the answer from this first method
I usually write code as functions and use OOP concepts but I don't have time for this.
Observation: there are two types of files with two different headers.
extracting columns from the csv files is quite straight forward except that, in the month-by-month csv, there are some values with "," in them, making the naive use of cut with "," as delimiter fail to fetch the correct column. My quick and dirty solution is to increment the column number count by 2 as a correction.
I made sure that the functions work on individual files before looping them over more files. Test codes are called 'exam_col.sh' and 'test.ipynb'
Also, on second thought I should have written a test code, using python or R to write out mean, median and standard deviation for each of the csv files.
Then also use my code to write output of same format. Then just use diff in the shell to check if there is any differences to check for as many possible exceptional cases as possible.
Now I'm just manually comparing statistics of individual files / writing out fake data for testing.
I really should have done a better job planning for unit tests ......
When I write in a scripting language, I usually
Since R is not particularly strong in string manipulation or "developers of R did not have string manipulation in mind when they designed R" (Temple Lang 2014), I try to keep all string operations within the shell script.
To get a sense of the order of magnitude of the numbers that we are computing :
Of course these stats are obtained after I have run them through my scripts. On second thought I should have find/guesstimate the properties of the data before deciding on what languages / tools to use for this homework.
For R,
Since R uses double for storing the ArrDelay field by default and data take integer form, we really shouldn't have to worry too much about numerical inaccuracies like overflow (when summing number), underflow (dividing by large number) or cancellation error (subtraction). I argue we do not need more rigorous algorithms in the world for just computing the statistics of ArrDelay with the frequency table approach. If we compute the statistics with other methods, we may have to worry more, especially when computing the standard deviation.
danger of loss of accuracies of this naive formula???
$$ \bar{t} = \frac{ \sum_i^n (t_i ~\omega_i)}{\omega_{total}}$$where $t_i$ denotes the delay time $\omega_i $ denotes the corresponding count for that $t_i$. Numerator may overflow if we are just summing a bunch of positive numbers that are too large, but some of the numbers in the summation are negative.
As a safety precaution I will compute the mean using
$$ \bar{t} = \sum_i^n \frac{(t_i ~\omega_i)}{\omega_{total}}$$Note $t_i, w_i \ll w_{total}$, by multiplying before dividing, there shouldn't be any underflow problems. And also I am using the default R sum() method which uses partial sums so I think it should be accurate enough.
After reading John Cook 's entry I decide to use an adapted version of the two-pass formula because
I tested total frequency count is odd or even to avoid grabbing the wrong median.
mean = 6.(566504) min
median = 0.(00000) min
standard deviation = 31.(556326) min
should probably just report up to 1 or 2 significant figure(s).
......eyeballing if answer make sense by plotting the frequency counts Maybe I should have made bar plots but I am just trying to quickly visualize this.
Being able to visualize the results is a great strength of the frequency table approach
import pandas as pd
import matplotlib.pyplot
# somehow pandas has trouble reading the sorted frequency table
a = pd.read_table("sorted_freq.txt", names = ['col'])
c = []
for i in range(a.shape[0]):
b = a['col'][i].split()
if (len(b) == 2):
if (b[1] != 'NA') and (b[1]!='ArrDelay'):
c.append([float(b[0]), float(b[1])])
c = np.array(c).transpose()
delay = c[1]
freq = c[0]
plt.plot(delay, freq, '.b')
plt.xlabel('Delay time (min)', size=14)
plt.ylabel('Frequency', size=14)
xlim_low = -100
xlim_high = 100
plt.title('Truncated view of frequency table from'+\
'{0} to {1}'.format(xlim_low, xlim_high), size=15)
plt.xlim(xlim_low, xlim_high)
plt.axvline('6.566504', label = 'mean', color = 'r', ls = ':', lw = 3 )
plt.axvline('0.00', label = 'median', color = 'g', ls = '--', lw = 3 )
plt.legend(loc = 'best')
<matplotlib.legend.Legend at 0x9289510>
plt.plot(delay, freq, '.b')
plt.xlabel('Delay time (min)', size=14)
plt.ylabel('Frequency', size=14)
xlim_low = -100
xlim_high = 100
plt.title('Total view of frequency table', size=15)
plt.axvline('6.566504', label = 'mean', color = 'r', ls = ':', lw = 3)
plt.axvline('0.00', label = 'median', color = 'g', ls = '--', lw = 3 )
plt.legend(loc = 'best')
<matplotlib.legend.Legend at 0x93d5690>
It makes sense that the mean is slightly positive which is explained by the longer tail on the positive end.
It also makes sense that the median could be at zero, the positive tail is offset by the taller and fatter peak on the negative end if you look at the truncated plot.
Standard deviation is also plausible given the long tails.
This is probably not the fastest approach in terms of CPU time.
Possible improvement in the bash script: the bash script grabs the columns from different csv and write them all out to a big file before sorting then counting frequencies. I expect the I/O to hard disk to be one of the bigger bottlenecks. Dynamic arrays in the shell can be used to remedy this but I am not familiar how bash handles the memory for growing array to comment on if it will not use up a lot of memory.
Also the computation of the frequency table is also a place for possible improvements. Currently I have sorted all the 500 MB of ArrDelay column data before using "uniq" to compute the frequency table. If we can implement a fast hashing method then we might be able to avoid holding all 500MB of data. Even though 500 MB data should not be a limiting factor for even laptop as old as 5 years old but if the data is bigger than this is not an extensible solution.
Negligible amount of swap was used as far as I am concerned
This implementation is probably the most straight forward with only around 47 lines.
Python pandas is a data analysis python package that imitates R syntax and it has a C backend highly optimized for speed. I took a quick look at the relevant source code on github. The relevant C / Cython code are scattered in mainly in https://github.com/pydata/pandas/tree/master/pandas and https://github.com/pydata/pandas/tree/master/pandas/src
The actual "python" implementation of the read.csv function looks like it is mostly in Python: https://github.com/pydata/pandas/blob/master/pandas/io/parsers.py BUT, it is naive to think the read.csv function is really written in python since it parses a bunch of C libraries to replace original Python parts. For example the imported pandas.lib https://github.com/pydata/pandas/blob/master/pandas/lib.pyx seems to replace the python data variables such as integers, nans, infs with C variables and make calls to C standard libraries.
here 's an excerpt of parsers.py where the calls to wrapped libraries are made:
import pandas.lib as lib
import pandas.tslib as tslib
import pandas.parser as _parser
Pros of this method:
Cons of this method:
user system elapsed 89.872 16.604 185.723
The idea is to reduce the amount of data to be analyzed by sampling uniformly across all the csv files using https://github.com/duncantl/FastCSVSample
Therefore in the coding process, it is important to ensure the (almost) same percentages of lines are sampled from each csv file. Since I do not usually do sampling as a physics student I would like to see what percentage of lines I need to sample in order to get results close to the exact method.
Total number of valid lines in all the files $\sim 1.3 \times 10^8$ so if we just sample $1\%$ of the lines, we still have to go over $\sim 10^6$ lines. My patience runs out for a computation that takes more than half an hour so let's see how we well we need to do on the speed for me not to explode due to impatience:
print 'each line should at most take '+\
'{0}s'.format( 30 * 60 / (1e6))
each line should at most take 0.0018s
And that is a long time for the computation of each line--- good sign to reassure myself that I am going to finish this on time...
the R functions for grabbing column and removing NaNs only takes $10\mu~s$
This is similar to method 1 but then the number of lines should be even fewer, so I expect the total runtime to be less than 5 mins unless I write some loops that are awfully slow.
It is ok to hold all the sampled lines in memory $500MB \times 1\% \approx 5MB$ and just use R default mean, median and standard deviation function.
So I will just piece all the sampled lines into one big list (or whatever R object it is) then use default functions.
do statistics completely on the run so only have to go through the files only once:
first pass through the data -> use recurrence formale provided by Duncan for the mean and standard deviation
for the median, use selection algorithm
but I really want to avoid writing loops in R ............
mean = 6.(591345) min
median = 0.(000000) min
sd = 31.(465116) min
Sampling only 1% of the lines give results good up to 2 significant figures.
Rscript "PATH TO GIT DIR"/methodN.R
replace N with the method number.
Thanks to an anonymous physicist in Portland who has kindly allowed me to use his gaming desktop for the computing of this homework. He helped me repair the partition table to my Linux partition and installed a HDD for use.
y = [1, 2, 3]
result1 = [6.566504, 0.00000, 31.556326]
result2 = [6.56650421703, 0.0, 31.5563262623]
result3 = [6.591345, 0.000000, 31.465116]
plt.plot(result1, y, 'rs', markeredgewidth=1,
alpha = .25, markersize = 10, label="Method 1")
plt.plot(result2, y, 'gx', markeredgewidth=3,
alpha = .25, markersize = 10, label="Method 2")
plt.plot(result3, y, 'b*', markeredgewidth=1,
alpha = .25, markersize = 10, label="Method 3")
plt.legend(loc='lower right')
plt.title('Comparison of statistics from different approaches', size = 15)
label = ['mean', 'median', 'std. dev',]
plt.yticks( y, label, size = 15)
plt.margins(0.2)
plt.xlabel('Arrival delay (min)', size = 15)
<matplotlib.text.Text at 0x98958d0>
No idea how to make the plot above better, the differences of the data points are too small to be noticible
plt.title('Comparison of runtime of different methods', size = 16)
y = [1, 2, 3]
time_taken = [282.900/60., 181.094/60., 269.420/60.]
plt.plot(time_taken, x, 'o')
label = ['Freq table - Shell + R ', 'Python Pandas', 'FastCSVSample + R',]
plt.yticks( y, label, size = 12)
plt.margins(0.2)
plt.xlim(0,6)
plt.xlabel('Wall clock time (min)', size = 15)
<matplotlib.text.Text at 0x94f7390>
After googling around it seems like there are also packages in R that can read csv files with specified columns. Such as:
So it is possible to do the computation purely in R without having to use the shell. It would be interesting to compare the speeds between a "pure R" implementation with these packages vs the "pure Python" implementation with pandas and numpy. It concerns me that R seems to pass by value if I try to use the same approach as my python code to append the data frames. The memory usage will definitely be more than my python implementation. Or maybe I will just have to adapt the R version in a different style than my python code.
How did Duncan download the data for use in this homework ? http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236
Short answer: download & start reading Duncan's new book from P.333 of http://link.springer.com/book/10.1007%2F978-1-4614-7900-0
Long answer: I could guesstimate how to fill in the dynamic form on that website with some scripts and pull the links to those forms for downloading the data. But I will just get back to working on the other parts of the homework...
(I do not like how the cat command messes with the indentation of my bash script but here it is)
!cat ../freq_count.sh
#!/bin/bash #--------------------------------------------------------------------------- # Author: Karen Ng # This script # * reads in airline csv data files # * finds the column that corresponds to delay time # * cut the column then count the frequencies of a certain delay time # * returns frequency count in files named 1.txt 2.txt etc. # * the frequency count files still has header and NA values mixed inside # * have to use R or other language to discard those # version 1: no growing arrays but have slow I/O #--------------------------------------------------------------------------- shopt -s nullglob dir="./data" # store all file names in a bash array files=( "$dir"/*.csv ) fileno=${#files[*]} # check for csv input error if [ ${fileno} -eq 0 ]; then echo "--------SHELL SCRIPT ERROR-----------------------------" echo "ERROR: no suitable *.csv file found in $PWD/data" echo "For this script to work, put *.csv files in $PWD/data" echo "-------------------------------------------------------" exit 0; fi # loop through the files one by one for ((j=0; j < ${fileno}; j++)); do # look at the header, split them by "," then store in array arr=($(head -1 ${files[$j]} | tr "," "\n")) arrLen=${#arr[*]} col=-99 # loop through array of headers and count which column it is in for ((i=0; i < ${arrLen}; i++)); do x=${arr[$i]} if [ $x == "ArrDelay" ] || [ $x == "\"ARR_DELAY\"" ]; then # definition of column count is off by 1 between bash array and cut col=$i (( col += 1 )) # if it is a monthly csv, have to add two to the column count for the # column value to be fetched correctly by cut. # This is because some values in the monthly CSV contain commas if [ $x == "\"ARR_DELAY\"" ]; then (( col += 2 )) fi break fi done #echo "Writing the file-$j delay time to freq_count.txt" # store the frequency count of all the csv files in one single file # possible time improvement: # use an array to hold these instead of writing it to a file first # the col-finding and the writing out takes around 1 min 8 s real time if [ $j -eq 0 ]; then cut -d',' -f${col} ${files[$j]} > freq_count.txt else cut -d',' -f${col} ${files[$j]} >> freq_count.txt fi done # pipe the content of frequency count to sed # use sed to remove the trailing decimal places for some delay time entries # sort them then find unique counts # remove header --- could have just used grep -v cat freq_count.txt | sed -E 's/([0-9]+)\.00/\1/g' |\ sort -n | uniq -c |\ sed -e '/ArrDelay/d' -e '/ARR_DEL15/d' # can output to a text file to check values #> sorted_freq.txt
!cat ../method1.R
#---------------------------------------------------------------------- # Author: Karen Ng # Purpose: # reads in a two column file with frequency of each entry and entry value #--------------------------------------------------------------------- run <- function(){ col.names <- c('freq', 'delay') print("First method: compute_stat.R: ") print("computing and reading in sorted frequency table from bash script") print("This takes ~5 mins for a 3.5 GHz machine with sufficient RAM") # call the shell script for cutting columns and doing frequency count DF <- try(read.table(pipe("./freq_count.sh"), col.names = col.names, fill = TRUE), silent=TRUE) if (class(DF) == "try-error"){ print("ERROR: failed to read in $PWD/data/*.csv") print("USAGE: place data in $PWD/data/*.csv for this script to work") q("no", 1, FALSE) } # REMOVE NAN!!! DF <- na.omit(DF) print("compute_stat.R") print("compute_stat.R: computing total frequencies") w.total <- sum(DF[['freq']]) print("total number of valid total frequency count") print(w.total) print("compute_stat.R: computing mean") t.mean <- sum(DF[['freq']] * ( DF[['delay']] / w.total), na.rm = TRUE) print("compute_stat.R: computing median") i <- 1 Sum <- DF[['freq']][i] medianFreqCount <- floor(w.total / 2) ## sorry don't know better than to write a loop... while(Sum < medianFreqCount) { i <- i + 1 # this vectorized operation Sum <- sum(DF[['freq']][1:i], na.rm = TRUE) # is faster than ## if ( !is.na(DF[['freq']][i]) ) { ## Sum <- Sum + DF[['freq']][i] ## } } ## check for corner case: ## or else there the median will may be off print("compute_stat.R: computing standard dev.") if( Sum == medianFreqCount && w.total %% 2 == 0){ # print("going through special case") t.median <- (DF[['delay']][i] + DF[['delay']][i+1])/2 }else{ t.median <- DF[['delay']][i] } # after reading Jook Cook 's entry on computation of std. dev # I decide to go with the (two-pass) direct method because this can be # written entirely in vectorized form std.dev <- sqrt(sum(DF[['freq']] * (DF[['delay']] - t.mean) ^ 2 / (w.total-1))) results1 <- c(t.mean, t.median, std.dev) #print(results) results1 } time.method1 <- system.time(results1 <- run()) save(results1, time.method1, file="results1.rda")
!cat ../compute_stat.py
#!/usr/bin/env python import pandas as pd import numpy as np data_path = "./data/" # First read in the data from 1987 to 2007 year = [ str(i) for i in range(1987,2008) ] # create empty dataframe delay1 = pd.DataFrame() # loop through the year-by-year csvs for yr in year: # read in relevant column from csv file using pandas file = yr +'.csv' temp = pd.read_csv(data_path + file, usecols=["ArrDelay"]) # append the dataframes - this is done by reference not by value delay1 = delay1.append(temp) print 'appending '+ file + ' - total lines = ' + \ '{0}'.format(delay1.shape[0]) # create another empty dataframe for handling month by month csv delay2 = pd.DataFrame() year = [ str(i) for i in range(2008, 2013) ] month = ['January','February', 'March', 'April', 'May', 'June', 'July', 'August','September', 'October', 'November', 'December'] # loop through all the month-by-montyh csv for yr in year: for mth in month: file = yr + '_' + mth + '.csv' # tell pandas to read only the relevant column in the csv temp = pd.read_csv(data_path + file, usecols=["ARR_DELAY"]) # append them to the dataframe by reference delay2 = delay2.append(temp) print 'appending ' + file + ' - total lines = ' + \ '{0}'.format(delay2.shape[0] + delay1.shape[0]) # hackish way to remove the column name of the dataframe to append # the two types of csv columns together # so I can compute the statistics in one pass later on delay1 = np.array(delay1) delay2 = np.array(delay2) delay = np.append(delay1, delay2) delay = pd.DataFrame(delay) print 'total number of valid lines = {0}'.format(delay.dropna().shape[0]) # note that pandas ignores nans automatically while computing stats #print 'mean = {0} \n'.format(delay[0].mean()) +\ # 'median = {0} \n'.format(delay[0].median()) +\ # 'std. dev. = {0}\n'.format(delay[0].std()) print 'saving to results2.txt' f = open('results2.txt', 'w') f.write('mean = {0}\n'.format(delay[0].mean())) f.write('median = {0}\n'.format(delay[0].median())) f.write('std = {0}\n'.format(delay[0].std())) f.close()
In this python code I don't even have to do exception handling since Pandas is so smart.
!cat ../method2.R
#---------------------------------------------------------------------- # Author: Karen Ng # Purpose: # R wrapper around python code so there would be consistent profiling #--------------------------------------------------------------------- time.method2 <- system.time(system("./compute_stat.py")) # read in results from python script con = file("./results2.txt","r") results2 <- readLines(con) save(results2, time.method2, file="results2.rda")
In this method I forked Duncan's FastCSVSample R package and added some R code (which might have made it not so fast).
Since we are sampling from files with quite different sizes, it is important that we sample approximately a fixed percentage of lines.
!cat ../method3.R
#---------------------------------------------------------------------- # Author: Karen Ng <karenyng@ucdavis.edu> # Method 3.R # Purpose: test what percentage we need to sample for us to be able to # compute statistics that is close enough to those of an exact approach #---------------------------------------------------------------------- require(NotSoFastCSVSample) run <- function() { # initialize some options dataDir <- "./data/" samplePercent <- .01 files <- list.files(path = dataDir , pattern = "*.csv") # can actually run these in parallel by splitting the files # in half for a dual core machine and have each cpu run half of the files # create empty dataframe to hold data ... # this might be stupid but I don't care I want to be done i want to be done for(i in 1:length(files)){ filepath <- paste(dataDir, files[i], sep = "") filelength <- getNumLines(filepath) sampleNo <- as.integer(floor(filelength * samplePercent)) # try to tell the month-by-month csv apart from the year-by-year csv if (grepl("([0-9_]+)([A-Za-z]+).csv", files[i])) { print(paste("going through mnth-by-mnth csv", files[i])) col.Name <- "\"ARR_DELAY\"" } else { print(paste("going through yr-by-yr csv", files[i])) col.Name <- "ArrDelay" } # kluegy way of suppressing warnings about NAs ... Iduncare <- capture.output(temp<-csvSample(filepath, n = sampleNo, colName = col.Name)) # create a dataframe for appending later files # if this is the first file being processed if(i == 1) { data <- temp } else { data <- c(data, temp) } } results3 <- c(mean(data, na.rm = T), median(data, na.rm =T), sd(data, na.rm=T)) } time.method3 <- system.time(results3 <- run()) save(results3, time.method3, file="results3.rda")
!cat ../NotSoFastCSVSample/R/csvSample.R
#@example tt = csvSample("~/Data/Airline/Airlines/2002.csv", 100) # read.csv(textConnection(tt), header = FALSE) csvSample = function(file, n, rows = sample(1:numRows, n), numRows = getNumLines(file), randomize = FALSE, header = TRUE, colName = NULL) { file = path.expand(file) if(!file.exists(file)) stop("file does not exist") rows = sort(as.integer(rows)) if(header) rows = rows + 1L ans = .Call("R_csv_sample", file, rows) names(ans) = rows if(header & !is.null(colName)) ans <- getCol(file, colName, ans) if(randomize) sample(ans) else ans } getNumLines = function(file) { txt = system(sprintf("wc -l %s", file), intern = TRUE) as.integer(gsub("^ ?([0-9]+) .*", "\\1", txt)) } getColNum = function(file, colName) { colNum <- -99L csvHeader <- system(sprintf('head -1 %s | tr "," "\n"', file), intern = TRUE) for(i in 1:length(csvHeader)) { if(csvHeader[[i]] == colName) # not sure if this is the best way to do string comparison in R #if(grepl(csvHeader[[i]], colName) & !grepl("^$", csvHeader[[i]])) { colNum <- i #print(paste("found ", colName, " at column", colNum)) break } } if(colNum == -99L) stop(paste(colName, "does not exist in the csv header")) colNum } getCol = function(file, colName, ans) { colNum <- getColNum(file, colName) ans <- extractCol(colNum, ans, colName) } extractCol= function(colNum, ans, colName) # have not found time to write a fancy one that would match the double quotes # in the csv intelligently # if there is time, should write one that compares consecutive elements # in the split list to see if \" and \" has been split between two elements # then join them back together # ideally this should also be written in C or C++ for speed { splitAns <- strsplit(ans, ",") if (colName == '\"ARR_DELAY\"') { splitAns <- sapply(splitAns, "[", c(colNum+2L)) splitAns <- gsub("*NA*","NaN", splitAns) } else if(colName == "ArrDelay") { splitAns <- sapply(splitAns, "[", c(colNum)) splitAns <- gsub("\"([-0-9]+)\"","\\1", splitAns) # replace all the blank entries with NaN splitAns <- gsub("^$","NaN", splitAns) } # trying to make R stop complaining about NAs introduced by coercion #splitAns <- as.data.frame(splitAns) #splitAns <- na.omit(splitAns) splitAns <- sapply(splitAns, as.integer) }