Notebook

Data Science with Hadoop - predicting airline delays - part 1¶

Introduction¶

This IPython notebook complements and expounds the implementation details outlined in our series on Data Science and Hadoop part 1 blog, demonstrating how to apply data science techniques with Apache Hadoop.

In this first blog post we will demonstrate a step by step solution to a supervised learning problem, including:

How to use Python and Matplotlib to explore the raw dataset
How to use PIG with Python UDFs to transform our raw data into a feature matrix
How to use Python's excellent Scikit-learn machine learning library for building a predictive model.

So let's begin.

Pig and Python Can’t Fly But Can Predict Flight Delays¶

Every year approximately 20% of airline flights are delayed or cancelled, resulting in significant costs to both travellers and airlines. As our example use-case, we will build a supervised learning model that predicts airline delay from historial flight data and weather information.

Let's begin by exploring the airline delay dataset available here: http://stat-computing.org/dataexpo/2009/the-data.html This dataset includes details about flights in the US from the years 1987-2008. Every row in the dataset includes 29 variables:

	Name	Description
1	Year	1987-2008
2	Month	1-12
3	DayofMonth	1-31
4	DayOfWeek	1 (Monday) - 7 (Sunday)
5	DepTime	actual departure time (local, hhmm)
6	CRSDepTime	scheduled departure time (local, hhmm)
7	ArrTime	actual arrival time (local, hhmm)
8	CRSArrTime	scheduled arrival time (local, hhmm)
9	UniqueCarrier	unique carrier code
10	FlightNum	flight number
11	TailNum	plane tail number
12	ActualElapsedTime	in minutes
13	CRSElapsedTime	in minutes
14	AirTime	in minutes
15	ArrDelay	arrival delay, in minutes
16	DepDelay	departure delay, in minutes
17	Origin	origin
18	Dest	destination
19	Distance	in miles
20	TaxiIn	taxi in time, in minutes
21	TaxiOut	taxi out time in minutes
22	Cancelled	was the flight cancelled?
23	CancellationCode	reason for cancellation (A = carrier, B = weather, C = NAS, D = security)
24	Diverted	1 = yes, 0 = no
25	CarrierDelay	in minutes
26	WeatherDelay	in minutes
27	NASDelay	in minutes
28	SecurityDelay	in minutes
29	LateAircraftDelay	in minutes

To simplify, we will build a supervised learning model to predict flight delays for flights leaving O'Hare International airport (ORD), where we "learn" the model using data from 2007, and evaluate its performance using data from 2008.

But first, let's do some exploration of this dataset. Exploration is a common step in building a predictive model -- our goal is to better understand the data we have and get some clues as to which features might be good for the predictive model.

We start by importing some useful python libraries that we will need later like Pandas, Numpy, Scikit-learn and Matplotlib.

In [1]:

# Python library imports: numpy, random, sklearn, pandas, etc

import warnings
warnings.filterwarnings('ignore')

import sys
import random
import numpy as np

from sklearn import linear_model, cross_validation, metrics, svm
from sklearn.metrics import confusion_matrix, precision_recall_fscore_support, accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler

import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

We now define a utility function to read an HDFS file into a Pandas dataframe using Pydoop. Pydoop is a package that provides a Python API for Hadoop MapReduce and HDFS.

Pydoop's hdfs.open() function reads a single file from HDFS. However many HDFS output files are actually multi-part files, so our read_csv_from_hdfs() function uses hdfs.ls() to grab all the needed file names, and then read each one separately. Finally, it concatenates the resulting Pandas dataframes of each file into a Pandas dataframe.

In [2]:

# function to read HDFS file into dataframe using PyDoop
import pydoop.hdfs as hdfs
def read_csv_from_hdfs(path, cols, col_types=None):
  files = hdfs.ls(path);
  pieces = []
  for f in files:
    fhandle = hdfs.open(f)
    pieces.append(pd.read_csv(fhandle, names=cols, dtype=col_types))
    fhandle.close()
  return pd.concat(pieces, ignore_index=True)

Great. Now we got the logistics out of the way, so let's explore this dataset further.

First, let's read the raw data for 2007 from HDFS into a Pandas dataframe. We use our utility function read_csv_from_hdfs() and provide it with column names since this is a raw file, not a HIVE table with meta-data. Let's see how it works:

In [3]:

# read 2007 year file
cols = ['year', 'month', 'day', 'dow', 'DepTime', 'CRSDepTime', 'ArrTime', 'CRSArrTime', 'Carrier', 'FlightNum', 
        'TailNum', 'ActualElapsedTime', 'CRSElapsedTime', 'AirTime', 'ArrDelay', 'DepDelay', 'Origin', 'Dest', 
        'Distance', 'TaxiIn', 'TaxiOut', 'Cancelled', 'CancellationCode', 'Diverted', 'CarrierDelay', 
        'WeatherDelay', 'NASDelay', 'SecurityDelay', 'LateAircraftDelay'];
flt_2007 = read_csv_from_hdfs('airline/delay/2007.csv', cols)

flt_2007.shape

Out[3]:

(7453216, 29)

We see 7.4M+ flights in 2007 and 29 variables.

Our "target" variable will be DepDelay (scheduled departure delay in minutes). To build a classifier, we further refine our target variable into a binary variable by defining a "delay" as having 15 mins or more of delay, and "non-delay" otherwise. We thus create a new binary variable that we name 'DepDelayed'.

Let's look at some basic statistics, after limiting ourselves to flights originating from ORD:

In [4]:

df = flt_2007[flt_2007['Origin']=='ORD'].dropna(subset=['DepDelay'])
df['DepDelayed'] = df['DepDelay'].apply(lambda x: x>=15)
print "total flights: " + str(df.shape[0])
print "total delays: " + str(df['DepDelayed'].sum())

total flights: 359169
total delays: 109346

Let's see how delayed flights are distributed by month:

In [5]:

# Select a Pandas dataframe with flight originating from ORD

# Compute average number of delayed flights per month
grouped = df[['DepDelayed', 'month']].groupby('month').mean()

# plot average delays by month
grouped.plot(kind='bar')

Out[5]:

<matplotlib.axes._subplots.AxesSubplot at 0x7fa749875990>

We see that the average number of delays is highest in December and February, which is what we would expect.

Now let's look at the hour-of-day:

In [6]:

# Compute average number of delayed flights by hour
df['hour'] = df['CRSDepTime'].map(lambda x: int(str(int(x)).zfill(4)[:2]))
grouped = df[['DepDelayed', 'hour']].groupby('hour').mean()

# plot average delays by hour of day
grouped.plot(kind='bar')

Out[6]:

<matplotlib.axes._subplots.AxesSubplot at 0x7fa6affb24d0>

A clear pattern here - flights tend to be delayed later in the day. Perhaps this is because delays tend to pile up as the day progresses and the problem tends to compound later in the day.

Now let's look at delays by carrier:

In [7]:

# Compute average number of delayed flights per carrier
grouped1 = df[['DepDelayed', 'Carrier']].groupby('Carrier').filter(lambda x: len(x)>10)
grouped2 = grouped1.groupby('Carrier').mean()
carrier = grouped2.sort(['DepDelayed'], ascending=False)

# display top 15 destination carriers by delay (from ORD)
carrier[:15].plot(kind='bar')

Out[7]:

<matplotlib.axes._subplots.AxesSubplot at 0x7fa6afed6310>

As expected, some airlines are better than others.

Pre-processing: using Hadoop to build a feature matrix¶

After exploring the data for a bit, we now move to building the feature matrix for our predictive model.

Let's look at possible predictive variables for our model:

month: winter months should have more delays than summer months
day of month: this is likely not a very predictive variable, but let's keep it in anyway
day of week: weekend vs. weekday
hour of the day: later hours tend to have more delays
Carrier: we might expect some carriers to be more prone to delays than others
Destination airport: we expect some airports to be more prone to delays than others
Distance: interesting to see if this variable is a good predictor of delay

We will also generate another feature: number of days from closest national holiday, with the assumption that holidays tend to be associated with more delays.

We implement this "feature generation" process using PIG and some simply Python user-defined-functions (UDFs). First, let's implement some Python UDFs:

#
# Python UDFs for our PIG script
#
from datetime import date

# get hour-of-day from HHMM field
@outputSchema("value: int")
def get_hour(val):
  return int(val.zfill(4)[:2])

# this array defines the dates of holiday in 2007 and 2008
holidays = [
        date(2007, 1, 1), date(2007, 1, 15), date(2007, 2, 19), date(2007, 5, 28), date(2007, 6, 7), date(2007, 7, 4), \
        date(2007, 9, 3), date(2007, 10, 8), date(2007, 11, 11), date(2007, 11, 22), date(2007, 12, 25), \
        date(2008, 1, 1), date(2008, 1, 21), date(2008, 2, 18), date(2008, 5, 22), date(2008, 5, 26), date(2008, 7, 4), \
        date(2008, 9, 1), date(2008, 10, 13), date(2008, 11, 11), date(2008, 11, 27), date(2008, 12, 25) \
     ]
# get number of days from nearest holiday
@outputSchema("days: int")
def days_from_nearest_holiday(year, month, day):
  d = date(year, month, day)
  x = [(abs(d-h)).days for h in holidays]
  return min(x)

Our PIG script is relatively simple:

Load the dataset (2007 or 2008)
Filter out flights that were cancelled or that are NOT originating in ORD
Project only variables that we want to use in the analysis
Generate the output feature matrix, using the Python UDFs

We can execute this script directly from IPython (the Python UDFs are separately stored in "util.py"):

In [8]:

%%writefile preprocess1.pig

Register 'util.py' USING jython as util;
DEFINE preprocess(year_str, airport_code) returns data
{
        -- load airline data from specified year (need to specify fields since it's not in HCat)
        airline = load 'airline/delay/$year_str.csv' using PigStorage(',') 
            as (Year: int, Month: int, DayOfMonth: int, DayOfWeek: int, DepTime: chararray, 
                CRSDepTime: chararray, ArrTime, CRSArrTime, Carrier: chararray, FlightNum, TailNum, ActualElapsedTime, 
                CRSElapsedTime, AirTime, ArrDelay, DepDelay: int, Origin: chararray, Dest: chararray, Distance: int, 
                TaxiIn, TaxiOut, Cancelled: int, CancellationCode, Diverted, CarrierDelay, WeatherDelay, 
                NASDelay, SecurityDelay, LateAircraftDelay);

        -- keep only instances where flight was not cancelled and originate at ORD
        airline_flt = filter airline by Cancelled == 0 and Origin == '$airport_code';

        -- Keep only fields I need
        $data = foreach airline_flt generate DepDelay as delay, Month, DayOfMonth, DayOfWeek, 
                                             util.get_hour(CRSDepTime) as hour, Distance, Carrier, Dest,
                                             util.days_from_nearest_holiday(Year, Month, DayOfMonth) as hdays;
};

ORD_2007 = preprocess('2007', 'ORD');
rmf airline/fm/ord_2007_1
store ORD_2007 into 'airline/fm/ord_2007_1' using PigStorage(',');

ORD_2008 = preprocess('2008', 'ORD');
rmf airline/fm/ord_2008_1
store ORD_2008 into 'airline/fm/ord_2008_1' using PigStorage(',');

Overwriting preprocess1.pig

Let's look at the output as the script continues to process...

In [9]:

%%bash --err pig_out --bg 
pig -f preprocess1.pig

Starting job # 0 in a separate thread.

In [10]:

while True:
    line = pig_out.readline()
    if not line: 
        break
    sys.stdout.write("%s" % line)
    sys.stdout.flush()

15/03/11 10:08:41 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL
15/03/11 10:08:41 INFO pig.ExecTypeProvider: Trying ExecType : MAPREDUCE
15/03/11 10:08:41 INFO pig.ExecTypeProvider: Picked MAPREDUCE as the ExecType
2015-03-11 10:08:41,358 [main] INFO  org.apache.pig.Main - Apache Pig version 0.14.0.2.2.0.0-2041 (rexported) compiled Nov 19 2014, 15:24:46
2015-03-11 10:08:41,359 [main] INFO  org.apache.pig.Main - Logging error messages to: /home/demo/airline-demo/pig_1426093721356.log
2015-03-11 10:08:42,495 [main] INFO  org.apache.pig.impl.util.Utils - Default bootup file /home/demo/.pigbootup not found
2015-03-11 10:08:42,770 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://ds-master.cloud.hortonworks.com:8020
2015-03-11 10:08:43,872 [main] INFO  org.apache.pig.scripting.jython.JythonScriptEngine - created tmp python.cachedir=/tmp/pig_jython_3861695012397846592
2015-03-11 10:08:45,674 [main] WARN  org.apache.pig.scripting.jython.JythonScriptEngine - pig.cmd.args.remainders is empty. This is not expected unless on testing.
2015-03-11 10:08:47,988 [main] INFO  org.apache.pig.scripting.jython.JythonScriptEngine - Register scripting UDF: util.get_hour
2015-03-11 10:08:47,990 [main] INFO  org.apache.pig.scripting.jython.JythonScriptEngine - Register scripting UDF: util.to_date
2015-03-11 10:08:47,992 [main] INFO  org.apache.pig.scripting.jython.JythonScriptEngine - Register scripting UDF: util.days_from_nearest_holiday
2015-03-11 10:08:52,833 [main] INFO  org.apache.pig.scripting.jython.JythonFunction - Schema 'value: int' defined for func get_hour
2015-03-11 10:08:52,994 [main] INFO  org.apache.pig.tools.grunt.GruntParser - Waited 0ms to delete file
2015-03-11 10:08:54,283 [main] INFO  org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: FILTER
2015-03-11 10:08:54,656 [main] INFO  org.apache.pig.data.SchemaTupleBackend - Key [pig.schematuple] was not set... will not generate code.
2015-03-11 10:08:54,727 [main] INFO  org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer - {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, ConstantCalculator, GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, PartitionFilterOptimizer, PredicatePushdownOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter]}
2015-03-11 10:08:54,783 [main] INFO  org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns pruned for macro_preprocess_airline_0: $4, $6, $7, $9, $10, $11, $12, $13, $14, $19, $20, $22, $23, $24, $25, $26, $27, $28
2015-03-11 10:08:55,720 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false
2015-03-11 10:08:55,774 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1
2015-03-11 10:08:55,777 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1
2015-03-11 10:08:57,494 [main] INFO  org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://ds-master.cloud.hortonworks.com:8188/ws/v1/timeline/
2015-03-11 10:08:57,703 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at ds-master.cloud.hortonworks.com/172.24.70.17:8050
2015-03-11 10:08:58,629 [main] INFO  org.apache.pig.tools.pigstats.mapreduce.MRScriptState - Pig script settings are added to the job
2015-03-11 10:08:58,644 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
2015-03-11 10:08:58,684 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - This job cannot be converted run in-process
2015-03-11 10:08:59,405 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/hdp/2.2.0.0-2041/pig/pig-0.14.0.2.2.0.0-2041-core-h2.jar to DistributedCache through /tmp/temp-1751751362/tmp1493467348/pig-0.14.0.2.2.0.0-2041-core-h2.jar
2015-03-11 10:09:01,690 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/hdp/2.2.0.0-2041/pig/lib/jython-standalone-2.5.3.jar to DistributedCache through /tmp/temp-1751751362/tmp2080441925/jython-standalone-2.5.3.jar
2015-03-11 10:09:01,908 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/hdp/2.2.0.0-2041/pig/lib/automaton-1.11-8.jar to DistributedCache through /tmp/temp-1751751362/tmp518987527/automaton-1.11-8.jar
2015-03-11 10:09:02,064 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/hdp/2.2.0.0-2041/pig/lib/antlr-runtime-3.4.jar to DistributedCache through /tmp/temp-1751751362/tmp524180625/antlr-runtime-3.4.jar
2015-03-11 10:09:02,658 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/hdp/2.2.0.0-2041/hadoop/lib/guava-11.0.2.jar to DistributedCache through /tmp/temp-1751751362/tmp-459080211/guava-11.0.2.jar
2015-03-11 10:09:02,702 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/hdp/2.2.0.0-2041/hadoop-mapreduce/joda-time-2.5.jar to DistributedCache through /tmp/temp-1751751362/tmp-1371032075/joda-time-2.5.jar
2015-03-11 10:09:02,800 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/tmp/PigScriptUDF-3f91fbfefba602bf28492c3cd7f8b54c.jar to DistributedCache through /tmp/temp-1751751362/tmp344686184/PigScriptUDF-3f91fbfefba602bf28492c3cd7f8b54c.jar
2015-03-11 10:09:02,862 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job
2015-03-11 10:09:02,877 [main] INFO  org.apache.pig.data.SchemaTupleFrontend - Key [pig.schematuple] is false, will not generate code.
2015-03-11 10:09:02,878 [main] INFO  org.apache.pig.data.SchemaTupleFrontend - Starting process to move generated code to distributed cacche
2015-03-11 10:09:02,879 [main] INFO  org.apache.pig.data.SchemaTupleFrontend - Setting key [pig.schematuple.classes] with classes to deserialize []
2015-03-11 10:09:02,978 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.
2015-03-11 10:09:03,230 [JobControl] INFO  org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://ds-master.cloud.hortonworks.com:8188/ws/v1/timeline/
2015-03-11 10:09:03,232 [JobControl] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at ds-master.cloud.hortonworks.com/172.24.70.17:8050
2015-03-11 10:09:04,006 [JobControl] WARN  org.apache.hadoop.mapreduce.JobSubmitter - No job jar file set.  User classes may not be found. See Job or Job#setJar(String).
2015-03-11 10:09:04,112 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2015-03-11 10:09:04,114 [JobControl] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
2015-03-11 10:09:04,354 [JobControl] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 6
2015-03-11 10:09:04,896 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - number of splits:6
2015-03-11 10:09:05,411 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1424904779802_0602
2015-03-11 10:09:06,317 [JobControl] INFO  org.apache.hadoop.mapred.YARNRunner - Job jar is not present. Not adding any jar to the list of resources.
2015-03-11 10:09:06,452 [JobControl] INFO  org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - Submitted application application_1424904779802_0602
2015-03-11 10:09:06,722 [JobControl] INFO  org.apache.hadoop.mapreduce.Job - The url to track the job: http://ds-master.cloud.hortonworks.com:8088/proxy/application_1424904779802_0602/
2015-03-11 10:09:06,724 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_1424904779802_0602
2015-03-11 10:09:06,725 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Processing aliases ORD_2007,macro_preprocess_airline_0,macro_preprocess_airline_flt_0
2015-03-11 10:09:06,726 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - detailed locations: M: macro_preprocess_airline_0[6,18],macro_preprocess_airline_0[-1,-1],macro_preprocess_airline_flt_0[14,22],ORD_2007[17,19] C:  R: 
2015-03-11 10:09:06,746 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete
2015-03-11 10:09:06,747 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0602]
2015-03-11 10:09:46,097 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 6% complete
2015-03-11 10:09:46,100 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0602]
2015-03-11 10:09:54,122 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 10% complete
2015-03-11 10:09:54,125 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0602]
2015-03-11 10:10:01,145 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 15% complete
2015-03-11 10:10:01,146 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0602]
2015-03-11 10:10:06,160 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 20% complete
2015-03-11 10:10:06,160 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0602]
2015-03-11 10:10:14,179 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 25% complete
2015-03-11 10:10:14,180 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0602]
2015-03-11 10:10:19,192 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 29% complete
2015-03-11 10:10:19,192 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0602]
2015-03-11 10:10:26,209 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 33% complete
2015-03-11 10:10:26,210 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0602]
2015-03-11 10:10:37,236 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 38% complete
2015-03-11 10:10:37,237 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0602]
2015-03-11 10:10:44,255 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 43% complete
2015-03-11 10:10:44,255 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0602]
2015-03-11 10:10:54,282 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 47% complete
2015-03-11 10:10:54,283 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0602]
2015-03-11 10:11:07,523 [main] INFO  org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://ds-master.cloud.hortonworks.com:8188/ws/v1/timeline/
2015-03-11 10:11:07,526 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at ds-master.cloud.hortonworks.com/172.24.70.17:8050
2015-03-11 10:11:07,539 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2015-03-11 10:11:08,434 [main] INFO  org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://ds-master.cloud.hortonworks.com:8188/ws/v1/timeline/
2015-03-11 10:11:08,436 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at ds-master.cloud.hortonworks.com/172.24.70.17:8050
2015-03-11 10:11:08,446 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2015-03-11 10:11:08,788 [main] INFO  org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://ds-master.cloud.hortonworks.com:8188/ws/v1/timeline/
2015-03-11 10:11:08,790 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at ds-master.cloud.hortonworks.com/172.24.70.17:8050
2015-03-11 10:11:08,799 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2015-03-11 10:11:08,929 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
2015-03-11 10:11:08,933 [main] INFO  org.apache.pig.tools.pigstats.mapreduce.SimplePigStats - Script Statistics: 

HadoopVersion	PigVersion	UserId	StartedAt	FinishedAt	Features
2.6.0.2.2.0.0-2041	0.14.0.2.2.0.0-2041	demo	2015-03-11 10:08:58	2015-03-11 10:11:08	FILTER

Success!

Job Stats (time in seconds):
JobId	Maps	Reduces	MaxMapTime	MinMapTime	AvgMapTime	MedianMapTime	MaxReduceTime	MinReduceTime	AvgReduceTime	MedianReducetime	Alias	Feature	Outputs
job_1424904779802_0602	6	0	106	50	85	94	0	0	0	0	ORD_2007,macro_preprocess_airline_0,macro_preprocess_airline_flt_0	MAP_ONLY	hdfs://ds-master.cloud.hortonworks.com:8020/user/demo/airline/fm/ord_2007_1,

Input(s):
Successfully read 7453216 records (703535971 bytes) from: "hdfs://ds-master.cloud.hortonworks.com:8020/user/demo/airline/delay/2007.csv"

Output(s):
Successfully stored 359169 records (9421186 bytes) in: "hdfs://ds-master.cloud.hortonworks.com:8020/user/demo/airline/fm/ord_2007_1"

Counters:
Total records written : 359169
Total bytes written : 9421186
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0

Job DAG:
job_1424904779802_0602


2015-03-11 10:11:09,102 [main] INFO  org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://ds-master.cloud.hortonworks.com:8188/ws/v1/timeline/
2015-03-11 10:11:09,104 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at ds-master.cloud.hortonworks.com/172.24.70.17:8050
2015-03-11 10:11:09,112 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2015-03-11 10:11:09,362 [main] INFO  org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://ds-master.cloud.hortonworks.com:8188/ws/v1/timeline/
2015-03-11 10:11:09,364 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at ds-master.cloud.hortonworks.com/172.24.70.17:8050
2015-03-11 10:11:09,373 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2015-03-11 10:11:09,575 [main] INFO  org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://ds-master.cloud.hortonworks.com:8188/ws/v1/timeline/
2015-03-11 10:11:09,577 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at ds-master.cloud.hortonworks.com/172.24.70.17:8050
2015-03-11 10:11:09,587 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2015-03-11 10:11:09,650 [main] WARN  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Encountered Warning FIELD_DISCARDED_TYPE_CONVERSION_FAILED 160755 time(s).
2015-03-11 10:11:09,652 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
2015-03-11 10:11:09,790 [main] INFO  org.apache.pig.tools.grunt.GruntParser - Waited 0ms to delete file
2015-03-11 10:11:10,146 [main] INFO  org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: FILTER
2015-03-11 10:11:10,189 [main] INFO  org.apache.pig.data.SchemaTupleBackend - Key [pig.schematuple] was not set... will not generate code.
2015-03-11 10:11:10,191 [main] INFO  org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer - {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, ConstantCalculator, GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, PartitionFilterOptimizer, PredicatePushdownOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter]}
2015-03-11 10:11:10,204 [main] INFO  org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns pruned for macro_preprocess_airline_1: $4, $6, $7, $9, $10, $11, $12, $13, $14, $19, $20, $22, $23, $24, $25, $26, $27, $28
2015-03-11 10:11:10,333 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false
2015-03-11 10:11:10,340 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1
2015-03-11 10:11:10,341 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1
2015-03-11 10:11:10,519 [main] INFO  org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://ds-master.cloud.hortonworks.com:8188/ws/v1/timeline/
2015-03-11 10:11:10,521 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at ds-master.cloud.hortonworks.com/172.24.70.17:8050
2015-03-11 10:11:10,529 [main] INFO  org.apache.pig.tools.pigstats.mapreduce.MRScriptState - Pig script settings are added to the job
2015-03-11 10:11:10,534 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
2015-03-11 10:11:10,537 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - This job cannot be converted run in-process
2015-03-11 10:11:11,086 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/hdp/2.2.0.0-2041/pig/pig-0.14.0.2.2.0.0-2041-core-h2.jar to DistributedCache through /tmp/temp-1751751362/tmp-2067043078/pig-0.14.0.2.2.0.0-2041-core-h2.jar
2015-03-11 10:11:11,264 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/hdp/2.2.0.0-2041/pig/lib/jython-standalone-2.5.3.jar to DistributedCache through /tmp/temp-1751751362/tmp-908608899/jython-standalone-2.5.3.jar
2015-03-11 10:11:11,340 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/hdp/2.2.0.0-2041/pig/lib/automaton-1.11-8.jar to DistributedCache through /tmp/temp-1751751362/tmp560249020/automaton-1.11-8.jar
2015-03-11 10:11:11,385 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/hdp/2.2.0.0-2041/pig/lib/antlr-runtime-3.4.jar to DistributedCache through /tmp/temp-1751751362/tmp884892759/antlr-runtime-3.4.jar
2015-03-11 10:11:11,437 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/hdp/2.2.0.0-2041/hadoop/lib/guava-11.0.2.jar to DistributedCache through /tmp/temp-1751751362/tmp-2092270325/guava-11.0.2.jar
2015-03-11 10:11:11,502 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/hdp/2.2.0.0-2041/hadoop-mapreduce/joda-time-2.5.jar to DistributedCache through /tmp/temp-1751751362/tmp-431570757/joda-time-2.5.jar
2015-03-11 10:11:11,540 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/tmp/PigScriptUDF-3f91fbfefba602bf28492c3cd7f8b54c.jar to DistributedCache through /tmp/temp-1751751362/tmp229632211/PigScriptUDF-3f91fbfefba602bf28492c3cd7f8b54c.jar
2015-03-11 10:11:11,570 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job
2015-03-11 10:11:11,573 [main] INFO  org.apache.pig.data.SchemaTupleFrontend - Key [pig.schematuple] is false, will not generate code.
2015-03-11 10:11:11,574 [main] INFO  org.apache.pig.data.SchemaTupleFrontend - Starting process to move generated code to distributed cacche
2015-03-11 10:11:11,574 [main] INFO  org.apache.pig.data.SchemaTupleFrontend - Setting key [pig.schematuple.classes] with classes to deserialize []
2015-03-11 10:11:11,628 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.
2015-03-11 10:11:11,790 [JobControl] INFO  org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://ds-master.cloud.hortonworks.com:8188/ws/v1/timeline/
2015-03-11 10:11:11,792 [JobControl] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at ds-master.cloud.hortonworks.com/172.24.70.17:8050
2015-03-11 10:11:11,827 [JobControl] WARN  org.apache.hadoop.mapreduce.JobSubmitter - No job jar file set.  User classes may not be found. See Job or Job#setJar(String).
2015-03-11 10:11:11,902 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2015-03-11 10:11:11,903 [JobControl] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
2015-03-11 10:11:11,909 [JobControl] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 6
2015-03-11 10:11:12,026 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - number of splits:6
2015-03-11 10:11:12,145 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1424904779802_0603
2015-03-11 10:11:12,155 [JobControl] INFO  org.apache.hadoop.mapred.YARNRunner - Job jar is not present. Not adding any jar to the list of resources.
2015-03-11 10:11:12,223 [JobControl] INFO  org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - Submitted application application_1424904779802_0603
2015-03-11 10:11:12,230 [JobControl] INFO  org.apache.hadoop.mapreduce.Job - The url to track the job: http://ds-master.cloud.hortonworks.com:8088/proxy/application_1424904779802_0603/
2015-03-11 10:11:12,231 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_1424904779802_0603
2015-03-11 10:11:12,231 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Processing aliases ORD_2008,macro_preprocess_airline_1,macro_preprocess_airline_flt_1
2015-03-11 10:11:12,231 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - detailed locations: M: macro_preprocess_airline_1[6,18],macro_preprocess_airline_1[-1,-1],macro_preprocess_airline_flt_1[14,22],ORD_2008[17,19] C:  R: 
2015-03-11 10:11:12,241 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete
2015-03-11 10:11:12,242 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0603]
2015-03-11 10:11:44,403 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 6% complete
2015-03-11 10:11:44,404 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0603]
2015-03-11 10:11:49,419 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 11% complete
2015-03-11 10:11:49,421 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0603]
2015-03-11 10:11:57,444 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 16% complete
2015-03-11 10:11:57,446 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0603]
2015-03-11 10:12:02,462 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 21% complete
2015-03-11 10:12:02,463 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0603]
2015-03-11 10:12:09,484 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 26% complete
2015-03-11 10:12:09,484 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0603]
2015-03-11 10:12:14,501 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 32% complete
2015-03-11 10:12:14,501 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0603]
2015-03-11 10:12:19,518 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 36% complete
2015-03-11 10:12:19,518 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0603]
2015-03-11 10:12:24,532 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 41% complete
2015-03-11 10:12:24,532 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0603]
2015-03-11 10:12:31,552 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 46% complete
2015-03-11 10:12:31,552 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0603]
2015-03-11 10:12:42,755 [main] INFO  org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://ds-master.cloud.hortonworks.com:8188/ws/v1/timeline/
2015-03-11 10:12:42,758 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at ds-master.cloud.hortonworks.com/172.24.70.17:8050
2015-03-11 10:12:42,773 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2015-03-11 10:12:43,041 [main] INFO  org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://ds-master.cloud.hortonworks.com:8188/ws/v1/timeline/
2015-03-11 10:12:43,043 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at ds-master.cloud.hortonworks.com/172.24.70.17:8050
2015-03-11 10:12:43,051 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2015-03-11 10:12:43,269 [main] INFO  org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://ds-master.cloud.hortonworks.com:8188/ws/v1/timeline/
2015-03-11 10:12:43,271 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at ds-master.cloud.hortonworks.com/172.24.70.17:8050
2015-03-11 10:12:43,282 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2015-03-11 10:12:43,329 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
2015-03-11 10:12:43,331 [main] INFO  org.apache.pig.tools.pigstats.mapreduce.SimplePigStats - Script Statistics: 

HadoopVersion	PigVersion	UserId	StartedAt	FinishedAt	Features
2.6.0.2.2.0.0-2041	0.14.0.2.2.0.0-2041	demo	2015-03-11 10:11:10	2015-03-11 10:12:43	FILTER

Success!

Job Stats (time in seconds):
JobId	Maps	Reduces	MaxMapTime	MinMapTime	AvgMapTime	MedianMapTime	MaxReduceTime	MinReduceTime	AvgReduceTime	MedianReducetime	Alias	Feature	Outputs
job_1424904779802_0603	6	0	73	23	59	69	0	0	0	0	ORD_2008,macro_preprocess_airline_1,macro_preprocess_airline_flt_1	MAP_ONLY	hdfs://ds-master.cloud.hortonworks.com:8020/user/demo/airline/fm/ord_2008_1,

Input(s):
Successfully read 7009729 records (690071122 bytes) from: "hdfs://ds-master.cloud.hortonworks.com:8020/user/demo/airline/delay/2008.csv"

Output(s):
Successfully stored 335330 records (8795284 bytes) in: "hdfs://ds-master.cloud.hortonworks.com:8020/user/demo/airline/fm/ord_2008_1"

Counters:
Total records written : 335330
Total bytes written : 8795284
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0

Job DAG:
job_1424904779802_0603


2015-03-11 10:12:43,483 [main] INFO  org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://ds-master.cloud.hortonworks.com:8188/ws/v1/timeline/
2015-03-11 10:12:43,484 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at ds-master.cloud.hortonworks.com/172.24.70.17:8050
2015-03-11 10:12:43,496 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2015-03-11 10:12:43,706 [main] INFO  org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://ds-master.cloud.hortonworks.com:8188/ws/v1/timeline/
2015-03-11 10:12:43,708 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at ds-master.cloud.hortonworks.com/172.24.70.17:8050
2015-03-11 10:12:43,716 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2015-03-11 10:12:43,917 [main] INFO  org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://ds-master.cloud.hortonworks.com:8188/ws/v1/timeline/
2015-03-11 10:12:43,919 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at ds-master.cloud.hortonworks.com/172.24.70.17:8050
2015-03-11 10:12:43,929 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2015-03-11 10:12:43,978 [main] WARN  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Encountered Warning FIELD_DISCARDED_TYPE_CONVERSION_FAILED 136253 time(s).
2015-03-11 10:12:43,980 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
2015-03-11 10:12:44,011 [main] INFO  org.apache.pig.Main - Pig script completed in 4 minutes, 2 seconds and 838 milliseconds (242838 ms)

Now that PIG finished processing, we have two new file generated:

airline/fm/ord_2007_1
airline/fm/ord_2008_1

(the "1" indicates this is the first iteration; we will work on a second iteration later).

PIG is great for pre-procesing raw data into a feature matrix, but it's not the only choice. We can use other tools such as HIVE, Cascading, Scalding or Spark for this type of pre-processing. We will show how to do the same type of pre-processing using Spark in the second part of this blog post series.

Iteration #1: building a Logistic Regression and Random Forest models¶

Now we have the files ord_2007_1 and ord_2008_1 under 'airline/fm' folder in HDFS. Let's read those files into Python, and prepare the training and testing (validation) datasets as Pandas DataFrame objects.

Initially, we use only the numerical variables:

In [11]:

# read files
cols = ['delay', 'month', 'day', 'dow', 'hour', 'distance', 'carrier', 'dest', 'days_from_holiday']
col_types = {'delay': int, 'month': int, 'day': int, 'dow': int, 'hour': int, 'distance': int, 
             'carrier': str, 'dest': str, 'days_from_holiday': int}
data_2007 = read_csv_from_hdfs('airline/fm/ord_2007_1', cols, col_types)
data_2008 = read_csv_from_hdfs('airline/fm/ord_2008_1', cols, col_types)

# Create training set and test set
cols = ['month', 'day', 'dow', 'hour', 'distance', 'days_from_holiday']
train_y = data_2007['delay'] >= 15
train_x = data_2007[cols]

test_y = data_2008['delay'] >= 15
test_x = data_2008[cols]

print train_x.shape

(359169, 6)

So we have ~359K rows and 6 features in our model.

Now we use Python's excellent Scikit-learn machine learning package to to build two predictive models (Logistic regression and Random Forest) and compare their performance. First we print the confusion matrix, which counts the true positive, true negatives, false positives and false negatives. Then from the confusion matrix, we compute precision, recall, F1 metric and accuracy. Let's start with a logistic regression model and evaluate its performance on the testing dataset.

In [12]:

# Create logistic regression model with L2 regularization
clf_lr = linear_model.LogisticRegression(penalty='l2', class_weight='auto')
clf_lr.fit(train_x, train_y)

# Predict output labels on test set
pr = clf_lr.predict(test_x)

# display evaluation metrics
cm = confusion_matrix(test_y, pr)
print("Confusion matrix")
print(pd.DataFrame(cm))
report_lr = precision_recall_fscore_support(list(test_y), list(pr), average='micro')
print "\nprecision = %0.2f, recall = %0.2f, F1 = %0.2f, accuracy = %0.2f\n" % \
        (report_lr[0], report_lr[1], report_lr[2], accuracy_score(list(test_y), list(pr)))

Confusion matrix
        0      1
0  143858  96036
1   36987  58449

precision = 0.38, recall = 0.61, F1 = 0.47, accuracy = 0.60

Our logistic regression model got overall accuracy of 60%. Now let's try Random Forest:

In [13]:

# Create Random Forest classifier with 50 trees
clf_rf = RandomForestClassifier(n_estimators=50, n_jobs=-1)
clf_rf.fit(train_x, train_y)

# Evaluate on test set
pr = clf_rf.predict(test_x)

# print results
cm = confusion_matrix(test_y, pr)
print("Confusion matrix")
print(pd.DataFrame(cm))
report_svm = precision_recall_fscore_support(list(test_y), list(pr), average='micro')
print "\nprecision = %0.2f, recall = %0.2f, F1 = %0.2f, accuracy = %0.2f\n" % \
        (report_svm[0], report_svm[1], report_svm[2], accuracy_score(list(test_y), list(pr)))

Confusion matrix
        0      1
0  197890  42004
1   65978  29458

precision = 0.41, recall = 0.31, F1 = 0.35, accuracy = 0.68

As we can see, Random Forest has overall better accuracy, but lower F1 score. For our problem -- we are trying to predict delays, so the higher level of true positives (197K vs. 143K) is better.

With any supervised learnign algorithm, one typically needs to choose values for the parameters of the model. For example, we chose "L1" regularization for the logistic regression model, and 50 trees for the Random Forest. Such choices are based on some experimentation and hyperparameter tuning (http://en.wikipedia.org/wiki/Hyperparameter_optimization). We are not addressing this topic in this demo, although such choices are important to achieve the overall best model.

Improving our predictive model with "One Hot Encoding" - Iteration #2¶

It is very common in data science to work iteratively, and improve the model with each iteration. Let's see how this works.

In this iteration, we improve our feature by converting existing variables that are categorical in nature (such as "hour", or "month") as well as categorical variables that are strings (like "carrier" and "dest"), into what is known as "dummy variables". Each "dummy variable" is a binary (0 or 1) that indicates whether a certain category value is "on" or "off.

Fortunately, scikit-learn has the OneHotEncoder functionality to make this easy:

In [14]:

from sklearn.preprocessing import OneHotEncoder

# read files
cols = ['delay', 'month', 'day', 'dow', 'hour', 'distance', 'carrier', 'dest', 'days_from_holiday']
col_types = {'delay': int, 'month': int, 'day': int, 'dow': int, 'hour': int, 'distance': int, 
             'carrier': str, 'dest': str, 'days_from_holiday': int}
data_2007 = read_csv_from_hdfs('airline/fm/ord_2007_1', cols, col_types)
data_2008 = read_csv_from_hdfs('airline/fm/ord_2008_1', cols, col_types)

# Create training set and test set
train_y = data_2007['delay'] >= 15
categ = [cols.index(x) for x in 'hour', 'month', 'day', 'dow', 'carrier', 'dest']
enc = OneHotEncoder(categorical_features = categ)
df = data_2007.drop('delay', axis=1)
df['carrier'] = pd.factorize(df['carrier'])[0]
df['dest'] = pd.factorize(df['dest'])[0]
train_x = enc.fit_transform(df)

test_y = data_2008['delay'] >= 15
df = data_2008.drop('delay', axis=1)
df['carrier'] = pd.factorize(df['carrier'])[0]
df['dest'] = pd.factorize(df['dest'])[0]
test_x = enc.transform(df)

print train_x.shape

(359169, 409)

So we can see the first 5 lines of the feature matrix. Overall, we have ~359K rows and 409 features in our model. Let's re-run the Random Forest model and see if this improved our model:

In [15]:

# Create Random Forest classifier with 50 trees
clf_rf = RandomForestClassifier(n_estimators=50, n_jobs=-1)
clf_rf.fit(train_x.toarray(), train_y)

# Evaluate on test set
pr = clf_rf.predict(test_x.toarray())

# print results
cm = confusion_matrix(test_y, pr)
print("Confusion matrix")
print(pd.DataFrame(cm))
report_svm = precision_recall_fscore_support(list(test_y), list(pr), average='micro')
print "\nprecision = %0.2f, recall = %0.2f, F1 = %0.2f, accuracy = %0.2f\n" % \
        (report_svm[0], report_svm[1], report_svm[2], accuracy_score(list(test_y), list(pr)))

Confusion matrix
        0      1
0  216743  23151
1   75678  19758

precision = 0.46, recall = 0.21, F1 = 0.29, accuracy = 0.71

This clearly helped -- accuracy is higher at ~70%, and true positive are also better at 216K (vs 197K previously).

Enriching the model -- how more data gets a better modeling - Iteration #3¶

Another common path to improve accuracy is by bringing in new types of data - enriching our dataset - and generating more features. Our idea is to layer-in weather data. We can get this data from a publicly available dataset here: http://www.ncdc.noaa.gov/cdo-web/datasets/

We will look at daily temperatures (min/max), wind speed, snow conditions and precipitation in the flight origin airport (ORD). Clearly, weather conditions in the destination airport also affect delays, but for simplicity of this demo we just include weather at the origin (ORD).

First, let's re-write our PIG script to add these new features to our feature matrix:

In [16]:

%%writefile preprocess2.pig

register 'util.py' USING jython as util;

-- Helper macro to load data and join into a feature vector per instance
DEFINE preprocess(year_str, airport_code) returns data
{
    -- load airline data from specified year (need to specify fields since it's not in HCat)
    airline = load 'airline/delay/$year_str.csv' using PigStorage(',') 
                    as (Year: int, Month: int, DayOfMonth: int, DayOfWeek: int, DepTime: chararray, CRSDepTime:chararray, 
                        ArrTime, CRSArrTime, Carrier: chararray, FlightNum, TailNum, ActualElapsedTime, CRSElapsedTime, AirTime, 
                        ArrDelay, DepDelay: int, Origin: chararray, Dest: chararray, Distance: int, TaxiIn, TaxiOut, 
                        Cancelled: int, CancellationCode, Diverted, CarrierDelay, WeatherDelay, NASDelay, 
                        SecurityDelay, LateAircraftDelay);

    -- keep only instances where flight was not cancelled and originate at ORD
    airline_flt = filter airline by Cancelled == 0 and Origin == '$airport_code';

    -- Keep only fields I need
    airline2 = foreach airline_flt generate Year as year, Month as month, DayOfMonth as day, DayOfWeek as dow,
                        Carrier as carrier, Origin as origin, Dest as dest, Distance as distance,
                        CRSDepTime as time, DepDelay as delay, util.to_date(Year, Month, DayOfMonth) as date;

    -- load weather data
    weather = load 'airline/weather/$year_str.csv' using PigStorage(',') 
                    as (station: chararray, date: chararray, metric, value, t1, t2, t3, time);

    -- keep only TMIN and TMAX weather observations from ORD
    weather_tmin = filter weather by station == 'USW00094846' and metric == 'TMIN';
    weather_tmax = filter weather by station == 'USW00094846' and metric == 'TMAX';
    weather_prcp = filter weather by station == 'USW00094846' and metric == 'PRCP';
    weather_snow = filter weather by station == 'USW00094846' and metric == 'SNOW';
    weather_awnd = filter weather by station == 'USW00094846' and metric == 'AWND';

    joined = join airline2 by date, weather_tmin by date, weather_tmax by date, weather_prcp by date, 
                                    weather_snow by date, weather_awnd by date;
    $data = foreach joined generate delay, month, day, dow, util.get_hour(airline2::time) as tod, distance, carrier, dest,
                                    util.days_from_nearest_holiday(year, month, day) as hdays,
                                    weather_tmin::value as temp_min, weather_tmax::value as temp_max,
                                    weather_prcp::value as prcp, weather_snow::value as snow, weather_awnd::value as wind;
};

ORD_2007 = preprocess('2007', 'ORD');
rmf airline/fm/ord_2007_2;
store ORD_2007 into 'airline/fm/ord_2007_2' using PigStorage(',');

ORD_2008 = preprocess('2008', 'ORD');
rmf airline/fm/ord_2008_2;
store ORD_2008 into 'airline/fm/ord_2008_2' using PigStorage(',');

Overwriting preprocess2.pig

In [17]:

%%bash --bg --err pig_out2 
pig -f preprocess2.pig

Starting job # 2 in a separate thread.

In [18]:

while True:
    line = pig_out2.readline()
    if not line:
        break
    sys.stdout.write("%s" % line)
    sys.stdout.flush()

15/03/11 10:14:44 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL
15/03/11 10:14:44 INFO pig.ExecTypeProvider: Trying ExecType : MAPREDUCE
15/03/11 10:14:44 INFO pig.ExecTypeProvider: Picked MAPREDUCE as the ExecType
2015-03-11 10:14:44,144 [main] INFO  org.apache.pig.Main - Apache Pig version 0.14.0.2.2.0.0-2041 (rexported) compiled Nov 19 2014, 15:24:46
2015-03-11 10:14:44,146 [main] INFO  org.apache.pig.Main - Logging error messages to: /home/demo/airline-demo/pig_1426094084142.log
2015-03-11 10:14:45,348 [main] INFO  org.apache.pig.impl.util.Utils - Default bootup file /home/demo/.pigbootup not found
2015-03-11 10:14:45,625 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://ds-master.cloud.hortonworks.com:8020
2015-03-11 10:14:46,776 [main] INFO  org.apache.pig.scripting.jython.JythonScriptEngine - created tmp python.cachedir=/tmp/pig_jython_1255674992043946473
2015-03-11 10:14:48,660 [main] WARN  org.apache.pig.scripting.jython.JythonScriptEngine - pig.cmd.args.remainders is empty. This is not expected unless on testing.
2015-03-11 10:14:50,983 [main] INFO  org.apache.pig.scripting.jython.JythonScriptEngine - Register scripting UDF: util.get_hour
2015-03-11 10:14:50,986 [main] INFO  org.apache.pig.scripting.jython.JythonScriptEngine - Register scripting UDF: util.to_date
2015-03-11 10:14:50,987 [main] INFO  org.apache.pig.scripting.jython.JythonScriptEngine - Register scripting UDF: util.days_from_nearest_holiday
2015-03-11 10:14:51,695 [main] INFO  org.apache.pig.scripting.jython.JythonFunction - Schema 'date: chararray' defined for func to_date
2015-03-11 10:14:52,041 [main] INFO  org.apache.pig.tools.grunt.GruntParser - Waited 0ms to delete file
2015-03-11 10:14:52,829 [main] WARN  org.apache.pig.newplan.BaseOperatorPlan - Encountered Warning IMPLICIT_CAST_TO_CHARARRAY 5 time(s).
2015-03-11 10:14:53,049 [main] INFO  org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: HASH_JOIN,FILTER
2015-03-11 10:14:53,094 [main] INFO  org.apache.pig.data.SchemaTupleBackend - Key [pig.schematuple] was not set... will not generate code.
2015-03-11 10:14:53,149 [main] INFO  org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer - {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, ConstantCalculator, GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, PartitionFilterOptimizer, PredicatePushdownOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter]}
2015-03-11 10:14:53,226 [main] INFO  org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns pruned for macro_preprocess_weather_0: $4, $5, $6, $7
2015-03-11 10:14:53,231 [main] INFO  org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns pruned for macro_preprocess_airline_0: $4, $6, $7, $9, $10, $11, $12, $13, $14, $19, $20, $22, $23, $24, $25, $26, $27, $28
2015-03-11 10:14:53,932 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false
2015-03-11 10:14:53,989 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler$LastInputStreamingOptimizer - Rewrite: POPackage->POForEach to POPackage(JoinPackager)
2015-03-11 10:14:54,004 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 2
2015-03-11 10:14:54,009 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 1 diamond splitter.
2015-03-11 10:14:54,010 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 1 out of total 2 MR operators.
2015-03-11 10:14:54,010 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1
2015-03-11 10:14:55,037 [main] INFO  org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://ds-master.cloud.hortonworks.com:8188/ws/v1/timeline/
2015-03-11 10:14:55,246 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at ds-master.cloud.hortonworks.com/172.24.70.17:8050
2015-03-11 10:14:55,583 [main] INFO  org.apache.pig.tools.pigstats.mapreduce.MRScriptState - Pig script settings are added to the job
2015-03-11 10:14:55,596 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
2015-03-11 10:14:55,602 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Reduce phase detected, estimating # of required reducers.
2015-03-11 10:14:55,604 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Using reducer estimator: org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.InputSizeReducerEstimator
2015-03-11 10:14:55,639 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.InputSizeReducerEstimator - BytesPerReducer=1000000000 maxReducers=999 totalInputFileSize=6190005583
2015-03-11 10:14:55,640 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting Parallelism to 7
2015-03-11 10:14:55,640 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - This job cannot be converted run in-process
2015-03-11 10:14:56,029 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/hdp/2.2.0.0-2041/pig/pig-0.14.0.2.2.0.0-2041-core-h2.jar to DistributedCache through /tmp/temp-1528785190/tmp-30784569/pig-0.14.0.2.2.0.0-2041-core-h2.jar
2015-03-11 10:14:56,348 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/hdp/2.2.0.0-2041/pig/lib/jython-standalone-2.5.3.jar to DistributedCache through /tmp/temp-1528785190/tmp1015136214/jython-standalone-2.5.3.jar
2015-03-11 10:14:56,420 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/hdp/2.2.0.0-2041/pig/lib/automaton-1.11-8.jar to DistributedCache through /tmp/temp-1528785190/tmp-1916985169/automaton-1.11-8.jar
2015-03-11 10:14:56,491 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/hdp/2.2.0.0-2041/pig/lib/antlr-runtime-3.4.jar to DistributedCache through /tmp/temp-1528785190/tmp-1682649805/antlr-runtime-3.4.jar
2015-03-11 10:14:56,544 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/hdp/2.2.0.0-2041/hadoop/lib/guava-11.0.2.jar to DistributedCache through /tmp/temp-1528785190/tmp1549937395/guava-11.0.2.jar
2015-03-11 10:14:56,611 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/hdp/2.2.0.0-2041/hadoop-mapreduce/joda-time-2.5.jar to DistributedCache through /tmp/temp-1528785190/tmp-711652979/joda-time-2.5.jar
2015-03-11 10:14:58,504 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/tmp/PigScriptUDF-3f91fbfefba602bf28492c3cd7f8b54c.jar to DistributedCache through /tmp/temp-1528785190/tmp-1849048894/PigScriptUDF-3f91fbfefba602bf28492c3cd7f8b54c.jar
2015-03-11 10:14:58,566 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job
2015-03-11 10:14:59,547 [main] INFO  org.apache.pig.data.SchemaTupleFrontend - Key [pig.schematuple] is false, will not generate code.
2015-03-11 10:14:59,549 [main] INFO  org.apache.pig.data.SchemaTupleFrontend - Starting process to move generated code to distributed cacche
2015-03-11 10:14:59,550 [main] INFO  org.apache.pig.data.SchemaTupleFrontend - Setting key [pig.schematuple.classes] with classes to deserialize []
2015-03-11 10:15:00,736 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.
2015-03-11 10:15:01,006 [JobControl] INFO  org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://ds-master.cloud.hortonworks.com:8188/ws/v1/timeline/
2015-03-11 10:15:01,009 [JobControl] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at ds-master.cloud.hortonworks.com/172.24.70.17:8050
2015-03-11 10:15:01,137 [JobControl] WARN  org.apache.hadoop.mapreduce.JobSubmitter - No job jar file set.  User classes may not be found. See Job or Job#setJar(String).
2015-03-11 10:15:01,241 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2015-03-11 10:15:01,242 [JobControl] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
2015-03-11 10:15:01,278 [JobControl] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 9
2015-03-11 10:15:01,286 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2015-03-11 10:15:01,286 [JobControl] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
2015-03-11 10:15:01,294 [JobControl] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 9
2015-03-11 10:15:01,299 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2015-03-11 10:15:01,300 [JobControl] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
2015-03-11 10:15:01,307 [JobControl] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 9
2015-03-11 10:15:01,313 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2015-03-11 10:15:01,314 [JobControl] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
2015-03-11 10:15:01,320 [JobControl] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 9
2015-03-11 10:15:01,325 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2015-03-11 10:15:01,326 [JobControl] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
2015-03-11 10:15:01,332 [JobControl] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 6
2015-03-11 10:15:01,337 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2015-03-11 10:15:01,338 [JobControl] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
2015-03-11 10:15:01,344 [JobControl] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 9
2015-03-11 10:15:01,900 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - number of splits:51
2015-03-11 10:15:02,189 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1424904779802_0604
2015-03-11 10:15:02,424 [JobControl] INFO  org.apache.hadoop.mapred.YARNRunner - Job jar is not present. Not adding any jar to the list of resources.
2015-03-11 10:15:02,543 [JobControl] INFO  org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - Submitted application application_1424904779802_0604
2015-03-11 10:15:02,611 [JobControl] INFO  org.apache.hadoop.mapreduce.Job - The url to track the job: http://ds-master.cloud.hortonworks.com:8088/proxy/application_1424904779802_0604/
2015-03-11 10:15:02,613 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_1424904779802_0604
2015-03-11 10:15:02,613 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Processing aliases ORD_2007,macro_preprocess_airline2_0,macro_preprocess_airline_0,macro_preprocess_airline_flt_0,macro_preprocess_joined_0,macro_preprocess_weather_0,macro_preprocess_weather_awnd_0,macro_preprocess_weather_prcp_0,macro_preprocess_weather_snow_0,macro_preprocess_weather_tmax_0,macro_preprocess_weather_tmin_0
2015-03-11 10:15:02,613 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - detailed locations: M: macro_preprocess_weather_0[24,14],macro_preprocess_weather_0[-1,-1],macro_preprocess_weather_0[-1,-1],macro_preprocess_weather_tmax_0[29,19],macro_preprocess_weather_tmax_0[-1,-1],macro_preprocess_joined_0[34,13],macro_preprocess_airline_0[8,14],macro_preprocess_airline_0[-1,-1],macro_preprocess_airline_flt_0[16,18],macro_preprocess_airline2_0[19,15],macro_preprocess_joined_0[34,13],macro_preprocess_weather_0[24,14],macro_preprocess_weather_0[-1,-1],macro_preprocess_weather_0[-1,-1],macro_preprocess_weather_snow_0[31,19],macro_preprocess_weather_snow_0[-1,-1],macro_preprocess_joined_0[34,13],macro_preprocess_weather_0[24,14],macro_preprocess_weather_0[-1,-1],macro_preprocess_weather_0[-1,-1],macro_preprocess_weather_tmin_0[28,19],macro_preprocess_weather_tmin_0[-1,-1],macro_preprocess_joined_0[34,13],macro_preprocess_weather_0[24,14],macro_preprocess_weather_0[-1,-1],macro_preprocess_weather_0[-1,-1],macro_preprocess_weather_prcp_0[30,19],macro_preprocess_weather_prcp_0[-1,-1],macro_preprocess_joined_0[34,13],macro_preprocess_weather_0[24,14],macro_preprocess_weather_0[-1,-1],macro_preprocess_weather_0[-1,-1],macro_preprocess_weather_awnd_0[32,19],macro_preprocess_weather_awnd_0[-1,-1],macro_preprocess_joined_0[34,13] C:  R: ORD_2007[36,15]
2015-03-11 10:15:02,633 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete
2015-03-11 10:15:02,634 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0604]
2015-03-11 10:16:04,915 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 4% complete
2015-03-11 10:16:04,916 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0604]
2015-03-11 10:16:11,936 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 8% complete
2015-03-11 10:16:11,937 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0604]
2015-03-11 10:16:17,953 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 12% complete
2015-03-11 10:16:17,955 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0604]
2015-03-11 10:16:21,967 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 16% complete
2015-03-11 10:16:21,967 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0604]
2015-03-11 10:16:26,978 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 21% complete
2015-03-11 10:16:26,978 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0604]
2015-03-11 10:16:32,991 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 25% complete
2015-03-11 10:16:32,991 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0604]
2015-03-11 10:16:37,003 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 30% complete
2015-03-11 10:16:37,003 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0604]
2015-03-11 10:16:42,017 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 34% complete
2015-03-11 10:16:42,017 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0604]
2015-03-11 10:16:47,030 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 40% complete
2015-03-11 10:16:47,031 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0604]
2015-03-11 10:16:55,051 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 44% complete
2015-03-11 10:16:55,052 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0604]
2015-03-11 10:17:00,067 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 49% complete
2015-03-11 10:17:00,068 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0604]
2015-03-11 10:17:02,077 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 54% complete
2015-03-11 10:17:02,077 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0604]
2015-03-11 10:17:05,086 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 58% complete
2015-03-11 10:17:05,089 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0604]
2015-03-11 10:17:10,103 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 63% complete
2015-03-11 10:17:10,105 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0604]
2015-03-11 10:17:13,113 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 68% complete
2015-03-11 10:17:13,113 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0604]
2015-03-11 10:17:15,119 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 83% complete
2015-03-11 10:17:15,120 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0604]
2015-03-11 10:17:30,157 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 88% complete
2015-03-11 10:17:30,158 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0604]
2015-03-11 10:17:38,178 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 92% complete
2015-03-11 10:17:38,180 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0604]
2015-03-11 10:17:45,196 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 97% complete
2015-03-11 10:17:45,196 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0604]
2015-03-11 10:17:53,425 [main] INFO  org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://ds-master.cloud.hortonworks.com:8188/ws/v1/timeline/
2015-03-11 10:17:53,428 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at ds-master.cloud.hortonworks.com/172.24.70.17:8050
2015-03-11 10:17:53,442 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2015-03-11 10:17:54,067 [main] INFO  org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://ds-master.cloud.hortonworks.com:8188/ws/v1/timeline/
2015-03-11 10:17:54,069 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at ds-master.cloud.hortonworks.com/172.24.70.17:8050
2015-03-11 10:17:54,078 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2015-03-11 10:17:54,529 [main] INFO  org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://ds-master.cloud.hortonworks.com:8188/ws/v1/timeline/
2015-03-11 10:17:54,531 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at ds-master.cloud.hortonworks.com/172.24.70.17:8050
2015-03-11 10:17:54,542 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2015-03-11 10:17:54,623 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
2015-03-11 10:17:54,629 [main] INFO  org.apache.pig.tools.pigstats.mapreduce.SimplePigStats - Script Statistics: 

HadoopVersion	PigVersion	UserId	StartedAt	FinishedAt	Features
2.6.0.2.2.0.0-2041	0.14.0.2.2.0.0-2041	demo	2015-03-11 10:14:55	2015-03-11 10:17:54	HASH_JOIN,FILTER

Success!

Job Stats (time in seconds):
JobId	Maps	Reduces	MaxMapTime	MinMapTime	AvgMapTime	MedianMapTime	MaxReduceTime	MinReduceTime	AvgReduceTime	MedianReducetime	Alias	Feature	Outputs
job_1424904779802_0604	51	7	117	37	95	104	88	83	85	85	ORD_2007,macro_preprocess_airline2_0,macro_preprocess_airline_0,macro_preprocess_airline_flt_0,macro_preprocess_joined_0,macro_preprocess_weather_0,macro_preprocess_weather_awnd_0,macro_preprocess_weather_prcp_0,macro_preprocess_weather_snow_0,macro_preprocess_weather_tmax_0,macro_preprocess_weather_tmin_0	HASH_JOIN	hdfs://ds-master.cloud.hortonworks.com:8020/user/demo/airline/fm/ord_2007_2,

Input(s):
Successfully read 31065125 records from: "hdfs://ds-master.cloud.hortonworks.com:8020/user/demo/airline/weather/2007.csv"
Successfully read 31065125 records from: "hdfs://ds-master.cloud.hortonworks.com:8020/user/demo/airline/weather/2007.csv"
Successfully read 31065125 records from: "hdfs://ds-master.cloud.hortonworks.com:8020/user/demo/airline/weather/2007.csv"
Successfully read 31065125 records from: "hdfs://ds-master.cloud.hortonworks.com:8020/user/demo/airline/weather/2007.csv"
Successfully read 7453216 records from: "hdfs://ds-master.cloud.hortonworks.com:8020/user/demo/airline/delay/2007.csv"
Successfully read 31065125 records from: "hdfs://ds-master.cloud.hortonworks.com:8020/user/demo/airline/weather/2007.csv"

Output(s):
Successfully stored 359169 records (14789642 bytes) in: "hdfs://ds-master.cloud.hortonworks.com:8020/user/demo/airline/fm/ord_2007_2"

Counters:
Total records written : 359169
Total bytes written : 14789642
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0

Job DAG:
job_1424904779802_0604


2015-03-11 10:17:54,800 [main] INFO  org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://ds-master.cloud.hortonworks.com:8188/ws/v1/timeline/
2015-03-11 10:17:54,800 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at ds-master.cloud.hortonworks.com/172.24.70.17:8050
2015-03-11 10:17:54,809 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2015-03-11 10:17:55,194 [main] INFO  org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://ds-master.cloud.hortonworks.com:8188/ws/v1/timeline/
2015-03-11 10:17:55,196 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at ds-master.cloud.hortonworks.com/172.24.70.17:8050
2015-03-11 10:17:55,204 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2015-03-11 10:17:55,444 [main] INFO  org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://ds-master.cloud.hortonworks.com:8188/ws/v1/timeline/
2015-03-11 10:17:55,446 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at ds-master.cloud.hortonworks.com/172.24.70.17:8050
2015-03-11 10:17:55,455 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2015-03-11 10:17:55,511 [main] WARN  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Encountered Warning FIELD_DISCARDED_TYPE_CONVERSION_FAILED 160755 time(s).
2015-03-11 10:17:55,512 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
2015-03-11 10:17:55,538 [main] INFO  org.apache.pig.tools.grunt.GruntParser - Waited 0ms to delete file
2015-03-11 10:17:56,000 [main] WARN  org.apache.pig.newplan.BaseOperatorPlan - Encountered Warning IMPLICIT_CAST_TO_CHARARRAY 5 time(s).
2015-03-11 10:17:56,015 [main] INFO  org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: HASH_JOIN,FILTER
2015-03-11 10:17:56,051 [main] INFO  org.apache.pig.data.SchemaTupleBackend - Key [pig.schematuple] was not set... will not generate code.
2015-03-11 10:17:56,052 [main] INFO  org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer - {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, ConstantCalculator, GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, PartitionFilterOptimizer, PredicatePushdownOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter]}
2015-03-11 10:17:56,084 [main] INFO  org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns pruned for macro_preprocess_weather_1: $4, $5, $6, $7
2015-03-11 10:17:56,086 [main] INFO  org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns pruned for macro_preprocess_airline_1: $4, $6, $7, $9, $10, $11, $12, $13, $14, $19, $20, $22, $23, $24, $25, $26, $27, $28
2015-03-11 10:17:56,392 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false
2015-03-11 10:17:56,399 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler$LastInputStreamingOptimizer - Rewrite: POPackage->POForEach to POPackage(JoinPackager)
2015-03-11 10:17:56,400 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 2
2015-03-11 10:17:56,403 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 1 diamond splitter.
2015-03-11 10:17:56,404 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 1 out of total 2 MR operators.
2015-03-11 10:17:56,405 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1
2015-03-11 10:17:56,577 [main] INFO  org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://ds-master.cloud.hortonworks.com:8188/ws/v1/timeline/
2015-03-11 10:17:56,579 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at ds-master.cloud.hortonworks.com/172.24.70.17:8050
2015-03-11 10:17:56,585 [main] INFO  org.apache.pig.tools.pigstats.mapreduce.MRScriptState - Pig script settings are added to the job
2015-03-11 10:17:56,588 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
2015-03-11 10:17:56,590 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Reduce phase detected, estimating # of required reducers.
2015-03-11 10:17:56,591 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Using reducer estimator: org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.InputSizeReducerEstimator
2015-03-11 10:17:56,606 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.InputSizeReducerEstimator - BytesPerReducer=1000000000 maxReducers=999 totalInputFileSize=6417173424
2015-03-11 10:17:56,607 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting Parallelism to 7
2015-03-11 10:17:56,608 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - This job cannot be converted run in-process
2015-03-11 10:17:56,727 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/hdp/2.2.0.0-2041/pig/pig-0.14.0.2.2.0.0-2041-core-h2.jar to DistributedCache through /tmp/temp-1528785190/tmp1946848258/pig-0.14.0.2.2.0.0-2041-core-h2.jar
2015-03-11 10:17:56,897 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/hdp/2.2.0.0-2041/pig/lib/jython-standalone-2.5.3.jar to DistributedCache through /tmp/temp-1528785190/tmp230810937/jython-standalone-2.5.3.jar
2015-03-11 10:17:57,303 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/hdp/2.2.0.0-2041/pig/lib/automaton-1.11-8.jar to DistributedCache through /tmp/temp-1528785190/tmp1969810629/automaton-1.11-8.jar
2015-03-11 10:17:57,358 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/hdp/2.2.0.0-2041/pig/lib/antlr-runtime-3.4.jar to DistributedCache through /tmp/temp-1528785190/tmp-613290889/antlr-runtime-3.4.jar
2015-03-11 10:17:57,424 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/hdp/2.2.0.0-2041/hadoop/lib/guava-11.0.2.jar to DistributedCache through /tmp/temp-1528785190/tmp-340380009/guava-11.0.2.jar
2015-03-11 10:17:57,611 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/hdp/2.2.0.0-2041/hadoop-mapreduce/joda-time-2.5.jar to DistributedCache through /tmp/temp-1528785190/tmp-1936706949/joda-time-2.5.jar
2015-03-11 10:17:57,662 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/tmp/PigScriptUDF-3f91fbfefba602bf28492c3cd7f8b54c.jar to DistributedCache through /tmp/temp-1528785190/tmp420591900/PigScriptUDF-3f91fbfefba602bf28492c3cd7f8b54c.jar
2015-03-11 10:17:57,677 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job
2015-03-11 10:17:57,679 [main] INFO  org.apache.pig.data.SchemaTupleFrontend - Key [pig.schematuple] is false, will not generate code.
2015-03-11 10:17:57,680 [main] INFO  org.apache.pig.data.SchemaTupleFrontend - Starting process to move generated code to distributed cacche
2015-03-11 10:17:57,681 [main] INFO  org.apache.pig.data.SchemaTupleFrontend - Setting key [pig.schematuple.classes] with classes to deserialize []
2015-03-11 10:17:57,794 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.
2015-03-11 10:17:57,955 [JobControl] INFO  org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://ds-master.cloud.hortonworks.com:8188/ws/v1/timeline/
2015-03-11 10:17:57,957 [JobControl] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at ds-master.cloud.hortonworks.com/172.24.70.17:8050
2015-03-11 10:17:57,995 [JobControl] WARN  org.apache.hadoop.mapreduce.JobSubmitter - No job jar file set.  User classes may not be found. See Job or Job#setJar(String).
2015-03-11 10:17:58,068 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2015-03-11 10:17:58,070 [JobControl] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
2015-03-11 10:17:58,076 [JobControl] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 9
2015-03-11 10:17:58,081 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2015-03-11 10:17:58,081 [JobControl] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
2015-03-11 10:17:58,088 [JobControl] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 9
2015-03-11 10:17:58,093 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2015-03-11 10:17:58,094 [JobControl] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
2015-03-11 10:17:58,099 [JobControl] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 9
2015-03-11 10:17:58,104 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2015-03-11 10:17:58,105 [JobControl] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
2015-03-11 10:17:58,109 [JobControl] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 9
2015-03-11 10:17:58,115 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2015-03-11 10:17:58,116 [JobControl] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
2015-03-11 10:17:58,120 [JobControl] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 9
2015-03-11 10:17:58,125 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2015-03-11 10:17:58,126 [JobControl] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
2015-03-11 10:17:58,130 [JobControl] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 6
2015-03-11 10:17:58,254 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - number of splits:51
2015-03-11 10:17:58,321 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1424904779802_0605
2015-03-11 10:17:58,328 [JobControl] INFO  org.apache.hadoop.mapred.YARNRunner - Job jar is not present. Not adding any jar to the list of resources.
2015-03-11 10:17:58,390 [JobControl] INFO  org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - Submitted application application_1424904779802_0605
2015-03-11 10:17:58,398 [JobControl] INFO  org.apache.hadoop.mapreduce.Job - The url to track the job: http://ds-master.cloud.hortonworks.com:8088/proxy/application_1424904779802_0605/
2015-03-11 10:17:58,400 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_1424904779802_0605
2015-03-11 10:17:58,401 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Processing aliases ORD_2008,macro_preprocess_airline2_1,macro_preprocess_airline_1,macro_preprocess_airline_flt_1,macro_preprocess_joined_1,macro_preprocess_weather_1,macro_preprocess_weather_awnd_1,macro_preprocess_weather_prcp_1,macro_preprocess_weather_snow_1,macro_preprocess_weather_tmax_1,macro_preprocess_weather_tmin_1
2015-03-11 10:17:58,401 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - detailed locations: M: macro_preprocess_weather_1[24,14],macro_preprocess_weather_1[-1,-1],macro_preprocess_weather_1[-1,-1],macro_preprocess_weather_tmin_1[28,19],macro_preprocess_weather_tmin_1[-1,-1],macro_preprocess_joined_1[34,13],macro_preprocess_airline_1[8,14],macro_preprocess_airline_1[-1,-1],macro_preprocess_airline_flt_1[16,18],macro_preprocess_airline2_1[19,15],macro_preprocess_joined_1[34,13],macro_preprocess_weather_1[24,14],macro_preprocess_weather_1[-1,-1],macro_preprocess_weather_1[-1,-1],macro_preprocess_weather_awnd_1[32,19],macro_preprocess_weather_awnd_1[-1,-1],macro_preprocess_joined_1[34,13],macro_preprocess_weather_1[24,14],macro_preprocess_weather_1[-1,-1],macro_preprocess_weather_1[-1,-1],macro_preprocess_weather_tmax_1[29,19],macro_preprocess_weather_tmax_1[-1,-1],macro_preprocess_joined_1[34,13],macro_preprocess_weather_1[24,14],macro_preprocess_weather_1[-1,-1],macro_preprocess_weather_1[-1,-1],macro_preprocess_weather_snow_1[31,19],macro_preprocess_weather_snow_1[-1,-1],macro_preprocess_joined_1[34,13],macro_preprocess_weather_1[24,14],macro_preprocess_weather_1[-1,-1],macro_preprocess_weather_1[-1,-1],macro_preprocess_weather_prcp_1[30,19],macro_preprocess_weather_prcp_1[-1,-1],macro_preprocess_joined_1[34,13] C:  R: ORD_2008[36,15]
2015-03-11 10:17:58,413 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete
2015-03-11 10:17:58,414 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0605]
2015-03-11 10:18:57,671 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 4% complete
2015-03-11 10:18:57,672 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0605]
2015-03-11 10:19:07,700 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 8% complete
2015-03-11 10:19:07,702 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0605]
2015-03-11 10:19:15,721 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 12% complete
2015-03-11 10:19:15,722 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0605]
2015-03-11 10:19:22,744 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 16% complete
2015-03-11 10:19:22,745 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0605]
2015-03-11 10:19:30,765 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 21% complete
2015-03-11 10:19:30,765 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0605]
2015-03-11 10:19:37,781 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 26% complete
2015-03-11 10:19:37,781 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0605]
2015-03-11 10:19:42,793 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 35% complete
2015-03-11 10:19:42,793 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0605]
2015-03-11 10:19:45,801 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 39% complete
2015-03-11 10:19:45,801 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0605]
2015-03-11 10:19:50,813 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 44% complete
2015-03-11 10:19:50,813 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0605]
2015-03-11 10:19:55,827 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 51% complete
2015-03-11 10:19:55,827 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0605]
2015-03-11 10:20:05,852 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 56% complete
2015-03-11 10:20:05,853 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0605]
2015-03-11 10:20:27,912 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 61% complete
2015-03-11 10:20:27,913 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0605]
2015-03-11 10:20:40,942 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 69% complete
2015-03-11 10:20:40,943 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0605]
2015-03-11 10:20:42,948 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 79% complete
2015-03-11 10:20:42,948 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0605]
2015-03-11 10:20:50,970 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 83% complete
2015-03-11 10:20:50,970 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0605]
2015-03-11 10:21:00,998 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 89% complete
2015-03-11 10:21:00,999 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0605]
2015-03-11 10:21:06,010 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 93% complete
2015-03-11 10:21:06,011 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0605]
2015-03-11 10:21:13,031 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 98% complete
2015-03-11 10:21:13,033 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0605]
2015-03-11 10:21:24,213 [main] INFO  org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://ds-master.cloud.hortonworks.com:8188/ws/v1/timeline/
2015-03-11 10:21:24,215 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at ds-master.cloud.hortonworks.com/172.24.70.17:8050
2015-03-11 10:21:24,225 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2015-03-11 10:21:24,574 [main] INFO  org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://ds-master.cloud.hortonworks.com:8188/ws/v1/timeline/
2015-03-11 10:21:24,576 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at ds-master.cloud.hortonworks.com/172.24.70.17:8050
2015-03-11 10:21:24,585 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2015-03-11 10:21:24,851 [main] INFO  org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://ds-master.cloud.hortonworks.com:8188/ws/v1/timeline/
2015-03-11 10:21:24,852 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at ds-master.cloud.hortonworks.com/172.24.70.17:8050
2015-03-11 10:21:24,862 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2015-03-11 10:21:24,907 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
2015-03-11 10:21:24,909 [main] INFO  org.apache.pig.tools.pigstats.mapreduce.SimplePigStats - Script Statistics: 

HadoopVersion	PigVersion	UserId	StartedAt	FinishedAt	Features
2.6.0.2.2.0.0-2041	0.14.0.2.2.0.0-2041	demo	2015-03-11 10:17:56	2015-03-11 10:21:24	HASH_JOIN,FILTER

Success!

Job Stats (time in seconds):
JobId	Maps	Reduces	MaxMapTime	MinMapTime	AvgMapTime	MedianMapTime	MaxReduceTime	MinReduceTime	AvgReduceTime	MedianReducetime	Alias	Feature	Outputs
job_1424904779802_0605	51	7	151	16	104	101	114	106	110	109	ORD_2008,macro_preprocess_airline2_1,macro_preprocess_airline_1,macro_preprocess_airline_flt_1,macro_preprocess_joined_1,macro_preprocess_weather_1,macro_preprocess_weather_awnd_1,macro_preprocess_weather_prcp_1,macro_preprocess_weather_snow_1,macro_preprocess_weather_tmax_1,macro_preprocess_weather_tmin_1	HASH_JOIN	hdfs://ds-master.cloud.hortonworks.com:8020/user/demo/airline/fm/ord_2008_2,

Input(s):
Successfully read 32534244 records from: "hdfs://ds-master.cloud.hortonworks.com:8020/user/demo/airline/weather/2008.csv"
Successfully read 32534244 records from: "hdfs://ds-master.cloud.hortonworks.com:8020/user/demo/airline/weather/2008.csv"
Successfully read 32534244 records from: "hdfs://ds-master.cloud.hortonworks.com:8020/user/demo/airline/weather/2008.csv"
Successfully read 32534244 records from: "hdfs://ds-master.cloud.hortonworks.com:8020/user/demo/airline/weather/2008.csv"
Successfully read 32534244 records from: "hdfs://ds-master.cloud.hortonworks.com:8020/user/demo/airline/weather/2008.csv"
Successfully read 7009729 records from: "hdfs://ds-master.cloud.hortonworks.com:8020/user/demo/airline/delay/2008.csv"

Output(s):
Successfully stored 335330 records (13817679 bytes) in: "hdfs://ds-master.cloud.hortonworks.com:8020/user/demo/airline/fm/ord_2008_2"

Counters:
Total records written : 335330
Total bytes written : 13817679
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0

Job DAG:
job_1424904779802_0605


2015-03-11 10:21:25,059 [main] INFO  org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://ds-master.cloud.hortonworks.com:8188/ws/v1/timeline/
2015-03-11 10:21:25,061 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at ds-master.cloud.hortonworks.com/172.24.70.17:8050
2015-03-11 10:21:25,073 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2015-03-11 10:21:25,438 [main] INFO  org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://ds-master.cloud.hortonworks.com:8188/ws/v1/timeline/
2015-03-11 10:21:25,440 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at ds-master.cloud.hortonworks.com/172.24.70.17:8050
2015-03-11 10:21:25,448 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2015-03-11 10:21:25,666 [main] INFO  org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://ds-master.cloud.hortonworks.com:8188/ws/v1/timeline/
2015-03-11 10:21:25,669 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at ds-master.cloud.hortonworks.com/172.24.70.17:8050
2015-03-11 10:21:25,679 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2015-03-11 10:21:25,730 [main] WARN  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Encountered Warning FIELD_DISCARDED_TYPE_CONVERSION_FAILED 136253 time(s).
2015-03-11 10:21:25,732 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
2015-03-11 10:21:25,766 [main] INFO  org.apache.pig.Main - Pig script completed in 6 minutes, 41 seconds and 825 milliseconds (401825 ms)

We now read this data in, convert temparatures to Fahrenheit (note original temp is in Celcius*10), and prepare the training and testing datasets for modeling.

In [19]:

from sklearn.preprocessing import OneHotEncoder

# Convert Celsius to Fahrenheit
def fahrenheit(x): return(x*1.8 + 32.0)

# read files
cols = ['delay', 'month', 'day', 'dow', 'hour', 'distance', 'carrier', 'dest', 'days_from_holiday',
        'origin_tmin', 'origin_tmax', 'origin_prcp', 'origin_snow', 'origin_wind']
col_types = {'delay': int, 'month': int, 'day': int, 'dow': int, 'hour': int, 'distance': int, 
             'carrier': str, 'dest': str, 'days_from_holiday': int,
             'origin_tmin': float, 'origin_tmax': float, 'origin_prcp': float, 'origin_snow': float, 'origin_wind': float}

data_2007 = read_csv_from_hdfs('airline/fm/ord_2007_2', cols, col_types)
data_2008 = read_csv_from_hdfs('airline/fm/ord_2008_2', cols, col_types)

data_2007['origin_tmin'] = data_2007['origin_tmin'].apply(lambda x: fahrenheit(x/10.0))
data_2007['origin_tmax'] = data_2007['origin_tmax'].apply(lambda x: fahrenheit(x/10.0))
data_2008['origin_tmin'] = data_2008['origin_tmin'].apply(lambda x: fahrenheit(x/10.0))
data_2008['origin_tmax'] = data_2008['origin_tmax'].apply(lambda x: fahrenheit(x/10.0))

# Create training set and test set
train_y = data_2007['delay'] >= 15
categ = [cols.index(x) for x in 'hour', 'month', 'day', 'dow', 'carrier', 'dest']
enc = OneHotEncoder(categorical_features = categ)
df = data_2007.drop('delay', axis=1)
df['carrier'] = pd.factorize(df['carrier'])[0]
df['dest'] = pd.factorize(df['dest'])[0]
train_x = enc.fit_transform(df)

test_y = data_2008['delay'] >= 15
df = data_2008.drop('delay', axis=1)
df['carrier'] = pd.factorize(df['carrier'])[0]
df['dest'] = pd.factorize(df['dest'])[0]
test_x = enc.transform(df)

print train_x.shape

(359169, 414)

Good. So now that we have the training and test (validation) set ready, let's try Random Forest with the new features:

In [20]:

# Create Random Forest classifier with 100 trees
clf_rf = RandomForestClassifier(n_estimators=100, n_jobs=-1)
clf_rf.fit(train_x.toarray(), train_y)

# Evaluate on test set
pr = clf_rf.predict(test_x.toarray())

# print results
cm = confusion_matrix(test_y, pr)
print("Confusion matrix")
print(pd.DataFrame(cm))
report_rf = precision_recall_fscore_support(list(test_y), list(pr), average='micro')
print "precision = %0.2f, recall = %0.2f, F1 = %0.2f, accuracy = %0.2f\n" % \
        (report_rf[0], report_rf[1], report_rf[2], accuracy_score(list(test_y), list(pr)))

Confusion matrix
        0      1
0  226595  13299
1   73098  22338
precision = 0.63, recall = 0.23, F1 = 0.34, accuracy = 0.74

with the new weather features, accuracy went up again from 0.70 to 0.74.

Clearly with more iterations, we are likely going to improve accuracy even further. For example, we can add weather information at the Origin, or explore the number of seats on the plan as a predictive feature (we can get that from the tail number), and so on.

Summary¶

In this blog post we have demonstrated how to build a predictive model with Hadoop and Python. We have used Hadoop to perform various types of data pre-processing and feature engineering tasks. We then applied Scikit-learn machine learning algorithm on the resulting datasets and have shown how via iterations we continuously add new and improved features resulting in better model performance.

In the next part of this multi-part blog post we will show how to perform the same learning task with Spark and ML-Lib.