This IPython notebook complements and expounds the implementation details outlined in our series on Data Science and Hadoop part 1 blog, demonstrating how to apply data science techniques with Apache Hadoop.
In this first blog post we will demonstrate a step by step solution to a supervised learning problem, including:
So let's begin.
Every year approximately 20% of airline flights are delayed or cancelled, resulting in significant costs to both travellers and airlines. As our example use-case, we will build a supervised learning model that predicts airline delay from historial flight data and weather information.
Let's begin by exploring the airline delay dataset available here: http://stat-computing.org/dataexpo/2009/the-data.html This dataset includes details about flights in the US from the years 1987-2008. Every row in the dataset includes 29 variables:
Name | Description | |
---|---|---|
1 | Year | 1987-2008 |
2 | Month | 1-12 |
3 | DayofMonth | 1-31 |
4 | DayOfWeek | 1 (Monday) - 7 (Sunday) |
5 | DepTime | actual departure time (local, hhmm) |
6 | CRSDepTime | scheduled departure time (local, hhmm) |
7 | ArrTime | actual arrival time (local, hhmm) |
8 | CRSArrTime | scheduled arrival time (local, hhmm) |
9 | UniqueCarrier | unique carrier code |
10 | FlightNum | flight number |
11 | TailNum | plane tail number |
12 | ActualElapsedTime | in minutes |
13 | CRSElapsedTime | in minutes |
14 | AirTime | in minutes |
15 | ArrDelay | arrival delay, in minutes |
16 | DepDelay | departure delay, in minutes |
17 | Origin | origin |
18 | Dest | destination |
19 | Distance | in miles |
20 | TaxiIn | taxi in time, in minutes |
21 | TaxiOut | taxi out time in minutes |
22 | Cancelled | was the flight cancelled? |
23 | CancellationCode | reason for cancellation (A = carrier, B = weather, C = NAS, D = security) |
24 | Diverted | 1 = yes, 0 = no |
25 | CarrierDelay | in minutes |
26 | WeatherDelay | in minutes |
27 | NASDelay | in minutes |
28 | SecurityDelay | in minutes |
29 | LateAircraftDelay | in minutes |
To simplify, we will build a supervised learning model to predict flight delays for flights leaving O'Hare International airport (ORD), where we "learn" the model using data from 2007, and evaluate its performance using data from 2008.
But first, let's do some exploration of this dataset. Exploration is a common step in building a predictive model -- our goal is to better understand the data we have and get some clues as to which features might be good for the predictive model.
We start by importing some useful python libraries that we will need later like Pandas, Numpy, Scikit-learn and Matplotlib.
# Python library imports: numpy, random, sklearn, pandas, etc
import warnings
warnings.filterwarnings('ignore')
import sys
import random
import numpy as np
from sklearn import linear_model, cross_validation, metrics, svm
from sklearn.metrics import confusion_matrix, precision_recall_fscore_support, accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
We now define a utility function to read an HDFS file into a Pandas dataframe using Pydoop. Pydoop is a package that provides a Python API for Hadoop MapReduce and HDFS.
Pydoop's hdfs.open() function reads a single file from HDFS. However many HDFS output files are actually multi-part files, so our read_csv_from_hdfs() function uses hdfs.ls() to grab all the needed file names, and then read each one separately. Finally, it concatenates the resulting Pandas dataframes of each file into a Pandas dataframe.
# function to read HDFS file into dataframe using PyDoop
import pydoop.hdfs as hdfs
def read_csv_from_hdfs(path, cols, col_types=None):
files = hdfs.ls(path);
pieces = []
for f in files:
fhandle = hdfs.open(f)
pieces.append(pd.read_csv(fhandle, names=cols, dtype=col_types))
fhandle.close()
return pd.concat(pieces, ignore_index=True)
Great. Now we got the logistics out of the way, so let's explore this dataset further.
First, let's read the raw data for 2007 from HDFS into a Pandas dataframe. We use our utility function read_csv_from_hdfs() and provide it with column names since this is a raw file, not a HIVE table with meta-data. Let's see how it works:
# read 2007 year file
cols = ['year', 'month', 'day', 'dow', 'DepTime', 'CRSDepTime', 'ArrTime', 'CRSArrTime', 'Carrier', 'FlightNum',
'TailNum', 'ActualElapsedTime', 'CRSElapsedTime', 'AirTime', 'ArrDelay', 'DepDelay', 'Origin', 'Dest',
'Distance', 'TaxiIn', 'TaxiOut', 'Cancelled', 'CancellationCode', 'Diverted', 'CarrierDelay',
'WeatherDelay', 'NASDelay', 'SecurityDelay', 'LateAircraftDelay'];
flt_2007 = read_csv_from_hdfs('airline/delay/2007.csv', cols)
flt_2007.shape
(7453216, 29)
We see 7.4M+ flights in 2007 and 29 variables.
Our "target" variable will be DepDelay (scheduled departure delay in minutes). To build a classifier, we further refine our target variable into a binary variable by defining a "delay" as having 15 mins or more of delay, and "non-delay" otherwise. We thus create a new binary variable that we name 'DepDelayed'.
Let's look at some basic statistics, after limiting ourselves to flights originating from ORD:
df = flt_2007[flt_2007['Origin']=='ORD'].dropna(subset=['DepDelay'])
df['DepDelayed'] = df['DepDelay'].apply(lambda x: x>=15)
print "total flights: " + str(df.shape[0])
print "total delays: " + str(df['DepDelayed'].sum())
total flights: 359169 total delays: 109346
Let's see how delayed flights are distributed by month:
# Select a Pandas dataframe with flight originating from ORD
# Compute average number of delayed flights per month
grouped = df[['DepDelayed', 'month']].groupby('month').mean()
# plot average delays by month
grouped.plot(kind='bar')
<matplotlib.axes._subplots.AxesSubplot at 0x7fa749875990>
We see that the average number of delays is highest in December and February, which is what we would expect.
Now let's look at the hour-of-day:
# Compute average number of delayed flights by hour
df['hour'] = df['CRSDepTime'].map(lambda x: int(str(int(x)).zfill(4)[:2]))
grouped = df[['DepDelayed', 'hour']].groupby('hour').mean()
# plot average delays by hour of day
grouped.plot(kind='bar')
<matplotlib.axes._subplots.AxesSubplot at 0x7fa6affb24d0>
A clear pattern here - flights tend to be delayed later in the day. Perhaps this is because delays tend to pile up as the day progresses and the problem tends to compound later in the day.
Now let's look at delays by carrier:
# Compute average number of delayed flights per carrier
grouped1 = df[['DepDelayed', 'Carrier']].groupby('Carrier').filter(lambda x: len(x)>10)
grouped2 = grouped1.groupby('Carrier').mean()
carrier = grouped2.sort(['DepDelayed'], ascending=False)
# display top 15 destination carriers by delay (from ORD)
carrier[:15].plot(kind='bar')
<matplotlib.axes._subplots.AxesSubplot at 0x7fa6afed6310>
As expected, some airlines are better than others.
After exploring the data for a bit, we now move to building the feature matrix for our predictive model.
Let's look at possible predictive variables for our model:
We will also generate another feature: number of days from closest national holiday, with the assumption that holidays tend to be associated with more delays.
We implement this "feature generation" process using PIG and some simply Python user-defined-functions (UDFs). First, let's implement some Python UDFs:
#
# Python UDFs for our PIG script
#
from datetime import date
# get hour-of-day from HHMM field
@outputSchema("value: int")
def get_hour(val):
return int(val.zfill(4)[:2])
# this array defines the dates of holiday in 2007 and 2008
holidays = [
date(2007, 1, 1), date(2007, 1, 15), date(2007, 2, 19), date(2007, 5, 28), date(2007, 6, 7), date(2007, 7, 4), \
date(2007, 9, 3), date(2007, 10, 8), date(2007, 11, 11), date(2007, 11, 22), date(2007, 12, 25), \
date(2008, 1, 1), date(2008, 1, 21), date(2008, 2, 18), date(2008, 5, 22), date(2008, 5, 26), date(2008, 7, 4), \
date(2008, 9, 1), date(2008, 10, 13), date(2008, 11, 11), date(2008, 11, 27), date(2008, 12, 25) \
]
# get number of days from nearest holiday
@outputSchema("days: int")
def days_from_nearest_holiday(year, month, day):
d = date(year, month, day)
x = [(abs(d-h)).days for h in holidays]
return min(x)
Our PIG script is relatively simple:
We can execute this script directly from IPython (the Python UDFs are separately stored in "util.py"):
%%writefile preprocess1.pig
Register 'util.py' USING jython as util;
DEFINE preprocess(year_str, airport_code) returns data
{
-- load airline data from specified year (need to specify fields since it's not in HCat)
airline = load 'airline/delay/$year_str.csv' using PigStorage(',')
as (Year: int, Month: int, DayOfMonth: int, DayOfWeek: int, DepTime: chararray,
CRSDepTime: chararray, ArrTime, CRSArrTime, Carrier: chararray, FlightNum, TailNum, ActualElapsedTime,
CRSElapsedTime, AirTime, ArrDelay, DepDelay: int, Origin: chararray, Dest: chararray, Distance: int,
TaxiIn, TaxiOut, Cancelled: int, CancellationCode, Diverted, CarrierDelay, WeatherDelay,
NASDelay, SecurityDelay, LateAircraftDelay);
-- keep only instances where flight was not cancelled and originate at ORD
airline_flt = filter airline by Cancelled == 0 and Origin == '$airport_code';
-- Keep only fields I need
$data = foreach airline_flt generate DepDelay as delay, Month, DayOfMonth, DayOfWeek,
util.get_hour(CRSDepTime) as hour, Distance, Carrier, Dest,
util.days_from_nearest_holiday(Year, Month, DayOfMonth) as hdays;
};
ORD_2007 = preprocess('2007', 'ORD');
rmf airline/fm/ord_2007_1
store ORD_2007 into 'airline/fm/ord_2007_1' using PigStorage(',');
ORD_2008 = preprocess('2008', 'ORD');
rmf airline/fm/ord_2008_1
store ORD_2008 into 'airline/fm/ord_2008_1' using PigStorage(',');
Overwriting preprocess1.pig
Let's look at the output as the script continues to process...
%%bash --err pig_out --bg
pig -f preprocess1.pig
Starting job # 0 in a separate thread.
while True:
line = pig_out.readline()
if not line:
break
sys.stdout.write("%s" % line)
sys.stdout.flush()
15/03/11 10:08:41 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL 15/03/11 10:08:41 INFO pig.ExecTypeProvider: Trying ExecType : MAPREDUCE 15/03/11 10:08:41 INFO pig.ExecTypeProvider: Picked MAPREDUCE as the ExecType 2015-03-11 10:08:41,358 [main] INFO org.apache.pig.Main - Apache Pig version 0.14.0.2.2.0.0-2041 (rexported) compiled Nov 19 2014, 15:24:46 2015-03-11 10:08:41,359 [main] INFO org.apache.pig.Main - Logging error messages to: /home/demo/airline-demo/pig_1426093721356.log 2015-03-11 10:08:42,495 [main] INFO org.apache.pig.impl.util.Utils - Default bootup file /home/demo/.pigbootup not found 2015-03-11 10:08:42,770 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://ds-master.cloud.hortonworks.com:8020 2015-03-11 10:08:43,872 [main] INFO org.apache.pig.scripting.jython.JythonScriptEngine - created tmp python.cachedir=/tmp/pig_jython_3861695012397846592 2015-03-11 10:08:45,674 [main] WARN org.apache.pig.scripting.jython.JythonScriptEngine - pig.cmd.args.remainders is empty. This is not expected unless on testing. 2015-03-11 10:08:47,988 [main] INFO org.apache.pig.scripting.jython.JythonScriptEngine - Register scripting UDF: util.get_hour 2015-03-11 10:08:47,990 [main] INFO org.apache.pig.scripting.jython.JythonScriptEngine - Register scripting UDF: util.to_date 2015-03-11 10:08:47,992 [main] INFO org.apache.pig.scripting.jython.JythonScriptEngine - Register scripting UDF: util.days_from_nearest_holiday 2015-03-11 10:08:52,833 [main] INFO org.apache.pig.scripting.jython.JythonFunction - Schema 'value: int' defined for func get_hour 2015-03-11 10:08:52,994 [main] INFO org.apache.pig.tools.grunt.GruntParser - Waited 0ms to delete file 2015-03-11 10:08:54,283 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: FILTER 2015-03-11 10:08:54,656 [main] INFO org.apache.pig.data.SchemaTupleBackend - Key [pig.schematuple] was not set... will not generate code. 2015-03-11 10:08:54,727 [main] INFO org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer - {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, ConstantCalculator, GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, PartitionFilterOptimizer, PredicatePushdownOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter]} 2015-03-11 10:08:54,783 [main] INFO org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns pruned for macro_preprocess_airline_0: $4, $6, $7, $9, $10, $11, $12, $13, $14, $19, $20, $22, $23, $24, $25, $26, $27, $28 2015-03-11 10:08:55,720 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false 2015-03-11 10:08:55,774 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1 2015-03-11 10:08:55,777 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1 2015-03-11 10:08:57,494 [main] INFO org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://ds-master.cloud.hortonworks.com:8188/ws/v1/timeline/ 2015-03-11 10:08:57,703 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at ds-master.cloud.hortonworks.com/172.24.70.17:8050 2015-03-11 10:08:58,629 [main] INFO org.apache.pig.tools.pigstats.mapreduce.MRScriptState - Pig script settings are added to the job 2015-03-11 10:08:58,644 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3 2015-03-11 10:08:58,684 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - This job cannot be converted run in-process 2015-03-11 10:08:59,405 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/hdp/2.2.0.0-2041/pig/pig-0.14.0.2.2.0.0-2041-core-h2.jar to DistributedCache through /tmp/temp-1751751362/tmp1493467348/pig-0.14.0.2.2.0.0-2041-core-h2.jar 2015-03-11 10:09:01,690 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/hdp/2.2.0.0-2041/pig/lib/jython-standalone-2.5.3.jar to DistributedCache through /tmp/temp-1751751362/tmp2080441925/jython-standalone-2.5.3.jar 2015-03-11 10:09:01,908 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/hdp/2.2.0.0-2041/pig/lib/automaton-1.11-8.jar to DistributedCache through /tmp/temp-1751751362/tmp518987527/automaton-1.11-8.jar 2015-03-11 10:09:02,064 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/hdp/2.2.0.0-2041/pig/lib/antlr-runtime-3.4.jar to DistributedCache through /tmp/temp-1751751362/tmp524180625/antlr-runtime-3.4.jar 2015-03-11 10:09:02,658 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/hdp/2.2.0.0-2041/hadoop/lib/guava-11.0.2.jar to DistributedCache through /tmp/temp-1751751362/tmp-459080211/guava-11.0.2.jar 2015-03-11 10:09:02,702 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/hdp/2.2.0.0-2041/hadoop-mapreduce/joda-time-2.5.jar to DistributedCache through /tmp/temp-1751751362/tmp-1371032075/joda-time-2.5.jar 2015-03-11 10:09:02,800 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/tmp/PigScriptUDF-3f91fbfefba602bf28492c3cd7f8b54c.jar to DistributedCache through /tmp/temp-1751751362/tmp344686184/PigScriptUDF-3f91fbfefba602bf28492c3cd7f8b54c.jar 2015-03-11 10:09:02,862 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job 2015-03-11 10:09:02,877 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Key [pig.schematuple] is false, will not generate code. 2015-03-11 10:09:02,878 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Starting process to move generated code to distributed cacche 2015-03-11 10:09:02,879 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Setting key [pig.schematuple.classes] with classes to deserialize [] 2015-03-11 10:09:02,978 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission. 2015-03-11 10:09:03,230 [JobControl] INFO org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://ds-master.cloud.hortonworks.com:8188/ws/v1/timeline/ 2015-03-11 10:09:03,232 [JobControl] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at ds-master.cloud.hortonworks.com/172.24.70.17:8050 2015-03-11 10:09:04,006 [JobControl] WARN org.apache.hadoop.mapreduce.JobSubmitter - No job jar file set. User classes may not be found. See Job or Job#setJar(String). 2015-03-11 10:09:04,112 [JobControl] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1 2015-03-11 10:09:04,114 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1 2015-03-11 10:09:04,354 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 6 2015-03-11 10:09:04,896 [JobControl] INFO org.apache.hadoop.mapreduce.JobSubmitter - number of splits:6 2015-03-11 10:09:05,411 [JobControl] INFO org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1424904779802_0602 2015-03-11 10:09:06,317 [JobControl] INFO org.apache.hadoop.mapred.YARNRunner - Job jar is not present. Not adding any jar to the list of resources. 2015-03-11 10:09:06,452 [JobControl] INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - Submitted application application_1424904779802_0602 2015-03-11 10:09:06,722 [JobControl] INFO org.apache.hadoop.mapreduce.Job - The url to track the job: http://ds-master.cloud.hortonworks.com:8088/proxy/application_1424904779802_0602/ 2015-03-11 10:09:06,724 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_1424904779802_0602 2015-03-11 10:09:06,725 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Processing aliases ORD_2007,macro_preprocess_airline_0,macro_preprocess_airline_flt_0 2015-03-11 10:09:06,726 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - detailed locations: M: macro_preprocess_airline_0[6,18],macro_preprocess_airline_0[-1,-1],macro_preprocess_airline_flt_0[14,22],ORD_2007[17,19] C: R: 2015-03-11 10:09:06,746 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete 2015-03-11 10:09:06,747 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0602] 2015-03-11 10:09:46,097 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 6% complete 2015-03-11 10:09:46,100 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0602] 2015-03-11 10:09:54,122 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 10% complete 2015-03-11 10:09:54,125 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0602] 2015-03-11 10:10:01,145 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 15% complete 2015-03-11 10:10:01,146 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0602] 2015-03-11 10:10:06,160 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 20% complete 2015-03-11 10:10:06,160 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0602] 2015-03-11 10:10:14,179 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 25% complete 2015-03-11 10:10:14,180 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0602] 2015-03-11 10:10:19,192 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 29% complete 2015-03-11 10:10:19,192 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0602] 2015-03-11 10:10:26,209 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 33% complete 2015-03-11 10:10:26,210 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0602] 2015-03-11 10:10:37,236 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 38% complete 2015-03-11 10:10:37,237 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0602] 2015-03-11 10:10:44,255 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 43% complete 2015-03-11 10:10:44,255 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0602] 2015-03-11 10:10:54,282 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 47% complete 2015-03-11 10:10:54,283 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0602] 2015-03-11 10:11:07,523 [main] INFO org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://ds-master.cloud.hortonworks.com:8188/ws/v1/timeline/ 2015-03-11 10:11:07,526 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at ds-master.cloud.hortonworks.com/172.24.70.17:8050 2015-03-11 10:11:07,539 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server 2015-03-11 10:11:08,434 [main] INFO org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://ds-master.cloud.hortonworks.com:8188/ws/v1/timeline/ 2015-03-11 10:11:08,436 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at ds-master.cloud.hortonworks.com/172.24.70.17:8050 2015-03-11 10:11:08,446 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server 2015-03-11 10:11:08,788 [main] INFO org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://ds-master.cloud.hortonworks.com:8188/ws/v1/timeline/ 2015-03-11 10:11:08,790 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at ds-master.cloud.hortonworks.com/172.24.70.17:8050 2015-03-11 10:11:08,799 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server 2015-03-11 10:11:08,929 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete 2015-03-11 10:11:08,933 [main] INFO org.apache.pig.tools.pigstats.mapreduce.SimplePigStats - Script Statistics: HadoopVersion PigVersion UserId StartedAt FinishedAt Features 2.6.0.2.2.0.0-2041 0.14.0.2.2.0.0-2041 demo 2015-03-11 10:08:58 2015-03-11 10:11:08 FILTER Success! Job Stats (time in seconds): JobId Maps Reduces MaxMapTime MinMapTime AvgMapTime MedianMapTime MaxReduceTime MinReduceTime AvgReduceTime MedianReducetime Alias Feature Outputs job_1424904779802_0602 6 0 106 50 85 94 0 0 0 0 ORD_2007,macro_preprocess_airline_0,macro_preprocess_airline_flt_0 MAP_ONLY hdfs://ds-master.cloud.hortonworks.com:8020/user/demo/airline/fm/ord_2007_1, Input(s): Successfully read 7453216 records (703535971 bytes) from: "hdfs://ds-master.cloud.hortonworks.com:8020/user/demo/airline/delay/2007.csv" Output(s): Successfully stored 359169 records (9421186 bytes) in: "hdfs://ds-master.cloud.hortonworks.com:8020/user/demo/airline/fm/ord_2007_1" Counters: Total records written : 359169 Total bytes written : 9421186 Spillable Memory Manager spill count : 0 Total bags proactively spilled: 0 Total records proactively spilled: 0 Job DAG: job_1424904779802_0602 2015-03-11 10:11:09,102 [main] INFO org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://ds-master.cloud.hortonworks.com:8188/ws/v1/timeline/ 2015-03-11 10:11:09,104 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at ds-master.cloud.hortonworks.com/172.24.70.17:8050 2015-03-11 10:11:09,112 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server 2015-03-11 10:11:09,362 [main] INFO org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://ds-master.cloud.hortonworks.com:8188/ws/v1/timeline/ 2015-03-11 10:11:09,364 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at ds-master.cloud.hortonworks.com/172.24.70.17:8050 2015-03-11 10:11:09,373 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server 2015-03-11 10:11:09,575 [main] INFO org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://ds-master.cloud.hortonworks.com:8188/ws/v1/timeline/ 2015-03-11 10:11:09,577 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at ds-master.cloud.hortonworks.com/172.24.70.17:8050 2015-03-11 10:11:09,587 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server 2015-03-11 10:11:09,650 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Encountered Warning FIELD_DISCARDED_TYPE_CONVERSION_FAILED 160755 time(s). 2015-03-11 10:11:09,652 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success! 2015-03-11 10:11:09,790 [main] INFO org.apache.pig.tools.grunt.GruntParser - Waited 0ms to delete file 2015-03-11 10:11:10,146 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: FILTER 2015-03-11 10:11:10,189 [main] INFO org.apache.pig.data.SchemaTupleBackend - Key [pig.schematuple] was not set... will not generate code. 2015-03-11 10:11:10,191 [main] INFO org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer - {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, ConstantCalculator, GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, PartitionFilterOptimizer, PredicatePushdownOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter]} 2015-03-11 10:11:10,204 [main] INFO org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns pruned for macro_preprocess_airline_1: $4, $6, $7, $9, $10, $11, $12, $13, $14, $19, $20, $22, $23, $24, $25, $26, $27, $28 2015-03-11 10:11:10,333 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false 2015-03-11 10:11:10,340 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1 2015-03-11 10:11:10,341 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1 2015-03-11 10:11:10,519 [main] INFO org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://ds-master.cloud.hortonworks.com:8188/ws/v1/timeline/ 2015-03-11 10:11:10,521 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at ds-master.cloud.hortonworks.com/172.24.70.17:8050 2015-03-11 10:11:10,529 [main] INFO org.apache.pig.tools.pigstats.mapreduce.MRScriptState - Pig script settings are added to the job 2015-03-11 10:11:10,534 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3 2015-03-11 10:11:10,537 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - This job cannot be converted run in-process 2015-03-11 10:11:11,086 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/hdp/2.2.0.0-2041/pig/pig-0.14.0.2.2.0.0-2041-core-h2.jar to DistributedCache through /tmp/temp-1751751362/tmp-2067043078/pig-0.14.0.2.2.0.0-2041-core-h2.jar 2015-03-11 10:11:11,264 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/hdp/2.2.0.0-2041/pig/lib/jython-standalone-2.5.3.jar to DistributedCache through /tmp/temp-1751751362/tmp-908608899/jython-standalone-2.5.3.jar 2015-03-11 10:11:11,340 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/hdp/2.2.0.0-2041/pig/lib/automaton-1.11-8.jar to DistributedCache through /tmp/temp-1751751362/tmp560249020/automaton-1.11-8.jar 2015-03-11 10:11:11,385 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/hdp/2.2.0.0-2041/pig/lib/antlr-runtime-3.4.jar to DistributedCache through /tmp/temp-1751751362/tmp884892759/antlr-runtime-3.4.jar 2015-03-11 10:11:11,437 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/hdp/2.2.0.0-2041/hadoop/lib/guava-11.0.2.jar to DistributedCache through /tmp/temp-1751751362/tmp-2092270325/guava-11.0.2.jar 2015-03-11 10:11:11,502 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/hdp/2.2.0.0-2041/hadoop-mapreduce/joda-time-2.5.jar to DistributedCache through /tmp/temp-1751751362/tmp-431570757/joda-time-2.5.jar 2015-03-11 10:11:11,540 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/tmp/PigScriptUDF-3f91fbfefba602bf28492c3cd7f8b54c.jar to DistributedCache through /tmp/temp-1751751362/tmp229632211/PigScriptUDF-3f91fbfefba602bf28492c3cd7f8b54c.jar 2015-03-11 10:11:11,570 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job 2015-03-11 10:11:11,573 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Key [pig.schematuple] is false, will not generate code. 2015-03-11 10:11:11,574 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Starting process to move generated code to distributed cacche 2015-03-11 10:11:11,574 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Setting key [pig.schematuple.classes] with classes to deserialize [] 2015-03-11 10:11:11,628 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission. 2015-03-11 10:11:11,790 [JobControl] INFO org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://ds-master.cloud.hortonworks.com:8188/ws/v1/timeline/ 2015-03-11 10:11:11,792 [JobControl] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at ds-master.cloud.hortonworks.com/172.24.70.17:8050 2015-03-11 10:11:11,827 [JobControl] WARN org.apache.hadoop.mapreduce.JobSubmitter - No job jar file set. User classes may not be found. See Job or Job#setJar(String). 2015-03-11 10:11:11,902 [JobControl] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1 2015-03-11 10:11:11,903 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1 2015-03-11 10:11:11,909 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 6 2015-03-11 10:11:12,026 [JobControl] INFO org.apache.hadoop.mapreduce.JobSubmitter - number of splits:6 2015-03-11 10:11:12,145 [JobControl] INFO org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1424904779802_0603 2015-03-11 10:11:12,155 [JobControl] INFO org.apache.hadoop.mapred.YARNRunner - Job jar is not present. Not adding any jar to the list of resources. 2015-03-11 10:11:12,223 [JobControl] INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - Submitted application application_1424904779802_0603 2015-03-11 10:11:12,230 [JobControl] INFO org.apache.hadoop.mapreduce.Job - The url to track the job: http://ds-master.cloud.hortonworks.com:8088/proxy/application_1424904779802_0603/ 2015-03-11 10:11:12,231 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_1424904779802_0603 2015-03-11 10:11:12,231 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Processing aliases ORD_2008,macro_preprocess_airline_1,macro_preprocess_airline_flt_1 2015-03-11 10:11:12,231 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - detailed locations: M: macro_preprocess_airline_1[6,18],macro_preprocess_airline_1[-1,-1],macro_preprocess_airline_flt_1[14,22],ORD_2008[17,19] C: R: 2015-03-11 10:11:12,241 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete 2015-03-11 10:11:12,242 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0603] 2015-03-11 10:11:44,403 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 6% complete 2015-03-11 10:11:44,404 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0603] 2015-03-11 10:11:49,419 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 11% complete 2015-03-11 10:11:49,421 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0603] 2015-03-11 10:11:57,444 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 16% complete 2015-03-11 10:11:57,446 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0603] 2015-03-11 10:12:02,462 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 21% complete 2015-03-11 10:12:02,463 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0603] 2015-03-11 10:12:09,484 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 26% complete 2015-03-11 10:12:09,484 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0603] 2015-03-11 10:12:14,501 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 32% complete 2015-03-11 10:12:14,501 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0603] 2015-03-11 10:12:19,518 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 36% complete 2015-03-11 10:12:19,518 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0603] 2015-03-11 10:12:24,532 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 41% complete 2015-03-11 10:12:24,532 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0603] 2015-03-11 10:12:31,552 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 46% complete 2015-03-11 10:12:31,552 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0603] 2015-03-11 10:12:42,755 [main] INFO org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://ds-master.cloud.hortonworks.com:8188/ws/v1/timeline/ 2015-03-11 10:12:42,758 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at ds-master.cloud.hortonworks.com/172.24.70.17:8050 2015-03-11 10:12:42,773 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server 2015-03-11 10:12:43,041 [main] INFO org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://ds-master.cloud.hortonworks.com:8188/ws/v1/timeline/ 2015-03-11 10:12:43,043 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at ds-master.cloud.hortonworks.com/172.24.70.17:8050 2015-03-11 10:12:43,051 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server 2015-03-11 10:12:43,269 [main] INFO org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://ds-master.cloud.hortonworks.com:8188/ws/v1/timeline/ 2015-03-11 10:12:43,271 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at ds-master.cloud.hortonworks.com/172.24.70.17:8050 2015-03-11 10:12:43,282 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server 2015-03-11 10:12:43,329 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete 2015-03-11 10:12:43,331 [main] INFO org.apache.pig.tools.pigstats.mapreduce.SimplePigStats - Script Statistics: HadoopVersion PigVersion UserId StartedAt FinishedAt Features 2.6.0.2.2.0.0-2041 0.14.0.2.2.0.0-2041 demo 2015-03-11 10:11:10 2015-03-11 10:12:43 FILTER Success! Job Stats (time in seconds): JobId Maps Reduces MaxMapTime MinMapTime AvgMapTime MedianMapTime MaxReduceTime MinReduceTime AvgReduceTime MedianReducetime Alias Feature Outputs job_1424904779802_0603 6 0 73 23 59 69 0 0 0 0 ORD_2008,macro_preprocess_airline_1,macro_preprocess_airline_flt_1 MAP_ONLY hdfs://ds-master.cloud.hortonworks.com:8020/user/demo/airline/fm/ord_2008_1, Input(s): Successfully read 7009729 records (690071122 bytes) from: "hdfs://ds-master.cloud.hortonworks.com:8020/user/demo/airline/delay/2008.csv" Output(s): Successfully stored 335330 records (8795284 bytes) in: "hdfs://ds-master.cloud.hortonworks.com:8020/user/demo/airline/fm/ord_2008_1" Counters: Total records written : 335330 Total bytes written : 8795284 Spillable Memory Manager spill count : 0 Total bags proactively spilled: 0 Total records proactively spilled: 0 Job DAG: job_1424904779802_0603 2015-03-11 10:12:43,483 [main] INFO org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://ds-master.cloud.hortonworks.com:8188/ws/v1/timeline/ 2015-03-11 10:12:43,484 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at ds-master.cloud.hortonworks.com/172.24.70.17:8050 2015-03-11 10:12:43,496 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server 2015-03-11 10:12:43,706 [main] INFO org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://ds-master.cloud.hortonworks.com:8188/ws/v1/timeline/ 2015-03-11 10:12:43,708 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at ds-master.cloud.hortonworks.com/172.24.70.17:8050 2015-03-11 10:12:43,716 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server 2015-03-11 10:12:43,917 [main] INFO org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://ds-master.cloud.hortonworks.com:8188/ws/v1/timeline/ 2015-03-11 10:12:43,919 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at ds-master.cloud.hortonworks.com/172.24.70.17:8050 2015-03-11 10:12:43,929 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server 2015-03-11 10:12:43,978 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Encountered Warning FIELD_DISCARDED_TYPE_CONVERSION_FAILED 136253 time(s). 2015-03-11 10:12:43,980 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success! 2015-03-11 10:12:44,011 [main] INFO org.apache.pig.Main - Pig script completed in 4 minutes, 2 seconds and 838 milliseconds (242838 ms)
Now that PIG finished processing, we have two new file generated:
(the "1" indicates this is the first iteration; we will work on a second iteration later).
PIG is great for pre-procesing raw data into a feature matrix, but it's not the only choice. We can use other tools such as HIVE, Cascading, Scalding or Spark for this type of pre-processing. We will show how to do the same type of pre-processing using Spark in the second part of this blog post series.
Now we have the files ord_2007_1 and ord_2008_1 under 'airline/fm' folder in HDFS. Let's read those files into Python, and prepare the training and testing (validation) datasets as Pandas DataFrame objects.
Initially, we use only the numerical variables:
# read files
cols = ['delay', 'month', 'day', 'dow', 'hour', 'distance', 'carrier', 'dest', 'days_from_holiday']
col_types = {'delay': int, 'month': int, 'day': int, 'dow': int, 'hour': int, 'distance': int,
'carrier': str, 'dest': str, 'days_from_holiday': int}
data_2007 = read_csv_from_hdfs('airline/fm/ord_2007_1', cols, col_types)
data_2008 = read_csv_from_hdfs('airline/fm/ord_2008_1', cols, col_types)
# Create training set and test set
cols = ['month', 'day', 'dow', 'hour', 'distance', 'days_from_holiday']
train_y = data_2007['delay'] >= 15
train_x = data_2007[cols]
test_y = data_2008['delay'] >= 15
test_x = data_2008[cols]
print train_x.shape
(359169, 6)
So we have ~359K rows and 6 features in our model.
Now we use Python's excellent Scikit-learn machine learning package to to build two predictive models (Logistic regression and Random Forest) and compare their performance. First we print the confusion matrix, which counts the true positive, true negatives, false positives and false negatives. Then from the confusion matrix, we compute precision, recall, F1 metric and accuracy. Let's start with a logistic regression model and evaluate its performance on the testing dataset.
# Create logistic regression model with L2 regularization
clf_lr = linear_model.LogisticRegression(penalty='l2', class_weight='auto')
clf_lr.fit(train_x, train_y)
# Predict output labels on test set
pr = clf_lr.predict(test_x)
# display evaluation metrics
cm = confusion_matrix(test_y, pr)
print("Confusion matrix")
print(pd.DataFrame(cm))
report_lr = precision_recall_fscore_support(list(test_y), list(pr), average='micro')
print "\nprecision = %0.2f, recall = %0.2f, F1 = %0.2f, accuracy = %0.2f\n" % \
(report_lr[0], report_lr[1], report_lr[2], accuracy_score(list(test_y), list(pr)))
Confusion matrix 0 1 0 143858 96036 1 36987 58449 precision = 0.38, recall = 0.61, F1 = 0.47, accuracy = 0.60
Our logistic regression model got overall accuracy of 60%. Now let's try Random Forest:
# Create Random Forest classifier with 50 trees
clf_rf = RandomForestClassifier(n_estimators=50, n_jobs=-1)
clf_rf.fit(train_x, train_y)
# Evaluate on test set
pr = clf_rf.predict(test_x)
# print results
cm = confusion_matrix(test_y, pr)
print("Confusion matrix")
print(pd.DataFrame(cm))
report_svm = precision_recall_fscore_support(list(test_y), list(pr), average='micro')
print "\nprecision = %0.2f, recall = %0.2f, F1 = %0.2f, accuracy = %0.2f\n" % \
(report_svm[0], report_svm[1], report_svm[2], accuracy_score(list(test_y), list(pr)))
Confusion matrix 0 1 0 197890 42004 1 65978 29458 precision = 0.41, recall = 0.31, F1 = 0.35, accuracy = 0.68
As we can see, Random Forest has overall better accuracy, but lower F1 score. For our problem -- we are trying to predict delays, so the higher level of true positives (197K vs. 143K) is better.
With any supervised learnign algorithm, one typically needs to choose values for the parameters of the model. For example, we chose "L1" regularization for the logistic regression model, and 50 trees for the Random Forest. Such choices are based on some experimentation and hyperparameter tuning (http://en.wikipedia.org/wiki/Hyperparameter_optimization). We are not addressing this topic in this demo, although such choices are important to achieve the overall best model.
It is very common in data science to work iteratively, and improve the model with each iteration. Let's see how this works.
In this iteration, we improve our feature by converting existing variables that are categorical in nature (such as "hour", or "month") as well as categorical variables that are strings (like "carrier" and "dest"), into what is known as "dummy variables". Each "dummy variable" is a binary (0 or 1) that indicates whether a certain category value is "on" or "off.
Fortunately, scikit-learn has the OneHotEncoder functionality to make this easy:
from sklearn.preprocessing import OneHotEncoder
# read files
cols = ['delay', 'month', 'day', 'dow', 'hour', 'distance', 'carrier', 'dest', 'days_from_holiday']
col_types = {'delay': int, 'month': int, 'day': int, 'dow': int, 'hour': int, 'distance': int,
'carrier': str, 'dest': str, 'days_from_holiday': int}
data_2007 = read_csv_from_hdfs('airline/fm/ord_2007_1', cols, col_types)
data_2008 = read_csv_from_hdfs('airline/fm/ord_2008_1', cols, col_types)
# Create training set and test set
train_y = data_2007['delay'] >= 15
categ = [cols.index(x) for x in 'hour', 'month', 'day', 'dow', 'carrier', 'dest']
enc = OneHotEncoder(categorical_features = categ)
df = data_2007.drop('delay', axis=1)
df['carrier'] = pd.factorize(df['carrier'])[0]
df['dest'] = pd.factorize(df['dest'])[0]
train_x = enc.fit_transform(df)
test_y = data_2008['delay'] >= 15
df = data_2008.drop('delay', axis=1)
df['carrier'] = pd.factorize(df['carrier'])[0]
df['dest'] = pd.factorize(df['dest'])[0]
test_x = enc.transform(df)
print train_x.shape
(359169, 409)
So we can see the first 5 lines of the feature matrix. Overall, we have ~359K rows and 409 features in our model. Let's re-run the Random Forest model and see if this improved our model:
# Create Random Forest classifier with 50 trees
clf_rf = RandomForestClassifier(n_estimators=50, n_jobs=-1)
clf_rf.fit(train_x.toarray(), train_y)
# Evaluate on test set
pr = clf_rf.predict(test_x.toarray())
# print results
cm = confusion_matrix(test_y, pr)
print("Confusion matrix")
print(pd.DataFrame(cm))
report_svm = precision_recall_fscore_support(list(test_y), list(pr), average='micro')
print "\nprecision = %0.2f, recall = %0.2f, F1 = %0.2f, accuracy = %0.2f\n" % \
(report_svm[0], report_svm[1], report_svm[2], accuracy_score(list(test_y), list(pr)))
Confusion matrix 0 1 0 216743 23151 1 75678 19758 precision = 0.46, recall = 0.21, F1 = 0.29, accuracy = 0.71
This clearly helped -- accuracy is higher at ~70%, and true positive are also better at 216K (vs 197K previously).
Another common path to improve accuracy is by bringing in new types of data - enriching our dataset - and generating more features. Our idea is to layer-in weather data. We can get this data from a publicly available dataset here: http://www.ncdc.noaa.gov/cdo-web/datasets/
We will look at daily temperatures (min/max), wind speed, snow conditions and precipitation in the flight origin airport (ORD). Clearly, weather conditions in the destination airport also affect delays, but for simplicity of this demo we just include weather at the origin (ORD).
First, let's re-write our PIG script to add these new features to our feature matrix:
%%writefile preprocess2.pig
register 'util.py' USING jython as util;
-- Helper macro to load data and join into a feature vector per instance
DEFINE preprocess(year_str, airport_code) returns data
{
-- load airline data from specified year (need to specify fields since it's not in HCat)
airline = load 'airline/delay/$year_str.csv' using PigStorage(',')
as (Year: int, Month: int, DayOfMonth: int, DayOfWeek: int, DepTime: chararray, CRSDepTime:chararray,
ArrTime, CRSArrTime, Carrier: chararray, FlightNum, TailNum, ActualElapsedTime, CRSElapsedTime, AirTime,
ArrDelay, DepDelay: int, Origin: chararray, Dest: chararray, Distance: int, TaxiIn, TaxiOut,
Cancelled: int, CancellationCode, Diverted, CarrierDelay, WeatherDelay, NASDelay,
SecurityDelay, LateAircraftDelay);
-- keep only instances where flight was not cancelled and originate at ORD
airline_flt = filter airline by Cancelled == 0 and Origin == '$airport_code';
-- Keep only fields I need
airline2 = foreach airline_flt generate Year as year, Month as month, DayOfMonth as day, DayOfWeek as dow,
Carrier as carrier, Origin as origin, Dest as dest, Distance as distance,
CRSDepTime as time, DepDelay as delay, util.to_date(Year, Month, DayOfMonth) as date;
-- load weather data
weather = load 'airline/weather/$year_str.csv' using PigStorage(',')
as (station: chararray, date: chararray, metric, value, t1, t2, t3, time);
-- keep only TMIN and TMAX weather observations from ORD
weather_tmin = filter weather by station == 'USW00094846' and metric == 'TMIN';
weather_tmax = filter weather by station == 'USW00094846' and metric == 'TMAX';
weather_prcp = filter weather by station == 'USW00094846' and metric == 'PRCP';
weather_snow = filter weather by station == 'USW00094846' and metric == 'SNOW';
weather_awnd = filter weather by station == 'USW00094846' and metric == 'AWND';
joined = join airline2 by date, weather_tmin by date, weather_tmax by date, weather_prcp by date,
weather_snow by date, weather_awnd by date;
$data = foreach joined generate delay, month, day, dow, util.get_hour(airline2::time) as tod, distance, carrier, dest,
util.days_from_nearest_holiday(year, month, day) as hdays,
weather_tmin::value as temp_min, weather_tmax::value as temp_max,
weather_prcp::value as prcp, weather_snow::value as snow, weather_awnd::value as wind;
};
ORD_2007 = preprocess('2007', 'ORD');
rmf airline/fm/ord_2007_2;
store ORD_2007 into 'airline/fm/ord_2007_2' using PigStorage(',');
ORD_2008 = preprocess('2008', 'ORD');
rmf airline/fm/ord_2008_2;
store ORD_2008 into 'airline/fm/ord_2008_2' using PigStorage(',');
Overwriting preprocess2.pig
%%bash --bg --err pig_out2
pig -f preprocess2.pig
Starting job # 2 in a separate thread.
while True:
line = pig_out2.readline()
if not line:
break
sys.stdout.write("%s" % line)
sys.stdout.flush()
15/03/11 10:14:44 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL 15/03/11 10:14:44 INFO pig.ExecTypeProvider: Trying ExecType : MAPREDUCE 15/03/11 10:14:44 INFO pig.ExecTypeProvider: Picked MAPREDUCE as the ExecType 2015-03-11 10:14:44,144 [main] INFO org.apache.pig.Main - Apache Pig version 0.14.0.2.2.0.0-2041 (rexported) compiled Nov 19 2014, 15:24:46 2015-03-11 10:14:44,146 [main] INFO org.apache.pig.Main - Logging error messages to: /home/demo/airline-demo/pig_1426094084142.log 2015-03-11 10:14:45,348 [main] INFO org.apache.pig.impl.util.Utils - Default bootup file /home/demo/.pigbootup not found 2015-03-11 10:14:45,625 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://ds-master.cloud.hortonworks.com:8020 2015-03-11 10:14:46,776 [main] INFO org.apache.pig.scripting.jython.JythonScriptEngine - created tmp python.cachedir=/tmp/pig_jython_1255674992043946473 2015-03-11 10:14:48,660 [main] WARN org.apache.pig.scripting.jython.JythonScriptEngine - pig.cmd.args.remainders is empty. This is not expected unless on testing. 2015-03-11 10:14:50,983 [main] INFO org.apache.pig.scripting.jython.JythonScriptEngine - Register scripting UDF: util.get_hour 2015-03-11 10:14:50,986 [main] INFO org.apache.pig.scripting.jython.JythonScriptEngine - Register scripting UDF: util.to_date 2015-03-11 10:14:50,987 [main] INFO org.apache.pig.scripting.jython.JythonScriptEngine - Register scripting UDF: util.days_from_nearest_holiday 2015-03-11 10:14:51,695 [main] INFO org.apache.pig.scripting.jython.JythonFunction - Schema 'date: chararray' defined for func to_date 2015-03-11 10:14:52,041 [main] INFO org.apache.pig.tools.grunt.GruntParser - Waited 0ms to delete file 2015-03-11 10:14:52,829 [main] WARN org.apache.pig.newplan.BaseOperatorPlan - Encountered Warning IMPLICIT_CAST_TO_CHARARRAY 5 time(s). 2015-03-11 10:14:53,049 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: HASH_JOIN,FILTER 2015-03-11 10:14:53,094 [main] INFO org.apache.pig.data.SchemaTupleBackend - Key [pig.schematuple] was not set... will not generate code. 2015-03-11 10:14:53,149 [main] INFO org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer - {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, ConstantCalculator, GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, PartitionFilterOptimizer, PredicatePushdownOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter]} 2015-03-11 10:14:53,226 [main] INFO org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns pruned for macro_preprocess_weather_0: $4, $5, $6, $7 2015-03-11 10:14:53,231 [main] INFO org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns pruned for macro_preprocess_airline_0: $4, $6, $7, $9, $10, $11, $12, $13, $14, $19, $20, $22, $23, $24, $25, $26, $27, $28 2015-03-11 10:14:53,932 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false 2015-03-11 10:14:53,989 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler$LastInputStreamingOptimizer - Rewrite: POPackage->POForEach to POPackage(JoinPackager) 2015-03-11 10:14:54,004 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 2 2015-03-11 10:14:54,009 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 1 diamond splitter. 2015-03-11 10:14:54,010 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 1 out of total 2 MR operators. 2015-03-11 10:14:54,010 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1 2015-03-11 10:14:55,037 [main] INFO org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://ds-master.cloud.hortonworks.com:8188/ws/v1/timeline/ 2015-03-11 10:14:55,246 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at ds-master.cloud.hortonworks.com/172.24.70.17:8050 2015-03-11 10:14:55,583 [main] INFO org.apache.pig.tools.pigstats.mapreduce.MRScriptState - Pig script settings are added to the job 2015-03-11 10:14:55,596 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3 2015-03-11 10:14:55,602 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Reduce phase detected, estimating # of required reducers. 2015-03-11 10:14:55,604 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Using reducer estimator: org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.InputSizeReducerEstimator 2015-03-11 10:14:55,639 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.InputSizeReducerEstimator - BytesPerReducer=1000000000 maxReducers=999 totalInputFileSize=6190005583 2015-03-11 10:14:55,640 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting Parallelism to 7 2015-03-11 10:14:55,640 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - This job cannot be converted run in-process 2015-03-11 10:14:56,029 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/hdp/2.2.0.0-2041/pig/pig-0.14.0.2.2.0.0-2041-core-h2.jar to DistributedCache through /tmp/temp-1528785190/tmp-30784569/pig-0.14.0.2.2.0.0-2041-core-h2.jar 2015-03-11 10:14:56,348 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/hdp/2.2.0.0-2041/pig/lib/jython-standalone-2.5.3.jar to DistributedCache through /tmp/temp-1528785190/tmp1015136214/jython-standalone-2.5.3.jar 2015-03-11 10:14:56,420 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/hdp/2.2.0.0-2041/pig/lib/automaton-1.11-8.jar to DistributedCache through /tmp/temp-1528785190/tmp-1916985169/automaton-1.11-8.jar 2015-03-11 10:14:56,491 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/hdp/2.2.0.0-2041/pig/lib/antlr-runtime-3.4.jar to DistributedCache through /tmp/temp-1528785190/tmp-1682649805/antlr-runtime-3.4.jar 2015-03-11 10:14:56,544 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/hdp/2.2.0.0-2041/hadoop/lib/guava-11.0.2.jar to DistributedCache through /tmp/temp-1528785190/tmp1549937395/guava-11.0.2.jar 2015-03-11 10:14:56,611 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/hdp/2.2.0.0-2041/hadoop-mapreduce/joda-time-2.5.jar to DistributedCache through /tmp/temp-1528785190/tmp-711652979/joda-time-2.5.jar 2015-03-11 10:14:58,504 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/tmp/PigScriptUDF-3f91fbfefba602bf28492c3cd7f8b54c.jar to DistributedCache through /tmp/temp-1528785190/tmp-1849048894/PigScriptUDF-3f91fbfefba602bf28492c3cd7f8b54c.jar 2015-03-11 10:14:58,566 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job 2015-03-11 10:14:59,547 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Key [pig.schematuple] is false, will not generate code. 2015-03-11 10:14:59,549 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Starting process to move generated code to distributed cacche 2015-03-11 10:14:59,550 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Setting key [pig.schematuple.classes] with classes to deserialize [] 2015-03-11 10:15:00,736 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission. 2015-03-11 10:15:01,006 [JobControl] INFO org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://ds-master.cloud.hortonworks.com:8188/ws/v1/timeline/ 2015-03-11 10:15:01,009 [JobControl] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at ds-master.cloud.hortonworks.com/172.24.70.17:8050 2015-03-11 10:15:01,137 [JobControl] WARN org.apache.hadoop.mapreduce.JobSubmitter - No job jar file set. User classes may not be found. See Job or Job#setJar(String). 2015-03-11 10:15:01,241 [JobControl] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1 2015-03-11 10:15:01,242 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1 2015-03-11 10:15:01,278 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 9 2015-03-11 10:15:01,286 [JobControl] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1 2015-03-11 10:15:01,286 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1 2015-03-11 10:15:01,294 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 9 2015-03-11 10:15:01,299 [JobControl] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1 2015-03-11 10:15:01,300 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1 2015-03-11 10:15:01,307 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 9 2015-03-11 10:15:01,313 [JobControl] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1 2015-03-11 10:15:01,314 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1 2015-03-11 10:15:01,320 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 9 2015-03-11 10:15:01,325 [JobControl] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1 2015-03-11 10:15:01,326 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1 2015-03-11 10:15:01,332 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 6 2015-03-11 10:15:01,337 [JobControl] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1 2015-03-11 10:15:01,338 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1 2015-03-11 10:15:01,344 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 9 2015-03-11 10:15:01,900 [JobControl] INFO org.apache.hadoop.mapreduce.JobSubmitter - number of splits:51 2015-03-11 10:15:02,189 [JobControl] INFO org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1424904779802_0604 2015-03-11 10:15:02,424 [JobControl] INFO org.apache.hadoop.mapred.YARNRunner - Job jar is not present. Not adding any jar to the list of resources. 2015-03-11 10:15:02,543 [JobControl] INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - Submitted application application_1424904779802_0604 2015-03-11 10:15:02,611 [JobControl] INFO org.apache.hadoop.mapreduce.Job - The url to track the job: http://ds-master.cloud.hortonworks.com:8088/proxy/application_1424904779802_0604/ 2015-03-11 10:15:02,613 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_1424904779802_0604 2015-03-11 10:15:02,613 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Processing aliases ORD_2007,macro_preprocess_airline2_0,macro_preprocess_airline_0,macro_preprocess_airline_flt_0,macro_preprocess_joined_0,macro_preprocess_weather_0,macro_preprocess_weather_awnd_0,macro_preprocess_weather_prcp_0,macro_preprocess_weather_snow_0,macro_preprocess_weather_tmax_0,macro_preprocess_weather_tmin_0 2015-03-11 10:15:02,613 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - detailed locations: M: macro_preprocess_weather_0[24,14],macro_preprocess_weather_0[-1,-1],macro_preprocess_weather_0[-1,-1],macro_preprocess_weather_tmax_0[29,19],macro_preprocess_weather_tmax_0[-1,-1],macro_preprocess_joined_0[34,13],macro_preprocess_airline_0[8,14],macro_preprocess_airline_0[-1,-1],macro_preprocess_airline_flt_0[16,18],macro_preprocess_airline2_0[19,15],macro_preprocess_joined_0[34,13],macro_preprocess_weather_0[24,14],macro_preprocess_weather_0[-1,-1],macro_preprocess_weather_0[-1,-1],macro_preprocess_weather_snow_0[31,19],macro_preprocess_weather_snow_0[-1,-1],macro_preprocess_joined_0[34,13],macro_preprocess_weather_0[24,14],macro_preprocess_weather_0[-1,-1],macro_preprocess_weather_0[-1,-1],macro_preprocess_weather_tmin_0[28,19],macro_preprocess_weather_tmin_0[-1,-1],macro_preprocess_joined_0[34,13],macro_preprocess_weather_0[24,14],macro_preprocess_weather_0[-1,-1],macro_preprocess_weather_0[-1,-1],macro_preprocess_weather_prcp_0[30,19],macro_preprocess_weather_prcp_0[-1,-1],macro_preprocess_joined_0[34,13],macro_preprocess_weather_0[24,14],macro_preprocess_weather_0[-1,-1],macro_preprocess_weather_0[-1,-1],macro_preprocess_weather_awnd_0[32,19],macro_preprocess_weather_awnd_0[-1,-1],macro_preprocess_joined_0[34,13] C: R: ORD_2007[36,15] 2015-03-11 10:15:02,633 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete 2015-03-11 10:15:02,634 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0604] 2015-03-11 10:16:04,915 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 4% complete 2015-03-11 10:16:04,916 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0604] 2015-03-11 10:16:11,936 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 8% complete 2015-03-11 10:16:11,937 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0604] 2015-03-11 10:16:17,953 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 12% complete 2015-03-11 10:16:17,955 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0604] 2015-03-11 10:16:21,967 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 16% complete 2015-03-11 10:16:21,967 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0604] 2015-03-11 10:16:26,978 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 21% complete 2015-03-11 10:16:26,978 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0604] 2015-03-11 10:16:32,991 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 25% complete 2015-03-11 10:16:32,991 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0604] 2015-03-11 10:16:37,003 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 30% complete 2015-03-11 10:16:37,003 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0604] 2015-03-11 10:16:42,017 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 34% complete 2015-03-11 10:16:42,017 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0604] 2015-03-11 10:16:47,030 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 40% complete 2015-03-11 10:16:47,031 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0604] 2015-03-11 10:16:55,051 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 44% complete 2015-03-11 10:16:55,052 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0604] 2015-03-11 10:17:00,067 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 49% complete 2015-03-11 10:17:00,068 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0604] 2015-03-11 10:17:02,077 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 54% complete 2015-03-11 10:17:02,077 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0604] 2015-03-11 10:17:05,086 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 58% complete 2015-03-11 10:17:05,089 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0604] 2015-03-11 10:17:10,103 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 63% complete 2015-03-11 10:17:10,105 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0604] 2015-03-11 10:17:13,113 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 68% complete 2015-03-11 10:17:13,113 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0604] 2015-03-11 10:17:15,119 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 83% complete 2015-03-11 10:17:15,120 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0604] 2015-03-11 10:17:30,157 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 88% complete 2015-03-11 10:17:30,158 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0604] 2015-03-11 10:17:38,178 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 92% complete 2015-03-11 10:17:38,180 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0604] 2015-03-11 10:17:45,196 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 97% complete 2015-03-11 10:17:45,196 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0604] 2015-03-11 10:17:53,425 [main] INFO org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://ds-master.cloud.hortonworks.com:8188/ws/v1/timeline/ 2015-03-11 10:17:53,428 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at ds-master.cloud.hortonworks.com/172.24.70.17:8050 2015-03-11 10:17:53,442 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server 2015-03-11 10:17:54,067 [main] INFO org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://ds-master.cloud.hortonworks.com:8188/ws/v1/timeline/ 2015-03-11 10:17:54,069 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at ds-master.cloud.hortonworks.com/172.24.70.17:8050 2015-03-11 10:17:54,078 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server 2015-03-11 10:17:54,529 [main] INFO org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://ds-master.cloud.hortonworks.com:8188/ws/v1/timeline/ 2015-03-11 10:17:54,531 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at ds-master.cloud.hortonworks.com/172.24.70.17:8050 2015-03-11 10:17:54,542 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server 2015-03-11 10:17:54,623 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete 2015-03-11 10:17:54,629 [main] INFO org.apache.pig.tools.pigstats.mapreduce.SimplePigStats - Script Statistics: HadoopVersion PigVersion UserId StartedAt FinishedAt Features 2.6.0.2.2.0.0-2041 0.14.0.2.2.0.0-2041 demo 2015-03-11 10:14:55 2015-03-11 10:17:54 HASH_JOIN,FILTER Success! Job Stats (time in seconds): JobId Maps Reduces MaxMapTime MinMapTime AvgMapTime MedianMapTime MaxReduceTime MinReduceTime AvgReduceTime MedianReducetime Alias Feature Outputs job_1424904779802_0604 51 7 117 37 95 104 88 83 85 85 ORD_2007,macro_preprocess_airline2_0,macro_preprocess_airline_0,macro_preprocess_airline_flt_0,macro_preprocess_joined_0,macro_preprocess_weather_0,macro_preprocess_weather_awnd_0,macro_preprocess_weather_prcp_0,macro_preprocess_weather_snow_0,macro_preprocess_weather_tmax_0,macro_preprocess_weather_tmin_0 HASH_JOIN hdfs://ds-master.cloud.hortonworks.com:8020/user/demo/airline/fm/ord_2007_2, Input(s): Successfully read 31065125 records from: "hdfs://ds-master.cloud.hortonworks.com:8020/user/demo/airline/weather/2007.csv" Successfully read 31065125 records from: "hdfs://ds-master.cloud.hortonworks.com:8020/user/demo/airline/weather/2007.csv" Successfully read 31065125 records from: "hdfs://ds-master.cloud.hortonworks.com:8020/user/demo/airline/weather/2007.csv" Successfully read 31065125 records from: "hdfs://ds-master.cloud.hortonworks.com:8020/user/demo/airline/weather/2007.csv" Successfully read 7453216 records from: "hdfs://ds-master.cloud.hortonworks.com:8020/user/demo/airline/delay/2007.csv" Successfully read 31065125 records from: "hdfs://ds-master.cloud.hortonworks.com:8020/user/demo/airline/weather/2007.csv" Output(s): Successfully stored 359169 records (14789642 bytes) in: "hdfs://ds-master.cloud.hortonworks.com:8020/user/demo/airline/fm/ord_2007_2" Counters: Total records written : 359169 Total bytes written : 14789642 Spillable Memory Manager spill count : 0 Total bags proactively spilled: 0 Total records proactively spilled: 0 Job DAG: job_1424904779802_0604 2015-03-11 10:17:54,800 [main] INFO org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://ds-master.cloud.hortonworks.com:8188/ws/v1/timeline/ 2015-03-11 10:17:54,800 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at ds-master.cloud.hortonworks.com/172.24.70.17:8050 2015-03-11 10:17:54,809 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server 2015-03-11 10:17:55,194 [main] INFO org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://ds-master.cloud.hortonworks.com:8188/ws/v1/timeline/ 2015-03-11 10:17:55,196 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at ds-master.cloud.hortonworks.com/172.24.70.17:8050 2015-03-11 10:17:55,204 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server 2015-03-11 10:17:55,444 [main] INFO org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://ds-master.cloud.hortonworks.com:8188/ws/v1/timeline/ 2015-03-11 10:17:55,446 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at ds-master.cloud.hortonworks.com/172.24.70.17:8050 2015-03-11 10:17:55,455 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server 2015-03-11 10:17:55,511 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Encountered Warning FIELD_DISCARDED_TYPE_CONVERSION_FAILED 160755 time(s). 2015-03-11 10:17:55,512 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success! 2015-03-11 10:17:55,538 [main] INFO org.apache.pig.tools.grunt.GruntParser - Waited 0ms to delete file 2015-03-11 10:17:56,000 [main] WARN org.apache.pig.newplan.BaseOperatorPlan - Encountered Warning IMPLICIT_CAST_TO_CHARARRAY 5 time(s). 2015-03-11 10:17:56,015 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: HASH_JOIN,FILTER 2015-03-11 10:17:56,051 [main] INFO org.apache.pig.data.SchemaTupleBackend - Key [pig.schematuple] was not set... will not generate code. 2015-03-11 10:17:56,052 [main] INFO org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer - {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, ConstantCalculator, GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, PartitionFilterOptimizer, PredicatePushdownOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter]} 2015-03-11 10:17:56,084 [main] INFO org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns pruned for macro_preprocess_weather_1: $4, $5, $6, $7 2015-03-11 10:17:56,086 [main] INFO org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns pruned for macro_preprocess_airline_1: $4, $6, $7, $9, $10, $11, $12, $13, $14, $19, $20, $22, $23, $24, $25, $26, $27, $28 2015-03-11 10:17:56,392 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false 2015-03-11 10:17:56,399 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler$LastInputStreamingOptimizer - Rewrite: POPackage->POForEach to POPackage(JoinPackager) 2015-03-11 10:17:56,400 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 2 2015-03-11 10:17:56,403 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 1 diamond splitter. 2015-03-11 10:17:56,404 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 1 out of total 2 MR operators. 2015-03-11 10:17:56,405 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1 2015-03-11 10:17:56,577 [main] INFO org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://ds-master.cloud.hortonworks.com:8188/ws/v1/timeline/ 2015-03-11 10:17:56,579 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at ds-master.cloud.hortonworks.com/172.24.70.17:8050 2015-03-11 10:17:56,585 [main] INFO org.apache.pig.tools.pigstats.mapreduce.MRScriptState - Pig script settings are added to the job 2015-03-11 10:17:56,588 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3 2015-03-11 10:17:56,590 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Reduce phase detected, estimating # of required reducers. 2015-03-11 10:17:56,591 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Using reducer estimator: org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.InputSizeReducerEstimator 2015-03-11 10:17:56,606 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.InputSizeReducerEstimator - BytesPerReducer=1000000000 maxReducers=999 totalInputFileSize=6417173424 2015-03-11 10:17:56,607 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting Parallelism to 7 2015-03-11 10:17:56,608 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - This job cannot be converted run in-process 2015-03-11 10:17:56,727 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/hdp/2.2.0.0-2041/pig/pig-0.14.0.2.2.0.0-2041-core-h2.jar to DistributedCache through /tmp/temp-1528785190/tmp1946848258/pig-0.14.0.2.2.0.0-2041-core-h2.jar 2015-03-11 10:17:56,897 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/hdp/2.2.0.0-2041/pig/lib/jython-standalone-2.5.3.jar to DistributedCache through /tmp/temp-1528785190/tmp230810937/jython-standalone-2.5.3.jar 2015-03-11 10:17:57,303 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/hdp/2.2.0.0-2041/pig/lib/automaton-1.11-8.jar to DistributedCache through /tmp/temp-1528785190/tmp1969810629/automaton-1.11-8.jar 2015-03-11 10:17:57,358 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/hdp/2.2.0.0-2041/pig/lib/antlr-runtime-3.4.jar to DistributedCache through /tmp/temp-1528785190/tmp-613290889/antlr-runtime-3.4.jar 2015-03-11 10:17:57,424 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/hdp/2.2.0.0-2041/hadoop/lib/guava-11.0.2.jar to DistributedCache through /tmp/temp-1528785190/tmp-340380009/guava-11.0.2.jar 2015-03-11 10:17:57,611 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/hdp/2.2.0.0-2041/hadoop-mapreduce/joda-time-2.5.jar to DistributedCache through /tmp/temp-1528785190/tmp-1936706949/joda-time-2.5.jar 2015-03-11 10:17:57,662 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/tmp/PigScriptUDF-3f91fbfefba602bf28492c3cd7f8b54c.jar to DistributedCache through /tmp/temp-1528785190/tmp420591900/PigScriptUDF-3f91fbfefba602bf28492c3cd7f8b54c.jar 2015-03-11 10:17:57,677 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job 2015-03-11 10:17:57,679 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Key [pig.schematuple] is false, will not generate code. 2015-03-11 10:17:57,680 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Starting process to move generated code to distributed cacche 2015-03-11 10:17:57,681 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Setting key [pig.schematuple.classes] with classes to deserialize [] 2015-03-11 10:17:57,794 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission. 2015-03-11 10:17:57,955 [JobControl] INFO org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://ds-master.cloud.hortonworks.com:8188/ws/v1/timeline/ 2015-03-11 10:17:57,957 [JobControl] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at ds-master.cloud.hortonworks.com/172.24.70.17:8050 2015-03-11 10:17:57,995 [JobControl] WARN org.apache.hadoop.mapreduce.JobSubmitter - No job jar file set. User classes may not be found. See Job or Job#setJar(String). 2015-03-11 10:17:58,068 [JobControl] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1 2015-03-11 10:17:58,070 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1 2015-03-11 10:17:58,076 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 9 2015-03-11 10:17:58,081 [JobControl] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1 2015-03-11 10:17:58,081 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1 2015-03-11 10:17:58,088 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 9 2015-03-11 10:17:58,093 [JobControl] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1 2015-03-11 10:17:58,094 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1 2015-03-11 10:17:58,099 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 9 2015-03-11 10:17:58,104 [JobControl] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1 2015-03-11 10:17:58,105 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1 2015-03-11 10:17:58,109 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 9 2015-03-11 10:17:58,115 [JobControl] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1 2015-03-11 10:17:58,116 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1 2015-03-11 10:17:58,120 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 9 2015-03-11 10:17:58,125 [JobControl] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1 2015-03-11 10:17:58,126 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1 2015-03-11 10:17:58,130 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 6 2015-03-11 10:17:58,254 [JobControl] INFO org.apache.hadoop.mapreduce.JobSubmitter - number of splits:51 2015-03-11 10:17:58,321 [JobControl] INFO org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1424904779802_0605 2015-03-11 10:17:58,328 [JobControl] INFO org.apache.hadoop.mapred.YARNRunner - Job jar is not present. Not adding any jar to the list of resources. 2015-03-11 10:17:58,390 [JobControl] INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - Submitted application application_1424904779802_0605 2015-03-11 10:17:58,398 [JobControl] INFO org.apache.hadoop.mapreduce.Job - The url to track the job: http://ds-master.cloud.hortonworks.com:8088/proxy/application_1424904779802_0605/ 2015-03-11 10:17:58,400 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_1424904779802_0605 2015-03-11 10:17:58,401 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Processing aliases ORD_2008,macro_preprocess_airline2_1,macro_preprocess_airline_1,macro_preprocess_airline_flt_1,macro_preprocess_joined_1,macro_preprocess_weather_1,macro_preprocess_weather_awnd_1,macro_preprocess_weather_prcp_1,macro_preprocess_weather_snow_1,macro_preprocess_weather_tmax_1,macro_preprocess_weather_tmin_1 2015-03-11 10:17:58,401 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - detailed locations: M: macro_preprocess_weather_1[24,14],macro_preprocess_weather_1[-1,-1],macro_preprocess_weather_1[-1,-1],macro_preprocess_weather_tmin_1[28,19],macro_preprocess_weather_tmin_1[-1,-1],macro_preprocess_joined_1[34,13],macro_preprocess_airline_1[8,14],macro_preprocess_airline_1[-1,-1],macro_preprocess_airline_flt_1[16,18],macro_preprocess_airline2_1[19,15],macro_preprocess_joined_1[34,13],macro_preprocess_weather_1[24,14],macro_preprocess_weather_1[-1,-1],macro_preprocess_weather_1[-1,-1],macro_preprocess_weather_awnd_1[32,19],macro_preprocess_weather_awnd_1[-1,-1],macro_preprocess_joined_1[34,13],macro_preprocess_weather_1[24,14],macro_preprocess_weather_1[-1,-1],macro_preprocess_weather_1[-1,-1],macro_preprocess_weather_tmax_1[29,19],macro_preprocess_weather_tmax_1[-1,-1],macro_preprocess_joined_1[34,13],macro_preprocess_weather_1[24,14],macro_preprocess_weather_1[-1,-1],macro_preprocess_weather_1[-1,-1],macro_preprocess_weather_snow_1[31,19],macro_preprocess_weather_snow_1[-1,-1],macro_preprocess_joined_1[34,13],macro_preprocess_weather_1[24,14],macro_preprocess_weather_1[-1,-1],macro_preprocess_weather_1[-1,-1],macro_preprocess_weather_prcp_1[30,19],macro_preprocess_weather_prcp_1[-1,-1],macro_preprocess_joined_1[34,13] C: R: ORD_2008[36,15] 2015-03-11 10:17:58,413 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete 2015-03-11 10:17:58,414 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0605] 2015-03-11 10:18:57,671 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 4% complete 2015-03-11 10:18:57,672 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0605] 2015-03-11 10:19:07,700 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 8% complete 2015-03-11 10:19:07,702 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0605] 2015-03-11 10:19:15,721 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 12% complete 2015-03-11 10:19:15,722 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0605] 2015-03-11 10:19:22,744 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 16% complete 2015-03-11 10:19:22,745 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0605] 2015-03-11 10:19:30,765 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 21% complete 2015-03-11 10:19:30,765 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0605] 2015-03-11 10:19:37,781 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 26% complete 2015-03-11 10:19:37,781 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0605] 2015-03-11 10:19:42,793 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 35% complete 2015-03-11 10:19:42,793 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0605] 2015-03-11 10:19:45,801 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 39% complete 2015-03-11 10:19:45,801 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0605] 2015-03-11 10:19:50,813 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 44% complete 2015-03-11 10:19:50,813 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0605] 2015-03-11 10:19:55,827 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 51% complete 2015-03-11 10:19:55,827 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0605] 2015-03-11 10:20:05,852 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 56% complete 2015-03-11 10:20:05,853 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0605] 2015-03-11 10:20:27,912 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 61% complete 2015-03-11 10:20:27,913 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0605] 2015-03-11 10:20:40,942 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 69% complete 2015-03-11 10:20:40,943 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0605] 2015-03-11 10:20:42,948 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 79% complete 2015-03-11 10:20:42,948 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0605] 2015-03-11 10:20:50,970 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 83% complete 2015-03-11 10:20:50,970 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0605] 2015-03-11 10:21:00,998 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 89% complete 2015-03-11 10:21:00,999 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0605] 2015-03-11 10:21:06,010 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 93% complete 2015-03-11 10:21:06,011 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0605] 2015-03-11 10:21:13,031 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 98% complete 2015-03-11 10:21:13,033 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1424904779802_0605] 2015-03-11 10:21:24,213 [main] INFO org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://ds-master.cloud.hortonworks.com:8188/ws/v1/timeline/ 2015-03-11 10:21:24,215 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at ds-master.cloud.hortonworks.com/172.24.70.17:8050 2015-03-11 10:21:24,225 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server 2015-03-11 10:21:24,574 [main] INFO org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://ds-master.cloud.hortonworks.com:8188/ws/v1/timeline/ 2015-03-11 10:21:24,576 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at ds-master.cloud.hortonworks.com/172.24.70.17:8050 2015-03-11 10:21:24,585 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server 2015-03-11 10:21:24,851 [main] INFO org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://ds-master.cloud.hortonworks.com:8188/ws/v1/timeline/ 2015-03-11 10:21:24,852 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at ds-master.cloud.hortonworks.com/172.24.70.17:8050 2015-03-11 10:21:24,862 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server 2015-03-11 10:21:24,907 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete 2015-03-11 10:21:24,909 [main] INFO org.apache.pig.tools.pigstats.mapreduce.SimplePigStats - Script Statistics: HadoopVersion PigVersion UserId StartedAt FinishedAt Features 2.6.0.2.2.0.0-2041 0.14.0.2.2.0.0-2041 demo 2015-03-11 10:17:56 2015-03-11 10:21:24 HASH_JOIN,FILTER Success! Job Stats (time in seconds): JobId Maps Reduces MaxMapTime MinMapTime AvgMapTime MedianMapTime MaxReduceTime MinReduceTime AvgReduceTime MedianReducetime Alias Feature Outputs job_1424904779802_0605 51 7 151 16 104 101 114 106 110 109 ORD_2008,macro_preprocess_airline2_1,macro_preprocess_airline_1,macro_preprocess_airline_flt_1,macro_preprocess_joined_1,macro_preprocess_weather_1,macro_preprocess_weather_awnd_1,macro_preprocess_weather_prcp_1,macro_preprocess_weather_snow_1,macro_preprocess_weather_tmax_1,macro_preprocess_weather_tmin_1 HASH_JOIN hdfs://ds-master.cloud.hortonworks.com:8020/user/demo/airline/fm/ord_2008_2, Input(s): Successfully read 32534244 records from: "hdfs://ds-master.cloud.hortonworks.com:8020/user/demo/airline/weather/2008.csv" Successfully read 32534244 records from: "hdfs://ds-master.cloud.hortonworks.com:8020/user/demo/airline/weather/2008.csv" Successfully read 32534244 records from: "hdfs://ds-master.cloud.hortonworks.com:8020/user/demo/airline/weather/2008.csv" Successfully read 32534244 records from: "hdfs://ds-master.cloud.hortonworks.com:8020/user/demo/airline/weather/2008.csv" Successfully read 32534244 records from: "hdfs://ds-master.cloud.hortonworks.com:8020/user/demo/airline/weather/2008.csv" Successfully read 7009729 records from: "hdfs://ds-master.cloud.hortonworks.com:8020/user/demo/airline/delay/2008.csv" Output(s): Successfully stored 335330 records (13817679 bytes) in: "hdfs://ds-master.cloud.hortonworks.com:8020/user/demo/airline/fm/ord_2008_2" Counters: Total records written : 335330 Total bytes written : 13817679 Spillable Memory Manager spill count : 0 Total bags proactively spilled: 0 Total records proactively spilled: 0 Job DAG: job_1424904779802_0605 2015-03-11 10:21:25,059 [main] INFO org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://ds-master.cloud.hortonworks.com:8188/ws/v1/timeline/ 2015-03-11 10:21:25,061 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at ds-master.cloud.hortonworks.com/172.24.70.17:8050 2015-03-11 10:21:25,073 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server 2015-03-11 10:21:25,438 [main] INFO org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://ds-master.cloud.hortonworks.com:8188/ws/v1/timeline/ 2015-03-11 10:21:25,440 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at ds-master.cloud.hortonworks.com/172.24.70.17:8050 2015-03-11 10:21:25,448 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server 2015-03-11 10:21:25,666 [main] INFO org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://ds-master.cloud.hortonworks.com:8188/ws/v1/timeline/ 2015-03-11 10:21:25,669 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at ds-master.cloud.hortonworks.com/172.24.70.17:8050 2015-03-11 10:21:25,679 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server 2015-03-11 10:21:25,730 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Encountered Warning FIELD_DISCARDED_TYPE_CONVERSION_FAILED 136253 time(s). 2015-03-11 10:21:25,732 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success! 2015-03-11 10:21:25,766 [main] INFO org.apache.pig.Main - Pig script completed in 6 minutes, 41 seconds and 825 milliseconds (401825 ms)
We now read this data in, convert temparatures to Fahrenheit (note original temp is in Celcius*10), and prepare the training and testing datasets for modeling.
from sklearn.preprocessing import OneHotEncoder
# Convert Celsius to Fahrenheit
def fahrenheit(x): return(x*1.8 + 32.0)
# read files
cols = ['delay', 'month', 'day', 'dow', 'hour', 'distance', 'carrier', 'dest', 'days_from_holiday',
'origin_tmin', 'origin_tmax', 'origin_prcp', 'origin_snow', 'origin_wind']
col_types = {'delay': int, 'month': int, 'day': int, 'dow': int, 'hour': int, 'distance': int,
'carrier': str, 'dest': str, 'days_from_holiday': int,
'origin_tmin': float, 'origin_tmax': float, 'origin_prcp': float, 'origin_snow': float, 'origin_wind': float}
data_2007 = read_csv_from_hdfs('airline/fm/ord_2007_2', cols, col_types)
data_2008 = read_csv_from_hdfs('airline/fm/ord_2008_2', cols, col_types)
data_2007['origin_tmin'] = data_2007['origin_tmin'].apply(lambda x: fahrenheit(x/10.0))
data_2007['origin_tmax'] = data_2007['origin_tmax'].apply(lambda x: fahrenheit(x/10.0))
data_2008['origin_tmin'] = data_2008['origin_tmin'].apply(lambda x: fahrenheit(x/10.0))
data_2008['origin_tmax'] = data_2008['origin_tmax'].apply(lambda x: fahrenheit(x/10.0))
# Create training set and test set
train_y = data_2007['delay'] >= 15
categ = [cols.index(x) for x in 'hour', 'month', 'day', 'dow', 'carrier', 'dest']
enc = OneHotEncoder(categorical_features = categ)
df = data_2007.drop('delay', axis=1)
df['carrier'] = pd.factorize(df['carrier'])[0]
df['dest'] = pd.factorize(df['dest'])[0]
train_x = enc.fit_transform(df)
test_y = data_2008['delay'] >= 15
df = data_2008.drop('delay', axis=1)
df['carrier'] = pd.factorize(df['carrier'])[0]
df['dest'] = pd.factorize(df['dest'])[0]
test_x = enc.transform(df)
print train_x.shape
(359169, 414)
Good. So now that we have the training and test (validation) set ready, let's try Random Forest with the new features:
# Create Random Forest classifier with 100 trees
clf_rf = RandomForestClassifier(n_estimators=100, n_jobs=-1)
clf_rf.fit(train_x.toarray(), train_y)
# Evaluate on test set
pr = clf_rf.predict(test_x.toarray())
# print results
cm = confusion_matrix(test_y, pr)
print("Confusion matrix")
print(pd.DataFrame(cm))
report_rf = precision_recall_fscore_support(list(test_y), list(pr), average='micro')
print "precision = %0.2f, recall = %0.2f, F1 = %0.2f, accuracy = %0.2f\n" % \
(report_rf[0], report_rf[1], report_rf[2], accuracy_score(list(test_y), list(pr)))
Confusion matrix 0 1 0 226595 13299 1 73098 22338 precision = 0.63, recall = 0.23, F1 = 0.34, accuracy = 0.74
with the new weather features, accuracy went up again from 0.70 to 0.74.
Clearly with more iterations, we are likely going to improve accuracy even further. For example, we can add weather information at the Origin, or explore the number of seats on the plan as a predictive feature (we can get that from the tail number), and so on.
In this blog post we have demonstrated how to build a predictive model with Hadoop and Python. We have used Hadoop to perform various types of data pre-processing and feature engineering tasks. We then applied Scikit-learn machine learning algorithm on the resulting datasets and have shown how via iterations we continuously add new and improved features resulting in better model performance.
In the next part of this multi-part blog post we will show how to perform the same learning task with Spark and ML-Lib.