Caterpillar Tube Pricing¶

In this competition we have data about tube assemblies which suppliers deliver to Caterpillar. The data is comprised of many files describing these tube assemblies.

train_set and test_set contain information on price quotes from suppliers. Prices can be quoted in 2 ways: bracket and non-bracket pricing. Bracket pricing has multiple levels of purchase based on quantity (in other words, the cost is given assuming a purchase of quantity tubes). Non-bracket pricing has a minimum order amount (min_order) for which the price would apply. Each quote is issued with an annual_usage, an estimate of how many tube assemblies will be purchased in a given year;
tube contains information on physical parameters of tube assemblies;
bill_of_materials contains the list of components, and their quantities, used on each tube assembly;
specs contains the list of unique specifications for the tube assembly;
tube_end_form contains list of end types which are physically formed utilizing only the wall of the tube;
components contains the list of all of the components used;
type... contain the names for each feature;
comp... contains information on physical parameters of components by their type;

The goal of competition if to predict prices of tube assemblies based on this information.

The main challenge of this competition is combining the data and choosing/creating features. After this I use XGBoost for prediction.

The metric to calculate the accuracy of predictions is Root Mean Squared Logarithmic Error (it penalizes an under-predicted estimate greater than an over-predicted estimate).

The RMSLE is calculated as

$$ \epsilon = \sqrt{\frac{1}{n} \sum_{i=1}^n (\log(p_i + 1) - \log(a_i+1))^2 } $$

Where:

$\epsilon$ is the RMSLE value (score); $n$ is the number of observations; $p_i$ is prediction; $a_i$ is the actual response for $i$; $\log(x)$ is the natural logarithm of $x$

Data exploration

1.1 Bill of materials

1.2 Comp Adaptor

1.3 Comp Boss

1.4 Comp Hfl

1.5 Comp Elbow

1.6 Comp Float

1.7 Comp Nut

1.8 Comp Other

1.9 Comp Sleeve

1.10 Comp Straight

1.11 Comp Tee

1.12 Comp Threaded

1.13 Tube
Data preparation
Model

In [1]:

import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder
from scipy.stats import skew
import glob

Data exploration ¶

An easy way to read all the files in separate variables is to get the list of files and loop through it.

In [2]:

glob.glob('Kaggle/Data_Kaggle/Caterpillar Tube Pricing/*.csv')

Out[2]:

['Kaggle/Data_Kaggle/Caterpillar Tube Pricing\\bill_of_materials.csv',
 'Kaggle/Data_Kaggle/Caterpillar Tube Pricing\\components.csv',
 'Kaggle/Data_Kaggle/Caterpillar Tube Pricing\\comp_adaptor.csv',
 'Kaggle/Data_Kaggle/Caterpillar Tube Pricing\\comp_boss.csv',
 'Kaggle/Data_Kaggle/Caterpillar Tube Pricing\\comp_elbow.csv',
 'Kaggle/Data_Kaggle/Caterpillar Tube Pricing\\comp_float.csv',
 'Kaggle/Data_Kaggle/Caterpillar Tube Pricing\\comp_hfl.csv',
 'Kaggle/Data_Kaggle/Caterpillar Tube Pricing\\comp_nut.csv',
 'Kaggle/Data_Kaggle/Caterpillar Tube Pricing\\comp_other.csv',
 'Kaggle/Data_Kaggle/Caterpillar Tube Pricing\\comp_sleeve.csv',
 'Kaggle/Data_Kaggle/Caterpillar Tube Pricing\\comp_straight.csv',
 'Kaggle/Data_Kaggle/Caterpillar Tube Pricing\\comp_tee.csv',
 'Kaggle/Data_Kaggle/Caterpillar Tube Pricing\\comp_threaded.csv',
 'Kaggle/Data_Kaggle/Caterpillar Tube Pricing\\sample_submission.csv',
 'Kaggle/Data_Kaggle/Caterpillar Tube Pricing\\specs.csv',
 'Kaggle/Data_Kaggle/Caterpillar Tube Pricing\\test_set.csv',
 'Kaggle/Data_Kaggle/Caterpillar Tube Pricing\\train_set.csv',
 'Kaggle/Data_Kaggle/Caterpillar Tube Pricing\\tube.csv',
 'Kaggle/Data_Kaggle/Caterpillar Tube Pricing\\tube_end_form.csv',
 'Kaggle/Data_Kaggle/Caterpillar Tube Pricing\\type_component.csv',
 'Kaggle/Data_Kaggle/Caterpillar Tube Pricing\\type_connection.csv',
 'Kaggle/Data_Kaggle/Caterpillar Tube Pricing\\type_end_form.csv']

In [3]:

#Read each file in a separate data frame.
bill_of_materials = pd.read_csv('Kaggle/Data_Kaggle/Caterpillar Tube Pricing/bill_of_materials.csv')
components = pd.read_csv('Kaggle/Data_Kaggle/Caterpillar Tube Pricing/components.csv')
comp_adaptor = pd.read_csv('Kaggle/Data_Kaggle/Caterpillar Tube Pricing/comp_adaptor.csv')
comp_boss = pd.read_csv('Kaggle/Data_Kaggle/Caterpillar Tube Pricing/comp_boss.csv')
comp_elbow = pd.read_csv('Kaggle/Data_Kaggle/Caterpillar Tube Pricing/comp_elbow.csv')
comp_float = pd.read_csv('Kaggle/Data_Kaggle/Caterpillar Tube Pricing/comp_float.csv')
comp_hfl = pd.read_csv('Kaggle/Data_Kaggle/Caterpillar Tube Pricing/comp_hfl.csv')
comp_nut = pd.read_csv('Kaggle/Data_Kaggle/Caterpillar Tube Pricing/comp_nut.csv')
comp_other = pd.read_csv('Kaggle/Data_Kaggle/Caterpillar Tube Pricing/comp_other.csv')
comp_sleeve = pd.read_csv('Kaggle/Data_Kaggle/Caterpillar Tube Pricing/comp_sleeve.csv')
comp_straight = pd.read_csv('Kaggle/Data_Kaggle/Caterpillar Tube Pricing/comp_straight.csv')
comp_tee = pd.read_csv('Kaggle/Data_Kaggle/Caterpillar Tube Pricing/comp_tee.csv')
comp_threaded = pd.read_csv('Kaggle/Data_Kaggle/Caterpillar Tube Pricing/comp_threaded.csv')
specs = pd.read_csv('Kaggle/Data_Kaggle/Caterpillar Tube Pricing/specs.csv')
tube = pd.read_csv('Kaggle/Data_Kaggle/Caterpillar Tube Pricing/tube.csv')
tube_end_form = pd.read_csv('Kaggle/Data_Kaggle/Caterpillar Tube Pricing/tube_end_form.csv')
type_component = pd.read_csv('Kaggle/Data_Kaggle/Caterpillar Tube Pricing/type_component.csv')
type_connection = pd.read_csv('Kaggle/Data_Kaggle/Caterpillar Tube Pricing/type_connection.csv')
type_end_form = pd.read_csv('Kaggle/Data_Kaggle/Caterpillar Tube Pricing/type_end_form.csv')
train = pd.read_csv('Kaggle/Data_Kaggle/Caterpillar Tube Pricing/train_set.csv', parse_dates=[2,])
test = pd.read_csv('Kaggle/Data_Kaggle/Caterpillar Tube Pricing/test_set.csv', parse_dates=[3,])

Bill of materials ¶

In [4]:

#The file contains information about components of tube assemblies. All information is necessary.
#Missing values could be filled only with 0, but it isn't necessary.
bill_of_materials.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21198 entries, 0 to 21197
Data columns (total 17 columns):
tube_assembly_id    21198 non-null object
component_id_1      19149 non-null object
quantity_1          19149 non-null float64
component_id_2      14786 non-null object
quantity_2          14786 non-null float64
component_id_3      4791 non-null object
quantity_3          4798 non-null float64
component_id_4      607 non-null object
quantity_4          608 non-null float64
component_id_5      92 non-null object
quantity_5          92 non-null float64
component_id_6      26 non-null object
quantity_6          26 non-null float64
component_id_7      7 non-null object
quantity_7          7 non-null float64
component_id_8      1 non-null object
quantity_8          1 non-null float64
dtypes: float64(8), object(9)
memory usage: 2.7+ MB

In [5]:

#Simply to see the line with all 8 components.
bill_of_materials.loc[bill_of_materials.quantity_8.notnull() == True]

Out[5]:

	tube_assembly_id	component_id_1	quantity_1	component_id_2	quantity_2	component_id_3	quantity_3	component_id_4	quantity_4	component_id_5	quantity_5	component_id_6	quantity_6	component_id_7	quantity_7	component_id_8	quantity_8
11523	TA-11524	C-1253	1.0	C-1496	2.0	C-1500	1.0	C-1708	1.0	C-1811	1.0	C-1920	1.0	C-1921	1.0	C-1981	1.0

Comp Adaptor ¶

In [6]:

#There are columns with too few non-null values. But it is necessary to see more.
comp_adaptor.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25 entries, 0 to 24
Data columns (total 20 columns):
component_id            25 non-null object
component_type_id       25 non-null object
adaptor_angle           1 non-null float64
overall_length          24 non-null float64
end_form_id_1           25 non-null object
connection_type_id_1    24 non-null object
length_1                1 non-null float64
thread_size_1           17 non-null float64
thread_pitch_1          17 non-null float64
nominal_size_1          8 non-null float64
end_form_id_2           25 non-null object
connection_type_id_2    24 non-null object
length_2                1 non-null float64
thread_size_2           23 non-null float64
thread_pitch_2          23 non-null float64
nominal_size_2          2 non-null float64
hex_size                17 non-null float64
unique_feature          25 non-null object
orientation             25 non-null object
weight                  23 non-null float64
dtypes: float64(12), object(8)
memory usage: 4.0+ KB

In [7]:

comp_adaptor

Out[7]:

	component_id	component_type_id	adaptor_angle	overall_length	end_form_id_1	connection_type_id_1	length_1	thread_size_1	thread_pitch_1	nominal_size_1	end_form_id_2	connection_type_id_2	length_2	thread_size_2	thread_pitch_2	nominal_size_2	hex_size	unique_feature	orientation	weight
0	C-0005	CP-028	NaN	58.40	A-001	B-001	NaN	1.312	12.0	NaN	A-001	B-004	NaN	1.000	11.5	NaN	34.93	No	No	0.206
1	C-0006	CP-028	NaN	34.80	A-001	B-001	NaN	0.437	20.0	NaN	A-001	B-005	NaN	0.750	16.0	NaN	22.20	No	No	0.083
2	C-1435	CP-028	NaN	20.30	A-007	B-004	NaN	NaN	NaN	15.88	A-001	B-007	NaN	0.875	18.0	NaN	22.22	No	No	0.023
3	C-1546	CP-028	NaN	26.40	A-007	B-004	NaN	0.125	27.0	NaN	A-001	B-004	NaN	0.125	27.0	NaN	15.88	No	No	0.026
4	C-1583	CP-028	NaN	44.50	A-001	B-005	NaN	1.312	12.0	NaN	A-007	B-005	NaN	1.062	12.0	NaN	38.10	No	No	0.256
5	C-1634	CP-028	NaN	34.50	A-001	B-005	NaN	0.750	16.0	NaN	A-001	B-002	NaN	0.687	16.0	NaN	22.23	No	No	0.060
6	C-1975	CP-028	NaN	13.20	A-007	B-007	NaN	NaN	NaN	3.18	A-001	B-007	NaN	0.312	28.0	NaN	NaN	No	No	0.005
7	C-0428	CP-028	NaN	26.99	A-001	B-004	NaN	0.250	18.0	NaN	A-007	NaN	NaN	NaN	NaN	9.52	17.46	No	No	0.032
8	C-0443	CP-028	NaN	22.35	A-007	B-007	NaN	NaN	NaN	19.05	9999	9999	NaN	1.062	16.0	NaN	26.97	No	No	NaN
9	C-0823	CP-028	NaN	16.80	A-007	B-007	NaN	NaN	NaN	9.52	A-001	9999	NaN	0.625	18.0	9.52	15.75	No	No	0.014
10	C-0354	CP-028	NaN	36.30	A-001	B-004	NaN	1.250	11.5	NaN	A-007	B-004	NaN	1.000	11.5	NaN	44.45	No	No	0.136
11	C-0361	CP-028	NaN	41.00	A-005	NaN	NaN	NaN	NaN	6.35	A-006	B-002	NaN	0.562	18.0	NaN	15.88	No	No	0.005
12	C-1312	CP-028	NaN	14.20	A-002	B-007	NaN	0.437	24.0	NaN	9999	9999	NaN	NaN	NaN	NaN	11.11	No	No	0.009
13	C-1313	CP-028	NaN	15.70	A-007	B-007	NaN	NaN	NaN	7.94	A-001	B-007	NaN	0.500	20.0	NaN	12.57	No	No	0.010
14	C-1194	CP-028	NaN	15.00	A-007	B-004	NaN	0.125	27.0	NaN	A-007	B-004	NaN	0.125	27.0	NaN	NaN	No	No	0.014
15	C-1195	CP-028	NaN	30.00	A-007	B-004	NaN	0.250	18.0	NaN	A-007	B-004	NaN	0.250	18.0	NaN	NaN	Yes	No	0.031
16	C-1196	CP-028	NaN	40.00	A-007	B-004	NaN	0.500	14.0	NaN	A-007	B-004	NaN	0.500	14.0	NaN	NaN	No	No	0.095
17	C-1197	CP-028	NaN	41.00	A-007	B-004	NaN	0.750	14.0	NaN	A-007	B-004	NaN	0.750	14.0	NaN	NaN	No	No	0.091
18	C-1198	CP-028	NaN	51.00	A-007	B-004	NaN	1.000	11.5	NaN	A-007	B-004	NaN	1.000	11.5	NaN	NaN	No	No	0.181
19	C-1229	CP-028	NaN	18.80	A-007	B-004	NaN	NaN	NaN	12.70	A-001	B-007	NaN	0.750	18.0	NaN	18.92	No	No	0.009
20	C-1230	CP-028	NaN	14.20	A-007	B-007	NaN	NaN	NaN	4.76	A-001	B-005	NaN	0.375	24.0	NaN	9.40	No	No	0.009
21	C-1695	CP-028	NaN	30.00	A-001	B-006	NaN	0.750	16.0	NaN	A-007	9999	NaN	9999.000	9999.0	NaN	25.40	No	No	NaN
22	C-1812	CP-028	NaN	35.10	A-007	B-004	NaN	0.375	18.0	NaN	A-007	B-004	NaN	0.375	18.0	NaN	NaN	No	No	0.091
23	C-1828	CP-028	NaN	28.40	A-001	B-005	NaN	1.312	12.0	NaN	A-007	B-005	NaN	0.562	18.0	NaN	38.10	No	No	0.181
24	C-1868	CP-029	90.0	NaN	A-001	B-005	65.5	0.750	16.0	NaN	A-001	B-002	28.0	0.812	16.0	NaN	NaN	No	Yes	0.226

component_type_id, end_form_id_1, connection_type_id_1, end_form_id_2, connection_type_id_2 - descriptive information, it is useless in all files. Drop.

adaptor_angle - only one value. Drop.

unique_feature, orientation - categorical description. Drop.

C-1695 has abnormal values. And only one tube assembly has it. So I'll drop it.

C-0443 has NaN weight. And it belongs to only one tube assembly, which has two components, both with NaN weight. So drop.

C-1868. It seems that for some reason overall_length wasn't calculated. I'll calculate it as a sum of length_1 and length_2. Drop length_1 and length_2 after using.

In [8]:

comp_adaptor.drop(['adaptor_angle', 'component_type_id', 'end_form_id_1', 'connection_type_id_1', 'length_1', 'length_2',
                   'unique_feature', 'orientation', 'end_form_id_2', 'connection_type_id_2'], axis=1, inplace=True)

#Could input a formula, but it single value.
comp_adaptor.loc[comp_adaptor['overall_length'].isnull(), 'overall_length'] = 93.5
comp_adaptor.drop(comp_adaptor.index[[8, 21]], inplace=True)
comp_adaptor

Out[8]:

	component_id	overall_length	thread_size_1	thread_pitch_1	nominal_size_1	thread_size_2	thread_pitch_2	nominal_size_2	hex_size	weight
0	C-0005	58.40	1.312	12.0	NaN	1.000	11.5	NaN	34.93	0.206
1	C-0006	34.80	0.437	20.0	NaN	0.750	16.0	NaN	22.20	0.083
2	C-1435	20.30	NaN	NaN	15.88	0.875	18.0	NaN	22.22	0.023
3	C-1546	26.40	0.125	27.0	NaN	0.125	27.0	NaN	15.88	0.026
4	C-1583	44.50	1.312	12.0	NaN	1.062	12.0	NaN	38.10	0.256
5	C-1634	34.50	0.750	16.0	NaN	0.687	16.0	NaN	22.23	0.060
6	C-1975	13.20	NaN	NaN	3.18	0.312	28.0	NaN	NaN	0.005
7	C-0428	26.99	0.250	18.0	NaN	NaN	NaN	9.52	17.46	0.032
9	C-0823	16.80	NaN	NaN	9.52	0.625	18.0	9.52	15.75	0.014
10	C-0354	36.30	1.250	11.5	NaN	1.000	11.5	NaN	44.45	0.136
11	C-0361	41.00	NaN	NaN	6.35	0.562	18.0	NaN	15.88	0.005
12	C-1312	14.20	0.437	24.0	NaN	NaN	NaN	NaN	11.11	0.009
13	C-1313	15.70	NaN	NaN	7.94	0.500	20.0	NaN	12.57	0.010
14	C-1194	15.00	0.125	27.0	NaN	0.125	27.0	NaN	NaN	0.014
15	C-1195	30.00	0.250	18.0	NaN	0.250	18.0	NaN	NaN	0.031
16	C-1196	40.00	0.500	14.0	NaN	0.500	14.0	NaN	NaN	0.095
17	C-1197	41.00	0.750	14.0	NaN	0.750	14.0	NaN	NaN	0.091
18	C-1198	51.00	1.000	11.5	NaN	1.000	11.5	NaN	NaN	0.181
19	C-1229	18.80	NaN	NaN	12.70	0.750	18.0	NaN	18.92	0.009
20	C-1230	14.20	NaN	NaN	4.76	0.375	24.0	NaN	9.40	0.009
22	C-1812	35.10	0.375	18.0	NaN	0.375	18.0	NaN	NaN	0.091
23	C-1828	28.40	1.312	12.0	NaN	0.562	18.0	NaN	38.10	0.181
24	C-1868	93.50	0.750	16.0	NaN	0.812	16.0	NaN	NaN	0.226

Comp Boss ¶

In [9]:

#Descriptive and categorical features will be dropped.
comp_boss.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 147 entries, 0 to 146
Data columns (total 15 columns):
component_id          147 non-null object
component_type_id     147 non-null object
type                  124 non-null object
connection_type_id    147 non-null object
outside_shape         124 non-null object
base_type             124 non-null object
height_over_tube      147 non-null float64
bolt_pattern_long     23 non-null float64
bolt_pattern_wide     17 non-null float64
groove                147 non-null object
base_diameter         57 non-null float64
shoulder_diameter     30 non-null float64
unique_feature        147 non-null object
orientation           147 non-null object
weight                145 non-null float64
dtypes: float64(6), object(9)
memory usage: 17.3+ KB

In [10]:

#Use only important information.
comp_boss = comp_boss[['component_id', 'height_over_tube', 'weight']]
comp_boss.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 147 entries, 0 to 146
Data columns (total 3 columns):
component_id        147 non-null object
height_over_tube    147 non-null float64
weight              145 non-null float64
dtypes: float64(2), object(1)
memory usage: 3.5+ KB

I wrote the following to check for outliers. Sometimes there are values which are too big. It is either mistake or it is due to unknown reason. I'll drop such values. Threshold is arbitrary.

In [11]:

for i in comp_boss.columns:
    if comp_boss[i].dtype != object:
        if comp_boss[i][comp_boss[i] > 4 * comp_boss[i].mean()].any() == True:
            print(i)
            print(comp_boss.loc[comp_boss[i] == comp_boss[i].max()])

height_over_tube
   component_id  height_over_tube  weight
31       C-0307            9999.0    1.05
weight
   component_id  height_over_tube  weight
21       C-0243              36.0   4.034

In [12]:

#Drop row with too big value. I don't drop weight, because it could be reasonable
comp_boss.drop(comp_boss.index[31], inplace=True)
comp_boss.head()

Out[12]:

	component_id	height_over_tube	weight
0	C-0008	17.0	0.032
1	C-0009	13.0	0.033
2	C-0020	28.4	0.070
3	C-0054	27.1	0.180
4	C-0071	20.0	0.080

Comp Hfl ¶

In [13]:

comp_hfl.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 9 columns):
component_id           6 non-null object
component_type_id      6 non-null object
hose_diameter          6 non-null float64
corresponding_shell    6 non-null object
coupling_class         6 non-null object
material               6 non-null object
plating                6 non-null object
orientation            6 non-null object
weight                 6 non-null float64
dtypes: float64(2), object(7)
memory usage: 512.0+ bytes

In [14]:

comp_hfl

Out[14]:

	component_id	component_type_id	hose_diameter	corresponding_shell	coupling_class	material	plating	orientation	weight
0	C-0872	CP-023	4.8	C-0855	SP-0098	SP-0016	Yes	No	0.010
1	C-0873	CP-023	4.8	C-0856	SP-0098	SP-0016	Yes	No	0.010
2	C-0874	CP-023	4.8	C-0857	SP-0098	SP-0038	Yes	No	0.001
3	C-1039	CP-023	15.9	C-1040	SP-0097	SP-0095	No	No	0.052
4	C-1041	CP-023	15.9	C-1042	SP-0099	SP-0095	No	No	0.065
5	C-1043	CP-023	25.4	C-1044	SP-0099	SP-0095	No	No	0.196

In [15]:

#It seems that only three columns are necessary.
comp_hfl = comp_hfl[['component_id', 'hose_diameter', 'weight']]
comp_hfl

Out[15]:

	component_id	hose_diameter	weight
0	C-0872	4.8	0.010
1	C-0873	4.8	0.010
2	C-0874	4.8	0.001
3	C-1039	15.9	0.052
4	C-1041	15.9	0.065
5	C-1043	25.4	0.196

Comp Elbow ¶

In [16]:

comp_elbow.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 178 entries, 0 to 177
Data columns (total 16 columns):
component_id          178 non-null object
component_type_id     178 non-null object
bolt_pattern_long     171 non-null float64
bolt_pattern_wide     138 non-null float64
extension_length      170 non-null float64
overall_length        175 non-null float64
thickness             171 non-null float64
drop_length           171 non-null float64
elbow_angle           130 non-null float64
mj_class_code         41 non-null object
mj_plug_class_code    40 non-null object
plug_diameter         7 non-null float64
groove                178 non-null object
unique_feature        178 non-null object
orientation           178 non-null object
weight                176 non-null float64
dtypes: float64(9), object(7)
memory usage: 22.3+ KB

In [17]:

comp_elbow.head()

Out[17]:

	component_id	component_type_id	bolt_pattern_long	bolt_pattern_wide	extension_length	overall_length	thickness	drop_length	elbow_angle	mj_class_code	mj_plug_class_code	plug_diameter	groove	unique_feature	orientation	weight
0	C-0013	CP-008	152.4	92.08	105.0	185.0	113.0	75.0	90.0	NaN	NaN	NaN	Yes	No	Yes	8.890
1	C-0016	CP-009	57.2	27.80	42.0	69.0	44.0	24.0	90.0	NaN	NaN	NaN	No	No	Yes	1.172
2	C-0017	CP-009	57.2	27.80	42.0	69.0	47.0	26.0	90.0	NaN	NaN	NaN	Yes	No	Yes	1.245
3	C-0018	CP-009	66.6	31.80	50.0	80.0	57.0	31.5	90.0	NaN	NaN	NaN	Yes	No	Yes	1.863
4	C-0021	CP-010	75.0	NaN	31.5	70.0	25.0	12.5	90.0	NaN	NaN	NaN	No	Yes	Yes	0.903

In [18]:

#Most of the columns aren't necessary.
comp_elbow.drop(['component_type_id', 'mj_class_code', 'mj_plug_class_code', 'plug_diameter', 'groove', 'unique_feature',
                 'orientation',], axis=1, inplace=True)

In [19]:

for i in comp_elbow.columns:
    if comp_elbow[i].dtype != object:
        if comp_elbow[i][comp_elbow[i] > 4 * comp_elbow[i].mean()].any() == True:
            print(i)
            print(comp_elbow.loc[comp_elbow[i] == comp_elbow[i].max()])

drop_length
   component_id  bolt_pattern_long  bolt_pattern_wide  extension_length  \
52       C-0537               78.2                NaN              32.0   

    overall_length  thickness  drop_length  elbow_angle  weight  
52            64.0       36.0       9999.0       169.25    0.42  
weight
   component_id  bolt_pattern_long  bolt_pattern_wide  extension_length  \
62       C-0646             106.38              61.93              55.5   

    overall_length  thickness  drop_length  elbow_angle  weight  
62           111.0      115.0         57.5         90.0   10.19

In [20]:

comp_elbow.drop(comp_elbow.index[52], inplace=True)

Comp Float ¶

In [21]:

comp_float.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16 entries, 0 to 15
Data columns (total 7 columns):
component_id         16 non-null object
component_type_id    16 non-null object
bolt_pattern_long    16 non-null float64
bolt_pattern_wide    16 non-null float64
thickness            16 non-null float64
orientation          16 non-null object
weight               16 non-null float64
dtypes: float64(4), object(3)
memory usage: 976.0+ bytes

In [22]:

#Drop description.
comp_float.drop(['component_type_id', 'orientation'], axis=1, inplace=True)
comp_float

Out[22]:

	component_id	bolt_pattern_long	bolt_pattern_wide	thickness	weight
0	C-0027	148.00	96.00	18.00	2.230
1	C-0454	58.72	30.18	28.00	0.590
2	C-0455	58.72	30.18	28.00	0.525
3	C-0494	52.40	26.20	15.85	0.230
4	C-0496	58.80	30.20	14.20	0.284
5	C-0508	77.76	42.88	25.40	1.144
6	C-0572	69.85	35.71	15.70	0.447
7	C-0797	120.65	69.85	20.00	1.983
8	C-0891	47.62	22.22	28.00	0.465
9	C-1096	76.40	69.90	25.00	2.565
10	C-1102	130.20	77.80	36.50	4.060
11	C-1135	116.00	17.00	22.00	1.880
12	C-1149	52.36	26.19	36.00	0.783
13	C-1452	52.38	26.19	22.00	0.276
14	C-1453	58.72	30.18	22.00	0.389
15	C-1584	106.38	61.93	25.40	0.516

Comp Nut ¶

In [23]:

comp_nut.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65 entries, 0 to 64
Data columns (total 11 columns):
component_id         65 non-null object
component_type_id    65 non-null object
hex_nut_size         42 non-null float64
seat_angle           15 non-null float64
length               65 non-null float64
thread_size          65 non-null object
thread_pitch         65 non-null float64
diameter             23 non-null float64
blind_hole           23 non-null object
orientation          65 non-null object
weight               64 non-null float64
dtypes: float64(6), object(5)
memory usage: 5.7+ KB

In [24]:

comp_nut.drop(['component_type_id', 'seat_angle', 'diameter', 'blind_hole', 'orientation'], axis=1, inplace=True)
comp_nut.head()

Out[24]:

	component_id	hex_nut_size	length	thread_size	thread_pitch	weight
0	C-1621	20.64	17.0	.687	16.0	0.015
1	C-1624	34.92	26.5	1.187	12.0	0.035
2	C-1623	28.58	23.5	1.000	14.0	0.044
3	C-1622	23.81	20.0	.812	16.0	0.036
4	C-1625	41.28	27.5	1.437	12.0	0.129

In [25]:

for i in comp_nut.columns:
    if comp_nut[i].dtype != object:
        if comp_nut[i][comp_nut[i] > 4 * comp_nut[i].mean()].any() == True:
            print(i)
            print(comp_nut.loc[comp_nut[i] == comp_nut[i].max()])

weight
   component_id  hex_nut_size  length thread_size  thread_pitch  weight
27       C-1442         57.15    36.1       1.875          12.0   0.343

Comp Other ¶

In [26]:

comp_other.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1001 entries, 0 to 1000
Data columns (total 3 columns):
component_id    1001 non-null object
part_name       1001 non-null object
weight          945 non-null float64
dtypes: float64(1), object(2)
memory usage: 23.5+ KB

In [27]:

#Dtop description.
comp_other.drop(['part_name'], axis=1, inplace=True)
comp_other.head()

Out[27]:

	component_id	weight
0	C-1385	0.014
1	C-1386	0.005
2	C-1369	0.003
3	C-0422	0.003
4	C-1817	0.014

Comp Sleeve ¶

In [28]:

comp_sleeve.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 10 columns):
component_id           50 non-null object
component_type_id      50 non-null object
connection_type_id     50 non-null object
length                 50 non-null float64
intended_nut_thread    50 non-null float64
intended_nut_pitch     50 non-null int64
unique_feature         50 non-null object
plating                50 non-null object
orientation            50 non-null object
weight                 50 non-null float64
dtypes: float64(3), int64(1), object(6)
memory usage: 4.0+ KB

In [29]:

comp_sleeve.drop(['component_type_id', 'connection_type_id', 'unique_feature', 'plating', 'orientation'], axis=1, inplace=True)
comp_sleeve.head()

Out[29]:

	component_id	length	intended_nut_thread	intended_nut_pitch	weight
0	C-0001	17.3	1.062	12	0.013
1	C-0002	11.2	0.500	20	0.005
2	C-0003	19.3	1.187	12	0.014
3	C-0048	9.5	0.562	18	0.006
4	C-0049	9.5	0.812	16	0.012

In [30]:

for i in comp_sleeve.columns:
    if comp_sleeve[i].dtype != object:
        if comp_sleeve[i][comp_sleeve[i] > 4 * comp_sleeve[i].mean()].any() == True:
            print(i)
            print(comp_sleeve.loc[comp_sleeve[i] == comp_sleeve[i].max()])

length
   component_id  length  intended_nut_thread  intended_nut_pitch  weight
28       C-1627  9999.0                0.562                  18   0.005
29       C-1628  9999.0                0.687                  16   0.006
30       C-1629  9999.0                0.812                  16   0.012
31       C-1630  9999.0                1.000                  14   0.018
32       C-1631  9999.0                1.187                  12   0.026
33       C-1632  9999.0                1.437                  12   0.033
34       C-1633  9999.0                1.687                  12   0.040
48       C-1866  9999.0                2.000                  12   0.035

In [31]:

comp_sleeve.drop(comp_sleeve.index[[28, 29, 30, 31, 32, 33, 34, 48]], inplace=True)

Comp Straight ¶

In [32]:

comp_straight.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 361 entries, 0 to 360
Data columns (total 12 columns):
component_id         361 non-null object
component_type_id    361 non-null object
bolt_pattern_long    291 non-null float64
bolt_pattern_wide    204 non-null float64
head_diameter        70 non-null float64
overall_length       41 non-null float64
thickness            361 non-null float64
mj_class_code        120 non-null object
groove               361 non-null object
unique_feature       361 non-null object
orientation          361 non-null object
weight               354 non-null float64
dtypes: float64(6), object(6)
memory usage: 33.9+ KB

In [33]:

comp_straight.drop(['component_type_id', 'overall_length', 'mj_class_code', 'head_diameter', 'unique_feature', 'groove',
                    'orientation'], axis=1, inplace=True)
comp_straight.head()

Out[33]:

	component_id	bolt_pattern_long	bolt_pattern_wide	thickness	weight
0	C-0012	66.68	31.75	20.0	0.788
1	C-0014	47.60	22.20	15.0	0.339
2	C-0015	66.70	31.80	20.0	0.788
3	C-0019	77.80	42.90	36.5	1.533
4	C-0029	47.63	22.23	16.0	0.286

In [34]:

for i in comp_straight.columns:
    if comp_straight[i].dtype != object:
        if comp_straight[i][comp_straight[i] > 4 * comp_straight[i].mean()].any() == True:
            print(i)
            print(comp_straight.loc[comp_straight[i] == comp_straight[i].max()])

weight
    component_id  bolt_pattern_long  bolt_pattern_wide  thickness  weight
171       C-1066               38.1                NaN       13.0   9.693

Comp Tee ¶

In [35]:

comp_tee.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 14 columns):
component_id          4 non-null object
component_type_id     4 non-null object
bolt_pattern_long     4 non-null float64
bolt_pattern_wide     4 non-null float64
extension_length      4 non-null float64
overall_length        4 non-null float64
thickness             4 non-null int64
drop_length           4 non-null float64
mj_class_code         4 non-null object
mj_plug_class_code    4 non-null object
groove                4 non-null object
unique_feature        4 non-null object
orientation           4 non-null object
weight                4 non-null float64
dtypes: float64(6), int64(1), object(7)
memory usage: 528.0+ bytes

In [36]:

comp_tee.drop(['component_type_id', 'mj_class_code', 'mj_plug_class_code', 'groove', 'unique_feature', 'orientation'],
              axis=1, inplace=True)
comp_tee

Out[36]:

	component_id	bolt_pattern_long	bolt_pattern_wide	extension_length	overall_length	thickness	drop_length	weight
0	C-0271	58.70	30.20	57.10	93.0	57	28.5	1.526
1	C-1809	58.72	30.18	57.09	108.0	57	28.5	2.184
2	C-1830	52.40	26.20	43.50	78.5	51	25.5	1.135
3	C-1865	58.70	30.20	57.10	107.0	57	28.5	1.953

In [37]:

for i in comp_tee.columns:
    if comp_tee[i].dtype != object:
        if comp_tee[i][comp_tee[i] > 4 * comp_tee[i].mean()].any() == True:
            print(i)
            print(comp_tee.loc[comp_tee[i] == comp_tee[i].max()])

Comp Threaded ¶

In [38]:

comp_threaded.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 194 entries, 0 to 193
Data columns (total 32 columns):
component_id            194 non-null object
component_type_id       194 non-null object
adaptor_angle           54 non-null float64
overall_length          121 non-null float64
hex_size                112 non-null float64
end_form_id_1           194 non-null object
connection_type_id_1    135 non-null object
length_1                73 non-null float64
thread_size_1           135 non-null float64
thread_pitch_1          135 non-null float64
nominal_size_1          60 non-null object
end_form_id_2           194 non-null object
connection_type_id_2    63 non-null object
length_2                73 non-null float64
thread_size_2           63 non-null float64
thread_pitch_2          63 non-null float64
nominal_size_2          131 non-null float64
end_form_id_3           19 non-null object
connection_type_id_3    10 non-null object
length_3                19 non-null float64
thread_size_3           10 non-null float64
thread_pitch_3          10 non-null float64
nominal_size_3          9 non-null float64
end_form_id_4           1 non-null object
connection_type_id_4    1 non-null object
length_4                1 non-null float64
thread_size_4           1 non-null float64
thread_pitch_4          1 non-null float64
nominal_size_4          0 non-null float64
unique_feature          194 non-null object
orientation             194 non-null object
weight                  193 non-null float64
dtypes: float64(19), object(13)
memory usage: 48.6+ KB

In [39]:

comp_threaded.drop(['component_type_id', 'adaptor_angle', 'end_form_id_1', 'connection_type_id_1', 'end_form_id_2',
                    'connection_type_id_2', 'end_form_id_3', 'connection_type_id_3', 'end_form_id_4', 'connection_type_id_4',
                    'nominal_size_4', 'unique_feature', 'orientation'], axis=1, inplace=True)
comp_threaded.head()

Out[39]:

	component_id	overall_length	hex_size	length_1	thread_size_1	thread_pitch_1	nominal_size_1	length_2	thread_size_2	thread_pitch_2	nominal_size_2	length_3	thread_size_3	thread_pitch_3	nominal_size_3	length_4	thread_size_4	thread_pitch_4	weight
0	C-0007	24.4	12.70	NaN	0.437	20.0	NaN	NaN	NaN	NaN	6.35	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.028
1	C-0030	NaN	NaN	47.0	NaN	NaN	42	47.0	NaN	NaN	42.00	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.914
2	C-0041	28.0	32.00	NaN	NaN	NaN	21.7	NaN	0.812	16.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.120
3	C-0043	38.0	27.00	NaN	NaN	NaN	21.7	NaN	0.812	16.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.090
4	C-0044	30.5	41.28	NaN	NaN	NaN	34	NaN	1.437	12.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.182

In [40]:

#There are five columns with length. So I fill NA with 0, summarize length and drop excessive columns.
comp_threaded['length_1'] = comp_threaded['length_1'].fillna(0)
comp_threaded['length_2'] = comp_threaded['length_2'].fillna(0)
comp_threaded['length_3'] = comp_threaded['length_3'].fillna(0)
comp_threaded['length_4'] = comp_threaded['length_4'].fillna(0)
comp_threaded['overall_length'] = comp_threaded['overall_length'].fillna(0)
comp_threaded['overall_length'] = comp_threaded['overall_length'] + comp_threaded['length_1'] + comp_threaded['length_2'] \
+ comp_threaded['length_3'] + comp_threaded['length_4']

comp_threaded.drop(['length_1', 'length_2', 'length_3', 'length_4'], axis=1, inplace=True)

In [41]:

for i in comp_threaded.columns:
    if comp_threaded[i].dtype != object:
        if comp_threaded[i][comp_threaded[i] > 4 * comp_threaded[i].mean()].any() == True:
            print(i)
            print(comp_threaded.loc[comp_threaded[i] == comp_threaded[i].max()])

nominal_size_2
   component_id  overall_length  hex_size  thread_size_1  thread_pitch_1  \
40       C-0386            42.7      60.3            2.0            12.0   

   nominal_size_1  thread_size_2  thread_pitch_2  nominal_size_2  \
40            NaN            NaN             NaN          9999.0   

    thread_size_3  thread_pitch_3  nominal_size_3  thread_size_4  \
40            NaN             NaN             NaN            NaN   

    thread_pitch_4  weight  
40             NaN   0.543  
nominal_size_3
   component_id  overall_length  hex_size  thread_size_1  thread_pitch_1  \
90       C-0971           124.1       NaN          1.187            12.0   

   nominal_size_1  thread_size_2  thread_pitch_2  nominal_size_2  \
90            NaN          1.187            12.0             NaN   

    thread_size_3  thread_pitch_3  nominal_size_3  thread_size_4  \
90            NaN             NaN          9999.0            NaN   

    thread_pitch_4  weight  
90             NaN   0.545  
weight
   component_id  overall_length  hex_size  thread_size_1  thread_pitch_1  \
91       C-0974           112.7       NaN          1.437            12.0   

   nominal_size_1  thread_size_2  thread_pitch_2  nominal_size_2  \
91            NaN            NaN             NaN           31.75   

    thread_size_3  thread_pitch_3  nominal_size_3  thread_size_4  \
91            NaN             NaN             NaN            NaN   

    thread_pitch_4  weight  
91             NaN    1.17

In [42]:

comp_threaded.drop(comp_threaded.index[[40, 90]], inplace=True)

Tube ¶

In [43]:

tube.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21198 entries, 0 to 21197
Data columns (total 16 columns):
tube_assembly_id    21198 non-null object
material_id         20919 non-null object
diameter            21198 non-null float64
wall                21198 non-null float64
length              21198 non-null float64
num_bends           21198 non-null int64
bend_radius         21198 non-null float64
end_a_1x            21198 non-null object
end_a_2x            21198 non-null object
end_x_1x            21198 non-null object
end_x_2x            21198 non-null object
end_a               21198 non-null object
end_x               21198 non-null object
num_boss            21198 non-null int64
num_bracket         21198 non-null int64
other               21198 non-null int64
dtypes: float64(4), int64(4), object(8)
memory usage: 2.6+ MB

In [44]:

tube.drop(['material_id', 'end_a_1x', 'end_a_2x', 'end_x_1x', 'end_x_2x', 'end_a', 'end_x', 'num_boss', 'num_bracket', 'other'],
          axis=1, inplace=True)
tube.head()

Out[44]:

	tube_assembly_id	diameter	wall	length	num_bends	bend_radius
0	TA-00001	12.70	1.65	164.0	5	38.10
1	TA-00002	6.35	0.71	137.0	8	19.05
2	TA-00003	6.35	0.71	127.0	7	19.05
3	TA-00004	6.35	0.71	137.0	9	19.05
4	TA-00005	19.05	1.24	109.0	4	50.80

In [45]:

for i in tube.columns:
    if tube[i].dtype != object:
        if tube[i][tube[i] > 4 * tube[i].mean()].any() == True:
            print(i)
            print(tube.loc[tube[i] == tube[i].max()])

diameter
      tube_assembly_id  diameter  wall  length  num_bends  bend_radius
11306         TA-11307     203.2   1.9    62.0          0          0.0
11309         TA-11310     203.2   1.9    48.0          0          0.0
11341         TA-11342     203.2   1.9    40.0          0          0.0
13049         TA-13050     203.2   1.9    70.0          0          0.0
14709         TA-14710     203.2   1.9    64.0          0          0.0
15619         TA-15620     203.2   1.9    26.0          0          0.0
16762         TA-16763     203.2   1.9   129.0          2        203.2
17421         TA-17422     203.2   1.9    22.0          0          0.0
wall
      tube_assembly_id  diameter  wall  length  num_bends  bend_radius
389           TA-00390      66.7   7.9   310.0          2       177.80
390           TA-00391      66.7   7.9   393.0          2       177.80
391           TA-00392      66.7   7.9   174.0          2       177.80
392           TA-00393      66.7   7.9   188.0          2       177.80
431           TA-00432      66.7   7.9   146.0          2       177.80
432           TA-00433      66.7   7.9   146.0          2       177.80
1228          TA-01229      66.7   7.9   430.0          5       177.80
1620          TA-01621      66.7   7.9   160.0          2       177.80
3835          TA-03836      66.7   7.9   114.0          3       180.00
4184          TA-04185      66.7   7.9   466.0          5       177.80
4185          TA-04186      66.7   7.9   149.0          2       177.80
4186          TA-04187      66.7   7.9   130.0          2       177.80
4613          TA-04614      66.7   7.9   139.0          2       177.80
5289          TA-05290      66.7   7.9   225.0          0         0.00
5290          TA-05291      66.7   7.9   161.0          2       180.00
5644          TA-05645      66.7   7.9   134.0          1       180.00
6446          TA-06447      66.7   7.9   211.0          2       180.00
6447          TA-06448      66.7   7.9   211.0          2       180.00
6448          TA-06449      66.7   7.9   351.0          2       180.00
6449          TA-06450      66.7   7.9   351.0          2       180.00
6518          TA-06519      66.7   7.9   141.0          2       180.00
8602          TA-08603      66.7   7.9    37.0          1       180.00
8932          TA-08933      66.7   7.9   221.0          2       180.00
9026          TA-09027      66.7   7.9    68.0          2       171.45
9027          TA-09028      66.7   7.9    68.0          2       171.45
11542         TA-11543      66.7   7.9   158.0          2       180.00
11543         TA-11544      66.7   7.9   158.0          2       180.00
11546         TA-11547      66.7   7.9    78.0          2       180.00
11637         TA-11638      66.7   7.9    78.0          2       180.00
11690         TA-11691      66.7   7.9   112.0          2       180.00
11691         TA-11692      66.7   7.9   110.0          2       180.00
15323         TA-15324      66.7   7.9   195.0          3       180.00
17688         TA-17689      66.7   7.9    88.0          2      9999.00
17689         TA-17690      66.7   7.9    71.0          2      9999.00
18002         TA-18003      66.7   7.9    63.0          2      9999.00
18003         TA-18004      66.7   7.9    81.0          2      9999.00
18093         TA-18094      66.7   7.9   172.0          2       180.00
18094         TA-18095      66.7   7.9    96.0          2       180.00
18095         TA-18096      66.7   7.9    83.0          2       180.00
18096         TA-18097      66.7   7.9   124.0          2       180.00
18097         TA-18098      66.7   7.9   130.0          2       180.00
18098         TA-18099      66.7   7.9   124.0          2       180.00
18099         TA-18100      66.7   7.9   130.0          2       180.00
length
     tube_assembly_id  diameter  wall  length  num_bends  bend_radius
4216         TA-04217     15.88  1.65  1333.0          6        31.75
num_bends
      tube_assembly_id  diameter   wall  length  num_bends  bend_radius
20087         TA-20089      6.35  2.000    93.0         17        19.05
20105         TA-20107      6.35  2.375    83.0         17        19.05
bend_radius
      tube_assembly_id  diameter  wall  length  num_bends  bend_radius
15132         TA-15133     31.75  1.65    49.0          2       9999.0
15174         TA-15175     31.75  1.65    36.0          1       9999.0
15175         TA-15176     31.75  1.65    76.0          1       9999.0
17688         TA-17689     66.70  7.90    88.0          2       9999.0
17689         TA-17690     66.70  7.90    71.0          2       9999.0
18002         TA-18003     66.70  7.90    63.0          2       9999.0
18003         TA-18004     66.70  7.90    81.0          2       9999.0
19320         TA-19321      6.35  0.71    32.0          3       9999.0

In [46]:

tube.drop(tube.index[[15132, 15174, 15175, 17688, 17689, 18002, 18003, 19320]], inplace=True)

These files contain only descriptions, so I don't use them:

tube_end_form

type_component

type_connection

type_end_form

components

Data preparation ¶

In [47]:

#Create several features from dates for additional information.
train['year'] = train.quote_date.dt.year
train['month'] = train.quote_date.dt.month
train['dayofyear'] = train.quote_date.dt.dayofyear
train['dayofweek'] = train.quote_date.dt.dayofweek
train['day'] = train.quote_date.dt.day

test['year'] = test.quote_date.dt.year
test['month'] = test.quote_date.dt.month
test['dayofyear'] = test.quote_date.dt.dayofyear
test['dayofweek'] = test.quote_date.dt.dayofweek
test['day'] = test.quote_date.dt.day

train = train.drop('quote_date',axis=1)
test = test.drop('quote_date',axis=1)

In [48]:

#I combine all files with info on components in one file.
all_comp = pd.concat([comp_adaptor, comp_boss, comp_elbow, comp_float, comp_hfl, comp_nut, comp_other,
                      comp_sleeve, comp_straight, comp_tee, comp_threaded])

In [49]:

all_comp.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2033 entries, 0 to 193
Data columns (total 29 columns):
bolt_pattern_long      481 non-null float64
bolt_pattern_wide      362 non-null float64
component_id           2033 non-null object
drop_length            174 non-null float64
elbow_angle            129 non-null float64
extension_length       173 non-null float64
height_over_tube       146 non-null float64
hex_nut_size           42 non-null float64
hex_size               126 non-null float64
hose_diameter          6 non-null float64
intended_nut_pitch     42 non-null float64
intended_nut_thread    42 non-null float64
length                 107 non-null float64
nominal_size_1         67 non-null object
nominal_size_2         132 non-null float64
nominal_size_3         8 non-null float64
overall_length         393 non-null float64
thickness              551 non-null float64
thread_pitch           65 non-null float64
thread_pitch_1         149 non-null float64
thread_pitch_2         83 non-null float64
thread_pitch_3         10 non-null float64
thread_pitch_4         1 non-null float64
thread_size            65 non-null object
thread_size_1          149 non-null float64
thread_size_2          83 non-null float64
thread_size_3          10 non-null float64
thread_size_4          1 non-null float64
weight                 1964 non-null float64
dtypes: float64(26), object(3)
memory usage: 476.5+ KB

In [50]:

#Some columns have little values, some have strings and integers, so I use only general parameters
all_comp = all_comp[['component_id', 'weight', 'length', 'overall_length', 'thickness']]
all_comp.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2033 entries, 0 to 193
Data columns (total 5 columns):
component_id      2033 non-null object
weight            1964 non-null float64
length            107 non-null float64
overall_length    393 non-null float64
thickness         551 non-null float64
dtypes: float64(4), object(1)
memory usage: 95.3+ KB

In [51]:

#Combine two length columns.
all_comp['overall_length'] = all_comp['overall_length'].fillna(0)
all_comp['length'] = all_comp['length'].fillna(0)
all_comp['length'] = all_comp['length'] + all_comp['overall_length']
all_comp = all_comp.drop(['overall_length'], axis=1)

all_comp['weight'] = all_comp['weight'].fillna(0)
all_comp['thickness'] = all_comp['thickness'].fillna(0)

In [52]:

#This is how file with components looks like
all_comp.head()

Out[52]:

	component_id	weight	length
0	C-0005	0.206	58.4
1	C-0006	0.083	34.8
2	C-1435	0.023	20.3
3	C-1546	0.026	26.4
4	C-1583	0.256	44.5

In [53]:

#Add information about tube itself and the list of components to main files.
train = pd.merge(train, tube, on='tube_assembly_id', how='left')
train = pd.merge(train, bill_of_materials, on ='tube_assembly_id', how='left')
test = pd.merge(test, tube, on='tube_assembly_id', how='left')
test = pd.merge(test, bill_of_materials, on ='tube_assembly_id', how='left')

In [54]:

#Rename columns so that they will be different from length of components.
train.rename(columns={'length': 'length_t'}, inplace = True)
test.rename(columns={'length': 'length_t'}, inplace = True)

In [55]:

#Merging to get information about components
for i in range(1, 9, 2):
    suffix1 = '_' + str(i)
    suffix2 = '_' + str(i + 1)
    component_1 = 'component_id' + suffix1
    component_2 = 'component_id' + suffix2
    
    train = pd.merge(train, all_comp, left_on = component_1, right_on = 'component_id', how='left')
    train = pd.merge(train, all_comp, left_on = component_2, right_on = 'component_id', suffixes=(suffix1, suffix2), how='left')
    
    test = pd.merge(test, all_comp, left_on = component_1, right_on = 'component_id', how='left')
    test = pd.merge(test, all_comp, left_on = component_2, right_on = 'component_id', suffixes=(suffix1, suffix2), how='left')

In [56]:

#Drop unnecessary columns
train.drop(['component_id_1', 'component_id_2', 'component_id_3', 'component_id_4', 'component_id_5', 'component_id_6',
            'component_id_7', 'component_id_8'], axis=1, inplace=True)
test.drop(['component_id_1', 'component_id_2', 'component_id_3', 'component_id_4', 'component_id_5', 'component_id_6',
            'component_id_7', 'component_id_8'], axis=1, inplace=True)
train.head()

Out[56]:

	tube_assembly_id	supplier	bracket_pricing	quantity	cost	year	month	dayofyear	...	thickness_5	weight_6	length_6	thickness_6	weight_7	length_7	thickness_7	weight_8	length_8	thickness_8
0	TA-00002	S-0066	Yes	1	21.905933	2013	7	188	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	TA-00002	S-0066	Yes	2	12.341214	2013	7	188	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	TA-00002	S-0066	Yes	5	6.601826	2013	7	188	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	TA-00002	S-0066	Yes	10	4.687770	2013	7	188	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	TA-00002	S-0066	Yes	25	3.541561	2013	7	188	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

5 rows × 49 columns

In [57]:

#Add descriptive information about specs.
train = pd.merge(train, specs, on='tube_assembly_id', how='left')
test = pd.merge(test, specs, on='tube_assembly_id', how='left')

In [58]:

#Maybe it is strange, but it turned out that tube id is quite a good feature. It seems to be data leak
train['ta_id'] = train['tube_assembly_id'].apply(lambda x: int(x.split('-')[1]))
test['ta_id'] = test['tube_assembly_id'].apply(lambda x: int(x.split('-')[1]))
train.drop(['tube_assembly_id'], axis=1, inplace=True)
test.drop(['tube_assembly_id'], axis=1, inplace=True)

I created a lot of derivative features adn tried them. The following features turned out to be good

In [59]:

[col for col in list(train.columns) if 'thickness' in col]

Out[59]:

['thickness_1',
 'thickness_2',
 'thickness_3',
 'thickness_4',
 'thickness_5',
 'thickness_6',
 'thickness_7',
 'thickness_8']

In [60]:

#Calculate various additional features on physical parameters. They turned out to be useful.
length_columns = [col for col in list(train.columns) if 'length' in col]
weight_columns = [col for col in list(train.columns) if 'weight' in col]
thickness_columns = [col for col in list(train.columns) if 'thickness' in col]
train['avg_w'] = train[weight_columns].mean(axis=1)
train['avg_l'] = train[length_columns].mean(axis=1)
train['avg_th'] = train[thickness_columns].mean(axis=1)
train['min_w'] = train[weight_columns].min(axis=1)
train['min_l'] = train[length_columns].min(axis=1)
train['min_th'] = train[thickness_columns].min(axis=1)
train['max_w'] = train[weight_columns].max(axis=1)
train['max_l'] = train[length_columns].max(axis=1)
train['max_th'] = train[thickness_columns].max(axis=1)
test['avg_w'] = test[weight_columns].mean(axis=1)
test['avg_l'] = test[length_columns].mean(axis=1)
test['avg_th'] = test[thickness_columns].mean(axis=1)
test['min_w'] = test[weight_columns].min(axis=1)
test['min_l'] = test[length_columns].min(axis=1)
test['min_th'] = test[thickness_columns].min(axis=1)
test['max_w'] = test[weight_columns].max(axis=1)
test['max_l'] = test[length_columns].max(axis=1)
test['max_th'] = test[thickness_columns].max(axis=1)
train['tot_w'] = train[weight_columns].sum(axis=1)
train['tot_l'] = train[length_columns].sum(axis=1)
test['tot_w'] = test[weight_columns].sum(axis=1)
test['tot_l'] = test[length_columns].sum(axis=1)

In [61]:

#Take log of skewered columns to smooth them and fill NA.
for col in train.columns:
    if train[col].dtype != 'object':
        if skew(train[col]) > 0.75:
            train[col] = np.log1p(train[col])
            train[col] = train[col].apply(lambda x: 0 if x == -np.inf else x)

        train[col] = train[col].fillna(0)
        
for col in test.columns:
    if test[col].dtype != 'object':
        if skew(test[col]) > 0.75:
            test[col] = np.log1p(test[col])
            test[col] = test[col].apply(lambda x: 0 if x == -np.inf else x)

        test[col] = test[col].fillna(0)

In [62]:

for col in train.columns:
    if train[col].dtype == 'object':
        train[col].replace(np.nan,' ', regex=True, inplace= True)
for col in test.columns:
    if test[col].dtype == 'object':
        test[col].replace(np.nan,' ', regex=True, inplace= True)

In [63]:

X_train = train.drop('cost',axis=1)
Y_train = train['cost']
X_test  = test.drop('id', axis=1)

In [64]:

#Check that the columns are the same
(X_test.columns == X_train.columns).all()

Out[64]:

True

In [65]:

#Convert to arrays for easier transformation
X_train = np.array(X_train)
X_test = np.array(X_test)

In [66]:

#Label encode the categorical variables
for i in range(X_train.shape[1]):
    if i in [0, 3, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56]:
        lbl = preprocessing.LabelEncoder()
        lbl.fit(list(X_train[:,i]) + list(X_test[:,i]))
        X_train[:,i] = lbl.transform(X_train[:,i])
        X_test[:,i] = lbl.transform(X_test[:,i])

In [67]:

#XGB need float.
X_train = X_train.astype(float)
X_test = X_test.astype(float)

Model ¶

In [68]:

params = {}
params['objective'] = 'reg:linear'
params['eta'] = 0.1
params['min_child_weight'] = 5
params['subsample'] = 1.0
params['scale_pos_weight'] = 1.0
params['silent'] = 1
params['max_depth'] = 7

param = list(params.items())

In [69]:

xgtrain = xgb.DMatrix(X_train, label=Y_train)
xgtest = xgb.DMatrix(X_test)

In [70]:

num_rounds = 1200
model = xgb.train(param, xgtrain, num_rounds)
preds = np.expm1(model.predict(xgtest))

In [71]:

preds_df = pd.DataFrame({'id': test['id'], 'cost': preds})
preds_df.to_csv('Caterpillar.csv', index=False)
#0.229153 from ~0.19

This competition has already ended, but people still can submit their solutions and see their scores. First places have a score ~0.19.

My model got a score of 0.229153.