Requirements: pandas, numpy, sklearn.model_selection, sklearn.metrics, matplotlib, matplotlib.pyplot
Download Criteo Click Logs dataset Day 15 in Terminal: wget http://azuremlsampleexperiments.blob.core.windows.net/criteo/day_15.gz
# optional installation if the following libraries have not been installed in the cluster:
!pip install pandas
!pip install numpy
!pip install matplotlib
!pip install sklearn
Requirement already satisfied: pandas in /opt/conda/envs/rapids/lib/python3.7/site-packages (1.1.4) Requirement already satisfied: python-dateutil>=2.7.3 in /opt/conda/envs/rapids/lib/python3.7/site-packages (from pandas) (2.8.1) Requirement already satisfied: pytz>=2017.2 in /opt/conda/envs/rapids/lib/python3.7/site-packages (from pandas) (2020.4) Requirement already satisfied: numpy>=1.15.4 in /opt/conda/envs/rapids/lib/python3.7/site-packages (from pandas) (1.19.4) Requirement already satisfied: six>=1.5 in /opt/conda/envs/rapids/lib/python3.7/site-packages (from python-dateutil>=2.7.3->pandas) (1.15.0) Requirement already satisfied: numpy in /opt/conda/envs/rapids/lib/python3.7/site-packages (1.19.4) Requirement already satisfied: matplotlib in /opt/conda/envs/rapids/lib/python3.7/site-packages (3.3.3) Requirement already satisfied: pillow>=6.2.0 in /opt/conda/envs/rapids/lib/python3.7/site-packages (from matplotlib) (8.0.1) Requirement already satisfied: numpy>=1.15 in /opt/conda/envs/rapids/lib/python3.7/site-packages (from matplotlib) (1.19.4) Requirement already satisfied: python-dateutil>=2.1 in /opt/conda/envs/rapids/lib/python3.7/site-packages (from matplotlib) (2.8.1) Requirement already satisfied: kiwisolver>=1.0.1 in /opt/conda/envs/rapids/lib/python3.7/site-packages (from matplotlib) (1.3.1) Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.3 in /opt/conda/envs/rapids/lib/python3.7/site-packages (from matplotlib) (2.4.7) Requirement already satisfied: cycler>=0.10 in /opt/conda/envs/rapids/lib/python3.7/site-packages (from matplotlib) (0.10.0) Requirement already satisfied: six in /opt/conda/envs/rapids/lib/python3.7/site-packages (from cycler>=0.10->matplotlib) (1.15.0) Requirement already satisfied: six in /opt/conda/envs/rapids/lib/python3.7/site-packages (from cycler>=0.10->matplotlib) (1.15.0) Requirement already satisfied: sklearn in /opt/conda/envs/rapids/lib/python3.7/site-packages (0.0) Requirement already satisfied: scikit-learn in /opt/conda/envs/rapids/lib/python3.7/site-packages (from sklearn) (0.23.1) Requirement already satisfied: threadpoolctl>=2.0.0 in /opt/conda/envs/rapids/lib/python3.7/site-packages (from scikit-learn->sklearn) (2.1.0) Requirement already satisfied: joblib>=0.11 in /opt/conda/envs/rapids/lib/python3.7/site-packages (from scikit-learn->sklearn) (0.17.0) Requirement already satisfied: numpy>=1.13.3 in /opt/conda/envs/rapids/lib/python3.7/site-packages (from scikit-learn->sklearn) (1.19.4) Requirement already satisfied: scipy>=0.19.1 in /opt/conda/envs/rapids/lib/python3.7/site-packages (from scikit-learn->sklearn) (1.5.1) Requirement already satisfied: numpy>=1.13.3 in /opt/conda/envs/rapids/lib/python3.7/site-packages (from scikit-learn->sklearn) (1.19.4)
file = '/data/day_15' #after download the dataset, decompressed the file first. day_15 is text format.
#readline() is reading the first 1 line.
with open(file) as f:
print(f.readline())
0 2 9 1 0 0 3 1 0 1036 4db5cd76 310b1fd7 bfbe69f6 bc892e1f 1315f676 6fcd6dcb e7222fbe b2a2bd17 25dd8f9a 2d40282b 4f91b406 a81c2672 a77a4a56 be4ee537 57469cbd 4cdc3efa 1f7fc70b b8170bba 9512c20b 31a9f3b3 228aee9b b74c6548 59f9dd38 165fbf32 0b3c06d0 2ccea557
%%time
import pandas as pd
import numpy as np
header = ['col'+str(i) for i in range (1,41)] #note that according to criteo, the first column in the dataset is Click Through (CT). Consist of 40 columns
first_row_taken = 50_000_000 # use this in pd.read_csv() if your compute resource is limited.
# total number of rows in day15 is 20B
# take 20M, 30M
"""
Read data & display the following metrics:
1. Total number of rows per day
2. df loading time in the cluster
3. Train a random forest model
"""
df = pd.read_csv(file, nrows=first_row_taken, delimiter='\t', names=header)
# take numerical columns
df_sliced = df.iloc[:, 0:14]
# split data into training and Y
Y = df_sliced.pop('col1') # first column is binary (click or not)
# change df_sliced data types & fillna
df_sliced = df_sliced.astype(np.float32).fillna(0)
from sklearn.ensemble import RandomForestClassifier
# Random Forest building parameters
# n_streams = 8 # optimization
max_depth = 10
n_bins = 16
n_trees = 10
rf_model = RandomForestClassifier(max_depth=max_depth, n_estimators=n_trees)
rf_model.fit(df_sliced, Y)
# testing data, last 1M rows in day15
test_file = '/data/day_15_test'
with open(test_file) as g:
print(g.readline())
# dataFrame processing for test data
test_df = pd.read_csv(test_file, delimiter='\t', names=header)
test_df_sliced = test_df.iloc[:, 0:14]
test_Y = test_df_sliced.pop('col1')
test_df_sliced = test_df_sliced.astype(np.float32).fillna(0)
# prediction & calculating error
pred_df = rf_model.predict(test_df_sliced)
from sklearn import metrics
# Model Accuracy
print("Accuracy:",metrics.accuracy_score(test_Y, pred_df))
0 1 3 0 7 5 0 0 3575 6 0 4 11976 1 f438eac0 e7c8f4b4 5b913d0f f2463ffb 729e35ab 6fcd6dcb 27f43f86 312aa74b 25dd8f9a 96bd225a 3861b8d7 f1b49bb9 a77a4a56 672e9cf8 96fd88a3 ae30c32c 1f7fc70b b6bc86c5 108a0699 5865ea16 d55ec182 f11ef8d0 483383ee d7b3dff0 321935cd 2ba8d787 Accuracy: 0.96592 CPU times: user 43min 50s, sys: 26.8 s, total: 44min 17s Wall time: 47min 21s
Data Source: https://labs.criteo.com/2013/12/download-terabyte-click-logs/
Project Inspiration:https://towardsdatascience.com/mobile-ads-click-through-rate-ctr-prediction-44fdac40c6ff
Mapping object to set of Integer wiht Hash Function, before using it in XGBoost: https://booking.ai/dont-be-tricked-by-the-hashing-trick-192a6aae3087
Regularization, Variance, OverFit Concept: https://www.youtube.com/watch?v=Q81RR3yKn30
XGBoost_Playlist by StatQuest: https://www.youtube.com/watch?v=OtD8wVaFm6E&list=PLblh5JKOoLULU0irPgs1SnKO6wqVjKUsQ
Visulazing XGBClassifier with val_metric Error & LogLoss: https://setscholars.net/wp-content/uploads/2019/02/visualise-XgBoost-model-with-learning-curves-in-Python.html