Mobile Price Classification

Автор: Трефилов Андрей

Oldi zdes'?

Part 1. Feature and data explanation

Bob has started his own mobile company. He wants to give tough fight to big companies like Apple,Samsung etc.

He does not know how to estimate price of mobiles his company creates. In this competitive mobile phone market you cannot simply assume things. To solve this problem he collects sales data of mobile phones of various companies.

Bob wants to find out some relation between features of a mobile phone(eg:- RAM,Internal Memory etc) and its selling price.

In this project we do have to predict price range indicating how high the price is.

Download dataset from Kaggle page
Dataset contain train (with target variable) and test (without target variable) samples.
For the train sample, we will solve the multiclass classification problem with 4 class, and for the test sample we will solve the clustering problem.

The dataset has the following features (copied from Kaggle):

Every object - it is a unique mobile phone.

  • battery_power - Total energy a battery can store in one time measured in mAh (quantitative);
  • blue - Has bluetooth or not (binary);
  • clock_speed - speed at which microprocessor executes instructions (quantitative);
  • dual_sim - Has dual sim support or not (binary);
  • fc - Front Camera mega pixels (categorical);
  • four_g - Has 4G or not (binary);
  • int_memory - Internal Memory in Gigabytes (quantitative);
  • m_dep - Mobile Depth in cm (categorical);
  • mobile_wt - Weight of mobile phone (quantitative);
  • n_cores - Number of cores of processor (categorical);
  • pc - Primary Camera mega pixels (categorical);
  • px_height - Pixel Resolution Heigh (quantitative);
  • px_width - Pixel Resolution Width (quantitative);
  • ram - Random Access Memory in Megabytes (quantitative);
  • sc_h - Screen Height of mobile in cm (categorical);
  • sc_w - Screen Width of mobile in cm (categorical);
  • talk_time - longest time that a single battery charge will last when you are (quantitative);
  • three_g - Has 3G or not (binary);
  • touch_screen - Has touch screen or not (binary);
  • wifi - Has wifi or not (binary);

  • price_range - This is the target variable with value of 0(low cost), 1(medium cost), 2(high cost) and 3(very high cost). Contain only the in train sample

Part 2. Primary data analysis

Importing libraries:

In [241]:
import numpy as np
import pandas as pd
import seaborn as sns

from pylab import rcParams
rcParams['figure.figsize'] = 10, 8
#%config InlineBackend.figure_format = 'svg'
import warnings
warnings.simplefilter('ignore')
from sklearn.decomposition import PCA
from matplotlib import pyplot as plt
from sklearn.manifold import TSNE
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.multiclass import OneVsRestClassifier, OneVsOneClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_predict, StratifiedKFold, validation_curve
from sklearn.metrics import roc_auc_score, accuracy_score, precision_score, recall_score,\
                            f1_score, make_scorer, classification_report, confusion_matrix
from sklearn import svm, datasets
from sklearn.metrics import roc_curve, auc
pd.set_option('display.max_rows', 20)
pd.set_option('display.max_columns', 21)
from sklearn import metrics
from sklearn.cluster import KMeans, AgglomerativeClustering, AffinityPropagation, SpectralClustering
from tqdm import tqdm_notebook
from sklearn.metrics.cluster import adjusted_rand_score
from scipy.cluster import hierarchy
from scipy.spatial.distance import pdist
from sklearn.model_selection import learning_curve
from sklearn.model_selection import ShuffleSplit

Let`s look at data:

In [2]:
data_train = pd.read_csv('../data/mobile/train.csv')
data_test = pd.read_csv('../data/mobile/test.csv')
data_test.drop(columns='id', inplace=True)
In [4]:
data_train.head()
Out[4]:
battery_power blue clock_speed dual_sim fc four_g int_memory m_dep mobile_wt n_cores pc px_height px_width ram sc_h sc_w talk_time three_g touch_screen wifi price_range
0 842 0 2.2 0 1 0 7 0.6 188 2 2 20 756 2549 9 7 19 0 0 1 1
1 1021 1 0.5 1 0 1 53 0.7 136 3 6 905 1988 2631 17 3 7 1 1 0 2
2 563 1 0.5 1 2 1 41 0.9 145 5 6 1263 1716 2603 11 2 9 1 1 0 2
3 615 1 2.5 0 0 0 10 0.8 131 6 9 1216 1786 2769 16 8 11 1 0 0 2
4 1821 1 1.2 0 13 1 44 0.6 141 2 14 1208 1212 1411 8 2 15 1 1 0 1
In [5]:
data_test.head()
Out[5]:
battery_power blue clock_speed dual_sim fc four_g int_memory m_dep mobile_wt n_cores pc px_height px_width ram sc_h sc_w talk_time three_g touch_screen wifi
0 1043 1 1.8 1 14 0 5 0.1 193 3 16 226 1412 3476 12 7 2 0 1 0
1 841 1 0.5 1 4 1 61 0.8 191 5 12 746 857 3895 6 0 7 1 0 0
2 1807 1 2.8 0 1 0 27 0.9 186 3 4 1270 1366 2396 17 10 10 0 1 1
3 1546 0 0.5 1 18 1 25 0.5 96 8 20 295 1752 3893 10 0 7 1 1 0
4 1434 0 1.4 0 11 1 49 0.5 108 6 18 749 810 1773 15 8 7 1 0 1

In our samples we have quantitative features, categorical and binary features


And our samples haven't missing items in the data:

In [23]:
data_train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 21 columns):
battery_power    2000 non-null int64
blue             2000 non-null int64
clock_speed      2000 non-null float64
dual_sim         2000 non-null int64
fc               2000 non-null int64
four_g           2000 non-null int64
int_memory       2000 non-null int64
m_dep            2000 non-null float64
mobile_wt        2000 non-null int64
n_cores          2000 non-null int64
pc               2000 non-null int64
px_height        2000 non-null int64
px_width         2000 non-null int64
ram              2000 non-null int64
sc_h             2000 non-null int64
sc_w             2000 non-null int64
talk_time        2000 non-null int64
three_g          2000 non-null int64
touch_screen     2000 non-null int64
wifi             2000 non-null int64
price_range      2000 non-null int64
dtypes: float64(2), int64(19)
memory usage: 328.2 KB
In [25]:
data_test.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 20 columns):
battery_power    1000 non-null int64
blue             1000 non-null int64
clock_speed      1000 non-null float64
dual_sim         1000 non-null int64
fc               1000 non-null int64
four_g           1000 non-null int64
int_memory       1000 non-null int64
m_dep            1000 non-null float64
mobile_wt        1000 non-null int64
n_cores          1000 non-null int64
pc               1000 non-null int64
px_height        1000 non-null int64
px_width         1000 non-null int64
ram              1000 non-null int64
sc_h             1000 non-null int64
sc_w             1000 non-null int64
talk_time        1000 non-null int64
three_g          1000 non-null int64
touch_screen     1000 non-null int64
wifi             1000 non-null int64
dtypes: float64(2), int64(18)
memory usage: 156.3 KB

Look at the distribution of target feature:

In [39]:
data_train.groupby('price_range')[['price_range']].count().rename(columns={'price_range': 'count'}).T
Out[39]:
price_range 0 1 2 3
count 500 500 500 500

Ok, it is a toy dataset..)We see that the target variable is uniform distributed

Part 3. Primary visual data analysis

Let's draw plot of correlation matrix (before this, drop a boolean variables):

In [6]:
corr_matrix = data_train.drop(['blue', 'dual_sim', 'four_g', 'three_g', 'touch_screen', 'wifi'], axis=1).corr()
fig, ax = plt.subplots(figsize=(16,12))
sns.heatmap(corr_matrix,annot=True,fmt='.1f',linewidths=0.5);

Ok, we see that there is a correlation between the target variable and four features: battery_power, px_height, px_width and ram.

And some variables are correlated with each other: pc and fc (photo modules), sc_w and sc_h (screen width and heght), px_width and px_height (pixel resolution heigh and width).

Draw plot of distribution of target variable:

In [42]:
data_train['price_range'].value_counts().plot(kind='bar',figsize=(14,6))
plt.title('Distribution of target variable');

Ok, we again see that the target variable is uniform distributed

Look at the distribution of quantitative features:

In [8]:
features = list(data_train.drop(['price_range', 'blue', 'dual_sim',\
                                     'four_g', 'fc', 'm_dep', 'n_cores',\
                                     'pc', 'sc_h', 'sc_w', 'three_g', 'wifi', 'touch_screen'], axis=1).columns)
data_train[features].hist(figsize=(20,12));

Let's look at the interaction of different features among themselves with sns.pairplot:

In [9]:
sns.pairplot(data_train[features + ['price_range']], hue='price_range');