PROJECT N. 3 - Exploring the data details of used cars sold in eBay

1. Introduction ¶

1.1 Dataset information:¶

This dataset contains information about used cars. This dataset came from a section of German eBay (eBay Kleinanzeigen). The original is here. We will work with a modified version with 50 000 data points and 20 columns, scrawled in 2016 by one Kaggle user.

1.2. Dataset dictionary:¶

dateCrawled - When this ad was first crawled. All field-values are taken from this date.
name - Name of the car.
seller - Whether the seller is private or a dealer.
offerType - The type of listing
price - The price on the ad to sell the car.
abtest - Whether the listing is included in an A/B test.
vehicleType - The vehicle Type.
yearOfRegistration - The year in which the car was first registered.
gearbox - The transmission type.
powerPS - The power of the car in PS.
model - The car model name.
kilometer - How many kilometers the car has driven.
monthOfRegistration - The month in which the car was first registered.
fuelType - What type of fuel the car uses.

1.3. Project objectives¶

The aim of this project is to clean the data and analyze the included used car listings.

2. Dataset cleaning - format data, identify outliers, identify wrong values¶

Let us start to import the library that we probably will need:

In [1]:

import pandas as pd
import numpy as np

In [2]:

autos=pd.read_csv("autos.csv")

In [3]:

autos

Out[3]:

	dateCrawled	name	seller	offerType	price	abtest	vehicleType	yearOfRegistration	gearbox	powerPS	model	odometer	monthOfRegistration	fuelType	brand	notRepairedDamage	dateCreated	nrOfPictures	postalCode	lastSeen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	0	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	0	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	$8,990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	2016-03-26 00:00:00	0	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	$4,350	control	kleinwagen	2007	automatik	71	fortwo	70,000km	6	benzin	smart	nein	2016-03-12 00:00:00	0	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_T�V_neu_ist_sehr_gepfleg...	privat	Angebot	$1,350	test	kombi	2003	manuell	0	focus	150,000km	7	benzin	ford	nein	2016-04-01 00:00:00	0	39218	2016-04-01 14:38:50
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
49995	2016-03-27 14:38:19	Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon	privat	Angebot	$24,900	control	limousine	2011	automatik	239	q5	100,000km	1	diesel	audi	nein	2016-03-27 00:00:00	0	82131	2016-04-01 13:47:40
49996	2016-03-28 10:50:25	Opel_Astra_F_Cabrio_Bertone_Edition___T�V_neu+...	privat	Angebot	$1,980	control	cabrio	1996	manuell	75	astra	150,000km	5	benzin	opel	nein	2016-03-28 00:00:00	0	44807	2016-04-02 14:18:02
49997	2016-04-02 14:44:48	Fiat_500_C_1.2_Dualogic_Lounge	privat	Angebot	$13,200	test	cabrio	2014	automatik	69	500	5,000km	11	benzin	fiat	nein	2016-04-02 00:00:00	0	73430	2016-04-04 11:47:27
49998	2016-03-08 19:25:42	Audi_A3_2.0_TDI_Sportback_Ambition	privat	Angebot	$22,900	control	kombi	2013	manuell	150	a3	40,000km	11	diesel	audi	nein	2016-03-08 00:00:00	0	35683	2016-04-05 16:45:07
49999	2016-03-14 00:42:12	Opel_Vectra_1.6_16V	privat	Angebot	$1,250	control	limousine	1996	manuell	101	vectra	150,000km	1	benzin	opel	nein	2016-03-13 00:00:00	0	45897	2016-04-06 21:18:48

50000 rows × 20 columns

In [4]:

autos.head()

Out[4]:

	dateCrawled	name	seller	offerType	price	abtest	vehicleType	yearOfRegistration	gearbox	powerPS	model	odometer	monthOfRegistration	fuelType	brand	notRepairedDamage	dateCreated	postalCode	lastSeen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	$8,990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	2016-03-26 00:00:00	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	$4,350	control	kleinwagen	2007	automatik	71	fortwo	70,000km	6	benzin	smart	nein	2016-03-12 00:00:00	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_T�V_neu_ist_sehr_gepfleg...	privat	Angebot	$1,350	test	kombi	2003	manuell	0	focus	150,000km	7	benzin	ford	nein	2016-04-01 00:00:00	39218	2016-04-01 14:38:50

In [5]:

autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   dateCrawled          50000 non-null  object
 1   name                 50000 non-null  object
 2   seller               50000 non-null  object
 3   offerType            50000 non-null  object
 4   price                50000 non-null  object
 5   abtest               50000 non-null  object
 6   vehicleType          44905 non-null  object
 7   yearOfRegistration   50000 non-null  int64 
 8   gearbox              47320 non-null  object
 9   powerPS              50000 non-null  int64 
 10  model                47242 non-null  object
 11  odometer             50000 non-null  object
 12  monthOfRegistration  50000 non-null  int64 
 13  fuelType             45518 non-null  object
 14  brand                50000 non-null  object
 15  notRepairedDamage    40171 non-null  object
 16  dateCreated          50000 non-null  object
 17  nrOfPictures         50000 non-null  int64 
 18  postalCode           50000 non-null  int64 
 19  lastSeen             50000 non-null  object
dtypes: int64(5), object(15)
memory usage: 7.6+ MB

2.1 - First notes:¶

Checking the above information, we can take the following notes:

we have 20 columns and 50 000 rows.
columns labels:
- we can check that there are a mix of upper and lower cases (camelcase). To make it easier work with dataset, we should pass all labels to lower cases and replace the space between words by underscore (snakecase). (please check following point 2.2)
columns data type vs information:
- checking datatype, we only have 5 columns which type is int64, the remaining 16 columns has data type object.
- however we can check that columns like price and odometer even contains numeric information are object because contains "$" and "km". We can transformed them in numeric type, removing this special characters (please check following point 2.3)
- there are also 3 columns with date information, dateCrawled, dateCreated and lastSee, which type is object. Probably it is better format those information as date time. (please check following point 2.4)
columns with null values:
- as we can see, we have some columns which contains nulls values vehicleType, gearbox, model, fuelType, notRepairedDamage. We will need to inspect and decide what we can do, if it can be revised, or if we will need to delete those rows.
text information is in German. We may need to translate it to English (using.map, to understand easily the meaning).
column name contains a lot information, which we can split and insert in different new columns. For example before first underscore we have the brand name - as we have already one column with this information, we don't need to duplicate this information. Between next two underscore we have the car model, however we also have one column with car model. Anyway, if we identify any interesting info, we can isolate it in one new column, and do some specific analyse

2.2 - Format columns labels to snakecase:¶

In [6]:

#list of columns names:
autos.columns

Out[6]:

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

In [7]:

#reanme following columns:

autos.rename({
    "dateCrawled":"date_crawled",
    "offerType":"offer_type",
    "vehicleType":"vehicle_type",
    "yearOfRegistration":"registration_year",
    "monthOfRegistration":"registration_month",
    "fuelType":"fuel_type",
    "notRepairedDamage":"unrepaired_damage",
    "dateCreated":"ad_created", 
    "nrOfPictures":"pic_num",
    "postalCode":"postal_code",
    "lastSeen":"last_seen"},
    axis=1, inplace=True)

autos.columns

Out[7]:

Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'powerPS', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'pic_num', 'postal_code',
       'last_seen'],
      dtype='object')

In [8]:

autos.head()

Out[8]:

	date_crawled	name	seller	offer_type	price	abtest	vehicle_type	registration_year	gearbox	powerPS	model	odometer	registration_month	fuel_type	brand	unrepaired_damage	ad_created	postal_code	last_seen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	$8,990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	2016-03-26 00:00:00	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	$4,350	control	kleinwagen	2007	automatik	71	fortwo	70,000km	6	benzin	smart	nein	2016-03-12 00:00:00	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_T�V_neu_ist_sehr_gepfleg...	privat	Angebot	$1,350	test	kombi	2003	manuell	0	focus	150,000km	7	benzin	ford	nein	2016-04-01 00:00:00	39218	2016-04-01 14:38:50

2.3 - Convert text column to numeric type¶

2.3.1- Column `price`¶

As we detected before, we can conver price and odemeter to numeric type, just removing "$" and "km", respectively.

In [9]:

autos["price"]=autos["price"].str.replace("$", "")
autos["price"]=autos["price"].str.replace(",", "").astype(float)

<ipython-input-9-48ac3c5553ea>:1: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will*not* be treated as literal strings when regex=True.
  autos["price"]=autos["price"].str.replace("$", "")

In [10]:

#print(autos["price"])
print(autos["price"].dtype)

float64

In [11]:

#autos["price"]="{:,}".format(autos["price"])
#print(autos["price"])

In [12]:

autos.rename({"price":"price_dollar"}, axis=1, inplace=True)
print(autos["price_dollar"])

0         5000.0
1         8500.0
2         8990.0
3         4350.0
4         1350.0
          ...   
49995    24900.0
49996     1980.0
49997    13200.0
49998    22900.0
49999     1250.0
Name: price_dollar, Length: 50000, dtype: float64

In [13]:

print(autos['price_dollar'].describe())

count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price_dollar, dtype: float64

When we tried to inspect the price column with series.descibe() we found the result came in scientific notation. To solve this question we will use format the float numbers with comma on thousand position and with 3 decimal digits:

In [14]:

pd.options.display.float_format = '{:,.3f}'.format

In [15]:

#rechecking price column:
print(autos['price_dollar'].describe())

count       50,000.000
mean         9,840.044
std        481,104.381
min              0.000
25%          1,100.000
50%          2,950.000
75%          7,200.000
max     99,999,999.000
Name: price_dollar, dtype: float64

2.3.2- Column `odometer`¶

In [16]:

autos["odometer"]=autos["odometer"].str.replace("km", "")
autos["odometer"]=autos["odometer"].str.replace(",", "").astype(float)
#print(autos["odometer"])
print(autos["odometer"].dtype)

float64

In [17]:

autos.rename({"odometer":"odometer_km"}, axis=1, inplace=True)
print(autos["odometer_km"])

0       150,000.000
1       150,000.000
2        70,000.000
3        70,000.000
4       150,000.000
            ...    
49995   100,000.000
49996   150,000.000
49997     5,000.000
49998    40,000.000
49999   150,000.000
Name: odometer_km, Length: 50000, dtype: float64

2.4 - Converts date columns to datetime type¶

In [18]:

import datetime

In [19]:

autos[['date_crawled','ad_created','last_seen','registration_year','registration_month']][0:5]

Out[19]:

	date_crawled	ad_created	last_seen	registration_year	registration_month
0	2016-03-26 17:47:46	2016-03-26 00:00:00	2016-04-06 06:45:54	2004	3
1	2016-04-04 13:38:56	2016-04-04 00:00:00	2016-04-06 14:45:08	1997	6
2	2016-03-26 18:57:24	2016-03-26 00:00:00	2016-04-06 20:15:37	2009	7
3	2016-03-12 16:58:10	2016-03-12 00:00:00	2016-03-15 03:16:28	2007	6
4	2016-04-01 14:38:50	2016-04-01 00:00:00	2016-04-01 14:38:50	2003	7

In [20]:

#(autos['date_crawled'].str[:10]).value_counts(dropna=False).sort_index()
autos['date_crawled']=pd.to_datetime(autos['date_crawled'], format='%Y-%m-%d %H:%M:%S')

In [21]:

#(autos['ad_created'].str[:10]).value_counts(dropna=False).sort_index()
autos['ad_created']=pd.to_datetime(autos['ad_created'], format='%Y-%m-%d %H:%M:%S')

In [22]:

#(autos['last_seen'].str[:10]).value_counts(dropna=False).sort_index()
autos['last_seen']=pd.to_datetime(autos['last_seen'], format='%Y-%m-%d %H:%M:%S')

In [23]:

autos['registration_year']=pd.to_datetime(autos['registration_year'], format='%Y',errors='ignore')
#we cannot converter to datetime format while we have wrong information on this column

In [24]:

#as we have some rows without month information, we need to replace the zero by nan:
#we can do it with the following:
#autos.loc[autos['registration_month']==0, "registration_month"]=np.nan
#or using error"coerce"

autos['registration_month']=pd.to_datetime(autos['registration_month'], format='%m',errors='coerce').dt.month

#we will not leave this column in datetime format because we want only see the month information, so it will be int84

In [25]:

autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   date_crawled        50000 non-null  datetime64[ns]
 1   name                50000 non-null  object        
 2   seller              50000 non-null  object        
 3   offer_type          50000 non-null  object        
 4   price_dollar        50000 non-null  float64       
 5   abtest              50000 non-null  object        
 6   vehicle_type        44905 non-null  object        
 7   registration_year   50000 non-null  int64         
 8   gearbox             47320 non-null  object        
 9   powerPS             50000 non-null  int64         
 10  model               47242 non-null  object        
 11  odometer_km         50000 non-null  float64       
 12  registration_month  44925 non-null  float64       
 13  fuel_type           45518 non-null  object        
 14  brand               50000 non-null  object        
 15  unrepaired_damage   40171 non-null  object        
 16  ad_created          50000 non-null  datetime64[ns]
 17  pic_num             50000 non-null  int64         
 18  postal_code         50000 non-null  int64         
 19  last_seen           50000 non-null  datetime64[ns]
dtypes: datetime64[ns](3), float64(3), int64(4), object(10)
memory usage: 7.6+ MB

Checking statistics of datetime type:

In [26]:

autos.describe(include="datetime64", datetime_is_numeric=True)

Out[26]:

	date_crawled	ad_created	last_seen
count	50000	50000	50000
mean	2016-03-21 13:43:25.697280	2016-03-20 19:41:34.656000	2016-03-30 04:04:36.309079808
min	2016-03-05 14:06:30	2015-06-11 00:00:00	2016-03-05 14:45:46
25%	2016-03-13 14:51:25.750000128	2016-03-13 00:00:00	2016-03-23 11:11:33.500000
50%	2016-03-21 17:53:56	2016-03-21 00:00:00	2016-04-04 01:17:24
75%	2016-03-29 14:36:47.750000128	2016-03-29 00:00:00	2016-04-06 10:45:28.249999872
max	2016-04-07 14:36:56	2016-04-07 00:00:00	2016-04-07 14:58:50

Checking above information, we don't see anything strange. So, we will not remove anything fromt these columns

2.5 - Checking statistics of numeric columns (exept datetime type)¶

In [27]:

autos.describe()

Out[27]:

	price_dollar	registration_year	powerPS	odometer_km	registration_month	pic_num	postal_code
count	50,000.000	50,000.000	50,000.000	50,000.000	44,925.000	50,000.000	50,000.000
mean	9,840.044	2,005.073	116.356	125,732.700	6.370	0.000	50,813.627
std	481,104.381	105.713	209.217	40,042.212	3.349	0.000	25,779.748
min	0.000	1,000.000	0.000	5,000.000	1.000	0.000	1,067.000
25%	1,100.000	1,999.000	70.000	125,000.000	3.000	0.000	30,451.000
50%	2,950.000	2,003.000	105.000	150,000.000	6.000	0.000	49,577.000
75%	7,200.000	2,008.000	150.000	150,000.000	9.000	0.000	71,540.000
max	99,999,999.000	9,999.000	17,700.000	150,000.000	12.000	0.000	99,998.000

Using the function dataframe.describe we can see some relevant information, which we will need to solve.

Checking the results of price_dollar, we can see that the minimum price is 0 (zero), which should be a mistake.
Checking the results of registration_year, we can see that the minimum year is 1000 and maximum ist 9999, which are obviously incorrect.
Checking the results of registration_month, we can see that the minimum month is 0 (zero), which is not correct.
Checking the results of postal_code information, we can see that the minimum postal code has only 4 digits, and other statistics has 5 digits, which could mean some mistake.

2.5.1 - Checking price column details¶

In [28]:

autos["price_dollar"].value_counts().sort_index()

Out[28]:

0.000             1421
1.000              156
2.000                3
3.000                1
5.000                2
                  ... 
10,000,000.000       1
11,111,111.000       2
12,345,678.000       3
27,322,222.000       1
99,999,999.000       1
Name: price_dollar, Length: 2357, dtype: int64

In [29]:

autos["price_dollar"].value_counts().sort_index().head(25)

Out[29]:

0.000     1421
1.000      156
2.000        3
3.000        1
5.000        2
8.000        1
9.000        1
10.000       7
11.000       2
12.000       3
13.000       2
14.000       1
15.000       2
17.000       3
18.000       1
20.000       4
25.000       5
29.000       1
30.000       7
35.000       1
40.000       6
45.000       4
47.000       1
49.000       4
50.000      49
Name: price_dollar, dtype: int64

In [30]:

autos["price_dollar"].value_counts().sort_index(ascending=False).head(12)

Out[30]:

99,999,999.000    1
27,322,222.000    1
12,345,678.000    3
11,111,111.000    2
10,000,000.000    1
3,890,000.000     1
1,300,000.000     1
1,234,566.000     1
999,999.000       2
999,990.000       1
350,000.000       1
345,000.000       1
Name: price_dollar, dtype: int64

In [31]:

autos.loc[autos["price_dollar"]<100,("price_dollar","registration_year")]

Out[31]:

	price_dollar	registration_year
25	90.000	1996
27	0.000	2005
30	80.000	2002
55	1.000	2017
64	40.000	2011
...	...	...
49884	0.000	1999
49943	0.000	2016
49960	0.000	1999
49974	0.000	1983
49984	0.000	2000

1762 rows × 2 columns

--> We have 1421 rows without price information, some rows with very low price like 1usd, 2usd, 5usd. otherwise we have columns with very high price like 99,999,999usd which should not make sense. We must remove at least the values lower than 50 and higher or equal than 999990

We tried to understand if prices equals to zero have some correlation with registration_year (if cars are too old, but we didn't find any correlation.

To take a decision which values we will remove we will try to check price distribution for intervals.

We we will create following price ranges:

prices equals to zero (because we already identify a big number of results here)
until 50 dollars
between 100 to 1000 dollars
between 1000 to 10000
between 10000 to 350000 dollars
more than 350000 dollars

Let's see how this categories behave.

In [32]:

# Creating the empty dictionary
price_dist = {'p_0':0,'p_until_50':0,'p_100_1000':0, 'p_1000_10000':0, 'p10000_350000':0,
              'p_more_350000':0}

# Populating it
for row in autos['price_dollar']:
    if row==0:
        price_dist['p_0'] += 1
    elif row <= 50:
        price_dist['p_until_50'] += 1
    elif 100 < row <= 1000:
        price_dist['p_100_1000'] += 1
    elif 1000 < row <= 10000:
        price_dist['p_1000_10000'] += 1
    elif 10000 < row <= 350000:
        price_dist['p10000_350000'] += 1
    else:
        price_dist['p_more_350000'] += 1
print(price_dist)

{'p_0': 1421, 'p_until_50': 267, 'p_100_1000': 10103, 'p_1000_10000': 29928, 'p10000_350000': 8059, 'p_more_350000': 222}

--> Based on this results we will remove pricpointes lower than 50 and higher than 350k dollars. (check point 4)

In [33]:

print("percentage of price lower than 50 is :{:.2f}%".format(((price_dist["p_0"]+price_dist["p_until_50"])/50000)*100))
print("percentage of price higher than 350000 is :{:.2f}%".format(((price_dist["p_more_350000"])/50000)*100))

percentage of price lower than 50 is :3.38%
percentage of price higher than 350000 is :0.44%

2.5.2 - Checking`registration_year` column details¶

In [34]:

autos["registration_year"].value_counts(dropna=False).sort_index().head(10)

Out[34]:

1000    1
1001    1
1111    1
1500    1
1800    2
1910    9
1927    1
1929    1
1931    1
1934    2
Name: registration_year, dtype: int64

-->We must remove year lower than 1910

In [35]:

autos["registration_year"].value_counts(dropna=False).sort_index(ascending=False).head(20)

Out[35]:

9999       4
9996       1
9000       2
8888       1
6200       1
5911       1
5000       4
4800       1
4500       1
4100       1
2800       1
2019       3
2018     492
2017    1453
2016    1316
2015     399
2014     666
2013     806
2012    1323
2011    1634
Name: registration_year, dtype: int64

-->We must remove year higher than 2016 (date of this dataset)

2.5.3 - Checking`registration_month` column details¶

In [36]:

#autos["registration_month"].dt.month.value_counts()
autos["registration_month"].value_counts(dropna=False)

Out[36]:

NaN       5075
3.000     5071
6.000     4368
5.000     4107
4.000     4102
7.000     3949
10.000    3651
12.000    3447
9.000     3389
11.000    3360
1.000     3282
8.000     3191
2.000     3008
Name: registration_month, dtype: int64

--> We have 5075 rows, without month infortmation! However it represents 10% of values. We should not just remove it... Let us proceed to try to understand if this will have impact on our study.

2.5.4 - Checking`postal_code` column details¶

In [37]:

autos["postal_code"].value_counts().sort_index().head(10)

Out[37]:

1067    11
1069     7
1097     4
1099    10
1108     2
1109     9
1127     3
1129     8
1139    15
1156     4
Name: postal_code, dtype: int64

--> We can conclude that it is usual to have zip code only with 4 digits, so it should not represent one problem. We will not remove anything here

2.5.5 - Checking`odometer_km` column details¶

In [38]:

autos["odometer_km"].value_counts().sort_index()

Out[38]:

5,000.000        967
10,000.000       264
20,000.000       784
30,000.000       789
40,000.000       819
50,000.000      1027
60,000.000      1164
70,000.000      1230
80,000.000      1436
90,000.000      1757
100,000.000     2169
125,000.000     5170
150,000.000    32424
Name: odometer_km, dtype: int64

--> Following what we can see, there is nothing strange. so we will not remove anything here.

2.6 - Checking statistics of non-numeric columns¶

In [39]:

#autos.describe(include="all")
autos.describe(include="object")

Out[39]:

	name	seller	offer_type	abtest	vehicle_type	gearbox	model	fuel_type	brand	unrepaired_damage
count	50000	50000	50000	50000	44905	47320	47242	45518	50000	40171
unique	38754	2	2	2	8	2	245	7	40	2
top	Ford_Fiesta	privat	Angebot	test	limousine	manuell	golf	benzin	volkswagen	nein
freq	78	49999	49999	25756	12859	36993	4024	30107	10687	35232

Using the function dataframe.describe with include=object, we can see statistcs for non-numerical columns, and we can see some relevant information, which we will need to solve.

Checking the results of seller and offer_type, we can see that almost all values in these columns has the same value (only one seller is not privat and only one offer is not Angebot type).
Checking the results of abtest,gearbox and unrepaired_damage , we can see that the results are split between 2 values, which is correct, if there is not nulls in these columns:
- abtest: should be test or control;
- gearbox: should be manuell or automatic;
- unrepaired_damage: should be nein or ja.
Checking the results of vehicle_type and fuel_type, we can see that we have 8 and 7 different types of each. We should inspect this uniques values and confirm if it is correct.

2.6.1 - Checking `seller` and `offer tyoe` columns details¶

In [40]:

print(autos["offer_type"].value_counts(dropna=False))

Angebot    49999
Gesuch         1
Name: offer_type, dtype: int64

In [41]:

print(autos["seller"].value_counts(dropna=False))

privat        49999
gewerblich        1
Name: seller, dtype: int64

2.6.2 - Checking `abtest`, `gearbox` and `offer tyoe` columns details¶

In [42]:

print(autos["abtest"].value_counts(dropna=False))

test       25756
control    24244
Name: abtest, dtype: int64

In [43]:

print(autos["gearbox"].value_counts(dropna=False))

manuell      36993
automatik    10327
NaN           2680
Name: gearbox, dtype: int64

In [44]:

print(autos["unrepaired_damage"].value_counts(dropna=False))

nein    35232
NaN      9829
ja       4939
Name: unrepaired_damage, dtype: int64

--> According to our results, we can see that:

we have big qty of NaN values in gearboxand unrepaired_damaged, so we cannot remove it directly. Let us see if it has impact on our analyze. If don't have we will keep this NaN values here.

In [45]:

print(autos["vehicle_type"].value_counts(dropna=False))

limousine     12859
kleinwagen    10822
kombi          9127
NaN            5095
bus            4093
cabrio         3061
coupe          2537
suv            1986
andere          420
Name: vehicle_type, dtype: int64

In [46]:

print(autos["fuel_type"].value_counts(dropna=False))

benzin     30107
diesel     14567
NaN         4482
lpg          691
cng           75
hybrid        37
andere        22
elektro       19
Name: fuel_type, dtype: int64

--> According to our following inspection,except NaN values, other values are regarding real different types of vehicles or fuel.

2.7 - Translate text from german to english¶

In [47]:

dic_gearbox={"manuell":"manual", "automatik":"automatic"}
autos["gearbox"]=autos["gearbox"].map(dic_gearbox)
print(autos["gearbox"].value_counts(dropna=False))

manual       36993
automatic    10327
NaN           2680
Name: gearbox, dtype: int64

In [48]:

dic_unrepaired={"nein":"no", "ja":"yes"}
autos["unrepaired_damage"]=autos["unrepaired_damage"].map(dic_unrepaired)
print(autos["unrepaired_damage"].value_counts(dropna=False))

no     35232
NaN     9829
yes     4939
Name: unrepaired_damage, dtype: int64

In [49]:

dic_vehicle_type=({
    "limousine":"sedan",
    "kleinwagen":"small car",
    "kombi":"station wagon",
    "NaN":"NaN",
    "bus":"bus",
    "cabrio":"convertible",
    "coupe":"coupe",
    "suv":"suv",    
    "andere":"other"})
autos["vehicle_type"]=autos["vehicle_type"].map(dic_vehicle_type)
print(autos["vehicle_type"].value_counts(dropna=False))

sedan            12859
small car        10822
station wagon     9127
NaN               5095
bus               4093
convertible       3061
coupe             2537
suv               1986
other              420
Name: vehicle_type, dtype: int64

In [50]:

dic_fuel_type=({
    "benzin":"petrol",
    "diesel":"diesel",
    "NaN":"NaN",
    "lpg":"lpg",
    "cng":"cng",
    "hybrid":"hybrid",
    "andere":"other",
    "elektro":"electro"})
autos["fuel_type"]=autos["fuel_type"].map(dic_fuel_type)
print(autos["fuel_type"].value_counts(dropna=False))

petrol     30107
diesel     14567
NaN         4482
lpg          691
cng           75
hybrid        37
other         22
electro       19
Name: fuel_type, dtype: int64

In [51]:

#autos["model"]=autos["name"].str.split("_").str[1]
#autos["model"]

3 - Dataset cleaning - remove outliers values¶

In [52]:

print("Dataset original shape:")
print(autos.shape)

Dataset original shape:
(50000, 20)

In [53]:

#from price_dollar-  as we conclude we will remove some rows
#autos_select=autos[autos["price_dollar"]==0].remove()
autos=autos.loc[autos["price_dollar"].between(50,350000)]
print("Shape after delete outliers of price:")
print(autos.shape)

Shape after delete outliers of price:
(48347, 20)

In [54]:

#from odometer_km - based on results that we got, we will not remove any rows
#from registration_year - basde on our previous conclusions we will remove some rows:
autos=autos.loc[autos["registration_year"].between(1910,2016)]
print("Shape after delete outliers of registration_year:")
print(autos.shape)

Shape after delete outliers of registration_year:
(46473, 20)

In [55]:

#autos_select["price_dollar"].value_counts(normalize=True).sort_index()
autos.describe(include="all", datetime_is_numeric=True)

Out[55]:

	date_crawled	name	seller	offer_type	price_dollar	abtest	vehicle_type	registration_year	gearbox	powerPS	model	odometer_km	registration_month	fuel_type	brand	unrepaired_damage	ad_created	pic_num	postal_code	last_seen
count	46473	46473	46473	46473	46,473.000	46473	43872	46,473.000	44453	46,473.000	44321	46,473.000	42,554.000	43252	46473	38311	46473	46,473.000	46,473.000	46473
unique	NaN	35613	2	1	NaN	2	8	NaN	2	NaN	244	NaN	NaN	7	40	2	NaN	NaN	NaN	NaN
top	NaN	BMW_316i	privat	Angebot	NaN	test	sedan	NaN	manual	NaN	golf	NaN	NaN	petrol	volkswagen	no	NaN	NaN	NaN	NaN
freq	NaN	75	46472	46473	NaN	23948	12567	NaN	34619	NaN	3695	NaN	NaN	28449	9821	33794	NaN	NaN	NaN	NaN
mean	2016-03-21 13:21:44.805177344	NaN	NaN	NaN	6,004.439	NaN	NaN	2,002.931	NaN	118.168	NaN	125,688.787	6.378	NaN	NaN	NaN	2016-03-20 19:19:03.141178880	0.000	51,118.913	2016-03-30 05:52:13.654874880
min	2016-03-05 14:06:30	NaN	NaN	NaN	50.000	NaN	NaN	1,910.000	NaN	0.000	NaN	5,000.000	1.000	NaN	NaN	NaN	2015-06-11 00:00:00	0.000	1,067.000	2016-03-05 14:45:46
25%	2016-03-13 13:52:31	NaN	NaN	NaN	1,250.000	NaN	NaN	1,999.000	NaN	75.000	NaN	100,000.000	3.000	NaN	NaN	NaN	2016-03-13 00:00:00	0.000	30,853.000	2016-03-23 13:48:53
50%	2016-03-21 17:46:58	NaN	NaN	NaN	3,149.000	NaN	NaN	2,003.000	NaN	109.000	NaN	150,000.000	6.000	NaN	NaN	NaN	2016-03-21 00:00:00	0.000	49,832.000	2016-04-04 07:47:19
75%	2016-03-29 14:46:30	NaN	NaN	NaN	7,500.000	NaN	NaN	2,008.000	NaN	150.000	NaN	150,000.000	9.000	NaN	NaN	NaN	2016-03-29 00:00:00	0.000	72,070.000	2016-04-06 11:15:36
max	2016-04-07 14:36:56	NaN	NaN	NaN	350,000.000	NaN	NaN	2,016.000	NaN	17,700.000	NaN	150,000.000	12.000	NaN	NaN	NaN	2016-04-07 00:00:00	0.000	99,998.000	2016-04-07 14:58:50
std	NaN	NaN	NaN	NaN	9,189.710	NaN	NaN	7.129	NaN	185.230	NaN	39,687.056	3.351	NaN	NaN	NaN	NaN	0.000	25,751.370	NaN

4 - Checking interesing facts¶

4.1 - What are the most commons brands?¶

First, we will aggregate the items by car brands

In [56]:

# Checking the unique brand values
print("What are the uniques brandes in our dataset: \n", autos["brand"].unique(),"\n")
print("How many brands do we have in our dataset? \n", autos["brand"].unique().shape[0])

What are the uniques brandes in our dataset: 
 ['peugeot' 'bmw' 'volkswagen' 'smart' 'ford' 'chrysler' 'seat' 'renault'
 'mercedes_benz' 'audi' 'sonstige_autos' 'opel' 'mazda' 'porsche' 'mini'
 'toyota' 'dacia' 'nissan' 'jeep' 'saab' 'volvo' 'mitsubishi' 'jaguar'
 'fiat' 'skoda' 'subaru' 'kia' 'citroen' 'chevrolet' 'hyundai' 'honda'
 'daewoo' 'suzuki' 'trabant' 'land_rover' 'alfa_romeo' 'lada' 'rover'
 'daihatsu' 'lancia'] 

How many brands do we have in our dataset? 
 40

In [57]:

print("TOP 15 of quantity of cars per brand:")
print(autos["brand"].value_counts().head(15))


print("\n")
print("TOP 15 percentage of cars per brand:")
print((((autos["brand"].value_counts())/autos["brand"].shape)*100).head(15))

TOP 15 of quantity of cars per brand:
volkswagen       9821
bmw              5110
opel             4987
mercedes_benz    4485
audi             4025
ford             3252
renault          2193
peugeot          1390
fiat             1195
seat              848
skoda             763
nissan            712
mazda             709
smart             659
citroen           653
Name: brand, dtype: int64


TOP 15 percentage of cars per brand:
volkswagen      21.133
bmw             10.996
opel            10.731
mercedes_benz    9.651
audi             8.661
ford             6.998
renault          4.719
peugeot          2.991
fiat             2.571
seat             1.825
skoda            1.642
nissan           1.532
mazda            1.526
smart            1.418
citroen          1.405
Name: brand, dtype: float64

Based on this values, we can easily conclude that the brands with more cars in this dataset are volkswagen,bmwandopel<\span>

4.2 - What are the relation between the most commons brands and its mean price?¶

In [58]:

freq_perc=((autos["brand"].value_counts()/autos["brand"].shape)*100)

In [59]:

print(freq_perc["jaguar"])

0.15708045531814172

In [60]:

#creating empty dictionary to strore our aggregate data

brand_avg_price={}


#populating our dictionary with average price per brand (because we think it can be useful, we will also add the brand freq)
for each in autos["brand"].unique():
    avg_price=autos.loc[autos["brand"]==each, "price_dollar"].mean()
    brand_avg_price[each]=[avg_price,freq_perc[each]]
print(brand_avg_price)

{'peugeot': [3100.6928057553955, 2.9909840122221505], 'bmw': [8376.805088062622, 10.99563187226992], 'volkswagen': [5424.9185419000105, 21.132700707937943], 'smart': [3591.0212443095597, 1.4180276719815805], 'ford': [3762.1165436654364, 6.997611516364341], 'chrysler': [3465.743902439024, 0.35289307770103073], 'seat': [4423.116745283019, 1.824715426161427], 'renault': [2483.889192886457, 4.718869020721709], 'mercedes_benz': [8663.059308807135, 9.65076496029953], 'audi': [9373.771428571428, 8.66094291308932], 'sonstige_autos': [12642.134228187919, 0.9618488154412239], 'opel': [2996.0850210547424, 10.730962063994147], 'mazda': [4112.596614950635, 1.525617024939212], 'porsche': [46455.89323843416, 0.604652163621888], 'mini': [10639.450980392157, 0.8779291201342715], 'toyota': [5167.091062394604, 1.2760097260775074], 'dacia': [5915.528455284553, 0.264669808275773], 'nissan': [4750.063202247191, 1.5320723861166698], 'jeep': [11650.5, 0.22808942827017836], 'saab': [3211.6493506493507, 0.16568760355475223], 'volvo': [4993.208037825059, 0.9102059260215609], 'mitsubishi': [3430.25, 0.817679082477998], 'jaguar': [11635.493150684932, 0.15708045531814172], 'fiat': [2818.456066945607, 2.5713855356873885], 'skoda': [6393.018348623853, 1.6418135261334537], 'subaru': [4033.7551020408164, 0.21087513179695738], 'kia': [5982.330303030303, 0.7100897295203666], 'citroen': [3784.924961715161, 1.405116949626665], 'chevrolet': [6709.358490566037, 0.5702235706754459], 'hyundai': [5376.740899357602, 1.0048845566242766], 'honda': [4119.109589041096, 0.7854022765907086], 'daewoo': [1064.0579710144928, 0.14847330708153123], 'suzuki': [4111.572463768116, 0.5938932283261249], 'trabant': [1818.609375, 0.1377143717857681], 'land_rover': [19108.091836734693, 0.21087513179695738], 'alfa_romeo': [4100.915857605178, 0.6649022012781616], 'lada': [2688.296296296296, 0.05809825059712091], 'rover': [1602.2903225806451, 0.13341079766746283], 'daihatsu': [1636.1965811965813, 0.2517590859208573], 'lancia': [3444.877551020408, 0.10543756589847869]}

However, it is not easy read the values print in a dictionary format.

Let us see, how will be the dictionary if we use pretty print from json:

In [61]:

import json
print(json.dumps(brand_avg_price, indent=4))

{
    "peugeot": [
        3100.6928057553955,
        2.9909840122221505
    ],
    "bmw": [
        8376.805088062622,
        10.99563187226992
    ],
    "volkswagen": [
        5424.9185419000105,
        21.132700707937943
    ],
    "smart": [
        3591.0212443095597,
        1.4180276719815805
    ],
    "ford": [
        3762.1165436654364,
        6.997611516364341
    ],
    "chrysler": [
        3465.743902439024,
        0.35289307770103073
    ],
    "seat": [
        4423.116745283019,
        1.824715426161427
    ],
    "renault": [
        2483.889192886457,
        4.718869020721709
    ],
    "mercedes_benz": [
        8663.059308807135,
        9.65076496029953
    ],
    "audi": [
        9373.771428571428,
        8.66094291308932
    ],
    "sonstige_autos": [
        12642.134228187919,
        0.9618488154412239
    ],
    "opel": [
        2996.0850210547424,
        10.730962063994147
    ],
    "mazda": [
        4112.596614950635,
        1.525617024939212
    ],
    "porsche": [
        46455.89323843416,
        0.604652163621888
    ],
    "mini": [
        10639.450980392157,
        0.8779291201342715
    ],
    "toyota": [
        5167.091062394604,
        1.2760097260775074
    ],
    "dacia": [
        5915.528455284553,
        0.264669808275773
    ],
    "nissan": [
        4750.063202247191,
        1.5320723861166698
    ],
    "jeep": [
        11650.5,
        0.22808942827017836
    ],
    "saab": [
        3211.6493506493507,
        0.16568760355475223
    ],
    "volvo": [
        4993.208037825059,
        0.9102059260215609
    ],
    "mitsubishi": [
        3430.25,
        0.817679082477998
    ],
    "jaguar": [
        11635.493150684932,
        0.15708045531814172
    ],
    "fiat": [
        2818.456066945607,
        2.5713855356873885
    ],
    "skoda": [
        6393.018348623853,
        1.6418135261334537
    ],
    "subaru": [
        4033.7551020408164,
        0.21087513179695738
    ],
    "kia": [
        5982.330303030303,
        0.7100897295203666
    ],
    "citroen": [
        3784.924961715161,
        1.405116949626665
    ],
    "chevrolet": [
        6709.358490566037,
        0.5702235706754459
    ],
    "hyundai": [
        5376.740899357602,
        1.0048845566242766
    ],
    "honda": [
        4119.109589041096,
        0.7854022765907086
    ],
    "daewoo": [
        1064.0579710144928,
        0.14847330708153123
    ],
    "suzuki": [
        4111.572463768116,
        0.5938932283261249
    ],
    "trabant": [
        1818.609375,
        0.1377143717857681
    ],
    "land_rover": [
        19108.091836734693,
        0.21087513179695738
    ],
    "alfa_romeo": [
        4100.915857605178,
        0.6649022012781616
    ],
    "lada": [
        2688.296296296296,
        0.05809825059712091
    ],
    "rover": [
        1602.2903225806451,
        0.13341079766746283
    ],
    "daihatsu": [
        1636.1965811965813,
        0.2517590859208573
    ],
    "lancia": [
        3444.877551020408,
        0.10543756589847869
    ]
}

It is much more readable. However, it is still not easy to compare the values. So, we decide to pass this values to a data frame

In [62]:

#Transform the dictionary in the dataframe:
df_brand_price=pd.DataFrame.from_dict(brand_avg_price,orient="index", columns=["mean_price_dol", "freq_percentage"])

#if we don't include the columns argument in DataFrame constructor, we can put the label in the columns in the following way:
#df_brand_price.rename({0:"mean_price_dol", 1:"freq_percentage"},axis=1, inplace=True)

print("TOP 5 cheaper brands:")
df_brand_price.sort_values(by=["mean_price_dol"]).head(5)

TOP 5 cheaper brands:

Out[62]:

	mean_price_dol	freq_percentage
daewoo	1,064.058	0.148
rover	1,602.290	0.133
daihatsu	1,636.197	0.252
trabant	1,818.609	0.138
renault	2,483.889	4.719

In [63]:

print("TOP 5 more expensive brands:")
df_brand_price.sort_values(by=["mean_price_dol"], ascending=False).head(5)

TOP 5 more expensive brands:

Out[63]:

	mean_price_dol	freq_percentage
porsche	46,455.893	0.605
land_rover	19,108.092	0.211
sonstige_autos	12,642.134	0.962
jeep	11,650.500	0.228
jaguar	11,635.493	0.157

If we check price by brand, from total cars in this dataset, we can see that the cheaper cars are: daewoo,rover,daihatsu,trabant and renault. And the more expensive brands are porsche,land_rover,sonstige_autos,jeep and jaguar.

However if at same time we check, how many cars from this brands we have in our dataset, we can see we have very few cars(almost all have less than 1% in our dataset (the only exception is renault which has 4.7%).

So it could be interesting to check what happens in the brands, for what we have more cars in this dataset:

In [64]:

df_brand_price[freq_perc>5].sort_values(by=["mean_price_dol"], ascending=False)

<ipython-input-64-204e483a8890>:1: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  df_brand_price[freq_perc>5].sort_values(by=["mean_price_dol"], ascending=False)

Out[64]:

	mean_price_dol	freq_percentage
audi	9,373.771	8.661
mercedes_benz	8,663.059	9.651
bmw	8,376.805	10.996
volkswagen	5,424.919	21.133
ford	3,762.117	6.998
opel	2,996.085	10.731

In [65]:

print("How can we read the values in this table?: \n")
print("The {:.2f}% of cars in this dataset are from volkswagen and its price in average is {:.2f}$.".format(df_brand_price.loc['volkswagen','freq_percentage'],df_brand_price.loc['volkswagen','mean_price_dol'] ))

print("The {:.2f}% of cars in this dataset are from opel and its price in average is {:.2f}$.".format(df_brand_price.loc['opel','freq_percentage'],df_brand_price.loc['opel','mean_price_dol'] ))

How can we read the values in this table?: 

The 21.13% of cars in this dataset are from volkswagen and its price in average is 5424.92$.
The 10.73% of cars in this dataset are from opel and its price in average is 2996.09$.

In the above table, we can see the price of most common brands in our dataset, i.e., the price of brands with more than 5% of frequence in the dataset.

From this data selection we can also see that:

the most expensive cars are from audi, mercedes_benz and `bmw´
the less expensive cars are ford and opel
Volkswagen even is the most common brand in this dataset, is in the middle in of above brands in what concerns mean price.

4.3 - What are the relation between the most commons brands and its mean odometer?¶

In [66]:

#creating empty dictionary to strore our aggregate data
brand_avg_odom={}

#populating our dictionary with average odemeter per brand
for each in autos["brand"].unique():
    avg_odom=autos.loc[autos["brand"]==each, "odometer_km"].mean()
    brand_avg_odom[each]=avg_odom
#print(brand_avg_odom)

#converting dictionary in series and dataframe using serie and dataframe constructor:
s_brand_odom=pd.Series(brand_avg_odom)
df_brand_odom=pd.DataFrame(s_brand_odom, columns=["mean_odometer_km"])
#print(df_brand_odom)

print("TOP 5 brands with km:")
df_brand_odom.sort_values(by=["mean_odometer_km"]).head(5)

TOP 5 brands with km:

Out[66]:

	mean_odometer_km
trabant	55,312.500
lada	83,518.519
dacia	84,268.293
mini	88,308.824
sonstige_autos	90,313.199

In [67]:

#As we think it can be useful check odometer information comparing with price and brand frequence, we will agregate the new values to our previous dataframe
df_brand_price_odom=df_brand_price.copy()
df_brand_price_odom["mean_odometer_km"]=df_brand_odom
df_brand_price_odom.head(5)

Out[67]:

	mean_price_dol	freq_percentage	mean_odometer_km
peugeot	3,100.693	2.991	127,122.302
bmw	8,376.805	10.996	132,651.663
volkswagen	5,424.919	21.133	128,771.001
smart	3,591.021	1.418	99,590.288
ford	3,762.117	6.998	124,295.818

In [68]:

print("TOP 5 brands with less km:")
df_brand_price_odom.sort_values(by=["mean_odometer_km"], ascending=False).head(5)

TOP 5 brands with less km:

Out[68]:

	mean_price_dol	freq_percentage	mean_odometer_km
saab	3,211.649	0.166	144,415.584
volvo	4,993.208	0.910	138,581.560
rover	1,602.290	0.133	137,661.290
bmw	8,376.805	10.996	132,651.663
chrysler	3,465.744	0.353	132,378.049

In [69]:

print("Brands with more than 5% of frequence:")
df_brand_price_odom[freq_perc>5].sort_values(by=["freq_percentage"], ascending=False)

Brands with more than 5% of frequence:

<ipython-input-69-0ff77f17394f>:2: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  df_brand_price_odom[freq_perc>5].sort_values(by=["freq_percentage"], ascending=False)

Out[69]:

	mean_price_dol	freq_percentage	mean_odometer_km
volkswagen	5,424.919	21.133	128,771.001
bmw	8,376.805	10.996	132,651.663
opel	2,996.085	10.731	129,440.545
mercedes_benz	8,663.059	9.651	130,923.077
audi	9,373.771	8.661	129,254.658
ford	3,762.117	6.998	124,295.818

Checking above values we can conclude that the most common brands in our dataset has in average as more or less same mileage, in average it is more than 124k km per brand.

4.3 - What are the relation between the most price and its mean odometer?¶

Let us to see what happen if we split the odometer into groups and use the aggregation to see if average prices follows any pattern based on the mileage

In [70]:

print(autos['odometer_km'].describe())

count    46,473.000
mean    125,688.787
std      39,687.056
min       5,000.000
25%     100,000.000
50%     150,000.000
75%     150,000.000
max     150,000.000
Name: odometer_km, dtype: float64

Our odometer groups will be:

less than 10000km
between 10000km and 30000km
between 30000km and 50000km
between 50000km and 75000km
between 75000km and 10000km
between 100000km and 14000km
more than 140000km

In [71]:

price_by_odometer = {'odom_until_10k':0,'odom_10k_30k':0, 'odom_30k_50k':0, 'odom_50k_75k':0,'odom_75k_100k':0,'odom_100k_140k':0,
              'odom_more_140k':0}
# Populating it
for each in autos['odometer_km']:
    if each<=10000:
        price_by_odometer['odom_until_10k'] += 1
    elif 10000 < each <= 30000:
        price_by_odometer['odom_10k_30k'] += 1
    elif 30000 < each <= 50000:
        price_by_odometer['odom_30k_50k'] += 1
    elif 30000 < each <= 75000:
        price_by_odometer['odom_50k_75k'] += 1
    elif 10000 < each <= 100000:
        price_by_odometer['odom_75k_100k'] += 1
    elif 10000 < each <= 140000:
        price_by_odometer['odom_100k_140k'] += 1
    else:
        price_by_odometer['odom_more_140k'] += 1
price_by_odometer
df_price_by_odometer=pd.DataFrame.from_dict(price_by_odometer, orient="index", columns=["freq_odom_group"])
df_price_by_odometer

Out[71]:

	freq_odom_group
odom_until_10k	970
odom_10k_30k	1499
odom_30k_50k	1789
odom_50k_75k	2311
odom_75k_100k	5094
odom_100k_140k	4846
odom_more_140k	29964

In [72]:

df_price_by_odometer["mean_price"]=0

In [73]:

df_price_by_odometer

Out[73]:

	freq_odom_group	mean_price
odom_until_10k	970	0
odom_10k_30k	1499	0
odom_30k_50k	1789	0
odom_50k_75k	2311	0
odom_75k_100k	5094	0
odom_100k_140k	4846	0
odom_more_140k	29964	0

In [74]:

df_price_by_odometer.iloc[0,1]=autos.loc[autos["odometer_km"] <=10000, "price_dollar"].mean()
df_price_by_odometer.iloc[1,1]=autos.loc[autos["odometer_km"] <=30000, "price_dollar"].mean()
df_price_by_odometer.iloc[2,1]=autos.loc[autos["odometer_km"] <=50000, "price_dollar"].mean()
df_price_by_odometer.iloc[3,1]=autos.loc[autos["odometer_km"] <=75000, "price_dollar"].mean()
df_price_by_odometer.iloc[4,1]=autos.loc[autos["odometer_km"] <=100000, "price_dollar"].mean()
df_price_by_odometer.iloc[5,1]=autos.loc[autos["odometer_km"] <=140000, "price_dollar"].mean()
df_price_by_odometer.iloc[6,1]=autos.loc[autos["odometer_km"] >140000, "price_dollar"].mean()

In [75]:

df_price_by_odometer

Out[75]:

	freq_odom_group	mean_price
odom_until_10k	970	12,286.290
odom_10k_30k	1499	15,483.654
odom_30k_50k	1789	15,100.466
odom_50k_75k	2311	13,889.271
odom_75k_100k	5094	11,618.389
odom_100k_140k	4846	10,036.145
odom_more_140k	29964	3,783.125

The results are interesting, as more milleage the car has, lower is its price. The only exception for this rule are the cares with 10k km or less, which has lower price than others. This can be explain by some typing mistake on odometer or on price, or car has other kind of problems.

4.4 - What are the most common brand/model combination?¶

In [76]:

#creating empty dictionary to strore our aggregate data
brand_model={}

#populating our dictionary with average odemeter per brand
for each in autos["brand"].unique():

    brand_filter=autos.loc[autos["brand"]==each,"model"]
    model_freq=brand_filter.value_counts().head(3) #for each brand we will have the most common 3 models
    model_dic= model_freq.to_dict()
    brand_model[each]=model_dic
    
    
s_brand_model=pd.Series(brand_model)
df_brand_model=pd.DataFrame(s_brand_model, columns=["model"])
df_brand_model["freq_percentage"]=freq_perc #we will add the frequence of each brand
df_brand_model[freq_perc>5].sort_values(by=["freq_percentage"], ascending=False)

<ipython-input-76-98d7d4861509>:16: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  df_brand_model[freq_perc>5].sort_values(by=["freq_percentage"], ascending=False)

Out[76]:

	model	freq_percentage
volkswagen	{'golf': 3695, 'polo': 1600, 'passat': 1345}	21.133
bmw	{'3er': 2604, '5er': 1123, '1er': 521}	10.996
opel	{'corsa': 1582, 'astra': 1339, 'vectra': 538}	10.731
mercedes_benz	{'c_klasse': 1136, 'e_klasse': 954, 'a_klasse'...	9.651
audi	{'a4': 1226, 'a3': 825, 'a6': 795}	8.661
ford	{'focus': 757, 'fiesta': 722, 'mondeo': 479}	6.998

4.5 - How much cheaper are cars with damage than their non-damaged counterparts?¶

In [ ]:

"The cars with deffect not repaired cost in average {:,.2f}$, while the other cost in average {:,.2f}$ ".format(autos.loc[autos["unrepaired_damage"]=="yes", "price_dollar"].mean(),autos.loc[autos["unrepaired_damage"]=="no", "price_dollar"].mean())

This mean that, as expected the cars with unrepaired damages are in average much more cheaper than the cars that don't have damage to repaire.

We can see that in average cars with unrepaired damages cost less 3 times the price.

4.6 - How much expensive are cars with automatic gearbox?¶

In [ ]:

"The cars with automatic gearbox cost in average {:,.2f}$, while the manual gearbox cost in average {:,.2f}$ ".format(autos.loc[autos["gearbox"]=="automatic", "price_dollar"].mean(),autos.loc[autos["gearbox"]=="manual", "price_dollar"].mean())

As expected the cars with automatic gearbox are in average much more expensive than the cars with manual gearbox.

The automatic gearbox cost in average more than double price than the manual gearbox cars.

4.7 - How is the price evoluation according to the vehicle type?¶

In [ ]:

#creating empty dictionary to strore our aggregate data
vehicle_avg_price={}


#populating our dictionary with average price per brand (because we think it can be useful, we will also add the brand freq)
for each in autos["vehicle_type"].unique():
    avg_price=autos.loc[autos["vehicle_type"]==each, "price_dollar"].mean()
    vehicle_avg_price[each]=float(avg_price)
#print(vehicle_avg_price)

print(json.dumps(vehicle_avg_price, indent=4))

The results we got are according to our expectation:

the most expensive types of cars are coupe and suv
the most cheapest types of cars are small cars and convertible

5 - Conclusions¶

After a brief overview of each column data of our dataset, we identified what could be clean, and what could be improve in order to take the analyse more easier and eficient.

Then in point 4, we made some analyzes, and we found very interesting results.

We could conclude how vehicle details have impact on price.

If we want to buy a car, we must understand if price makes sense attending average price for the same details:

brand
gearbox
unrepaired damages
vehicle car
millenage

PROJECT N. 3 - Exploring the data details of used cars sold in eBay

1. Introduction¶

1.1 Dataset information:¶

1.2. Dataset dictionary:¶

1.3. Project objectives¶

2. Dataset cleaning - format data, identify outliers, identify wrong values¶

2.1 - First notes:¶

2.2 - Format columns labels to snakecase:¶

2.3 - Convert text column to numeric type¶

2.3.1- Column price¶

2.3.2- Column odometer¶

2.4 - Converts date columns to datetime type¶

2.5 - Checking statistics of numeric columns (exept datetime type)¶

2.5.1 - Checking price column details¶

2.5.2 - Checkingregistration_year column details¶

2.5.3 - Checkingregistration_month column details¶

2.5.4 - Checkingpostal_code column details¶

2.5.5 - Checkingodometer_km column details¶

2.6 - Checking statistics of non-numeric columns¶

2.6.1 - Checking seller and offer tyoe columns details¶

2.6.2 - Checking abtest, gearbox and offer tyoe columns details¶

2.7 - Translate text from german to english¶

3 - Dataset cleaning - remove outliers values¶

4 - Checking interesing facts¶

4.1 - What are the most commons brands?¶

4.2 - What are the relation between the most commons brands and its mean price?¶

4.3 - What are the relation between the most commons brands and its mean odometer?¶

4.3 - What are the relation between the most price and its mean odometer?¶

4.4 - What are the most common brand/model combination?¶

4.5 - How much cheaper are cars with damage than their non-damaged counterparts?¶

4.6 - How much expensive are cars with automatic gearbox?¶

4.7 - How is the price evoluation according to the vehicle type?¶

5 - Conclusions¶

1. Introduction ¶

2.3.1- Column `price`¶

2.3.2- Column `odometer`¶

2.5.2 - Checking`registration_year` column details¶

2.5.3 - Checking`registration_month` column details¶

2.5.4 - Checking`postal_code` column details¶

2.5.5 - Checking`odometer_km` column details¶

2.6.1 - Checking `seller` and `offer tyoe` columns details¶

2.6.2 - Checking `abtest`, `gearbox` and `offer tyoe` columns details¶