import pandas as pd
import numpy as np
autos = pd.read_csv('autos.csv',encoding = 'Latin-1')
autos.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 dateCrawled 50000 non-null object 1 name 50000 non-null object 2 seller 50000 non-null object 3 offerType 50000 non-null object 4 price 50000 non-null object 5 abtest 50000 non-null object 6 vehicleType 44905 non-null object 7 yearOfRegistration 50000 non-null int64 8 gearbox 47320 non-null object 9 powerPS 50000 non-null int64 10 model 47242 non-null object 11 odometer 50000 non-null object 12 monthOfRegistration 50000 non-null int64 13 fuelType 45518 non-null object 14 brand 50000 non-null object 15 notRepairedDamage 40171 non-null object 16 dateCreated 50000 non-null object 17 nrOfPictures 50000 non-null int64 18 postalCode 50000 non-null int64 19 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
there are null values for some of the columns, we will probably need to clean these columns.
Convert columns' names from camelcase into snakecase :
autos.columns
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest', 'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model', 'odometer', 'monthOfRegistration', 'fuelType', 'brand', 'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode', 'lastSeen'], dtype='object')
corrected_column = {'dateCrawled':'date_crawled','offerType':'offer_type','vehicleType':'vehicle_type', 'yearOfRegistration':'registration_year','monthOfRegistration':'registration_month', 'fuelType':'fuel_type','notRepairedDamage':'unrepaired_damage', 'dateCreated':'ad_created', 'nrOfPictures':'nr_of_pictures', 'postalCode':'postal_code','lastSeen':'last_seen'}
autos = autos.rename(corrected_column,axis=1)
print(autos.head())
date_crawled name \ 0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD 1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik 2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United 3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... 4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... seller offer_type price abtest vehicle_type registration_year \ 0 privat Angebot $5,000 control bus 2004 1 privat Angebot $8,500 control limousine 1997 2 privat Angebot $8,990 test limousine 2009 3 privat Angebot $4,350 control kleinwagen 2007 4 privat Angebot $1,350 test kombi 2003 gearbox powerPS model odometer registration_month fuel_type \ 0 manuell 158 andere 150,000km 3 lpg 1 automatik 286 7er 150,000km 6 benzin 2 manuell 102 golf 70,000km 7 benzin 3 automatik 71 fortwo 70,000km 6 benzin 4 manuell 0 focus 150,000km 7 benzin brand unrepaired_damage ad_created nr_of_pictures \ 0 peugeot nein 2016-03-26 00:00:00 0 1 bmw nein 2016-04-04 00:00:00 0 2 volkswagen nein 2016-03-26 00:00:00 0 3 smart nein 2016-03-12 00:00:00 0 4 ford nein 2016-04-01 00:00:00 0 postal_code last_seen 0 79588 2016-04-06 06:45:54 1 71034 2016-04-06 14:45:08 2 35394 2016-04-06 20:15:37 3 33729 2016-03-15 03:16:28 4 39218 2016-04-01 14:38:50
autos.describe(include='all')
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | powerPS | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000 | 50000 | 50000 | 50000 | 50000 | 50000 | 44905 | 50000.000000 | 47320 | 50000.000000 | 47242 | 50000 | 50000.000000 | 45518 | 50000 | 40171 | 50000 | 50000.0 | 50000.000000 | 50000 |
unique | 48213 | 38754 | 2 | 2 | 2357 | 2 | 8 | NaN | 2 | NaN | 245 | 13 | NaN | 7 | 40 | 2 | 76 | NaN | NaN | 39481 |
top | 2016-04-02 15:49:30 | Ford_Fiesta | privat | Angebot | $0 | test | limousine | NaN | manuell | NaN | golf | 150,000km | NaN | benzin | volkswagen | nein | 2016-04-03 00:00:00 | NaN | NaN | 2016-04-07 06:17:27 |
freq | 3 | 78 | 49999 | 49999 | 1421 | 25756 | 12859 | NaN | 36993 | NaN | 4024 | 32424 | NaN | 30107 | 10687 | 35232 | 1946 | NaN | NaN | 8 |
mean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2005.073280 | NaN | 116.355920 | NaN | NaN | 5.723360 | NaN | NaN | NaN | NaN | 0.0 | 50813.627300 | NaN |
std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 105.712813 | NaN | 209.216627 | NaN | NaN | 3.711984 | NaN | NaN | NaN | NaN | 0.0 | 25779.747957 | NaN |
min | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1000.000000 | NaN | 0.000000 | NaN | NaN | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 1067.000000 | NaN |
25% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1999.000000 | NaN | 70.000000 | NaN | NaN | 3.000000 | NaN | NaN | NaN | NaN | 0.0 | 30451.000000 | NaN |
50% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2003.000000 | NaN | 105.000000 | NaN | NaN | 6.000000 | NaN | NaN | NaN | NaN | 0.0 | 49577.000000 | NaN |
75% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | NaN | 9.000000 | NaN | NaN | NaN | NaN | 0.0 | 71540.000000 | NaN |
max | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 9999.000000 | NaN | 17700.000000 | NaN | NaN | 12.000000 | NaN | NaN | NaN | NaN | 0.0 | 99998.000000 | NaN |
2, Columns with number of unique values showing NaN needs further investigation.
autos.loc[:,'nr_of_pictures'].value_counts()
0 50000 Name: nr_of_pictures, dtype: int64
Column 'nr_of_pictures' only shows '0' for all its values, not helpful in analysis.
autos.loc[:,'price'].value_counts()
$0 1421 $500 781 $1,500 734 $2,500 643 $1,000 639 ... $11,780 1 $47,997 1 $7,373 1 $11,149 1 $277 1 Name: price, Length: 2357, dtype: int64
in order to convert 'price' column into numeric values, need to remove '$' & ',', before converting datatype into int.
autos.loc[:,'odometer'].value_counts()
150,000km 32424 125,000km 5170 100,000km 2169 90,000km 1757 80,000km 1436 70,000km 1230 60,000km 1164 50,000km 1027 5,000km 967 40,000km 819 30,000km 789 20,000km 784 10,000km 264 Name: odometer, dtype: int64
in order to convert 'odometer' column into numeric values, need to remove 'km' & ',', before converting datatype into int.
define a function that can be shared for above two purposes.
def replace_convert(col):
col = col.replace(',','')
col = col.replace('$','')
col = col.replace('km','')
col = int(col)
return col
price1 = []
for element in autos.loc[:,'price']:
element = replace_convert(element)
price1.append(element)
autos.loc[:,'price'] = price1
odometer1 = []
for element in autos.loc[:,'odometer']:
element = replace_convert(element)
odometer1.append(element)
autos.loc[:,'odometer'] = odometer1
since we've removed 'km', to avoid confusion, rename column name 'odometer' to 'odometer_km'
autos = autos.rename({'odometer':'odometer_km'},axis=1)
autos.head()
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | powerPS | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | 5000 | control | bus | 2004 | manuell | 158 | andere | 150000 | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | 8500 | control | limousine | 1997 | automatik | 286 | 7er | 150000 | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | 8990 | test | limousine | 2009 | manuell | 102 | golf | 70000 | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | 4350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70000 | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | 1350 | test | kombi | 2003 | manuell | 0 | focus | 150000 | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
examine values from column 'price'
autos.loc[:,'price'].value_counts(dropna=False).sort_index()
0 1421 1 156 2 3 3 1 5 2 ... 10000000 1 11111111 2 12345678 3 27322222 1 99999999 1 Name: price, Length: 2357, dtype: int64
to remove rows with price equals 0, since they constitute not many of data rows
autos.loc[:,'price'].value_counts(dropna=False).sort_index(ascending=False).head(30)
99999999 1 27322222 1 12345678 3 11111111 2 10000000 1 3890000 1 1300000 1 1234566 1 999999 2 999990 1 350000 1 345000 1 299000 1 295000 1 265000 1 259000 1 250000 1 220000 1 198000 1 197000 1 194000 1 190000 1 180000 1 175000 1 169999 1 169000 1 163991 1 163500 1 155000 1 151990 1 Name: price, dtype: int64
to remove prices above 350000, as there is sudden increase in price, probably outliers
autos = autos.loc[autos.loc[:,'price'].between(1,350000),:]
autos.loc[:,'price'].describe()
count 48565.000000 mean 5888.935591 std 9059.854754 min 1.000000 25% 1200.000000 50% 3000.000000 75% 7490.000000 max 350000.000000 Name: price, dtype: float64
examine values from column 'odometer_km'
autos.loc[:,'odometer_km'].describe()
count 48565.000000 mean 125770.101925 std 39788.636804 min 5000.000000 25% 125000.000000 50% 150000.000000 75% 150000.000000 max 150000.000000 Name: odometer_km, dtype: float64
autos.loc[:,'odometer_km'].value_counts()
150000 31414 125000 5057 100000 2115 90000 1734 80000 1415 70000 1217 60000 1155 50000 1012 5000 836 40000 815 30000 780 20000 762 10000 253 Name: odometer_km, dtype: int64
no significant problem with values from column 'odometer_km'
now we check for datetime columns 'date_crawled','ad_created','last_seen'
autos.loc[:,['date_crawled','ad_created','last_seen']].head(20)
date_crawled | ad_created | last_seen | |
---|---|---|---|
0 | 2016-03-26 17:47:46 | 2016-03-26 00:00:00 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | 2016-04-04 00:00:00 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | 2016-03-26 00:00:00 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | 2016-03-12 00:00:00 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | 2016-04-01 00:00:00 | 2016-04-01 14:38:50 |
5 | 2016-03-21 13:47:45 | 2016-03-21 00:00:00 | 2016-04-06 09:45:21 |
6 | 2016-03-20 17:55:21 | 2016-03-20 00:00:00 | 2016-03-23 02:48:59 |
7 | 2016-03-16 18:55:19 | 2016-03-16 00:00:00 | 2016-04-07 03:17:32 |
8 | 2016-03-22 16:51:34 | 2016-03-22 00:00:00 | 2016-03-26 18:18:10 |
9 | 2016-03-16 13:47:02 | 2016-03-16 00:00:00 | 2016-04-06 10:46:35 |
10 | 2016-03-15 01:41:36 | 2016-03-14 00:00:00 | 2016-04-07 03:16:21 |
11 | 2016-03-16 18:45:34 | 2016-03-16 00:00:00 | 2016-03-16 18:45:34 |
12 | 2016-03-31 19:48:22 | 2016-03-31 00:00:00 | 2016-04-06 14:17:52 |
13 | 2016-03-23 10:48:32 | 2016-03-23 00:00:00 | 2016-04-01 14:17:13 |
14 | 2016-03-23 11:50:46 | 2016-03-23 00:00:00 | 2016-04-01 15:46:47 |
15 | 2016-04-01 12:06:20 | 2016-04-01 00:00:00 | 2016-04-02 21:10:48 |
16 | 2016-03-16 14:59:02 | 2016-03-16 00:00:00 | 2016-03-18 05:29:37 |
17 | 2016-03-29 11:46:22 | 2016-03-29 00:00:00 | 2016-04-05 20:46:26 |
18 | 2016-03-26 19:57:44 | 2016-03-26 00:00:00 | 2016-04-02 12:16:41 |
19 | 2016-03-17 13:36:21 | 2016-03-17 00:00:00 | 2016-03-17 14:45:58 |
autos.loc[:,'date_crawled'].value_counts(dropna=False,normalize=True)
2016-03-16 21:50:53 0.000062 2016-04-04 16:40:33 0.000062 2016-03-23 18:39:34 0.000062 2016-03-05 16:57:05 0.000062 2016-03-11 22:38:16 0.000062 ... 2016-03-09 10:52:53 0.000021 2016-03-05 22:43:54 0.000021 2016-04-01 16:56:53 0.000021 2016-03-31 19:55:35 0.000021 2016-04-01 21:52:59 0.000021 Name: date_crawled, Length: 46882, dtype: float64
as almost all the values are unique, to ease our analysis work, we simplify datetime into date, for all the three columns.
autos.loc[:,'date_crawled'] = autos.loc[:,'date_crawled'].str[:10]
autos.loc[:,'ad_created'] = autos.loc[:,'ad_created'].str[:10]
autos.loc[:,'last_seen'] = autos.loc[:,'last_seen'].str[:10]
autos.loc[:,'date_crawled'].value_counts(dropna=False,normalize=True).sort_index()
2016-03-05 0.025327 2016-03-06 0.014043 2016-03-07 0.036014 2016-03-08 0.033296 2016-03-09 0.033090 2016-03-10 0.032184 2016-03-11 0.032575 2016-03-12 0.036920 2016-03-13 0.015670 2016-03-14 0.036549 2016-03-15 0.034284 2016-03-16 0.029610 2016-03-17 0.031628 2016-03-18 0.012911 2016-03-19 0.034778 2016-03-20 0.037887 2016-03-21 0.037373 2016-03-22 0.032987 2016-03-23 0.032225 2016-03-24 0.029342 2016-03-25 0.031607 2016-03-26 0.032204 2016-03-27 0.031092 2016-03-28 0.034860 2016-03-29 0.034099 2016-03-30 0.033687 2016-03-31 0.031834 2016-04-01 0.033687 2016-04-02 0.035478 2016-04-03 0.038608 2016-04-04 0.036487 2016-04-05 0.013096 2016-04-06 0.003171 2016-04-07 0.001400 Name: date_crawled, dtype: float64
autos.loc[:,'ad_created'].value_counts(dropna=False,normalize=True).sort_index()
2015-06-11 0.000021 2015-08-10 0.000021 2015-09-09 0.000021 2015-11-10 0.000021 2015-12-05 0.000021 ... 2016-04-03 0.038855 2016-04-04 0.036858 2016-04-05 0.011819 2016-04-06 0.003253 2016-04-07 0.001256 Name: ad_created, Length: 76, dtype: float64
autos.loc[:,'last_seen'].value_counts(dropna=False,normalize=True).sort_index()
2016-03-05 0.001071 2016-03-06 0.004324 2016-03-07 0.005395 2016-03-08 0.007413 2016-03-09 0.009595 2016-03-10 0.010666 2016-03-11 0.012375 2016-03-12 0.023783 2016-03-13 0.008895 2016-03-14 0.012602 2016-03-15 0.015876 2016-03-16 0.016452 2016-03-17 0.028086 2016-03-18 0.007351 2016-03-19 0.015834 2016-03-20 0.020653 2016-03-21 0.020632 2016-03-22 0.021373 2016-03-23 0.018532 2016-03-24 0.019767 2016-03-25 0.019211 2016-03-26 0.016802 2016-03-27 0.015649 2016-03-28 0.020859 2016-03-29 0.022341 2016-03-30 0.024771 2016-03-31 0.023783 2016-04-01 0.022794 2016-04-02 0.024915 2016-04-03 0.025203 2016-04-04 0.024483 2016-04-05 0.124761 2016-04-06 0.221806 2016-04-07 0.131947 Name: last_seen, dtype: float64
The last three days contain a disproportionate amount of 'last seen' values. Given that these are 6-10x the values from the previous days, it's unlikely that there was a massive spike in sales, and more likely that these values are to do with the crawling period ending and don't indicate car sales.
next we check the column 'registration_year'
autos.loc[:,'registration_year'].describe()
count 48565.000000 mean 2004.755421 std 88.643887 min 1000.000000 25% 1999.000000 50% 2004.000000 75% 2008.000000 max 9999.000000 Name: registration_year, dtype: float64
autos.loc[:,'registration_year'].value_counts(dropna=False).sort_index(ascending=False).head(20)
9999 3 9000 1 8888 1 6200 1 5911 1 5000 4 4800 1 4500 1 4100 1 2800 1 2019 2 2018 470 2017 1392 2016 1220 2015 392 2014 663 2013 803 2012 1310 2011 1623 2010 1589 Name: registration_year, dtype: int64
autos.loc[:,'registration_year'].value_counts(dropna=False).sort_index(ascending=True).head(20)
1000 1 1001 1 1111 1 1800 2 1910 5 1927 1 1929 1 1931 1 1934 2 1937 4 1938 1 1939 1 1941 2 1943 1 1948 1 1950 3 1951 2 1952 1 1953 1 1954 2 Name: registration_year, dtype: int64
There are some values with car registration year later than this year, meaning error in values. There are also some with registration year long before car was invented. Let's check the count (in percentage) of car registration year before 1900 and after 2016.
(~autos.loc[:,'registration_year'].between(1900,2016)).sum() / autos.shape[0]
0.038793369710697
we will remove rows with car registration year before 1900 and after 2016, since they constitutes only 4% of our data.
autos = autos.loc[autos.loc[:,'registration_year'].between(1900,2016),:]
autos.describe(include='all')
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | powerPS | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 46681 | 46681 | 46681 | 46681 | 46681.000000 | 46681 | 43977 | 46681.000000 | 44571 | 46681.000000 | 44488 | 46681.000000 | 46681.000000 | 43363 | 46681 | 38374 | 46681 | 46681.0 | 46681.000000 | 46681 |
unique | 34 | 35812 | 2 | 1 | NaN | 2 | 8 | NaN | 2 | NaN | 244 | NaN | NaN | 7 | 40 | 2 | 74 | NaN | NaN | 34 |
top | 2016-04-03 | BMW_316i | privat | Angebot | NaN | test | limousine | NaN | manuell | NaN | golf | NaN | NaN | benzin | volkswagen | nein | 2016-04-03 | NaN | NaN | 2016-04-06 |
freq | 1810 | 75 | 46680 | 46681 | NaN | 24062 | 12598 | NaN | 34715 | NaN | 3707 | NaN | NaN | 28540 | 9862 | 33834 | 1821 | NaN | NaN | 10425 |
mean | NaN | NaN | NaN | NaN | 5977.716801 | NaN | NaN | 2002.910756 | NaN | 117.892933 | NaN | 125586.855466 | 5.827125 | NaN | NaN | NaN | NaN | 0.0 | 51097.434181 | NaN |
std | NaN | NaN | NaN | NaN | 9177.909479 | NaN | NaN | 7.185103 | NaN | 184.922911 | NaN | 39852.528628 | 3.670300 | NaN | NaN | NaN | NaN | 0.0 | 25755.387192 | NaN |
min | NaN | NaN | NaN | NaN | 1.000000 | NaN | NaN | 1910.000000 | NaN | 0.000000 | NaN | 5000.000000 | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 1067.000000 | NaN |
25% | NaN | NaN | NaN | NaN | 1250.000000 | NaN | NaN | 1999.000000 | NaN | 75.000000 | NaN | 100000.000000 | 3.000000 | NaN | NaN | NaN | NaN | 0.0 | 30827.000000 | NaN |
50% | NaN | NaN | NaN | NaN | 3100.000000 | NaN | NaN | 2003.000000 | NaN | 109.000000 | NaN | 150000.000000 | 6.000000 | NaN | NaN | NaN | NaN | 0.0 | 49828.000000 | NaN |
75% | NaN | NaN | NaN | NaN | 7500.000000 | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | 150000.000000 | 9.000000 | NaN | NaN | NaN | NaN | 0.0 | 71732.000000 | NaN |
max | NaN | NaN | NaN | NaN | 350000.000000 | NaN | NaN | 2016.000000 | NaN | 17700.000000 | NaN | 150000.000000 | 12.000000 | NaN | NaN | NaN | NaN | 0.0 | 99998.000000 | NaN |
autos.loc[:,'registration_year'].value_counts(dropna=False,normalize=True).sort_index(ascending=False).head(20)
2016 0.026135 2015 0.008397 2014 0.014203 2013 0.017202 2012 0.028063 2011 0.034768 2010 0.034040 2009 0.044665 2008 0.047450 2007 0.048778 2006 0.057197 2005 0.062895 2004 0.057904 2003 0.057818 2002 0.053255 2001 0.056468 2000 0.067608 1999 0.062060 1998 0.050620 1997 0.041794 Name: registration_year, dtype: float64
most of the cars were registered during the period 1997 to 2016, peak around years 1997 to 2007
To calculate mean price of top car brand First decide on which brands :
autos.loc[:,'brand'].value_counts(dropna=False,normalize=True).head(20)
volkswagen 0.211264 bmw 0.110045 opel 0.107581 mercedes_benz 0.096463 audi 0.086566 ford 0.069900 renault 0.047150 peugeot 0.029841 fiat 0.025642 seat 0.018273 skoda 0.016409 nissan 0.015274 mazda 0.015188 smart 0.014160 citroen 0.014010 toyota 0.012703 hyundai 0.010025 sonstige_autos 0.009811 volvo 0.009147 mini 0.008762 Name: brand, dtype: float64
we decided to choose the top six brands as each of them constitute percentage of more than 5%
top_brand = autos.loc[:,'brand'].value_counts(dropna=False).head(6).index
top_brand_price = {}
for brand in top_brand:
mean_price = int(autos.loc[autos.loc[:,'brand'] == brand,'price'].mean())
top_brand_price[brand]=mean_price
top_brand_price
{'volkswagen': 5402, 'bmw': 8332, 'opel': 2975, 'mercedes_benz': 8628, 'audi': 9336, 'ford': 3749}
We have calculated mean prices for cars with top brands. They have distinct price gap.
calculate mean mileage for top brands
top_brand_mileage = {}
for brand in top_brand:
mean_mileage = int(autos.loc[autos.loc[:,'brand'] ==brand,'odometer_km'].mean())
top_brand_mileage[brand]=mean_mileage
top_brand_mileage
{'volkswagen': 128707, 'bmw': 132572, 'opel': 129310, 'mercedes_benz': 130788, 'audi': 129157, 'ford': 124266}
convert 'top_brand_price' from dictionary to pandas series, then to pandas dataframe.
tb_price = pd.Series(top_brand_price)
df = pd.DataFrame(tb_price,columns=['mean_price'])
print(df)
mean_price volkswagen 5402 bmw 8332 opel 2975 mercedes_benz 8628 audi 9336 ford 3749
convert 'top_brand_mileage' from dictionary to pandas series.
tb_mileage = pd.Series(top_brand_mileage)
add the series 'tb_mileage' to dataframe 'df'
df.loc[:,'mean_mileage'] = tb_mileage
print(df)
mean_price mean_mileage volkswagen 5402 128707 bmw 8332 132572 opel 2975 129310 mercedes_benz 8628 130788 audi 9336 129157 ford 3749 124266
Conclusion : The range of car mileages does not vary as much as the prices do by brand.