The aim of this project is to clean the data and analyze the included used car listings.
The data set consists of 50,000 data points, organized in the following fields:
Column Name | Col Description |
---|---|
dateCrawled | When this ad was first crawled. All field-values are taken from this date. |
name | Name of the car. |
seller | Whether the seller is private or a dealer. |
offerType | The type of listing |
price | The price on the ad to sell the car. |
abtest | Whether the listing is included in an A/B test. |
vehicleType | The vehicle Type. |
yearOfRegistration | The year in which the car was first registered. |
gearbox | The transmission type. |
powerPS | The power of the car in PS. |
model | The car model name. |
kilometer | How many kilometers the car has driven. |
monthOfRegistration | The month in which the car was first registered. |
fuelType | What type of fuel the car uses. |
brand | The brand of the car. |
notRepairedDamage | If the car has a damage which is not yet repaired. |
dateCreated | The date on which the eBay listing was created. |
nrOfPictures | The number of pictures in the ad. |
postalCode | The postal code for the location of the vehicle. |
lastSeenOnline | When the crawler saw this ad last online. |
import pandas as pd
import numpy as np
autos = pd.read_csv('autos.csv', encoding = 'Latin-1')
autos.head(3)
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
for i in range(len(autos.columns)):
print(i, autos.columns[i])
0 dateCrawled 1 name 2 seller 3 offerType 4 price 5 abtest 6 vehicleType 7 yearOfRegistration 8 gearbox 9 powerPS 10 model 11 odometer 12 monthOfRegistration 13 fuelType 14 brand 15 notRepairedDamage 16 dateCreated 17 nrOfPictures 18 postalCode 19 lastSeen
#Populate a dictionary with mapping criteria for renaming columns.
c = autos.columns.copy()
m = {
'yearOfRegistration' : 'registration_year',
'monthOfRegistration' : 'registration_month',
'notRepairedDamage' : 'unrepaired_damage',
'dateCreated' : 'ad_created'
}
for i in c:
if i not in m:
split_points = []
start = 0
for j in range(len(i)):
if i[j] in 'ABCDEFGHIJKLMNOPQRSTUVWXYZ':
split_points.append(i[start:j].lower())
start = j
split_points.append(i[start:].lower())
if len(split_points) > 0:
m[i]='_'.join(split_points)
else:
m[i] = i
m
{'abtest': 'abtest', 'brand': 'brand', 'dateCrawled': 'date_crawled', 'dateCreated': 'ad_created', 'fuelType': 'fuel_type', 'gearbox': 'gearbox', 'lastSeen': 'last_seen', 'model': 'model', 'monthOfRegistration': 'registration_month', 'name': 'name', 'notRepairedDamage': 'unrepaired_damage', 'nrOfPictures': 'nr_of_pictures', 'odometer': 'odometer', 'offerType': 'offer_type', 'postalCode': 'postal_code', 'powerPS': 'power_p_s', 'price': 'price', 'seller': 'seller', 'vehicleType': 'vehicle_type', 'yearOfRegistration': 'registration_year'}
#Use the mapping dic to change column names
autos.rename(columns = m, inplace = True)
autos.head(2)
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_p_s | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
for i in range(len(autos.columns)):
print(i, autos.columns[i])
0 date_crawled 1 name 2 seller 3 offer_type 4 price 5 abtest 6 vehicle_type 7 registration_year 8 gearbox 9 power_p_s 10 model 11 odometer 12 registration_month 13 fuel_type 14 brand 15 unrepaired_damage 16 ad_created 17 nr_of_pictures 18 postal_code 19 last_seen
Thus far we have reviewed the dataset and changed the column headings. For the sake of simplicity and clarity we converted 'camelcase' to 'snakecase' and shortened some of the names. All headings are lower case now.
autos.describe(include = 'all')
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_p_s | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000 | 50000 | 50000 | 50000 | 50000 | 50000 | 44905 | 50000.000000 | 47320 | 50000.000000 | 47242 | 50000 | 50000.000000 | 45518 | 50000 | 40171 | 50000 | 50000.0 | 50000.000000 | 50000 |
unique | 48213 | 38754 | 2 | 2 | 2357 | 2 | 8 | NaN | 2 | NaN | 245 | 13 | NaN | 7 | 40 | 2 | 76 | NaN | NaN | 39481 |
top | 2016-03-09 11:54:38 | Ford_Fiesta | privat | Angebot | $0 | test | limousine | NaN | manuell | NaN | golf | 150,000km | NaN | benzin | volkswagen | nein | 2016-04-03 00:00:00 | NaN | NaN | 2016-04-07 06:17:27 |
freq | 3 | 78 | 49999 | 49999 | 1421 | 25756 | 12859 | NaN | 36993 | NaN | 4024 | 32424 | NaN | 30107 | 10687 | 35232 | 1946 | NaN | NaN | 8 |
mean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2005.073280 | NaN | 116.355920 | NaN | NaN | 5.723360 | NaN | NaN | NaN | NaN | 0.0 | 50813.627300 | NaN |
std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 105.712813 | NaN | 209.216627 | NaN | NaN | 3.711984 | NaN | NaN | NaN | NaN | 0.0 | 25779.747957 | NaN |
min | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1000.000000 | NaN | 0.000000 | NaN | NaN | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 1067.000000 | NaN |
25% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1999.000000 | NaN | 70.000000 | NaN | NaN | 3.000000 | NaN | NaN | NaN | NaN | 0.0 | 30451.000000 | NaN |
50% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2003.000000 | NaN | 105.000000 | NaN | NaN | 6.000000 | NaN | NaN | NaN | NaN | 0.0 | 49577.000000 | NaN |
75% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | NaN | 9.000000 | NaN | NaN | NaN | NaN | 0.0 | 71540.000000 | NaN |
max | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 9999.000000 | NaN | 17700.000000 | NaN | NaN | 12.000000 | NaN | NaN | NaN | NaN | 0.0 | 99998.000000 | NaN |
autos['seller' ].value_counts()
privat 49999 gewerblich 1 Name: seller, dtype: int64
autos[ 'offer_type'].value_counts()
Angebot 49999 Gesuch 1 Name: offer_type, dtype: int64
autos.loc[(autos['registration_year']<1900) | (autos['registration_year']>2020),'registration_year'].sort_values(ascending = False)
14341 9999 38076 9999 33950 9999 8012 9999 6308 9996 49910 9000 13559 9000 25003 8888 8360 6200 27618 5911 49153 5000 4164 5000 24519 5000 22799 5000 42079 4800 453 4500 4549 4100 27578 2800 10556 1800 32585 1800 35238 1500 24511 1111 49283 1001 22316 1000 Name: registration_year, dtype: int64
autos['nr_of_pictures'].value_counts()
0 50000 Name: nr_of_pictures, dtype: int64
autos[['price', 'registration_year', 'power_p_s', 'odometer','registration_month', 'nr_of_pictures']].head()
price | registration_year | power_p_s | odometer | registration_month | nr_of_pictures | |
---|---|---|---|---|---|---|
0 | $5,000 | 2004 | 158 | 150,000km | 3 | 0 |
1 | $8,500 | 1997 | 286 | 150,000km | 6 | 0 |
2 | $8,990 | 2009 | 102 | 70,000km | 7 | 0 |
3 | $4,350 | 2007 | 71 | 70,000km | 6 | 0 |
4 | $1,350 | 2003 | 0 | 150,000km | 7 | 0 |
autos.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): date_crawled 50000 non-null object name 50000 non-null object seller 50000 non-null object offer_type 50000 non-null object price 50000 non-null object abtest 50000 non-null object vehicle_type 44905 non-null object registration_year 50000 non-null int64 gearbox 47320 non-null object power_p_s 50000 non-null int64 model 47242 non-null object odometer 50000 non-null object registration_month 50000 non-null int64 fuel_type 45518 non-null object brand 50000 non-null object unrepaired_damage 40171 non-null object ad_created 50000 non-null object nr_of_pictures 50000 non-null int64 postal_code 50000 non-null int64 last_seen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
The examination above reveals the following issues:
'seller', 'offer_type', 'nr_of_pictures'
# replaceing non-numeric characters and converting the columns to 'int' type.
autos['price'] = autos['price'].str.replace("$","")
autos['price'] = autos['price'].str.replace(",","").astype(int)
autos['odometer'] = autos['odometer'].str.replace(",","")
autos['odometer'] = autos['odometer'].str.replace("km","").astype(int)
autos[['price', 'odometer']].head()
price | odometer | |
---|---|---|
0 | 5000 | 150000 |
1 | 8500 | 150000 |
2 | 8990 | 70000 |
3 | 4350 | 70000 |
4 | 1350 | 150000 |
# Rename the 'odometer' column to keep the information
autos.rename(columns = {'odometer': 'odometer_km'}, inplace = True)
autos[['price', 'odometer_km']].head()
price | odometer_km | |
---|---|---|
0 | 5000 | 150000 |
1 | 8500 | 150000 |
2 | 8990 | 70000 |
3 | 4350 | 70000 |
4 | 1350 | 150000 |
print('Odometer minimum = ',autos['odometer_km'].min())
print('Odometer maximum = ',autos['odometer_km'].max())
print('Price minimum = ',autos['price'].min())
print('Price maximum = ',autos['price'].max())
Odometer minimum = 5000 Odometer maximum = 150000 Price minimum = 0 Price maximum = 99999999
While the 'odometer' data seems okay, 'price' data has some erroneous min and max values. Let's investigate it further.
print('Unique values in "odometer_km" column: ',autos['odometer_km'].unique().shape)
print('Unique values in "price" column: ',autos['price'].unique().shape)
Unique values in "odometer_km" column: (13,) Unique values in "price" column: (2357,)
autos['odometer_km'].describe()
count 50000.000000 mean 125732.700000 std 40042.211706 min 5000.000000 25% 125000.000000 50% 150000.000000 75% 150000.000000 max 150000.000000 Name: odometer_km, dtype: float64
autos['price'].describe()
count 5.000000e+04 mean 9.840044e+03 std 4.811044e+05 min 0.000000e+00 25% 1.100000e+03 50% 2.950000e+03 75% 7.200000e+03 max 1.000000e+08 Name: price, dtype: float64
autos['odometer_km'].value_counts()
150000 32424 125000 5170 100000 2169 90000 1757 80000 1436 70000 1230 60000 1164 50000 1027 5000 967 40000 819 30000 789 20000 784 10000 264 Name: odometer_km, dtype: int64
Inadequately high or low values in the 'price' column.
autos.loc[autos['price']<501,'price'].value_counts().sort_index(ascending = True)
0 1421 1 156 2 3 3 1 5 2 8 1 9 1 10 7 11 2 12 3 13 2 14 1 15 2 17 3 18 1 20 4 25 5 29 1 30 7 35 1 40 6 45 4 47 1 49 4 50 49 55 2 59 1 60 9 65 5 66 1 ... 375 7 378 1 379 1 380 29 385 1 388 1 390 26 395 2 399 71 400 321 410 1 414 1 420 9 425 4 430 13 435 1 440 1 444 8 449 12 450 265 459 1 460 5 470 7 475 4 480 25 485 2 490 34 495 4 499 92 500 781 Name: price, Length: 140, dtype: int64
autos.loc[autos['price']>100000,'price'].value_counts().sort_index(ascending = True)
104900 1 105000 2 109999 1 114400 1 115000 1 115991 1 116000 1 119500 1 119900 1 120000 2 128000 1 129000 1 130000 1 135000 1 137999 1 139997 1 145000 1 151990 1 155000 1 163500 1 163991 1 169000 1 169999 1 175000 1 180000 1 190000 1 194000 1 197000 1 198000 1 220000 1 250000 1 259000 1 265000 1 295000 1 299000 1 345000 1 350000 1 999990 1 999999 2 1234566 1 1300000 1 3890000 1 10000000 1 11111111 2 12345678 3 27322222 1 99999999 1 Name: price, dtype: int64
autos.loc[autos['price']>100000,['price', 'name', 'registration_year', 'vehicle_type']].sort_values('price')
price | name | registration_year | vehicle_type | |
---|---|---|---|---|
29286 | 104900 | Porsche_911_Turbo_PDK__Sport_Chrono_Garantie_D... | 2011 | coupe |
17540 | 105000 | Porsche_930_Turbo_classic_analytics_Gutachten_2+ | 1988 | coupe |
16964 | 105000 | Bentley_Continental_Supersports | 2010 | coupe |
49391 | 109999 | Lamborghini_Gallardo_LP560_4_E_Gear_"Callisto_... | 2008 | coupe |
22060 | 114400 | Tesla_Model_X90D_Autopilot_Leder_AHK_Kaltwette... | 2016 | suv |
7402 | 115000 | Porsche_911_Carrera_4S_Cabrio_PDK__BOSE__NEU__... | 2016 | cabrio |
21783 | 115991 | Porsche_911_Carrera_4S_Cabriolet_PDK | 2013 | cabrio |
33884 | 116000 | Porsche_911_Carrera_4S_Cabriolet_PDK | 2013 | cabrio |
38814 | 119500 | Porsche_Porsche_964_TURBO_S_X33_WLS_ab_Werk_35... | 1992 | coupe |
43282 | 119900 | Porsche_911_Carrera_4S_Cabriolet_PDK | 2014 | cabrio |
2751 | 120000 | Porsche_911___993_4S | 1998 | coupe |
44406 | 120000 | Mercedes_Benz_170_S_Cabriolet_A__Startklar_fue... | 1951 | cabrio |
8232 | 128000 | Porsche_993_S_Schalter_BRD_neuwertig | 1997 | coupe |
1878 | 129000 | Porsche_911_Turbo | 1995 | coupe |
49815 | 130000 | SUCHE_TIPPS___Ford_Mustang_Shelby_GT_350_500_K... | 1968 | coupe |
14268 | 135000 | Porsche_997_GT3_Approved_Lift_Unfallfrei_Mj.2011 | 2010 | coupe |
2454 | 137999 | Porsche_911_GT3 | 2010 | coupe |
32185 | 139997 | Porsche_997_GT3_MKII_EIN_SAMMLERSTÜCK_IN_VOLLA... | 2010 | coupe |
49668 | 145000 | Porsche_911__993__Carrera_S_"SONDERMODELL" | 1998 | coupe |
33638 | 151990 | Porsche_911_GT3_RS | 2007 | coupe |
10500 | 155000 | Porsche_991 | 2013 | coupe |
45387 | 163500 | Porsche_911_GT3__Clubsp._Lift_nur_bis_25.03._v... | 2015 | coupe |
22673 | 163991 | Porsche_911_991_GT3_LED_Licht_Chrono_Navi_Club... | 2014 | coupe |
18509 | 169000 | Porsche_991_GT3 | 2013 | coupe |
32840 | 169999 | Mercedes_Benz_G_500_7G_TRONIC_Cabrio_Desingno_... | 2012 | cabrio |
11433 | 175000 | Audi_R8_V10_plus_5.2_FSI_quattro_S_tronic_Akra... | 2016 | coupe |
17140 | 180000 | Mercedes_Benz_SLS_AMG_Roadster_AMG_SPEEDSHIFT_DCT | 2012 | cabrio |
20351 | 190000 | BMW_Z8_roadster | 2000 | cabrio |
28090 | 194000 | Tesla_Model_X_P90D_Signature_Sondermodel__Neuw... | 2016 | suv |
43668 | 197000 | Porsche_993/911_Turbo_WLS_II_Exclusive_S_deuts... | 1998 | coupe |
40918 | 198000 | Porsche_911_991_GT3_RS | 2015 | coupe |
37840 | 220000 | Porsche_997 | 2008 | coupe |
38299 | 250000 | Glas_BMW_mit_Wasser | 2015 | NaN |
47337 | 259000 | BMW_Z8_roadster | 2001 | cabrio |
12682 | 265000 | Porsche_GT3_RS__PCCB__Lift___grosser_Exklusiv_... | 2016 | coupe |
35923 | 295000 | Porsche_911_Targa_Exclusive_Edition__1_von_15_... | 2015 | cabrio |
34723 | 299000 | Porsche_Porsche_911/930_Turbo_3.0__deutsche_Au... | 1977 | coupe |
14715 | 345000 | Rolls_Royce_Phantom_Drophead_Coupe | 2012 | cabrio |
36818 | 350000 | Porsche_991 | 2016 | coupe |
37585 | 999990 | Volkswagen_Jetta_GT | 1985 | limousine |
43049 | 999999 | 2_VW_Busse_T3 | 1981 | bus |
514 | 999999 | Ford_Focus_Turnier_1.6_16V_Style | 2009 | kombi |
22947 | 1234566 | Bmw_530d_zum_ausschlachten | 1999 | kombi |
7814 | 1300000 | Ferrari_F40 | 1992 | coupe |
47634 | 3890000 | Ferrari_FXX | 2006 | coupe |
11137 | 10000000 | suche_maserati_3200_gt_Zustand_unwichtig_laufe... | 1960 | coupe |
2897 | 11111111 | Escort_MK_1_Hundeknochen_zum_umbauen_auf_RS_2000 | 1973 | limousine |
24384 | 11111111 | Schlachte_Golf_3_gt_tdi | 1995 | NaN |
27371 | 12345678 | Fiat_Punto | 2017 | NaN |
47598 | 12345678 | Opel_Vectra_B_1_6i_16V_Facelift_Tuning_Showcar... | 2001 | limousine |
39377 | 12345678 | Tausche_volvo_v40_gegen_van | 2018 | NaN |
42221 | 27322222 | Leasinguebernahme | 2014 | limousine |
39705 | 99999999 | Tausch_gegen_gleichwertiges | 1999 | limousine |
We can conclude that data enrties containing prices below 399 and above 350000 can be removed
autos = autos[autos['price'].between(399,351000)]
autos.sort_values('price').head()
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_p_s | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
5179 | 2016-03-14 00:48:15 | Fiat_Punto_60_Selecta_SX | privat | Angebot | 399 | control | kleinwagen | 1997 | manuell | 60 | punto | 150000 | 5 | benzin | fiat | nein | 2016-03-13 00:00:00 | 0 | 53117 | 2016-04-07 03:16:34 |
6263 | 2016-03-30 11:52:01 | BMW_316i_LEDER_UND_KLIMA_BITTE_KEINE_MAILS | privat | Angebot | 399 | test | kleinwagen | 1998 | manuell | 102 | 3er | 150000 | 12 | benzin | bmw | nein | 2016-03-30 00:00:00 | 0 | 10827 | 2016-04-03 01:21:43 |
45964 | 2016-04-05 11:57:35 | Corsa_opel | privat | Angebot | 399 | test | NaN | 2000 | NaN | 0 | corsa | 150000 | 3 | NaN | opel | NaN | 2016-04-05 00:00:00 | 0 | 33334 | 2016-04-05 12:39:37 |
12965 | 2016-03-29 01:55:57 | Opel_Astra_Kombi_mit_AHK | privat | Angebot | 399 | control | kombi | 1998 | automatik | 101 | astra | 150000 | 3 | benzin | opel | ja | 2016-03-29 00:00:00 | 0 | 36304 | 2016-04-05 17:26:08 |
42442 | 2016-03-30 15:50:26 | Fiat_Punto | privat | Angebot | 399 | test | kleinwagen | 1997 | manuell | 50 | punto | 150000 | 11 | benzin | fiat | NaN | 2016-03-30 00:00:00 | 0 | 48369 | 2016-04-07 06:44:43 |
autos.sort_values('price', ascending = False).head()
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_p_s | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
36818 | 2016-03-27 18:37:37 | Porsche_991 | privat | Angebot | 350000 | control | coupe | 2016 | manuell | 500 | 911 | 5000 | 3 | benzin | porsche | nein | 2016-03-27 00:00:00 | 0 | 70499 | 2016-03-27 18:37:37 |
14715 | 2016-03-30 08:37:24 | Rolls_Royce_Phantom_Drophead_Coupe | privat | Angebot | 345000 | control | cabrio | 2012 | automatik | 460 | NaN | 20000 | 8 | benzin | sonstige_autos | nein | 2016-03-30 00:00:00 | 0 | 73525 | 2016-04-07 00:16:26 |
34723 | 2016-03-23 16:37:29 | Porsche_Porsche_911/930_Turbo_3.0__deutsche_Au... | privat | Angebot | 299000 | test | coupe | 1977 | manuell | 260 | 911 | 100000 | 7 | benzin | porsche | nein | 2016-03-23 00:00:00 | 0 | 61462 | 2016-04-06 16:44:50 |
35923 | 2016-04-03 07:56:23 | Porsche_911_Targa_Exclusive_Edition__1_von_15_... | privat | Angebot | 295000 | test | cabrio | 2015 | automatik | 400 | 911 | 5000 | 6 | benzin | porsche | nein | 2016-04-03 00:00:00 | 0 | 74078 | 2016-04-03 08:56:20 |
12682 | 2016-03-28 22:48:01 | Porsche_GT3_RS__PCCB__Lift___grosser_Exklusiv_... | privat | Angebot | 265000 | control | coupe | 2016 | automatik | 500 | 911 | 5000 | 3 | benzin | porsche | nein | 2016-03-28 00:00:00 | 0 | 70193 | 2016-04-05 03:44:51 |
Remaining data entries containg only prices between 399 and 350000.
autos['date_crawled'].str[:10].value_counts(normalize = True, dropna = False).sort_index()
2016-03-05 0.025534 2016-03-06 0.014137 2016-03-07 0.036235 2016-03-08 0.033081 2016-03-09 0.032864 2016-03-10 0.032559 2016-03-11 0.032820 2016-03-12 0.037105 2016-03-13 0.015573 2016-03-14 0.036235 2016-03-15 0.034234 2016-03-16 0.029514 2016-03-17 0.031319 2016-03-18 0.012854 2016-03-19 0.034495 2016-03-20 0.038105 2016-03-21 0.037627 2016-03-22 0.032994 2016-03-23 0.032429 2016-03-24 0.028992 2016-03-25 0.031211 2016-03-26 0.032385 2016-03-27 0.031080 2016-03-28 0.034865 2016-03-29 0.033668 2016-03-30 0.033560 2016-03-31 0.031950 2016-04-01 0.033908 2016-04-02 0.035691 2016-04-03 0.038779 2016-04-04 0.036539 2016-04-05 0.013137 2016-04-06 0.003132 2016-04-07 0.001392 Name: date_crawled, dtype: float64
autos['ad_created'].str[:10].value_counts(normalize = True, dropna = False).sort_index()
2015-06-11 0.000022 2015-08-10 0.000022 2015-09-09 0.000022 2015-11-10 0.000022 2015-12-05 0.000022 2015-12-30 0.000022 2016-01-03 0.000022 2016-01-07 0.000022 2016-01-10 0.000043 2016-01-13 0.000022 2016-01-14 0.000022 2016-01-16 0.000022 2016-01-22 0.000022 2016-01-27 0.000065 2016-01-29 0.000022 2016-02-01 0.000022 2016-02-02 0.000043 2016-02-05 0.000043 2016-02-07 0.000022 2016-02-08 0.000022 2016-02-09 0.000022 2016-02-11 0.000022 2016-02-12 0.000043 2016-02-14 0.000043 2016-02-16 0.000022 2016-02-17 0.000022 2016-02-18 0.000043 2016-02-19 0.000065 2016-02-20 0.000043 2016-02-21 0.000043 ... 2016-03-09 0.032951 2016-03-10 0.032298 2016-03-11 0.033125 2016-03-12 0.036887 2016-03-13 0.017052 2016-03-14 0.034799 2016-03-15 0.033995 2016-03-16 0.030014 2016-03-17 0.030971 2016-03-18 0.013463 2016-03-19 0.033407 2016-03-20 0.038214 2016-03-21 0.037844 2016-03-22 0.032777 2016-03-23 0.032255 2016-03-24 0.028949 2016-03-25 0.031363 2016-03-26 0.032407 2016-03-27 0.031015 2016-03-28 0.034908 2016-03-29 0.033690 2016-03-30 0.033386 2016-03-31 0.032015 2016-04-01 0.033864 2016-04-02 0.035386 2016-04-03 0.039019 2016-04-04 0.036909 2016-04-05 0.011853 2016-04-06 0.003219 2016-04-07 0.001240 Name: ad_created, Length: 76, dtype: float64
autos['last_seen'].str[:10].value_counts(normalize = True, dropna = False).sort_index()
2016-03-05 0.001087 2016-03-06 0.004219 2016-03-07 0.005263 2016-03-08 0.007134 2016-03-09 0.009483 2016-03-10 0.010353 2016-03-11 0.012114 2016-03-12 0.023859 2016-03-13 0.008809 2016-03-14 0.012354 2016-03-15 0.015703 2016-03-16 0.016225 2016-03-17 0.027839 2016-03-18 0.007308 2016-03-19 0.015507 2016-03-20 0.020532 2016-03-21 0.020553 2016-03-22 0.021293 2016-03-23 0.018509 2016-03-24 0.019553 2016-03-25 0.018813 2016-03-26 0.016399 2016-03-27 0.015594 2016-03-28 0.020749 2016-03-29 0.021728 2016-03-30 0.024142 2016-03-31 0.023576 2016-04-01 0.022881 2016-04-02 0.024860 2016-04-03 0.024947 2016-04-04 0.024359 2016-04-05 0.126343 2016-04-06 0.224325 2016-04-07 0.133586 Name: last_seen, dtype: float64
We confirm that there are no missing values in the three columns, shown in the distributions above.
autos['registration_year'].describe()
count 45978.000000 mean 2004.937427 std 88.905588 min 1000.000000 25% 1999.000000 50% 2004.000000 75% 2008.000000 max 9999.000000 Name: registration_year, dtype: float64
As we saw earlier the 'registration_year' column contains some inacurate entries lower than 1900 and higher than current year.
We'll remove all data enries containing a registration year earlier than 1900 and later than 2020.
autos = autos[autos['registration_year'].between(1900, 2016)]
autos.sort_values('registration_year').head(3)
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_p_s | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
28693 | 2016-03-22 17:48:41 | Renault_Twingo | privat | Angebot | 599 | control | kleinwagen | 1910 | manuell | 0 | NaN | 5000 | 0 | benzin | renault | NaN | 2016-03-22 00:00:00 | 0 | 70376 | 2016-04-06 09:16:59 |
22659 | 2016-03-14 08:51:18 | Opel_Corsa_B | privat | Angebot | 500 | test | NaN | 1910 | NaN | 0 | corsa | 150000 | 0 | NaN | opel | NaN | 2016-03-14 00:00:00 | 0 | 52393 | 2016-04-03 07:53:55 |
21416 | 2016-03-12 08:36:21 | Essex_super_six__Ford_A | privat | Angebot | 16500 | control | cabrio | 1927 | manuell | 40 | andere | 5000 | 5 | benzin | ford | NaN | 2016-03-12 00:00:00 | 0 | 74821 | 2016-03-15 12:45:12 |
autos.sort_values('registration_year', ascending = False).head(3)
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_p_s | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
19492 | 2016-03-09 19:43:08 | Mercedes_Benz_C_180 | privat | Angebot | 1700 | test | NaN | 2016 | manuell | 115 | c_klasse | 150000 | 6 | benzin | mercedes_benz | nein | 2016-03-09 00:00:00 | 0 | 6217 | 2016-04-06 04:46:05 |
2906 | 2016-03-09 17:57:44 | Golf_4_sehr_sparsam | privat | Angebot | 2200 | test | NaN | 2016 | manuell | 0 | golf | 150000 | 6 | diesel | volkswagen | NaN | 2016-03-09 00:00:00 | 0 | 27283 | 2016-03-14 10:44:23 |
13898 | 2016-03-16 16:57:08 | Audi_A4_B5_mit_TÜV!!! | privat | Angebot | 750 | control | NaN | 2016 | manuell | 174 | a4 | 150000 | 12 | NaN | audi | NaN | 2016-03-16 00:00:00 | 0 | 26169 | 2016-03-19 15:18:16 |
Here is the description of our data set after removing erroneous car registration years.
autos.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 44161 entries, 0 to 49999 Data columns (total 20 columns): date_crawled 44161 non-null object name 44161 non-null object seller 44161 non-null object offer_type 44161 non-null object price 44161 non-null int64 abtest 44161 non-null object vehicle_type 42122 non-null object registration_year 44161 non-null int64 gearbox 42484 non-null object power_p_s 44161 non-null int64 model 42267 non-null object odometer_km 44161 non-null int64 registration_month 44161 non-null int64 fuel_type 41476 non-null object brand 44161 non-null object unrepaired_damage 37037 non-null object ad_created 44161 non-null object nr_of_pictures 44161 non-null int64 postal_code 44161 non-null int64 last_seen 44161 non-null object dtypes: int64(7), object(13) memory usage: 7.1+ MB
autos['registration_year'].value_counts(normalize = True).sort_index()
1910 0.000045 1927 0.000023 1929 0.000023 1931 0.000023 1934 0.000045 1937 0.000091 1938 0.000023 1939 0.000023 1941 0.000045 1943 0.000023 1948 0.000023 1950 0.000023 1951 0.000045 1952 0.000023 1953 0.000023 1954 0.000045 1955 0.000045 1956 0.000091 1957 0.000045 1958 0.000091 1959 0.000136 1960 0.000430 1961 0.000136 1962 0.000091 1963 0.000181 1964 0.000272 1965 0.000385 1966 0.000476 1967 0.000589 1968 0.000589 ... 1987 0.001562 1988 0.002966 1989 0.003533 1990 0.006454 1991 0.006839 1992 0.007292 1993 0.008333 1994 0.011956 1995 0.021512 1996 0.025950 1997 0.036661 1998 0.047599 1999 0.060823 2000 0.063495 2001 0.058174 2002 0.055230 2003 0.060415 2004 0.060868 2005 0.065239 2006 0.060393 2007 0.051448 2008 0.049954 2009 0.047055 2010 0.035937 2011 0.036616 2012 0.029596 2013 0.018048 2014 0.014809 2015 0.008265 2016 0.023188 Name: registration_year, Length: 78, dtype: float64
First we are going to summarize the mean price for all brands and store the data in a dictionary called "brand_pivot"
brand = autos['brand'].unique()
brand_pivot = {}
for i in brand:
brand_pivot[i] = round(autos.loc[autos['brand'] == i, 'price'].mean(),0)
brand_pivot
{'alfa_romeo': 4269.0, 'audi': 9516.0, 'bmw': 8521.0, 'chevrolet': 6810.0, 'chrysler': 3567.0, 'citroen': 3955.0, 'dacia': 5916.0, 'daewoo': 1207.0, 'daihatsu': 1860.0, 'fiat': 3152.0, 'ford': 4167.0, 'honda': 4296.0, 'hyundai': 5571.0, 'jaguar': 12130.0, 'jeep': 11650.0, 'kia': 6090.0, 'lada': 2780.0, 'lancia': 3726.0, 'land_rover': 19108.0, 'mazda': 4321.0, 'mercedes_benz': 8731.0, 'mini': 10691.0, 'mitsubishi': 3721.0, 'nissan': 5083.0, 'opel': 3281.0, 'peugeot': 3300.0, 'porsche': 46788.0, 'renault': 2723.0, 'rover': 1720.0, 'saab': 3416.0, 'seat': 4679.0, 'skoda': 6542.0, 'smart': 3601.0, 'sonstige_autos': 13317.0, 'subaru': 4415.0, 'suzuki': 4354.0, 'toyota': 5250.0, 'trabant': 2222.0, 'volkswagen': 5688.0, 'volvo': 5143.0}
Next we see how brands are distributed by the number of adds for each brand, sorted in descending order.
br_pr = autos['brand']
top_br = br_pr.value_counts(ascending = False)
top_br
volkswagen 9347 bmw 5021 opel 4519 mercedes_benz 4449 audi 3963 ford 2917 renault 1982 peugeot 1299 fiat 1058 seat 799 skoda 745 mazda 673 nissan 663 smart 657 citroen 623 toyota 583 hyundai 450 sonstige_autos 424 volvo 410 mini 406 honda 349 mitsubishi 348 kia 324 alfa_romeo 296 porsche 279 chevrolet 261 suzuki 260 chrysler 159 dacia 123 jeep 106 daihatsu 101 land_rover 98 subaru 89 saab 72 jaguar 70 daewoo 59 rover 57 trabant 51 lancia 45 lada 26 Name: brand, dtype: int64
Finally we can choose what portion of the brands we want to examine more closely. Lets say we want to see the mean prices for each brand that has more than 1000 adds:
popular_brands = []
for i in brand_pivot:
if top_br[i] > 1000:
popular_brands.append((brand_pivot[i], i))
sorted(popular_brands, reverse = True)
[(9516.0, 'audi'), (8731.0, 'mercedes_benz'), (8521.0, 'bmw'), (5688.0, 'volkswagen'), (4167.0, 'ford'), (3300.0, 'peugeot'), (3281.0, 'opel'), (3152.0, 'fiat'), (2723.0, 'renault')]
No surprises there - the top three most expensive brands (at least as far as mass production cars go) are Audi, Mercedes and BMW. The fourth - Volkswagen is far behind.
mpr_series = pd.Series(brand_pivot).sort_values(ascending = False)
mpr_series
porsche 46788.0 land_rover 19108.0 sonstige_autos 13317.0 jaguar 12130.0 jeep 11650.0 mini 10691.0 audi 9516.0 mercedes_benz 8731.0 bmw 8521.0 chevrolet 6810.0 skoda 6542.0 kia 6090.0 dacia 5916.0 volkswagen 5688.0 hyundai 5571.0 toyota 5250.0 volvo 5143.0 nissan 5083.0 seat 4679.0 subaru 4415.0 suzuki 4354.0 mazda 4321.0 honda 4296.0 alfa_romeo 4269.0 ford 4167.0 citroen 3955.0 lancia 3726.0 mitsubishi 3721.0 smart 3601.0 chrysler 3567.0 saab 3416.0 peugeot 3300.0 opel 3281.0 fiat 3152.0 lada 2780.0 renault 2723.0 trabant 2222.0 daihatsu 1860.0 rover 1720.0 daewoo 1207.0 dtype: float64
df_mean = pd.DataFrame(mpr_series, columns = ['mean_price'])
df_mean
mean_price | |
---|---|
porsche | 46788.0 |
land_rover | 19108.0 |
sonstige_autos | 13317.0 |
jaguar | 12130.0 |
jeep | 11650.0 |
mini | 10691.0 |
audi | 9516.0 |
mercedes_benz | 8731.0 |
bmw | 8521.0 |
chevrolet | 6810.0 |
skoda | 6542.0 |
kia | 6090.0 |
dacia | 5916.0 |
volkswagen | 5688.0 |
hyundai | 5571.0 |
toyota | 5250.0 |
volvo | 5143.0 |
nissan | 5083.0 |
seat | 4679.0 |
subaru | 4415.0 |
suzuki | 4354.0 |
mazda | 4321.0 |
honda | 4296.0 |
alfa_romeo | 4269.0 |
ford | 4167.0 |
citroen | 3955.0 |
lancia | 3726.0 |
mitsubishi | 3721.0 |
smart | 3601.0 |
chrysler | 3567.0 |
saab | 3416.0 |
peugeot | 3300.0 |
opel | 3281.0 |
fiat | 3152.0 |
lada | 2780.0 |
renault | 2723.0 |
trabant | 2222.0 |
daihatsu | 1860.0 |
rover | 1720.0 |
daewoo | 1207.0 |
miles_pivot = {}
for i in brand:
miles_pivot[i] = round(autos.loc[autos['brand'] == i, 'odometer_km'].mean(),0)
miles_pivot
{'alfa_romeo': 131436.0, 'audi': 129045.0, 'bmw': 132881.0, 'chevrolet': 100441.0, 'chrysler': 132736.0, 'citroen': 118965.0, 'dacia': 84268.0, 'daewoo': 118220.0, 'daihatsu': 115198.0, 'fiat': 115369.0, 'ford': 123615.0, 'honda': 122135.0, 'hyundai': 105656.0, 'jaguar': 125429.0, 'jeep': 127123.0, 'kia': 111836.0, 'lada': 84808.0, 'lancia': 119889.0, 'land_rover': 118010.0, 'mazda': 124264.0, 'mercedes_benz': 131068.0, 'mini': 88682.0, 'mitsubishi': 125690.0, 'nissan': 117526.0, 'opel': 128522.0, 'peugeot': 126532.0, 'porsche': 98477.0, 'renault': 126975.0, 'rover': 136579.0, 'saab': 144028.0, 'seat': 120776.0, 'skoda': 110792.0, 'smart': 99658.0, 'sonstige_autos': 90401.0, 'subaru': 125955.0, 'suzuki': 106654.0, 'toyota': 115669.0, 'trabant': 52647.0, 'volkswagen': 128417.0, 'volvo': 138573.0}
miles_series = pd.Series(miles_pivot, name = 'mean_km').sort_values(ascending = False)
count_series = pd.Series(top_br)
import pprint
df_mean ['mean_km'] = miles_series
df_mean['num_adds'] = count_series
top_6 = df_mean[df_mean['num_adds']>2000]
pprint.pprint(top_6.sort_values('mean_price', ascending = False))
mean_price mean_km num_adds audi 9516.0 129045.0 3963 mercedes_benz 8731.0 131068.0 4449 bmw 8521.0 132881.0 5021 volkswagen 5688.0 128417.0 9347 ford 4167.0 123615.0 2917 opel 3281.0 128522.0 4519
This is the summary data for the top 6 most popular brands, collected in a single data frame. As the milage column contains only 13 unique milage numbers (probably ranges and not real values) we cannot see if milage can explain the wide differene in mean prices.
Columns containing categorical data: abtest | vehicle_type | offer_type | seller | gearbox | model | fuel_type | brand | unrepaired_damage
Let's review the content of these columns and translate into English.
autos['abtest'].value_counts()
test 22785 control 21376 Name: abtest, dtype: int64
No need for translation.
print(autos['vehicle_type'].value_counts(dropna = False))
limousine 12224 kleinwagen 9558 kombi 8672 bus 3973 cabrio 2983 coupe 2395 NaN 2039 suv 1959 andere 358 Name: vehicle_type, dtype: int64
map_dict_1 ={#'limousine':'limousine'
'kleinwagen': 'small car',
'kombi': 'station wagon',
'bus': 'van',
# 'coupe':'coupe',
# 'suv':'suv',
'cabrio':'convertible',
'andere':'other'}
eng_type = autos['vehicle_type'].replace(map_dict_1, inplace = True)
autos['vehicle_type'].value_counts(dropna = False)
limousine 12224 small car 9558 station wagon 8672 van 3973 convertible 2983 coupe 2395 NaN 2039 suv 1959 other 358 Name: vehicle_type, dtype: int64
print(autos['offer_type'].value_counts())
Angebot 44161 Name: offer_type, dtype: int64
autos['offer_type'].replace({'Angebot':'Offer'}, inplace = True)
autos['offer_type'].value_counts()
Offer 44161 Name: offer_type, dtype: int64
print(autos['seller'].value_counts())
privat 44161 Name: seller, dtype: int64
autos['seller'].replace({'privat':'private'}, inplace = True)
autos['seller'].value_counts()
private 44161 Name: seller, dtype: int64
print(autos['gearbox'].value_counts())
manuell 32771 automatik 9713 Name: gearbox, dtype: int64
autos['gearbox'].replace({'manuell':'manual', 'automatik':'automatic'}, inplace = True)
autos['gearbox'].value_counts()
manual 32771 automatic 9713 Name: gearbox, dtype: int64
print(autos['fuel_type'].value_counts())
benzin 26774 diesel 13932 lpg 637 cng 67 hybrid 37 elektro 18 andere 11 Name: fuel_type, dtype: int64
autos['fuel_type'].replace({'benzin':'gasoline',
'elektro':'electric',
'andere':'other'}, inplace = True)
autos['fuel_type'].value_counts()
gasoline 26774 diesel 13932 lpg 637 cng 67 hybrid 37 electric 18 other 11 Name: fuel_type, dtype: int64
print(autos['unrepaired_damage'].value_counts())
nein 33251 ja 3786 Name: unrepaired_damage, dtype: int64
autos['unrepaired_damage'].replace({'nein':'no',
'ja':'yes'}, inplace = True)
autos['unrepaired_damage'].value_counts()
no 33251 yes 3786 Name: unrepaired_damage, dtype: int64
The tree columns, containing dates are: date_crawled; ad_created; last_seen.
autos['date_crawled'] = autos['date_crawled'].str[:10].str.replace('-','').astype(int)
autos['ad_created']=autos['ad_created'].str[:10].str.replace('-','').astype(int)
autos['last_seen']=autos['last_seen'].str[:10].str.replace('-','').astype(int)
autos[['date_crawled', 'ad_created', 'last_seen']].info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 44161 entries, 0 to 49999 Data columns (total 3 columns): date_crawled 44161 non-null int64 ad_created 44161 non-null int64 last_seen 44161 non-null int64 dtypes: int64(3) memory usage: 1.3 MB
autos[['date_crawled', 'ad_created','last_seen']].head()
date_crawled | ad_created | last_seen | |
---|---|---|---|
0 | 20160326 | 20160326 | 20160406 |
1 | 20160404 | 20160404 | 20160406 |
2 | 20160326 | 20160326 | 20160406 |
3 | 20160312 | 20160312 | 20160315 |
4 | 20160401 | 20160401 | 20160401 |
autos.loc[autos['name'].str.find('Klima')> 0,'AC'] = 'Yes'
autos.sort_values('AC', ascending = False).head()
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_p_s | ... | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | AC | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3 | 20160312 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | private | Offer | 4350 | control | small car | 2007 | automatic | 71 | ... | 70000 | 6 | gasoline | smart | no | 20160312 | 0 | 33729 | 20160315 | Yes |
31743 | 20160320 | BMW_116d_Klimautomatik_/_PDC_/_LM_/_Sitzheizung | private | Offer | 15290 | test | limousine | 2013 | manual | 116 | ... | 100000 | 11 | diesel | bmw | no | 20160320 | 0 | 28325 | 20160407 | Yes |
32412 | 20160305 | Ford_Fusion_1_4_Klima | private | Offer | 2590 | control | limousine | 2006 | manual | 80 | ... | 125000 | 3 | gasoline | ford | yes | 20160305 | 0 | 61267 | 20160311 | Yes |
32371 | 20160326 | Audi_A3_Ambition_1.6_FSI_S_Line__Xenon/SSD/Kli... | private | Offer | 4990 | test | small car | 2006 | manual | 115 | ... | 150000 | 11 | gasoline | audi | no | 20160326 | 0 | 41334 | 20160326 | Yes |
32350 | 20160306 | Mercedes_Benz_SLK_230_Kompressor__Automatik__K... | private | Offer | 6400 | control | convertible | 1999 | automatic | 193 | ... | 150000 | 5 | gasoline | mercedes_benz | no | 20160306 | 0 | 76846 | 20160406 | Yes |
5 rows × 21 columns
We explored the text in column 'name' and if we found a string containing 'Klima' we put a value of 'Yes' in a new column called 'AC'
brand_mod = autos['brand']+'|'+autos['model']
brand_mod.value_counts().sort_values(ascending = False).head(10)
volkswagen|golf 3504 bmw|3er 2546 volkswagen|polo 1407 opel|corsa 1343 volkswagen|passat 1324 opel|astra 1260 audi|a4 1210 mercedes_benz|c_klasse 1128 bmw|5er 1120 mercedes_benz|e_klasse 949 dtype: int64
autos['odometer_km'].value_counts()
150000 28133 125000 4709 100000 1994 90000 1634 80000 1356 70000 1165 60000 1110 50000 980 40000 791 30000 748 20000 713 5000 601 10000 227 Name: odometer_km, dtype: int64
It seems that mileage data has already been aggregated but lets group it further in the following buckets and then ckeck what's the mean price for each bucket:
buckets = (20000, 50000, 70000, 90000, 100000, 125000)
miles_prices = {}
lim1 = 0
r = '{0}k-{1}k km'
r_final = 'above {0}k km'
for i in buckets:
miles_prices[r.format(lim1/1000, i/1000)] = round(\
autos.loc[autos['odometer_km'].between(lim1, i),'price'].mean(), 0)
lim1 = i + 1000
miles_prices[r_final.format(lim1/1000)] = round(\
autos.loc[autos['odometer_km']>lim1,'price'].mean(),0)
m_p_df = pd.Series(miles_prices)
m_p_df.sort_values()
pd.DataFrame(m_p_df, columns = ['Mean price']).sort_values('Mean price')
Mean price | |
---|---|
above 126.0k km | 4013.0 |
101.0k-125.0k km | 6403.0 |
91.0k-100.0k km | 8388.0 |
71.0k-90.0k km | 9203.0 |
51.0k-70.0k km | 11838.0 |
21.0k-50.0k km | 15357.0 |
0.0k-20.0k km | 16594.0 |
Of course there is no surprise here - prices are perfectly negatively correlated to mileage - the higher the mileage the lower the mean price of vehicles.
autos['unrepaired_damage'].value_counts()
no 33251 yes 3786 Name: unrepaired_damage, dtype: int64
damage = {}
damage['Damaged cars'] = round(\
autos.loc[autos['unrepaired_damage'] == 'yes','price'].mean())
damage['Cars with no defects'] = round(\
autos.loc[autos['unrepaired_damage'] == 'no', 'price'].mean())
pd.Series(damage)
Cars with no defects 7285.0 Damaged cars 2642.0 dtype: float64
Again - a predictable outcome! Damaged cars are cheaper on average than cars with no defects.