Total Data points - 50,0000
dateCrawled
- When this ad was first crawled. All field-values are taken from this date.name
- Name of the car.seller
- Whether the seller is private or a dealer.offerType
- The type of listingprice
- The price on the ad to sell the car.abtest
- Whether the listing is included in an A/B test.vehicleType
- The vehicle Type.yearOfRegistration
- The year in which the car was first registered.gearbox
- The transmission type.powerPS
- The power of the car in PS.model
- The car model name.kilometer
- How many kilometers the car has driven.monthOfRegistration
- The month in which the car was first registered.fuelType
- What type of fuel the car uses.brand
- The brand of the car.notRepairedDamage
- If the car has a damage which is not yet repaired.dateCreated
- The date on which the eBay listing was created.nrOfPictures
- The number of pictures in the ad.postalCode
- The postal code for the location of the vehicle.lastSeenOnline
- When the crawler saw this ad last online.The full dataset can be found here
#Importing required packages
import pandas as pd
import numpy as np
#Read the csv
autos = pd.read_csv("autos.csv", encoding = "Windows-1252")
autos
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
49995 | 2016-03-27 14:38:19 | Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon | privat | Angebot | $24,900 | control | limousine | 2011 | automatik | 239 | q5 | 100,000km | 1 | diesel | audi | nein | 2016-03-27 00:00:00 | 0 | 82131 | 2016-04-01 13:47:40 |
49996 | 2016-03-28 10:50:25 | Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... | privat | Angebot | $1,980 | control | cabrio | 1996 | manuell | 75 | astra | 150,000km | 5 | benzin | opel | nein | 2016-03-28 00:00:00 | 0 | 44807 | 2016-04-02 14:18:02 |
49997 | 2016-04-02 14:44:48 | Fiat_500_C_1.2_Dualogic_Lounge | privat | Angebot | $13,200 | test | cabrio | 2014 | automatik | 69 | 500 | 5,000km | 11 | benzin | fiat | nein | 2016-04-02 00:00:00 | 0 | 73430 | 2016-04-04 11:47:27 |
49998 | 2016-03-08 19:25:42 | Audi_A3_2.0_TDI_Sportback_Ambition | privat | Angebot | $22,900 | control | kombi | 2013 | manuell | 150 | a3 | 40,000km | 11 | diesel | audi | nein | 2016-03-08 00:00:00 | 0 | 35683 | 2016-04-05 16:45:07 |
49999 | 2016-03-14 00:42:12 | Opel_Vectra_1.6_16V | privat | Angebot | $1,250 | control | limousine | 1996 | manuell | 101 | vectra | 150,000km | 1 | benzin | opel | nein | 2016-03-13 00:00:00 | 0 | 45897 | 2016-04-06 21:18:48 |
50000 rows × 20 columns
autos.info() # TO observe the data types and null values if any
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 dateCrawled 50000 non-null object 1 name 50000 non-null object 2 seller 50000 non-null object 3 offerType 50000 non-null object 4 price 50000 non-null object 5 abtest 50000 non-null object 6 vehicleType 44905 non-null object 7 yearOfRegistration 50000 non-null int64 8 gearbox 47320 non-null object 9 powerPS 50000 non-null int64 10 model 47242 non-null object 11 odometer 50000 non-null object 12 monthOfRegistration 50000 non-null int64 13 fuelType 45518 non-null object 14 brand 50000 non-null object 15 notRepairedDamage 40171 non-null object 16 dateCreated 50000 non-null object 17 nrOfPictures 50000 non-null int64 18 postalCode 50000 non-null int64 19 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
From the above, there seems to be null velues present in VehicleType
, gearbox
, model
, fuelType
, notRepairedDamage
fields. The dataset has both string value columns and integer value columns
autos.head()
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
autos.columns
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest', 'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model', 'odometer', 'monthOfRegistration', 'fuelType', 'brand', 'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode', 'lastSeen'], dtype='object')
#Renaming blowe columns for readability
autos.rename({"yearOfRegistration":"registration_year", "monthOfRegistration":"registration_month", "notRepairedDamage":"unrepaired_damage", "dateCreated":"ad_created"}, axis = 1, inplace = True)
# Creating a list of columns with lowercase and replacing the original columns for easier accessibility
lower_col = []
for col in autos.columns:
col = col.strip()
col = col.replace(" ", "_")
col = col.lower()
lower_col.append(col)
autos.columns = lower_col
autos.head()
datecrawled | name | seller | offertype | price | abtest | vehicletype | registration_year | gearbox | powerps | model | odometer | registration_month | fueltype | brand | unrepaired_damage | ad_created | nrofpictures | postalcode | lastseen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
#Exploring the Data
autos.describe(include = "all")
datecrawled | name | seller | offertype | price | abtest | vehicletype | registration_year | gearbox | powerps | model | odometer | registration_month | fueltype | brand | unrepaired_damage | ad_created | nrofpictures | postalcode | lastseen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000 | 50000 | 50000 | 50000 | 50000 | 50000 | 44905 | 50000.000000 | 47320 | 50000.000000 | 47242 | 50000 | 50000.000000 | 45518 | 50000 | 40171 | 50000 | 50000.0 | 50000.000000 | 50000 |
unique | 48213 | 38754 | 2 | 2 | 2357 | 2 | 8 | NaN | 2 | NaN | 245 | 13 | NaN | 7 | 40 | 2 | 76 | NaN | NaN | 39481 |
top | 2016-03-10 15:36:24 | Ford_Fiesta | privat | Angebot | $0 | test | limousine | NaN | manuell | NaN | golf | 150,000km | NaN | benzin | volkswagen | nein | 2016-04-03 00:00:00 | NaN | NaN | 2016-04-07 06:17:27 |
freq | 3 | 78 | 49999 | 49999 | 1421 | 25756 | 12859 | NaN | 36993 | NaN | 4024 | 32424 | NaN | 30107 | 10687 | 35232 | 1946 | NaN | NaN | 8 |
mean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2005.073280 | NaN | 116.355920 | NaN | NaN | 5.723360 | NaN | NaN | NaN | NaN | 0.0 | 50813.627300 | NaN |
std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 105.712813 | NaN | 209.216627 | NaN | NaN | 3.711984 | NaN | NaN | NaN | NaN | 0.0 | 25779.747957 | NaN |
min | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1000.000000 | NaN | 0.000000 | NaN | NaN | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 1067.000000 | NaN |
25% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1999.000000 | NaN | 70.000000 | NaN | NaN | 3.000000 | NaN | NaN | NaN | NaN | 0.0 | 30451.000000 | NaN |
50% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2003.000000 | NaN | 105.000000 | NaN | NaN | 6.000000 | NaN | NaN | NaN | NaN | 0.0 | 49577.000000 | NaN |
75% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | NaN | 9.000000 | NaN | NaN | NaN | NaN | 0.0 | 71540.000000 | NaN |
max | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 9999.000000 | NaN | 17700.000000 | NaN | NaN | 12.000000 | NaN | NaN | NaN | NaN | 0.0 | 99998.000000 | NaN |
autos["seller"].value_counts()
privat 49999 gewerblich 1 Name: seller, dtype: int64
autos["nrofpictures"].value_counts()
0 50000 Name: nrofpictures, dtype: int64
autos["gearbox"].value_counts()
manuell 36993 automatik 10327 Name: gearbox, dtype: int64
autos["fueltype"].value_counts()
benzin 30107 diesel 14567 lpg 691 cng 75 hybrid 37 andere 22 elektro 19 Name: fueltype, dtype: int64
autos["offertype"].value_counts()
Angebot 49999 Gesuch 1 Name: offertype, dtype: int64
autos["unrepaired_damage"].value_counts(dropna = False)
nein 35232 NaN 9829 ja 4939 Name: unrepaired_damage, dtype: int64
price
and odometer
seems to be in string types. They need to be convertged into int/float.seller
, nrofpictures
, offertype
field doesnt give much info. We can drop these columnsgearbox
, unrepaired_damage
and fueltype
are in german. Need to tranlate to english.autos["odometer"] = autos["odometer"].str.replace("km", "").str.replace(",", "").astype(int)
autos["price"] = autos["price"].str.replace("$", "").str.replace(",", "").astype(int)
autos.rename({"odometer":"odometer_km"}, axis = 1, inplace = True)
print(autos["odometer_km"].dtype)
print(autos["price"].dtype)
int64 int64
autos.head()
datecrawled | name | seller | offertype | price | abtest | vehicletype | registration_year | gearbox | powerps | model | odometer_km | registration_month | fueltype | brand | unrepaired_damage | ad_created | nrofpictures | postalcode | lastseen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | 5000 | control | bus | 2004 | manuell | 158 | andere | 150000 | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | 8500 | control | limousine | 1997 | automatik | 286 | 7er | 150000 | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | 8990 | test | limousine | 2009 | manuell | 102 | golf | 70000 | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | 4350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70000 | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | 1350 | test | kombi | 2003 | manuell | 0 | focus | 150000 | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
drop_columns = ["seller", "nrofpictures", "offertype"]
autos.drop(drop_columns, axis = 1, inplace = True)
autos["gearbox"].unique()
array(['manuell', 'automatik', nan], dtype=object)
gearbox_dict = {
'manuell' : 'manual',
'automatik' : 'automatic',
np.nan : 'unknown'
}
autos["gearbox"] = autos["gearbox"].map(gearbox_dict)
autos["gearbox"].value_counts(dropna = False)
manual 36993 automatic 10327 unknown 2680 Name: gearbox, dtype: int64
autos["fueltype"].unique()
array(['lpg', 'benzin', 'diesel', nan, 'cng', 'hybrid', 'elektro', 'andere'], dtype=object)
fueltype_dict = {
'lpg' : 'lpg',
'benzin' : 'petrol',
np.nan : 'unknown',
'cng' : 'cng',
'hybrid' : 'hybrid',
'elektro' : 'electric',
'andere' : 'other'
}
autos["fueltype"] = autos["fueltype"].map(fueltype_dict)
autos["fueltype"].value_counts(dropna = False)
petrol 30107 NaN 14567 unknown 4482 lpg 691 cng 75 hybrid 37 other 22 electric 19 Name: fueltype, dtype: int64
autos["unrepaired_damage"].unique()
array(['nein', nan, 'ja'], dtype=object)
unrepaired_damage_dict = {
'nein' : 'no',
np.nan : 'unknown',
'ja' : 'yes'
}
autos["unrepaired_damage"] = autos["unrepaired_damage"].map(unrepaired_damage_dict)
autos["unrepaired_damage"].value_counts(dropna = False)
no 35232 unknown 9829 yes 4939 Name: unrepaired_damage, dtype: int64
autos["odometer_km"].unique().shape[0] #No.of unique values
13
autos["odometer_km"].describe()
count 50000.000000 mean 125732.700000 std 40042.211706 min 5000.000000 25% 125000.000000 50% 150000.000000 75% 150000.000000 max 150000.000000 Name: odometer_km, dtype: float64
autos["odometer_km"].value_counts(dropna = False)
150000 32424 125000 5170 100000 2169 90000 1757 80000 1436 70000 1230 60000 1164 50000 1027 5000 967 40000 819 30000 789 20000 784 10000 264 Name: odometer_km, dtype: int64
autos["odometer_km"].value_counts().sort_index(ascending = False).head()
150000 32424 125000 5170 100000 2169 90000 1757 80000 1436 Name: odometer_km, dtype: int64
The odometer doesn't seem to have any outliers to be concerning.
autos["price"].unique().shape[0] #No.of unique values
2357
autos["price"].describe()
count 5.000000e+04 mean 9.840044e+03 std 4.811044e+05 min 0.000000e+00 25% 1.100000e+03 50% 2.950000e+03 75% 7.200000e+03 max 1.000000e+08 Name: price, dtype: float64
autos["price"].value_counts(dropna = False)
0 1421 500 781 1500 734 2500 643 1000 639 ... 20790 1 8970 1 846 1 2895 1 33980 1 Name: price, Length: 2357, dtype: int64
autos["price"].value_counts(dropna = False).sort_index(ascending = True).tail(20)
197000 1 198000 1 220000 1 250000 1 259000 1 265000 1 295000 1 299000 1 345000 1 350000 1 999990 1 999999 2 1234566 1 1300000 1 3890000 1 10000000 1 11111111 2 12345678 3 27322222 1 99999999 1 Name: price, dtype: int64
The price seems to have outliers at the higher end. So we will just proceed with the price ranges from 1000 to 100000
autos = autos[autos["price"].between(1000, 100000)]
autos["price"].value_counts(dropna = False).sort_index(ascending = True).head(20)
1000 639 1039 1 1040 1 1049 6 1050 95 1059 1 1070 1 1080 6 1090 4 1095 3 1098 1 1099 44 1100 376 1111 39 1112 1 1119 1 1120 2 1149 10 1150 226 1169 1 Name: price, dtype: int64
cols = ["datecrawled", "ad_created", "lastseen"]
for c in cols:
print("The counts of the column {} are".format(c), "\n")
print(autos[c].str[:10].value_counts(normalize = True, dropna = False))
print("\n")
The counts of the column datecrawled are 2016-04-03 0.039132 2016-03-20 0.038173 2016-03-12 0.037396 2016-03-21 0.037266 2016-04-04 0.036904 2016-03-14 0.036670 2016-04-02 0.036333 2016-03-28 0.035349 2016-03-07 0.035167 2016-03-19 0.035167 2016-04-01 0.034571 2016-03-29 0.034001 2016-03-15 0.033612 2016-03-10 0.033353 2016-03-26 0.033094 2016-03-30 0.033016 2016-03-11 0.032835 2016-03-08 0.032602 2016-03-09 0.032472 2016-03-22 0.032420 2016-03-23 0.032213 2016-03-31 0.031410 2016-03-27 0.031384 2016-03-25 0.030528 2016-03-17 0.030477 2016-03-16 0.029077 2016-03-24 0.029025 2016-03-05 0.025527 2016-03-13 0.016016 2016-03-06 0.013891 2016-04-05 0.013346 2016-03-18 0.012802 2016-04-06 0.003265 2016-04-07 0.001503 Name: datecrawled, dtype: float64 The counts of the column ad_created are 2016-04-03 0.039443 2016-03-20 0.038277 2016-03-21 0.037552 2016-04-04 0.037318 2016-03-12 0.037137 ... 2015-11-10 0.000026 2016-01-13 0.000026 2016-01-07 0.000026 2015-08-10 0.000026 2015-12-30 0.000026 Name: ad_created, Length: 74, dtype: float64 The counts of the column lastseen are 2016-04-06 0.234742 2016-04-07 0.139969 2016-04-05 0.130977 2016-03-17 0.026408 2016-04-02 0.024905 2016-04-03 0.024438 2016-03-30 0.023479 2016-04-04 0.023402 2016-04-01 0.023220 2016-03-31 0.022754 2016-03-12 0.022210 2016-03-22 0.020810 2016-03-29 0.020784 2016-03-20 0.019799 2016-03-21 0.019670 2016-03-28 0.019411 2016-03-24 0.018504 2016-03-23 0.017908 2016-03-25 0.017778 2016-03-26 0.016016 2016-03-16 0.015446 2016-03-15 0.015005 2016-03-19 0.014616 2016-03-27 0.014072 2016-03-14 0.011999 2016-03-11 0.011740 2016-03-10 0.009822 2016-03-09 0.008889 2016-03-13 0.008397 2016-03-18 0.007360 2016-03-08 0.006246 2016-03-07 0.004561 2016-03-06 0.003576 2016-03-05 0.001088 Name: lastseen, dtype: float64
import datetime as dt
cols = ["datecrawled", "ad_created", "lastseen"]
for c in cols:
print("The counts of the column {} are".format(c), "\n")
min_val = dt.datetime.strptime(autos[c].str[:10].min(), "%Y-%m-%d")
max_val = dt.datetime.strptime(autos[c].str[:10].max(), "%Y-%m-%d")
print("The min value of the column {} is {} are".format(c, min_val.strftime("%Y-%m-%d")))
print("The max value of the column {} is {} are".format(c, max_val.strftime("%Y-%m-%d")))
print("The difference in days are {}".format(str(max_val - min_val)[:7]))
print("\n")
The counts of the column datecrawled are The min value of the column datecrawled is 2016-03-05 are The max value of the column datecrawled is 2016-04-07 are The difference in days are 33 days The counts of the column ad_created are The min value of the column ad_created is 2015-06-11 are The max value of the column ad_created is 2016-04-07 are The difference in days are 301 day The counts of the column lastseen are The min value of the column lastseen is 2016-03-05 are The max value of the column lastseen is 2016-04-07 are The difference in days are 33 days
Insights from the datecrawled
, ad_created
and lastseen
columns are:
autos["registration_year"].describe()
count 38587.000000 mean 2005.680125 std 86.727780 min 1000.000000 25% 2001.000000 50% 2005.000000 75% 2009.000000 max 9999.000000 Name: registration_year, dtype: float64
Max and Min values seems to be off in the registration year
autos["registration_year"].value_counts(dropna = False).sort_index(ascending = True).head(50)
1000 1 1001 1 1927 1 1929 1 1931 1 1934 2 1937 4 1938 1 1939 1 1941 2 1943 1 1948 1 1950 1 1951 1 1952 1 1953 1 1954 2 1955 2 1956 4 1957 2 1958 3 1959 6 1960 17 1961 6 1962 4 1963 8 1964 9 1965 17 1966 21 1967 25 1968 25 1969 18 1970 32 1971 24 1972 30 1973 23 1974 23 1975 18 1976 21 1977 19 1978 41 1979 31 1980 71 1981 26 1982 38 1983 50 1984 47 1985 79 1986 56 1987 51 Name: registration_year, dtype: int64
autos["registration_year"].value_counts(dropna = False).sort_index(ascending = True).tail(20)
2009 2076 2010 1580 2011 1616 2012 1303 2013 791 2014 647 2015 357 2016 645 2017 1008 2018 398 2019 1 2800 1 4100 1 4500 1 5000 2 5911 1 6200 1 8888 1 9000 1 9999 2 Name: registration_year, dtype: int64
# As the last ad created was in 2016, we will just cosider max register year until 2016
autos = autos[autos["registration_year"].between(1954, 2016)]
We have considered the max date as 2016 depending upon the ads created dates, and min date depending upon the online research
autos["registration_year"].value_counts(dropna = False).sort_index(ascending = True).tail(20)
1997 786 1998 1280 1999 1724 2000 2000 2001 2064 2002 2135 2003 2477 2004 2608 2005 2785 2006 2651 2007 2258 2008 2202 2009 2076 2010 1580 2011 1616 2012 1303 2013 791 2014 647 2015 357 2016 645 Name: registration_year, dtype: int64
autos["registration_year"].value_counts(normalize = True, dropna = False)
2005 0.074970 2006 0.071363 2004 0.070206 2003 0.066679 2007 0.060784 ... 1956 0.000108 1958 0.000081 1957 0.000054 1955 0.000054 1954 0.000054 Name: registration_year, Length: 63, dtype: float64
autos["brand"].describe()
count 37148 unique 40 top volkswagen freq 7842 Name: brand, dtype: object
autos["brand"].value_counts(dropna = False, normalize = True)
volkswagen 0.211102 bmw 0.125444 mercedes_benz 0.111635 audi 0.097717 opel 0.089184 ford 0.058684 renault 0.037337 peugeot 0.027942 fiat 0.021105 skoda 0.019086 seat 0.017309 smart 0.016636 toyota 0.014644 mazda 0.014267 citroen 0.013917 nissan 0.013648 mini 0.010902 hyundai 0.010768 sonstige_autos 0.010095 volvo 0.008991 kia 0.007699 honda 0.007349 mitsubishi 0.006891 porsche 0.006784 chevrolet 0.006622 alfa_romeo 0.006218 suzuki 0.005734 dacia 0.003284 chrysler 0.003176 jeep 0.002773 land_rover 0.002611 jaguar 0.001857 subaru 0.001723 daihatsu 0.001696 saab 0.001373 daewoo 0.000915 trabant 0.000861 rover 0.000727 lancia 0.000673 lada 0.000619 Name: brand, dtype: float64
Lets explore top 4 brands as they consolidate to nearly 50% of the data
brands_req = autos["brand"].value_counts(dropna = False, normalize = True).index[:4]
brand_price = {}
for brand in brands_req:
mean_price = autos[autos["brand"] == brand]["price"].mean()
brand_price[brand] = mean_price
brand_price
{'volkswagen': 6640.636954858454, 'bmw': 8974.562017167382, 'mercedes_benz': 9192.851700024114, 'audi': 10276.903581267217}
The top brand volkswagen
has a mean price of $6641
and it might be its affordability and not being a luxury brand, there are lot of ads for this brand of ebay
bmp_df = pd.DataFrame.from_dict(brand_price, orient='index', columns = ["mean_price"])
bmp_df
mean_price | |
---|---|
volkswagen | 6640.636955 |
bmw | 8974.562017 |
mercedes_benz | 9192.851700 |
audi | 10276.903581 |
brand_mileage = {}
for brand in brands_req:
mean_mil = autos[autos["brand"] == brand]["odometer_km"].mean()
brand_mileage[brand] = mean_mil
brand_mileage
{'volkswagen': 125768.93649579189, 'bmw': 132095.49356223177, 'mercedes_benz': 130177.23655654691, 'audi': 127524.79338842975}
bmp_df["mean_mileage"] = pd.Series(brand_mileage)
bmp_df
mean_price | mean_mileage | |
---|---|---|
volkswagen | 6640.636955 | 125768.936496 |
bmw | 8974.562017 | 132095.493562 |
mercedes_benz | 9192.851700 | 130177.236557 |
audi | 10276.903581 | 127524.793388 |
All the brands seems to have almost same mileage, which points to affordability as we mentioned above might be the reason for high popularity on ebay
print(sorted(autos["odometer_km"].unique()))
[5000, 10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000, 125000, 150000]
# create a list of our conditions
conditions = [
(autos['odometer_km'] <= 30000),
(autos['odometer_km'] > 30000) & (autos['odometer_km'] <= 60000),
(autos['odometer_km'] > 60000) & (autos['odometer_km'] <= 90000),
(autos['odometer_km'] > 90000)
]
# create a list of the values we want to assign for each condition
values = ['low', 'avg', 'high', 'very_high']
# create a new column and use np.select to assign values to it using our lists as arguments
autos['mileage_cat'] = np.select(conditions, values)
# display updated DataFrame
autos.head()
datecrawled | name | price | abtest | vehicletype | registration_year | gearbox | powerps | model | odometer_km | registration_month | fueltype | brand | unrepaired_damage | ad_created | postalcode | lastseen | mileage_cat | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | 5000 | control | bus | 2004 | manual | 158 | andere | 150000 | 3 | lpg | peugeot | no | 2016-03-26 00:00:00 | 79588 | 2016-04-06 06:45:54 | very_high |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | 8500 | control | limousine | 1997 | automatic | 286 | 7er | 150000 | 6 | petrol | bmw | no | 2016-04-04 00:00:00 | 71034 | 2016-04-06 14:45:08 | very_high |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | 8990 | test | limousine | 2009 | manual | 102 | golf | 70000 | 7 | petrol | volkswagen | no | 2016-03-26 00:00:00 | 35394 | 2016-04-06 20:15:37 | high |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | 4350 | control | kleinwagen | 2007 | automatic | 71 | fortwo | 70000 | 6 | petrol | smart | no | 2016-03-12 00:00:00 | 33729 | 2016-03-15 03:16:28 | high |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | 1350 | test | kombi | 2003 | manual | 0 | focus | 150000 | 7 | petrol | ford | no | 2016-04-01 00:00:00 | 39218 | 2016-04-01 14:38:50 | very_high |
autos["mileage_cat"].value_counts(dropna = False, normalize = True)
very_high 0.762706 high 0.106062 avg 0.075536 low 0.055696 Name: mileage_cat, dtype: float64
76% of the cars in the dataset belong to very_high
mileage category with mileage over 100,000 km
unique_mil_cat = autos["mileage_cat"].unique()
mil_price = {}
for cat in unique_mil_cat:
mean_price = autos[autos["mileage_cat"] == cat]["price"].mean()
mil_price[cat] = mean_price
mil_price
{'very_high': 5437.589101048247, 'high': 10207.797208121827, 'avg': 13921.566286528867, 'low': 16079.809569840503}
From the above, we can observe that the price of the cars with negatively correlated to the mileage. i.e. If the mileage is high, the price is low, and if the mileage is low, the price seems to be high
brand_model = autos["brand"] + "/" + autos["model"]
brand_model.value_counts(dropna = False, normalize = True)
volkswagen/golf 0.078551 bmw/3er 0.061511 NaN 0.038225 volkswagen/passat 0.031011 audi/a4 0.029396 ... dacia/andere 0.000027 alfa_romeo/145 0.000027 audi/200 0.000027 rover/rangerover 0.000027 rover/discovery 0.000027 Length: 288, dtype: float64
Volkswagen Golf is the mosst common car accounting for 8% of the dataset.
damage_unique = (autos[autos["unrepaired_damage"].isin(["yes", "no"])]["unrepaired_damage"]
.value_counts()
.index
)
is_damage_price = {}
for var in damage_unique:
mean_price = autos[autos["unrepaired_damage"] == var]["price"].mean()
is_damage_price[var] = mean_price
is_damage_price
{'no': 7861.325567281432, 'yes': 3866.8081244598097}
price_diff = round((is_damage_price["yes"] - is_damage_price["no"])/is_damage_price["no"],2)
price_diff
-0.51
The cars that are damaged are ~51% cheaper
than their non-damaged counterparts.