Data:used cars from eBay Kleinanzeigen, a classified section of the German eBay website. But we will use a modified Dataset having 50,000 data points from the original dataset.
Data Dictionary: Most fields are self explanatory.
datacrawled
- When this ad was first crawled
name
- Name of the car
seller
- private or dealer
offerType
- type of listing
price
- selling price
abtest
- A/B Test
vehicleType
yearofRegistration
gearbox
- Transmission type
powerPS
model
kilometer
- how many kilometers the car has driven
monthofRegistration
fuelType
brand
notRepairedDamage
nrOfPictures
- The number of pictures in the ad
postalCode
lastSeenOnline
- When the crawler saw this ad last online\
Aim of this Project: Clean the data and analyze the cars listed in the dataset. This Project gives hands-on Data Cleaning Basics which includes NumPy and Pandas
Step 1: Import the pandas and NumPy libraries
Step 2: Read the autos.CSV file
Step 3: Create a new cell with variable autos
Step 4: Print information about the dataset and first few rows
# Step 1:
import pandas as pd
import numpy as np
# Step 2:
autos = pd.read_csv('autos.csv',encoding='Latin-1')
# Step 3:
autos
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
49995 | 2016-03-27 14:38:19 | Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon | privat | Angebot | $24,900 | control | limousine | 2011 | automatik | 239 | q5 | 100,000km | 1 | diesel | audi | nein | 2016-03-27 00:00:00 | 0 | 82131 | 2016-04-01 13:47:40 |
49996 | 2016-03-28 10:50:25 | Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... | privat | Angebot | $1,980 | control | cabrio | 1996 | manuell | 75 | astra | 150,000km | 5 | benzin | opel | nein | 2016-03-28 00:00:00 | 0 | 44807 | 2016-04-02 14:18:02 |
49997 | 2016-04-02 14:44:48 | Fiat_500_C_1.2_Dualogic_Lounge | privat | Angebot | $13,200 | test | cabrio | 2014 | automatik | 69 | 500 | 5,000km | 11 | benzin | fiat | nein | 2016-04-02 00:00:00 | 0 | 73430 | 2016-04-04 11:47:27 |
49998 | 2016-03-08 19:25:42 | Audi_A3_2.0_TDI_Sportback_Ambition | privat | Angebot | $22,900 | control | kombi | 2013 | manuell | 150 | a3 | 40,000km | 11 | diesel | audi | nein | 2016-03-08 00:00:00 | 0 | 35683 | 2016-04-05 16:45:07 |
49999 | 2016-03-14 00:42:12 | Opel_Vectra_1.6_16V | privat | Angebot | $1,250 | control | limousine | 1996 | manuell | 101 | vectra | 150,000km | 1 | benzin | opel | nein | 2016-03-13 00:00:00 | 0 | 45897 | 2016-04-06 21:18:48 |
50000 rows × 20 columns
# Step 4:
autos.info()
autos.head()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 dateCrawled 50000 non-null object 1 name 50000 non-null object 2 seller 50000 non-null object 3 offerType 50000 non-null object 4 price 50000 non-null object 5 abtest 50000 non-null object 6 vehicleType 44905 non-null object 7 yearOfRegistration 50000 non-null int64 8 gearbox 47320 non-null object 9 powerPS 50000 non-null int64 10 model 47242 non-null object 11 odometer 50000 non-null object 12 monthOfRegistration 50000 non-null int64 13 fuelType 45518 non-null object 14 brand 50000 non-null object 15 notRepairedDamage 40171 non-null object 16 dateCreated 50000 non-null object 17 nrOfPictures 50000 non-null int64 18 postalCode 50000 non-null int64 19 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
vehicleType
gearbox
model
fuelType
notrepairedDamage
has null valuesStep 1: Print array of existing column names
Step 2: Make changes to column names from camelCase to snakecase
Step 3: Change the Column names for better readibility
Step 4: Look at the current state of the autos
autos.columns
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest', 'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model', 'odometer', 'monthOfRegistration', 'fuelType', 'brand', 'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode', 'lastSeen'], dtype='object')
autos.rename(columns={'dateCrawled':'date_crawled',
'offerType':'offer_type',
'vehicleType':'vehicle_type',
'yearOfRegistration':'registration_year',
'powerPS':'power_ps',
'monthOfRegistration':'registration_month',
'fuelType':'fuel_type',
'notRepairedDamage':'unrepaired_damage',
'dateCreated':'ad_created',
'nrOfPictures':'no_of_pics',
'postalCode':'postal_code',
'lastSeen':'last_seen'},inplace=True)
autos.columns = autos.columns.str.lower()
autos.head()
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | no_of_pics | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
Step 1: Identify Text columns where all or almost all values are same. These columns can be dropped from the analysis.
Step 2: Identify numeric data stored as text and convert to numeric values with proper column names.
autos.describe(include='all')
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | no_of_pics | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000 | 50000 | 50000 | 50000 | 50000 | 50000 | 44905 | 50000.000000 | 47320 | 50000.000000 | 47242 | 50000 | 50000.000000 | 45518 | 50000 | 40171 | 50000 | 50000.0 | 50000.000000 | 50000 |
unique | 48213 | 38754 | 2 | 2 | 2357 | 2 | 8 | NaN | 2 | NaN | 245 | 13 | NaN | 7 | 40 | 2 | 76 | NaN | NaN | 39481 |
top | 2016-03-12 16:06:22 | Ford_Fiesta | privat | Angebot | $0 | test | limousine | NaN | manuell | NaN | golf | 150,000km | NaN | benzin | volkswagen | nein | 2016-04-03 00:00:00 | NaN | NaN | 2016-04-07 06:17:27 |
freq | 3 | 78 | 49999 | 49999 | 1421 | 25756 | 12859 | NaN | 36993 | NaN | 4024 | 32424 | NaN | 30107 | 10687 | 35232 | 1946 | NaN | NaN | 8 |
mean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2005.073280 | NaN | 116.355920 | NaN | NaN | 5.723360 | NaN | NaN | NaN | NaN | 0.0 | 50813.627300 | NaN |
std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 105.712813 | NaN | 209.216627 | NaN | NaN | 3.711984 | NaN | NaN | NaN | NaN | 0.0 | 25779.747957 | NaN |
min | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1000.000000 | NaN | 0.000000 | NaN | NaN | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 1067.000000 | NaN |
25% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1999.000000 | NaN | 70.000000 | NaN | NaN | 3.000000 | NaN | NaN | NaN | NaN | 0.0 | 30451.000000 | NaN |
50% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2003.000000 | NaN | 105.000000 | NaN | NaN | 6.000000 | NaN | NaN | NaN | NaN | 0.0 | 49577.000000 | NaN |
75% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | NaN | 9.000000 | NaN | NaN | NaN | NaN | 0.0 | 71540.000000 | NaN |
max | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 9999.000000 | NaN | 17700.000000 | NaN | NaN | 12.000000 | NaN | NaN | NaN | NaN | 0.0 | 99998.000000 | NaN |
autos["seller"].value_counts()
privat 49999 gewerblich 1 Name: seller, dtype: int64
autos["offer_type"].value_counts()
Angebot 49999 Gesuch 1 Name: offer_type, dtype: int64
autos["abtest"].value_counts()
test 25756 control 24244 Name: abtest, dtype: int64
autos["vehicle_type"].value_counts()
limousine 12859 kleinwagen 10822 kombi 9127 bus 4093 cabrio 3061 coupe 2537 suv 1986 andere 420 Name: vehicle_type, dtype: int64
autos["unrepaired_damage"].value_counts()
nein 35232 ja 4939 Name: unrepaired_damage, dtype: int64
autos["no_of_pics"].value_counts()
0 50000 Name: no_of_pics, dtype: int64
seller
has only one entry with value "gewerblich". We can drop this column from our analysis/offertype
has only one entry with value "gesuch". We can drop this column from our analysis./no_of_pics
has values only 0. This column can be dropped from the analysisautos = autos.drop(["seller","offer_type","no_of_pics"],axis=1)
price
is in dollar and thus can be edited to have only numeric values and column name can be changed to price_$
odometer
values are in km and thus can be edited to have only numeric values and column name can be changed to odometer_km
autos["price"] = autos["price"].str.replace("$","").str.replace(',','').astype(int)
autos["odometer"] = autos["odometer"].str.replace("km","").str.replace(',','').astype(int)
autos.rename(columns={'price':'price_$','odometer':'odometer_km'},inplace=True)
autos.head()
date_crawled | name | price_$ | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | 5000 | control | bus | 2004 | manuell | 158 | andere | 150000 | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | 8500 | control | limousine | 1997 | automatik | 286 | 7er | 150000 | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | 8990 | test | limousine | 2009 | manuell | 102 | golf | 70000 | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | 4350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70000 | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | 1350 | test | kombi | 2003 | manuell | 0 | focus | 150000 | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 39218 | 2016-04-01 14:38:50 |
Analyze the columns and identify any outliers
autos["price_$"].unique().shape
(2357,)
autos["price_$"].describe()
count 5.000000e+04 mean 9.840044e+03 std 4.811044e+05 min 0.000000e+00 25% 1.100000e+03 50% 2.950000e+03 75% 7.200000e+03 max 1.000000e+08 Name: price_$, dtype: float64
autos["price_$"].value_counts().sort_index(ascending=False).head(20)
99999999 1 27322222 1 12345678 3 11111111 2 10000000 1 3890000 1 1300000 1 1234566 1 999999 2 999990 1 350000 1 345000 1 299000 1 295000 1 265000 1 259000 1 250000 1 220000 1 198000 1 197000 1 Name: price_$, dtype: int64
Looking at the above data let's keep the maximum value as 350000. Any value above this looks unrealistic with steep increase in price and thus can be ignored from analysis.
autos["price_$"].value_counts().sort_index(ascending=True).head(50)
0 1421 1 156 2 3 3 1 5 2 8 1 9 1 10 7 11 2 12 3 13 2 14 1 15 2 17 3 18 1 20 4 25 5 29 1 30 7 35 1 40 6 45 4 47 1 49 4 50 49 55 2 59 1 60 9 65 5 66 1 70 10 75 5 79 1 80 15 89 1 90 5 99 19 100 134 110 3 111 2 115 2 117 1 120 39 122 1 125 8 129 1 130 15 135 1 139 1 140 9 Name: price_$, dtype: int64
price value 0 can be ignored from our analysis.
autos = autos[autos["price_$"].between(1,350000)]
autos.describe(include='all')
date_crawled | name | price_$ | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 48565 | 48565 | 48565.000000 | 48565 | 43979 | 48565.000000 | 46222 | 48565.000000 | 46107 | 48565.000000 | 48565.000000 | 44535 | 48565 | 39464 | 48565 | 48565.000000 | 48565 |
unique | 46882 | 37470 | NaN | 2 | 8 | NaN | 2 | NaN | 245 | NaN | NaN | 7 | 40 | 2 | 76 | NaN | 38474 |
top | 2016-03-23 19:38:20 | Ford_Fiesta | NaN | test | limousine | NaN | manuell | NaN | golf | NaN | NaN | benzin | volkswagen | nein | 2016-04-03 00:00:00 | NaN | 2016-04-07 06:17:27 |
freq | 3 | 76 | NaN | 25019 | 12598 | NaN | 36102 | NaN | 3900 | NaN | NaN | 29368 | 10336 | 34775 | 1887 | NaN | 8 |
mean | NaN | NaN | 5888.935591 | NaN | NaN | 2004.755421 | NaN | 117.197158 | NaN | 125770.101925 | 5.782251 | NaN | NaN | NaN | NaN | 50975.745207 | NaN |
std | NaN | NaN | 9059.854754 | NaN | NaN | 88.643887 | NaN | 200.649618 | NaN | 39788.636804 | 3.685595 | NaN | NaN | NaN | NaN | 25746.968398 | NaN |
min | NaN | NaN | 1.000000 | NaN | NaN | 1000.000000 | NaN | 0.000000 | NaN | 5000.000000 | 0.000000 | NaN | NaN | NaN | NaN | 1067.000000 | NaN |
25% | NaN | NaN | 1200.000000 | NaN | NaN | 1999.000000 | NaN | 71.000000 | NaN | 125000.000000 | 3.000000 | NaN | NaN | NaN | NaN | 30657.000000 | NaN |
50% | NaN | NaN | 3000.000000 | NaN | NaN | 2004.000000 | NaN | 107.000000 | NaN | 150000.000000 | 6.000000 | NaN | NaN | NaN | NaN | 49716.000000 | NaN |
75% | NaN | NaN | 7490.000000 | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | 150000.000000 | 9.000000 | NaN | NaN | NaN | NaN | 71665.000000 | NaN |
max | NaN | NaN | 350000.000000 | NaN | NaN | 9999.000000 | NaN | 17700.000000 | NaN | 150000.000000 | 12.000000 | NaN | NaN | NaN | NaN | 99998.000000 | NaN |
Fuel Type
benzin is the most listed one.autos["odometer_km"].value_counts().sort_index(ascending=False).head(20)
150000 31414 125000 5057 100000 2115 90000 1734 80000 1415 70000 1217 60000 1155 50000 1012 40000 815 30000 780 20000 762 10000 253 5000 836 Name: odometer_km, dtype: int64
There are no changes recommended for this column as all seems to be legitimate. Value for 150000km seems to be very high. Probably it is for all the values '150000+'
registration_year and registration_month are already numeric values. date_crawled, ad_created and last_seen contains string values. Thus we need to extract Date from these fields.
autos[['date_crawled','ad_created','last_seen']][0:5]
date_crawled | ad_created | last_seen | |
---|---|---|---|
0 | 2016-03-26 17:47:46 | 2016-03-26 00:00:00 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | 2016-04-04 00:00:00 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | 2016-03-26 00:00:00 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | 2016-03-12 00:00:00 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | 2016-04-01 00:00:00 | 2016-04-01 14:38:50 |
The dates are stored in string format with date in first 10 characters.
autos["date_crawled"].str[:10].value_counts(normalize=True, dropna=False).sort_index()
2016-03-05 0.025327 2016-03-06 0.014043 2016-03-07 0.036014 2016-03-08 0.033296 2016-03-09 0.033090 2016-03-10 0.032184 2016-03-11 0.032575 2016-03-12 0.036920 2016-03-13 0.015670 2016-03-14 0.036549 2016-03-15 0.034284 2016-03-16 0.029610 2016-03-17 0.031628 2016-03-18 0.012911 2016-03-19 0.034778 2016-03-20 0.037887 2016-03-21 0.037373 2016-03-22 0.032987 2016-03-23 0.032225 2016-03-24 0.029342 2016-03-25 0.031607 2016-03-26 0.032204 2016-03-27 0.031092 2016-03-28 0.034860 2016-03-29 0.034099 2016-03-30 0.033687 2016-03-31 0.031834 2016-04-01 0.033687 2016-04-02 0.035478 2016-04-03 0.038608 2016-04-04 0.036487 2016-04-05 0.013096 2016-04-06 0.003171 2016-04-07 0.001400 Name: date_crawled, dtype: float64
Data is crawled between 5th March 2016 and 7th April 2016.
autos["ad_created"].str[:10].value_counts(normalize=True, dropna=False).sort_index()
2015-06-11 0.000021 2015-08-10 0.000021 2015-09-09 0.000021 2015-11-10 0.000021 2015-12-05 0.000021 ... 2016-04-03 0.038855 2016-04-04 0.036858 2016-04-05 0.011819 2016-04-06 0.003253 2016-04-07 0.001256 Name: ad_created, Length: 76, dtype: float64
The dates are in between June 2015 and April 2016. We can analyze further based on month and year (without date)
autos["ad_created"].str[:7].value_counts(normalize=True, dropna=False).sort_index()
2015-06 0.000021 2015-08 0.000021 2015-09 0.000021 2015-11 0.000021 2015-12 0.000041 2016-01 0.000247 2016-02 0.001256 2016-03 0.837496 2016-04 0.160877 Name: ad_created, dtype: float64
The maximum number of ads were created in the month of March 2016.
autos["last_seen"].str[:10].value_counts(normalize=True, dropna=False).sort_index()
2016-03-05 0.001071 2016-03-06 0.004324 2016-03-07 0.005395 2016-03-08 0.007413 2016-03-09 0.009595 2016-03-10 0.010666 2016-03-11 0.012375 2016-03-12 0.023783 2016-03-13 0.008895 2016-03-14 0.012602 2016-03-15 0.015876 2016-03-16 0.016452 2016-03-17 0.028086 2016-03-18 0.007351 2016-03-19 0.015834 2016-03-20 0.020653 2016-03-21 0.020632 2016-03-22 0.021373 2016-03-23 0.018532 2016-03-24 0.019767 2016-03-25 0.019211 2016-03-26 0.016802 2016-03-27 0.015649 2016-03-28 0.020859 2016-03-29 0.022341 2016-03-30 0.024771 2016-03-31 0.023783 2016-04-01 0.022794 2016-04-02 0.024915 2016-04-03 0.025203 2016-04-04 0.024483 2016-04-05 0.124761 2016-04-06 0.221806 2016-04-07 0.131947 Name: last_seen, dtype: float64
On 5.April, 6.April and 7.April, we can see the spike in last seen entries,
autos["registration_year"].describe()
count 48565.000000 mean 2004.755421 std 88.643887 min 1000.000000 25% 1999.000000 50% 2004.000000 75% 2008.000000 max 9999.000000 Name: registration_year, dtype: float64
registration year 1000 and 9999 looks unrealistic. So let's find out the minimum logical registration year.
autos["registration_year"].value_counts(normalize=True, dropna=False).sort_index()
1000 0.000021 1001 0.000021 1111 0.000021 1800 0.000041 1910 0.000103 ... 5911 0.000021 6200 0.000021 8888 0.000021 9000 0.000021 9999 0.000062 Name: registration_year, Length: 95, dtype: float64
1910 looks realistic minimum year and 2016 looks realistic maximum year because the registration date cannot be later than ad_created. So we can delete the entries having invalid data in registration year
autos = autos[autos["registration_year"].between(1910,2016)]
autos.describe()
price_$ | registration_year | power_ps | odometer_km | registration_month | postal_code | |
---|---|---|---|---|---|---|
count | 46681.000000 | 46681.000000 | 46681.000000 | 46681.000000 | 46681.000000 | 46681.000000 |
mean | 5977.716801 | 2002.910756 | 117.892933 | 125586.855466 | 5.827125 | 51097.434181 |
std | 9177.909479 | 7.185103 | 184.922911 | 39852.528628 | 3.670300 | 25755.387192 |
min | 1.000000 | 1910.000000 | 0.000000 | 5000.000000 | 0.000000 | 1067.000000 |
25% | 1250.000000 | 1999.000000 | 75.000000 | 100000.000000 | 3.000000 | 30827.000000 |
50% | 3100.000000 | 2003.000000 | 109.000000 | 150000.000000 | 6.000000 | 49828.000000 |
75% | 7500.000000 | 2008.000000 | 150.000000 | 150000.000000 | 9.000000 | 71732.000000 |
max | 350000.000000 | 2016.000000 | 17700.000000 | 150000.000000 | 12.000000 | 99998.000000 |
Around 1884 entries have been removed from the analysis.
price
by brand
¶We will use aggregation to understand the brand
column.
autos["brand"].value_counts(normalize=True)
volkswagen 0.211264 bmw 0.110045 opel 0.107581 mercedes_benz 0.096463 audi 0.086566 ford 0.069900 renault 0.047150 peugeot 0.029841 fiat 0.025642 seat 0.018273 skoda 0.016409 nissan 0.015274 mazda 0.015188 smart 0.014160 citroen 0.014010 toyota 0.012703 hyundai 0.010025 sonstige_autos 0.009811 volvo 0.009147 mini 0.008762 mitsubishi 0.008226 honda 0.007840 kia 0.007069 alfa_romeo 0.006641 porsche 0.006127 suzuki 0.005934 chevrolet 0.005698 chrysler 0.003513 dacia 0.002635 daihatsu 0.002506 jeep 0.002271 subaru 0.002142 land_rover 0.002099 saab 0.001649 jaguar 0.001564 daewoo 0.001500 trabant 0.001392 rover 0.001328 lancia 0.001071 lada 0.000578 Name: brand, dtype: float64
Let us consider the brands that form at least 1 % of our data.
brands_all = autos["brand"].value_counts(normalize=True)
brands_selected = brands_all[brands_all > 0.01].index
brand_price = {}
for brand in brands_selected:
brand_autos = autos[autos["brand"]==brand]
mean_price = brand_autos["price_$"].mean()
brand_price[brand] = int(mean_price)
brand_price
{'volkswagen': 5402, 'bmw': 8332, 'opel': 2975, 'mercedes_benz': 8628, 'audi': 9336, 'ford': 3749, 'renault': 2474, 'peugeot': 3094, 'fiat': 2813, 'seat': 4397, 'skoda': 6368, 'nissan': 4743, 'mazda': 4112, 'smart': 3580, 'citroen': 3779, 'toyota': 5167, 'hyundai': 5365}
Calculate mean mileage and mean price for the above selected brands which constitute more than 0.05% of our data and create the data frame. We will be caluclating these separately for automatic and manual.
brands_all = autos["brand"].value_counts(normalize=True)
brands_selected = brands_all[brands_all > 0.05].index
brand_price_automatik = {}
brand_price_manuell = {}
brand_odometer_automatik = {}
brand_odometer_manuell = {}
brand_count_manuell = {}
brand_count_automatik = {}
for brand in brands_selected:
brand_autos = autos[(autos["brand"] == brand) & (autos["gearbox"] == "manuell")]
brand_autos_count = brand_autos["brand"].value_counts()
mean_price = brand_autos["price_$"].mean()
mean_mileage = brand_autos["odometer_km"].mean()
brand_price_manuell[brand] = int(mean_price)
brand_odometer_manuell[brand] = int(mean_mileage)
brand_count_manuell[brand] = int(brand_autos_count)
brand_autos = autos[(autos["brand"] == brand) & (autos["gearbox"] == "automatik")]
brand_autos_count = brand_autos["brand"].value_counts()
brand_autos_automatik_count = brand_autos["brand"].value_counts()
mean_price = brand_autos["price_$"].mean()
mean_mileage = brand_autos["odometer_km"].mean()
brand_price_automatik[brand] = int(mean_price)
brand_odometer_automatik[brand] = int(mean_mileage)
brand_count_automatik[brand] = int(brand_autos_count)
bpa_series = pd.Series(brand_price_automatik)
bpm_series = pd.Series(brand_price_manuell)
boa_series = pd.Series(brand_odometer_automatik)
bom_series = pd.Series(brand_odometer_manuell)
bcm_series = pd.Series(brand_count_manuell)
bca_series = pd.Series(brand_count_automatik)
brand_summary = pd.DataFrame(bpa_series, columns=['mean_price_automatik'])
brand_summary["mean_odometer_automatik"] = boa_series
brand_summary["count_automatik"] = bca_series
brand_summary["mean_price_manuell"] = bpm_series
brand_summary["mean_odometer_manuell"] = bom_series
brand_summary["count_manuell"] = bcm_series
brand_summary
mean_price_automatik | mean_odometer_automatik | count_automatik | mean_price_manuell | mean_odometer_manuell | count_manuell | |
---|---|---|---|---|---|---|
volkswagen | 10489 | 118499 | 1296 | 4738 | 130328 | 8077 |
bmw | 12376 | 129085 | 1716 | 6313 | 134865 | 3275 |
opel | 3630 | 124960 | 511 | 2955 | 130059 | 4264 |
mercedes_benz | 11216 | 129301 | 2477 | 5474 | 133835 | 1881 |
audi | 13836 | 125135 | 1330 | 7341 | 131123 | 2560 |
ford | 8614 | 114089 | 258 | 3324 | 125373 | 2849 |
brand
or gearbox