The purpose of this project is importing, cleaning and exploratory analysis of used car listings from the German ebay website. Results are visualized to help easily understand and compare the data. Observations from the analysis are summarized at the end.
# Import necessary packages
import numpy as np
import pandas as pd
import matplotlib
# Read the input CSV data file into a dataframe
input_file = 'autos.csv'
autos = pd.read_csv(input_file, encoding='Latin-1' )
# Setup the NB to display plots inline
%matplotlib inline
Some options to look at the dataframe and get familiar with the dataset
def print_some_rows(csv_file, num_rows, enc='Latin-1'):
return pd.read_csv(csv_file, nrows=num_rows, encoding=enc)
print_some_rows(input_file, 8)
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
5 | 2016-03-21 13:47:45 | Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto... | privat | Angebot | $7,900 | test | bus | 2006 | automatik | 150 | voyager | 150,000km | 4 | diesel | chrysler | NaN | 2016-03-21 00:00:00 | 0 | 22962 | 2016-04-06 09:45:21 |
6 | 2016-03-20 17:55:21 | VW_Golf_III_GT_Special_Electronic_Green_Metall... | privat | Angebot | $300 | test | limousine | 1995 | manuell | 90 | golf | 150,000km | 8 | benzin | volkswagen | NaN | 2016-03-20 00:00:00 | 0 | 31535 | 2016-03-23 02:48:59 |
7 | 2016-03-16 18:55:19 | Golf_IV_1.9_TDI_90PS | privat | Angebot | $1,990 | control | limousine | 1998 | manuell | 90 | golf | 150,000km | 12 | diesel | volkswagen | nein | 2016-03-16 00:00:00 | 0 | 53474 | 2016-04-07 03:17:32 |
#autos
autos.head()
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
# Details about the dataframe - number of rows and columns, names of columns and types of data they contain
autos.info()
autos.dtypes
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): dateCrawled 50000 non-null object name 50000 non-null object seller 50000 non-null object offerType 50000 non-null object price 50000 non-null object abtest 50000 non-null object vehicleType 44905 non-null object yearOfRegistration 50000 non-null int64 gearbox 47320 non-null object powerPS 50000 non-null int64 model 47242 non-null object odometer 50000 non-null object monthOfRegistration 50000 non-null int64 fuelType 45518 non-null object brand 50000 non-null object notRepairedDamage 40171 non-null object dateCreated 50000 non-null object nrOfPictures 50000 non-null int64 postalCode 50000 non-null int64 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
dateCrawled object name object seller object offerType object price object abtest object vehicleType object yearOfRegistration int64 gearbox object powerPS int64 model object odometer object monthOfRegistration int64 fuelType object brand object notRepairedDamage object dateCreated object nrOfPictures int64 postalCode int64 lastSeen object dtype: object
There are 50,000 rows with 20 columns of datapoints which are mostly strings.
There are 5 numeric columns.
Some columns have missing values.
More observations on patterns and the need to clean:
Replace
yearOfRegistration
with registration_year
monthOfRegistration
with registration_month
notRepairedDamage
with unrepaired_damage
dateCreated
with ad_created
lower_case_with_underscores
snake formatDefine functions to make the two changes to them.
# Get column names using .columns attribute
print (autos.columns)
print (list(autos.columns))
# While this is useful for looping over in cleaning the column names, df.head() is also useful for a quick view of the names in a table format
autos.head(0)
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest', 'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model', 'odometer', 'monthOfRegistration', 'fuelType', 'brand', 'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode', 'lastSeen'], dtype='object') ['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest', 'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model', 'odometer', 'monthOfRegistration', 'fuelType', 'brand', 'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode', 'lastSeen']
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen |
---|
# Function that edits existing column names to be readable
def edit_cols(col):
col = col.replace('yearOfRegistration', 'registration_year')
col = col.replace('monthOfRegistration', 'registration_month')
col = col.replace('notRepairedDamage', 'unrepaired_damage')
col = col.replace('dateCreated', 'ad_created')
return col
# Implement the above function on the columns
ed_cols = []
for col in list(autos.columns):
ed_cols.append(edit_cols(col))
# Assign modified column names list back to the .columns attribute
autos.columns = ed_cols
print (autos.columns)
autos.head(0)
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest', 'vehicleType', 'registration_year', 'gearbox', 'powerPS', 'model', 'odometer', 'registration_month', 'fuelType', 'brand', 'unrepaired_damage', 'ad_created', 'nrOfPictures', 'postalCode', 'lastSeen'], dtype='object')
dateCrawled | name | seller | offerType | price | abtest | vehicleType | registration_year | gearbox | powerPS | model | odometer | registration_month | fuelType | brand | unrepaired_damage | ad_created | nrOfPictures | postalCode | lastSeen |
---|
# Function that converts column names to python snake case
def to_snake(col):
snake = col[0].lower()
return (snake + ''.join( '_'+l.lower() if l.isupper() else l for l in col[1:]) )
# Implement the above function on the columns
snaked_cols = []
for c in autos.columns:
snaked_cols.append(to_snake(c))
# Assign modified column names list back to the .columns attribute
autos.columns = snaked_cols
# Column names after the modifications
autos.head(0)
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_p_s | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen |
---|
# Summary stats of all columns - numeric and non-numeric
autos.describe(include='all')
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_p_s | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000 | 50000 | 50000 | 50000 | 50000 | 50000 | 44905 | 50000.000000 | 47320 | 50000.000000 | 47242 | 50000 | 50000.000000 | 45518 | 50000 | 40171 | 50000 | 50000.0 | 50000.000000 | 50000 |
unique | 48213 | 38754 | 2 | 2 | 2357 | 2 | 8 | NaN | 2 | NaN | 245 | 13 | NaN | 7 | 40 | 2 | 76 | NaN | NaN | 39481 |
top | 2016-03-05 16:57:05 | Ford_Fiesta | privat | Angebot | $0 | test | limousine | NaN | manuell | NaN | golf | 150,000km | NaN | benzin | volkswagen | nein | 2016-04-03 00:00:00 | NaN | NaN | 2016-04-07 06:17:27 |
freq | 3 | 78 | 49999 | 49999 | 1421 | 25756 | 12859 | NaN | 36993 | NaN | 4024 | 32424 | NaN | 30107 | 10687 | 35232 | 1946 | NaN | NaN | 8 |
mean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2005.073280 | NaN | 116.355920 | NaN | NaN | 5.723360 | NaN | NaN | NaN | NaN | 0.0 | 50813.627300 | NaN |
std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 105.712813 | NaN | 209.216627 | NaN | NaN | 3.711984 | NaN | NaN | NaN | NaN | 0.0 | 25779.747957 | NaN |
min | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1000.000000 | NaN | 0.000000 | NaN | NaN | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 1067.000000 | NaN |
25% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1999.000000 | NaN | 70.000000 | NaN | NaN | 3.000000 | NaN | NaN | NaN | NaN | 0.0 | 30451.000000 | NaN |
50% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2003.000000 | NaN | 105.000000 | NaN | NaN | 6.000000 | NaN | NaN | NaN | NaN | 0.0 | 49577.000000 | NaN |
75% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | NaN | 9.000000 | NaN | NaN | NaN | NaN | 0.0 | 71540.000000 | NaN |
max | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 9999.000000 | NaN | 17700.000000 | NaN | NaN | 12.000000 | NaN | NaN | NaN | NaN | 0.0 | 99998.000000 | NaN |
# Summary stats of just numeric columns to better explore how good data quality is
autos.describe()
registration_year | power_p_s | registration_month | nr_of_pictures | postal_code | |
---|---|---|---|---|---|
count | 50000.000000 | 50000.000000 | 50000.000000 | 50000.0 | 50000.000000 |
mean | 2005.073280 | 116.355920 | 5.723360 | 0.0 | 50813.627300 |
std | 105.712813 | 209.216627 | 3.711984 | 0.0 | 25779.747957 |
min | 1000.000000 | 0.000000 | 0.000000 | 0.0 | 1067.000000 |
25% | 1999.000000 | 70.000000 | 3.000000 | 0.0 | 30451.000000 |
50% | 2003.000000 | 105.000000 | 6.000000 | 0.0 | 49577.000000 |
75% | 2008.000000 | 150.000000 | 9.000000 | 0.0 | 71540.000000 |
max | 9999.000000 | 17700.000000 | 12.000000 | 0.0 | 99998.000000 |
# Summary stats of non-numeric columns
autos.describe(include=['O'])
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | gearbox | model | odometer | fuel_type | brand | unrepaired_damage | ad_created | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000 | 50000 | 50000 | 50000 | 50000 | 50000 | 44905 | 47320 | 47242 | 50000 | 45518 | 50000 | 40171 | 50000 | 50000 |
unique | 48213 | 38754 | 2 | 2 | 2357 | 2 | 8 | 2 | 245 | 13 | 7 | 40 | 2 | 76 | 39481 |
top | 2016-03-05 16:57:05 | Ford_Fiesta | privat | Angebot | $0 | test | limousine | manuell | golf | 150,000km | benzin | volkswagen | nein | 2016-04-03 00:00:00 | 2016-04-07 06:17:27 |
freq | 3 | 78 | 49999 | 49999 | 1421 | 25756 | 12859 | 36993 | 4024 | 32424 | 30107 | 10687 | 35232 | 1946 | 8 |
registration_year
values 1000 and 9999 need further digging intoregistration_month
is 0 for over 5K data points. It needs to be looked into.power_in_ps
for some listings is 0. That means car is a dud or for scrap onlynr_of_pictures
has 0 for all 50K values, which is not useful for analysis and can be droppedprice
has non-digit characters ($ and punctuation - commas). Removing those characters can help converting it to numeric data typeprice
has values of 0 which is strange and needs further studyodometer
has non-digit characters (km and punctuation - commas), which can be removed and column data type converted to numeric as wellGet a closer look at the counts on numeric columns
autos['registration_month'].value_counts()
0 5075 3 5071 6 4368 5 4107 4 4102 7 3949 10 3651 12 3447 9 3389 11 3360 1 3282 8 3191 2 3008 Name: registration_month, dtype: int64
autos['power_p_s'].value_counts().head(10)
0 5500 75 3171 60 2195 150 2046 140 1884 101 1756 90 1746 116 1646 170 1492 105 1410 Name: power_p_s, dtype: int64
autos['nr_of_pictures'].value_counts()
0 50000 Name: nr_of_pictures, dtype: int64
# Convert string columns to numeric data type
autos['price'] = autos['price'].str.replace('$','')
autos['price'] = autos['price'].str.replace(',','')
autos['odometer'] =autos['odometer'].str.replace('km','')
autos['odometer'] =autos['odometer'].str.replace(',','')
autos[ ['price', 'odometer'] ] = autos[ ['price', 'odometer'] ].astype(int)
# Rename odometer column to odometer_km, since, km is a useful unit, for information
autos.rename({"odometer": "odometer_km"}, axis=1, inplace=True)
# dataframe.dtypes will help verify the changed data types and column names
autos.dtypes
date_crawled object name object seller object offer_type object price int64 abtest object vehicle_type object registration_year int64 gearbox object power_p_s int64 model object odometer_km int64 registration_month int64 fuel_type object brand object unrepaired_damage object ad_created object nr_of_pictures int64 postal_code int64 last_seen object dtype: object
autos.isnull().sum()
date_crawled 0 name 0 seller 0 offer_type 0 price 0 abtest 0 vehicle_type 5095 registration_year 0 gearbox 2680 power_p_s 0 model 2758 odometer_km 0 registration_month 0 fuel_type 4482 brand 0 unrepaired_damage 9829 ad_created 0 nr_of_pictures 0 postal_code 0 last_seen 0 dtype: int64
As observed earlier, some columns have null values, however, not more than about 20%
price
and odometer_km
¶print (autos['price'].unique().shape)
print (autos['odometer_km'].unique().shape)
(2357,) (13,)
autos['odometer_km'].unique()
array([150000, 70000, 50000, 80000, 10000, 30000, 125000, 90000, 20000, 60000, 5000, 100000, 40000])
autos.odometer_km.value_counts()
150000 25936 125000 4441 100000 1848 90000 1522 80000 1279 70000 1087 60000 1043 50000 932 40000 754 30000 707 20000 673 5000 469 10000 202 Name: odometer_km, dtype: int64
autos.odometer_km.plot.box(vert=False,grid=True)
<matplotlib.axes._subplots.AxesSubplot at 0x7efeb11e2e48>
autos.odometer_km.plot.hist()
<matplotlib.axes._subplots.AxesSubplot at 0x7efeb11c73c8>
autos['odometer_km'].describe()
count 50000.000000 mean 125732.700000 std 40042.211706 min 5000.000000 25% 125000.000000 50% 150000.000000 75% 150000.000000 max 150000.000000 Name: odometer_km, dtype: float64
Listings range from low-mileage cars (5000 km) to high-mileage ones (150000 km)
Most cars' mileage is on the higher side - over 120000KM. Visualizing it above indicates that the maximum, median value and 3rd quartile are all coinciding at 150K. It also shows that there is a wide range in the values of mileage (i.e more spread out data) in the under 100K area and about 50% of the listings are close together in the higher range.
When we split the odometer_km
into bins later on in the analysis, it will be clearer which bins most mileages fall into.
autos['price'].describe()
count 5.000000e+04 mean 9.840044e+03 std 4.811044e+05 min 0.000000e+00 25% 1.100000e+03 50% 2.950000e+03 75% 7.200000e+03 max 1.000000e+08 Name: price, dtype: float64
As found earlier, price
column has some outliers - minimum of 0 and maximum of ~100M dollars. Let's study the price list further
pd.DataFrame(autos['price'].value_counts(dropna=False).head(10))
price | |
---|---|
0 | 1421 |
500 | 781 |
1500 | 734 |
2500 | 643 |
1000 | 639 |
1200 | 639 |
600 | 531 |
800 | 498 |
3500 | 498 |
2000 | 460 |
The value_counts index above (which is actually the price) is in ascending order, so, by reversing the order of the index, let's see what the price distribution looks like on the higher end.
pd.DataFrame(autos['price'].value_counts().sort_index(ascending=False).head(20))
price | |
---|---|
99999999 | 1 |
27322222 | 1 |
12345678 | 3 |
11111111 | 2 |
10000000 | 1 |
3890000 | 1 |
1300000 | 1 |
1234566 | 1 |
999999 | 2 |
999990 | 1 |
350000 | 1 |
345000 | 1 |
299000 | 1 |
295000 | 1 |
265000 | 1 |
259000 | 1 |
250000 | 1 |
220000 | 1 |
198000 | 1 |
197000 | 1 |
There are no cars in range 350000 < price
< 999990
A used car has nearly 100M price tag - find what that is (i.e row with the MAX value of price
)
pd.DataFrame(autos.loc[autos['price'].idxmax()])
39705 | |
---|---|
date_crawled | 2016-03-22 14:58:27 |
name | Tausch_gegen_gleichwertiges |
seller | privat |
offer_type | Angebot |
price | 99999999 |
abtest | control |
vehicle_type | limousine |
registration_year | 1999 |
gearbox | automatik |
power_p_s | 224 |
model | s_klasse |
odometer_km | 150000 |
registration_month | 9 |
fuel_type | benzin |
brand | mercedes_benz |
unrepaired_damage | NaN |
ad_created | 2016-03-22 00:00:00 |
nr_of_pictures | 0 |
postal_code | 73525 |
last_seen | 2016-04-06 05:15:30 |
It's an intriguing listing name (when translated to english) - Exchange for something of the same value.
It's also unrealistic a limo with 150K miles would be worth a 100M dollars.
Let's see how many and of what type used cars are in millions of dollars
millions = autos['price'] > 999999
car_is = autos[millions]
pd.DataFrame(car_is[ ['vehicle_type', 'model','odometer_km', 'brand', 'price' ] ])
vehicle_type | model | odometer_km | brand | price | |
---|---|---|---|---|---|
2897 | limousine | escort | 50000 | ford | 11111111 |
7814 | coupe | NaN | 50000 | sonstige_autos | 1300000 |
11137 | coupe | NaN | 100000 | sonstige_autos | 10000000 |
22947 | kombi | NaN | 150000 | bmw | 1234566 |
24384 | NaN | NaN | 150000 | volkswagen | 11111111 |
27371 | NaN | punto | 150000 | fiat | 12345678 |
39377 | NaN | v40 | 150000 | volvo | 12345678 |
39705 | limousine | s_klasse | 150000 | mercedes_benz | 99999999 |
42221 | limousine | c4 | 40000 | citroen | 27322222 |
47598 | limousine | vectra | 150000 | opel | 12345678 |
47634 | coupe | NaN | 5000 | sonstige_autos | 3890000 |
Even millions price range does not seem to contain any of the expensive brands like Bugatti or Lamborghini, which could sell for that much in used condition.
There are many that are missing vehicle type or model. A buyer would need that information before bidding.
Therefore, it's reasonable to drop these and explore data points upto 1M.
# Look at the distribution when the price is between 500 and 1M.
upto_1mil = autos['price'].between(500,999999)
autos_1mil = autos[upto_1mil]
autos_1mil['price'].describe()
count 45100.000000 mean 6386.757738 std 12306.631453 min 500.000000 25% 1500.000000 50% 3500.000000 75% 7900.000000 max 999999.000000 Name: price, dtype: float64
Checking distribution again using an upper price limit of 350K, not many data points are lost compared to 1M, because there are no listings between 350K and 1M.
Also, $6K is more realistic for average price of a used car.
# Clean the data so that prices are between 500 and 350K. Look at the distribution
upto_350k = autos['price'].between(500,350000)
autos_350k = autos[upto_350k]
print (autos_350k['price'].describe())
autos = autos_350k
count 45097.000000 mean 6320.659600 std 9261.841444 min 500.000000 25% 1500.000000 50% 3500.000000 75% 7900.000000 max 350000.000000 Name: price, dtype: float64
# Lines left after cleaning
autos.shape
(45097, 20)
These columns are string values right now
date_crawled
ad_created
last_seen
They can be better understood and useful for analysis when converted to numeric type
autos[['date_crawled','ad_created','last_seen']].head()
date_crawled | ad_created | last_seen | |
---|---|---|---|
0 | 2016-03-26 17:47:46 | 2016-03-26 00:00:00 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | 2016-04-04 00:00:00 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | 2016-03-26 00:00:00 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | 2016-03-12 00:00:00 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | 2016-04-01 00:00:00 | 2016-04-01 14:38:50 |
These columns have non-numeric values so, Series.describe() gets us different stats like so:
autos['date_crawled'].str[:10].describe()
count 45097 unique 34 top 2016-04-03 freq 1751 Name: date_crawled, dtype: object
# Look at the distribution of values in percentages
autos['date_crawled'].str[:10].value_counts(normalize=True, dropna=False).sort_index()
2016-03-05 0.025567 2016-03-06 0.014125 2016-03-07 0.036189 2016-03-08 0.033173 2016-03-09 0.032907 2016-03-10 0.032707 2016-03-11 0.033018 2016-03-12 0.037320 2016-03-13 0.015522 2016-03-14 0.036300 2016-03-15 0.034016 2016-03-16 0.029359 2016-03-17 0.031155 2016-03-18 0.012883 2016-03-19 0.034747 2016-03-20 0.038073 2016-03-21 0.037741 2016-03-22 0.033018 2016-03-23 0.032397 2016-03-24 0.028982 2016-03-25 0.031089 2016-03-26 0.032641 2016-03-27 0.031177 2016-03-28 0.034836 2016-03-29 0.033262 2016-03-30 0.033328 2016-03-31 0.031665 2016-04-01 0.033905 2016-04-02 0.035767 2016-04-03 0.038827 2016-04-04 0.036610 2016-04-05 0.013172 2016-04-06 0.003171 2016-04-07 0.001353 Name: date_crawled, dtype: float64
autos['ad_created'].str[:10].describe()
count 45097 unique 76 top 2016-04-03 freq 1761 Name: ad_created, dtype: object
Ads are created starting 2015-06-11. Most listings occured on 2016-04-03
# Look at the distribution of values in percentages
autos['ad_created'].str[:10].value_counts(normalize=True, dropna=False).sort_index().head(20)
2015-06-11 0.000022 2015-08-10 0.000022 2015-09-09 0.000022 2015-11-10 0.000022 2015-12-05 0.000022 2015-12-30 0.000022 2016-01-03 0.000022 2016-01-07 0.000022 2016-01-10 0.000044 2016-01-13 0.000022 2016-01-14 0.000022 2016-01-16 0.000022 2016-01-22 0.000022 2016-01-27 0.000067 2016-01-29 0.000022 2016-02-01 0.000022 2016-02-02 0.000044 2016-02-05 0.000044 2016-02-07 0.000022 2016-02-08 0.000022 Name: ad_created, dtype: float64
date_crawled
and last_seen
are the same dates
autos['last_seen'].str[:10].describe()
count 45097 unique 34 top 2016-04-06 freq 10161 Name: last_seen, dtype: object
# Look at the distribution of values in percentages
autos['last_seen'].str[:10].value_counts(normalize=True, dropna=False).sort_index()
2016-03-05 0.001087 2016-03-06 0.004169 2016-03-07 0.005211 2016-03-08 0.007007 2016-03-09 0.009468 2016-03-10 0.010289 2016-03-11 0.012041 2016-03-12 0.023904 2016-03-13 0.008870 2016-03-14 0.012285 2016-03-15 0.015677 2016-03-16 0.016165 2016-03-17 0.027674 2016-03-18 0.007406 2016-03-19 0.015411 2016-03-20 0.020423 2016-03-21 0.020667 2016-03-22 0.021243 2016-03-23 0.018405 2016-03-24 0.019536 2016-03-25 0.018582 2016-03-26 0.016476 2016-03-27 0.015456 2016-03-28 0.020534 2016-03-29 0.021354 2016-03-30 0.024148 2016-03-31 0.023438 2016-04-01 0.022862 2016-04-02 0.024880 2016-04-03 0.024946 2016-04-04 0.024303 2016-04-05 0.126616 2016-04-06 0.225314 2016-04-07 0.134155 Name: last_seen, dtype: float64
registration_year
and registration_month
¶autos['registration_year'].describe()
count 45097.000000 mean 2005.064173 std 89.652017 min 1000.000000 25% 2000.000000 50% 2004.000000 75% 2008.000000 max 9999.000000 Name: registration_year, dtype: float64
Registration years of 1000 or 9999 do not make sense and are irrelevant to analysis.
The ad was first crawled starting in 2016 so, the latest valid registration year could not be after 2016.
For lower limit on the registration year, I would utilize publicly available knowledge about when first cars came out.
Cars became widely available in the early 20th century, even though there were some first ones in late 1800s.
Hence, selecting 1920 for lower limit.
year_bool = autos['registration_year'].between(1000, 1920)
autos.loc[year_bool, ['registration_year', 'brand', 'price']]
registration_year | brand | price | |
---|---|---|---|
22316 | 1000 | volkswagen | 1500 |
22659 | 1910 | opel | 500 |
28693 | 1910 | renault | 599 |
49283 | 1001 | citroen | 7750 |
Based on 1920 for lower cutoff, we won't lose much of valid data
include_reg_year = autos[autos['registration_year'].between(1920,2016)]
include_reg_year.shape
(43321, 20)
include_reg_year['registration_year'].describe()
count 43321.000000 mean 2003.231574 std 7.046978 min 1927.000000 25% 1999.000000 50% 2004.000000 75% 2008.000000 max 2016.000000 Name: registration_year, dtype: float64
Looking at the percentages (instead of counts) of the unique registration year values, distribution prior to 1987 needs some looking into whether we can drop them.
include_reg_year['registration_year'].value_counts(normalize=True, dropna=False).sort_index()
1927 0.000023 1929 0.000023 1931 0.000023 1934 0.000046 1937 0.000092 1938 0.000023 1939 0.000023 1941 0.000046 1943 0.000023 1948 0.000023 1950 0.000023 1951 0.000046 1952 0.000023 1953 0.000023 1954 0.000046 1955 0.000046 1956 0.000092 1957 0.000046 1958 0.000092 1959 0.000139 1960 0.000439 1961 0.000139 1962 0.000092 1963 0.000185 1964 0.000254 1965 0.000392 1966 0.000485 1967 0.000600 1968 0.000600 1969 0.000439 ... 1987 0.001570 1988 0.002978 1989 0.003578 1990 0.006325 1991 0.006763 1992 0.007040 1993 0.007848 1994 0.011542 1995 0.019852 1996 0.024746 1997 0.034902 1998 0.046767 1999 0.059371 2000 0.062672 2001 0.058170 2002 0.055839 2003 0.061333 2004 0.061933 2005 0.066111 2006 0.061541 2007 0.052400 2008 0.050922 2009 0.047967 2010 0.036610 2011 0.037326 2012 0.030170 2013 0.018398 2014 0.015050 2015 0.008402 2016 0.021929 Name: registration_year, Length: 77, dtype: float64
include_reg_year2 = include_reg_year[include_reg_year['registration_year'].between(1969,2016)]
include_reg_year2.shape
(43143, 20)
include_reg_year2['registration_year'].describe()
count 43143.000000 mean 2003.404492 std 6.502568 min 1969.000000 25% 1999.000000 50% 2004.000000 75% 2008.000000 max 2016.000000 Name: registration_year, dtype: float64
include_reg_year2['registration_year'].value_counts(normalize=True, dropna=False).sort_index(ascending=False)
2016 0.022020 2015 0.008437 2014 0.015113 2013 0.018473 2012 0.030295 2011 0.037480 2010 0.036761 2009 0.048165 2008 0.051132 2007 0.052616 2006 0.061794 2005 0.066384 2004 0.062189 2003 0.061586 2002 0.056069 2001 0.058410 2000 0.062930 1999 0.059616 1998 0.046960 1997 0.035046 1996 0.024848 1995 0.019934 1994 0.011589 1993 0.007881 1992 0.007070 1991 0.006791 1990 0.006351 1989 0.003593 1988 0.002990 1987 0.001576 1986 0.001507 1985 0.001970 1984 0.001136 1983 0.001159 1982 0.000950 1981 0.000626 1980 0.001762 1979 0.000788 1978 0.000974 1977 0.000510 1976 0.000487 1975 0.000417 1974 0.000556 1973 0.000533 1972 0.000719 1971 0.000579 1970 0.000788 1969 0.000440 Name: registration_year, dtype: float64
include_reg_year2['registration_year'].value_counts(normalize=True, dropna=False).sort_index(ascending=False).head(23).sum()
0.9478478548084279
Distribution of the data for years of registration 1994-2016, with 94% of listings in that range, indicates it's reasonable to keep that data for analysis.
include_1994_2016 = include_reg_year2[include_reg_year2['registration_year'].between(1994,2016)]
include_1994_2016.shape
(40893, 20)
include_1994_2016['registration_year'].describe()
count 40893.000000 mean 2004.287996 std 5.248485 min 1994.000000 25% 2000.000000 50% 2004.000000 75% 2008.000000 max 2016.000000 Name: registration_year, dtype: float64
include_1994_2016['registration_year'].value_counts(normalize=True, dropna=False).sort_index()
1994 0.012227 1995 0.021030 1996 0.026215 1997 0.036975 1998 0.049544 1999 0.062896 2000 0.066393 2001 0.061624 2002 0.059154 2003 0.064974 2004 0.065610 2005 0.070036 2006 0.065195 2007 0.055511 2008 0.053946 2009 0.050816 2010 0.038784 2011 0.039542 2012 0.031961 2013 0.019490 2014 0.015944 2015 0.008901 2016 0.023231 Name: registration_year, dtype: float64
autos = include_1994_2016
# Taking stock of cleaned data
print (autos.shape)
autos.head()
(40893, 20)
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_p_s | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | 5000 | control | bus | 2004 | manuell | 158 | andere | 150000 | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | 8500 | control | limousine | 1997 | automatik | 286 | 7er | 150000 | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | 8990 | test | limousine | 2009 | manuell | 102 | golf | 70000 | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | 4350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70000 | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | 1350 | test | kombi | 2003 | manuell | 0 | focus | 150000 | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
brand
column¶Understand the mean prices across brands
# List and number of all the unique brands
brand_list = autos["brand"].unique()
print (len(brand_list))
print (brand_list)
39 ['peugeot' 'bmw' 'volkswagen' 'smart' 'ford' 'chrysler' 'renault' 'audi' 'mazda' 'porsche' 'mini' 'mercedes_benz' 'seat' 'toyota' 'dacia' 'nissan' 'opel' 'saab' 'volvo' 'jaguar' 'fiat' 'skoda' 'subaru' 'sonstige_autos' 'kia' 'citroen' 'mitsubishi' 'chevrolet' 'hyundai' 'honda' 'daewoo' 'suzuki' 'land_rover' 'jeep' 'alfa_romeo' 'rover' 'daihatsu' 'lancia' 'lada']
counts = pd.DataFrame(autos['brand'].value_counts())
counts
brand | |
---|---|
volkswagen | 8509 |
bmw | 4744 |
opel | 4208 |
mercedes_benz | 3996 |
audi | 3746 |
ford | 2713 |
renault | 1881 |
peugeot | 1258 |
fiat | 976 |
seat | 768 |
skoda | 739 |
smart | 654 |
mazda | 630 |
nissan | 627 |
citroen | 595 |
toyota | 556 |
hyundai | 440 |
mini | 402 |
volvo | 361 |
mitsubishi | 324 |
kia | 318 |
honda | 307 |
alfa_romeo | 264 |
sonstige_autos | 264 |
suzuki | 243 |
porsche | 228 |
chevrolet | 217 |
chrysler | 146 |
dacia | 123 |
jeep | 97 |
land_rover | 93 |
daihatsu | 92 |
subaru | 78 |
saab | 66 |
jaguar | 62 |
daewoo | 58 |
rover | 53 |
lancia | 37 |
lada | 20 |
I am selecting brands that have atleast 100 listings, including Porsche in the mix makes it interesting to analyze mean prices
#counts[counts.brand > 100] # Note: counts.brand is same as counts['brand']
brands_gt_100 = counts[counts.brand > 100].index
print (brands_gt_100)
Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford', 'renault', 'peugeot', 'fiat', 'seat', 'skoda', 'smart', 'mazda', 'nissan', 'citroen', 'toyota', 'hyundai', 'mini', 'volvo', 'mitsubishi', 'kia', 'honda', 'alfa_romeo', 'sonstige_autos', 'suzuki', 'porsche', 'chevrolet', 'chrysler', 'dacia'], dtype='object')
# Aggregate data by `brand` column
# Assign brands and their mean prices as key-value pairs to a dictionary
brand_price = {}
for b in brands_gt_100:
# Select only rows that correspond to a specific brand
brand_rows = autos[autos["brand"] == b]
# Calculate the mean price for those rows
mean_price = brand_rows["price"].mean()
brand_price[b] = mean_price
print (len(brand_price))
print (brand_price)
29 {'mini': 10742.965174129353, 'fiat': 3159.2715163934427, 'hyundai': 5686.940909090909, 'chevrolet': 6307.451612903225, 'ford': 3970.638039071139, 'porsche': 49661.149122807015, 'renault': 2762.946836788942, 'sonstige_autos': 14265.818181818182, 'bmw': 8787.180227655987, 'mitsubishi': 3929.061728395062, 'chrysler': 3632.0, 'mercedes_benz': 8956.977477477478, 'volkswagen': 5941.851686449641, 'smart': 3614.0428134556573, 'dacia': 5915.528455284553, 'mazda': 4459.720634920635, 'audi': 9946.883609183129, 'opel': 3395.6494771863117, 'citroen': 3890.435294117647, 'kia': 6196.229559748428, 'nissan': 5248.81658692185, 'toyota': 5328.836330935252, 'suzuki': 4559.061728395061, 'alfa_romeo': 3643.2083333333335, 'honda': 4513.166123778502, 'peugeot': 3374.4252782193958, 'seat': 4845.5546875, 'volvo': 5318.81163434903, 'skoda': 6577.7997293640055}
# Construct a dataframe from this dictionary
brand_price_df = pd.DataFrame(list(brand_price.items()),columns = ['brand','mean_price'])
print (type(brand_price_df))
# Sort the mean_price column from highest to lowest, to find its corresponding brand
brand_price_df = brand_price_df.sort_values('mean_price', ascending=False)
brand_price_df
<class 'pandas.core.frame.DataFrame'>
brand | mean_price | |
---|---|---|
5 | porsche | 49661.149123 |
7 | sonstige_autos | 14265.818182 |
0 | mini | 10742.965174 |
16 | audi | 9946.883609 |
11 | mercedes_benz | 8956.977477 |
8 | bmw | 8787.180228 |
28 | skoda | 6577.799729 |
3 | chevrolet | 6307.451613 |
19 | kia | 6196.229560 |
12 | volkswagen | 5941.851686 |
14 | dacia | 5915.528455 |
2 | hyundai | 5686.940909 |
21 | toyota | 5328.836331 |
27 | volvo | 5318.811634 |
20 | nissan | 5248.816587 |
26 | seat | 4845.554688 |
22 | suzuki | 4559.061728 |
24 | honda | 4513.166124 |
15 | mazda | 4459.720635 |
4 | ford | 3970.638039 |
9 | mitsubishi | 3929.061728 |
18 | citroen | 3890.435294 |
23 | alfa_romeo | 3643.208333 |
10 | chrysler | 3632.000000 |
13 | smart | 3614.042813 |
17 | opel | 3395.649477 |
25 | peugeot | 3374.425278 |
1 | fiat | 3159.271516 |
6 | renault | 2762.946837 |
brand_price_df.describe()
mean_price | |
---|---|
count | 29.000000 |
mean | 7194.221476 |
std | 8577.023981 |
min | 2762.946837 |
25% | 3890.435294 |
50% | 5248.816587 |
75% | 6307.451613 |
max | 49661.149123 |
# Top 6 brands by listings
counts.head(6)
brand | |
---|---|
volkswagen | 8509 |
bmw | 4744 |
opel | 4208 |
mercedes_benz | 3996 |
audi | 3746 |
ford | 2713 |
# Brands by mean price
brand_price_df.head(6)
brand | mean_price | |
---|---|---|
5 | porsche | 49661.149123 |
7 | sonstige_autos | 14265.818182 |
0 | mini | 10742.965174 |
16 | audi | 9946.883609 |
11 | mercedes_benz | 8956.977477 |
8 | bmw | 8787.180228 |
counts.head(6).plot.barh(title='Top 6 popular brands', legend=False)
<matplotlib.axes._subplots.AxesSubplot at 0x7efeacfb05c0>
brand_price_df.head(6).plot.barh(x='brand',y='mean_price',title='Top 6 brands by mean price', legend=False)
<matplotlib.axes._subplots.AxesSubplot at 0x7efeacfb0d30>
We may infer that there are so many Volkswagen listings because they can sell easily due to their lower average price.
However, it would be interesting to see what the average mileage on it is. That tells us if it's not a well liked car and people want to dump quickly or it's so popular that despite high mileage, people expect it to be an easy sale.
With mean mileage and mean price for each of the top brands, understand if there's any visible link between them
# Aggregate data by `odometer_km' column
# Assign brands and their mean mileages as key-value pairs to a dictionary
brand_miles = {}
for b in list(counts.head(6).index):
# Select only rows that correspond to a specific brand
brand_rows = autos[autos["brand"] == b]
# Calculate the mean mileage for those rows
mean_mileage = round(brand_rows["odometer_km"].mean(),2)
brand_miles[b] = mean_mileage
print (len(brand_miles))
print (brand_miles)
6 {'bmw': 132718.17, 'opel': 128688.21, 'mercedes_benz': 130481.73, 'volkswagen': 128295.33, 'ford': 124699.59, 'audi': 128183.4}
# Convert `brand_miles` dictionary to a series object; don't sort values
brand_miles_series = pd.Series(brand_miles)
print (brand_miles_series)
audi 128183.40 bmw 132718.17 ford 124699.59 mercedes_benz 130481.73 opel 128688.21 volkswagen 128295.33 dtype: float64
# Create a dataframe from the series brand_miles_series
brand_miles_df = pd.DataFrame(brand_miles_series,columns = ['avg_miles'])
brand_miles_df
avg_miles | |
---|---|
audi | 128183.40 |
bmw | 132718.17 |
ford | 124699.59 |
mercedes_benz | 130481.73 |
opel | 128688.21 |
volkswagen | 128295.33 |
# Similarly, calculate the average price for the top 6 brands, using aggregation
brand_avg_pr = {}
for b in list(counts.head(6).index):
# Select only rows that correspond to a specific brand
brand_rows = autos[autos["brand"] == b]
# Calculate the mean price for those rows
avg_pr = round(brand_rows["price"].mean(),2)
# Assign the mean price to the dictionary brand_avg_pr, using the brand name as the key
brand_avg_pr[b] = avg_pr
print (len(brand_avg_pr))
print (brand_avg_pr)
6 {'bmw': 8787.18, 'opel': 3395.65, 'mercedes_benz': 8956.98, 'volkswagen': 5941.85, 'ford': 3970.64, 'audi': 9946.88}
# Convert `brand_avg_pr` dictionary to a Series object
brand_avgpr_series = pd.Series(brand_avg_pr)
print (type(brand_avgpr_series))
brand_avgpr_series
<class 'pandas.core.series.Series'>
audi 9946.88 bmw 8787.18 ford 3970.64 mercedes_benz 8956.98 opel 3395.65 volkswagen 5941.85 dtype: float64
# Add the Series object as a new column named `mean_price` to the dataframe `brand_miles_df`
brand_miles_df['mean_price'] = brand_avgpr_series
brand_miles_df
avg_miles | mean_price | |
---|---|---|
audi | 128183.40 | 9946.88 |
bmw | 132718.17 | 8787.18 |
ford | 124699.59 | 3970.64 |
mercedes_benz | 130481.73 | 8956.98 |
opel | 128688.21 | 3395.65 |
volkswagen | 128295.33 | 5941.85 |
# Sort mileage values for seeing any connection to mean price values
brand_miles_df.sort_values('avg_miles', ascending=False)
avg_miles | mean_price | |
---|---|---|
bmw | 132718.17 | 8787.18 |
mercedes_benz | 130481.73 | 8956.98 |
opel | 128688.21 | 3395.65 |
volkswagen | 128295.33 | 5941.85 |
audi | 128183.40 | 9946.88 |
ford | 124699.59 | 3970.64 |
brand_miles_df.plot.scatter(x='avg_miles',y='mean_price',title='Impact of mileage on avg.price of car')
<matplotlib.axes._subplots.AxesSubplot at 0x7efeb10f2a90>
For the high-end brands Audi, BMW and Mercedes(in the top right of the scatter plot), as mileage increases, the average price clearly decreases. Mileage is a indeed factor when it comes to price, even among high-end brands.
Ford and Opel (in the bottom left) are much less expensive used cars unlike German brands. However, as mileage goes up, their price goes down consistent with the trend seen with the expensive brands.
Volkswagen with average mileage similar to Opel and higher than Ford still costs more than both. That could be because it's just a popular brand. It's just not in the league of expensive brands, nor comparable to the inexpensive ones.
There are text columns where all or almost all values are the same or unique values are 1 or very few
nr_of_pictures
as identified earlier contains all values = 0
seller
and offer_type
seem to have very few unique valuesLet's see
print (autos[['seller','offer_type',]].dtypes)
autos.shape
seller object offer_type object dtype: object
(40893, 20)
num_uniq_seller = len(autos['seller'].unique())
num_uniq_offer = len(autos['offer_type'].unique())
print (num_uniq_seller)
print (num_uniq_offer)
1 1
print (autos['seller'].value_counts())
print (autos['offer_type'].value_counts())
privat 40893 Name: seller, dtype: int64 Angebot 40893 Name: offer_type, dtype: int64
#Drop them 3 columns
autos = autos.drop(['seller','offer_type','nr_of_pictures'],axis=1)
autos.shape
(40893, 17)
These 3 columns have categorical data that needs translation from German
gearbox
unrepaired_damage
fuel_type
autos['gearbox'].value_counts()
manuell 30283 automatik 9184 Name: gearbox, dtype: int64
autos['unrepaired_damage'].value_counts()
nein 31417 ja 3336 Name: unrepaired_damage, dtype: int64
autos['fuel_type'].value_counts()
benzin 24215 diesel 13685 lpg 606 cng 64 hybrid 36 elektro 18 andere 8 Name: fuel_type, dtype: int64
autos['gearbox'] = autos['gearbox'].str.replace('manuell','manual')
autos['gearbox'] = autos['gearbox'].str.replace('automatik','automatic')
autos['unrepaired_damage'] = autos['unrepaired_damage'].str.replace('nein','no')
autos['unrepaired_damage'] = autos['unrepaired_damage'].str.replace('ja','yes')
autos['fuel_type'] = autos['fuel_type'].str.replace('benzin', 'gasoline')
autos['fuel_type'] = autos['fuel_type'].str.replace('elektro', 'electric')
autos['fuel_type'] = autos['fuel_type'].str.replace('andere', 'other')
print(autos['gearbox'].value_counts())
print ('\n')
print(autos['unrepaired_damage'].value_counts())
print ('\n')
print(autos['fuel_type'].value_counts())
manual 30283 automatic 9184 Name: gearbox, dtype: int64 no 31417 yes 3336 Name: unrepaired_damage, dtype: int64 gasoline 24215 diesel 13685 lpg 606 cng 64 hybrid 36 electric 18 other 8 Name: fuel_type, dtype: int64
name
column worthy of extracting as new columnsautos[ ['name', 'brand','model']].tail(50)
name | brand | model | |
---|---|---|---|
49939 | Audi_TT_Roadster_2.0_TFSI | audi | tt |
49940 | Omas_Liebling | bmw | 3er |
49941 | Maserati_Ghibli_Diesel_Automatik | sonstige_autos | NaN |
49942 | Audi_A3_2.0_TDI_Sportback_DPF_Ambition | audi | a3 |
49944 | SAAB_9_3__2_0t_SE__Automatik__Tempomat__Regens... | saab | andere |
49945 | omega_2_2_facelift_mit_neuem_tuev | opel | omega |
49947 | Audi_A5_2.7_TDI_+S_Line_Vollausstattung/ABT/B&... | audi | a5 |
49948 | Hyundai_ix35_2.0_CRDi_4WD_Automatik | hyundai | i_reihe |
49950 | Volvo_V70_2_II__2002__AHK__Klima__Volleder | volvo | NaN |
49951 | Opel_Corsa_D1.0_weiss_/Scheckheftgepflegt/_Unf... | opel | corsa |
49952 | VW_POLO__9N3_Silber_1_2_Ltr._5_tuerer__96_tsd.... | volkswagen | polo |
49954 | 328_Cabrio_Schalter | bmw | NaN |
49955 | Golf_1.9_TDI_DPF_4mot_GT_Sport | volkswagen | golf |
49956 | Toyota_Yaris_1.3_VVT_i_Executive_MODEL_2007 | toyota | yaris |
49957 | Bmw_mit_LPG__super_zustand_2_Jahre_Tuev_klima | bmw | 3er |
49958 | Golf_3_2.8_vr6_highline | volkswagen | golf |
49959 | Mercedes_Benz_C_180_T_Kompressor_Classic | mercedes_benz | c_klasse |
49961 | Golf_2.0_TDI_Sportline_Sonderproduktion | volkswagen | golf |
49962 | Mitsubishi_Space_Star_1__3_L__Bj_2004_Standhei... | mitsubishi | andere |
49963 | Mercedes_Benz_B_200_CDI_Special_Edition | mercedes_benz | b_klasse |
49964 | Audi_2_7_TDI_AVANT | audi | a4 |
49965 | Opel_Astra_1.6_Lenkradheizung~Sitzheizung | opel | astra |
49966 | Citroën_C1_1.0_**Euro4**TÜV_OKT_2017**Scheiten... | citroen | c1 |
49967 | VW_Passat_2_0_TDI_comfortline | volkswagen | passat |
49969 | Nissan_X_Trail_2.2_dCi_4x4_Sport_m.AHZ | nissan | x_trail |
49970 | c4_Grand_Picasso_mit_Automatik_Leder_Navi_Temp... | citroen | c4 |
49971 | W.Lupo_1.0 | volkswagen | lupo |
49972 | Mercedes_Benz_Vito_115_CDI_Extralang_Aut. | mercedes_benz | vito |
49973 | Mercedes_Benz_SLK_200_Kompressor | mercedes_benz | slk |
49975 | Honda_Jazz_1.3_DSi_i_VTEC_IMA_CVT_Comfort | honda | jazz |
49977 | Mercedes_Benz_C200_Cdi_W203 | mercedes_benz | c_klasse |
49978 | Mercedes_Benz_E_200_Classic | mercedes_benz | e_klasse |
49979 | Volkswagen_Polo_1.6_TDI_Style | volkswagen | polo |
49981 | Opel_Astra_Kombi_mit_Anhaengerkupplung | opel | astra |
49982 | Skoda_Fabia_4_Tuerer_Bj:2004__85.000Tkm | skoda | fabia |
49983 | Ford_focus_99 | ford | focus |
49985 | Verkaufe_meinen_vw_vento! | volkswagen | NaN |
49986 | Chrysler_300C_3.0_CRD_DPF_Automatik_Voll_Ausst... | chrysler | 300c |
49987 | Audi_A3_Limousine_2.0_TDI_DPF_Ambition__NAVI__... | audi | a3 |
49988 | BMW_330_Ci | bmw | 3er |
49990 | Mercedes_Benz_A_200__BlueEFFICIENCY__Urban | mercedes_benz | a_klasse |
49991 | Kleinwagen | renault | twingo |
49992 | Fiat_Grande_Punto_1.4_T_Jet_16V_Sport | fiat | andere |
49993 | Audi_A3__1_8l__Silber;_schoenes_Fahrzeug | audi | NaN |
49994 | Audi_A6__S6__Avant_4.2_quattro_eventuell_Tausc... | audi | a6 |
49995 | Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon | audi | q5 |
49996 | Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... | opel | astra |
49997 | Fiat_500_C_1.2_Dualogic_Lounge | fiat | 500 |
49998 | Audi_A3_2.0_TDI_Sportback_Ambition | audi | a3 |
49999 | Opel_Vectra_1.6_16V | opel | vectra |
_
indicate the brand
and model
of the car respectively. However, those columns already exist so, it's not worth extracting them as new columnsOmas_Liebling
or W.Lupo_1.0
)single_names =[]
for row in list(autos['name']):
if '_' not in row:
single_names.append(row)
pd.DataFrame(single_names)
0 | |
---|---|
0 | 017668128439 |
1 | Anfàngerauto |
2 | TOOPPPWAGENDIESEL |
3 | Schnaeppchen...... |
4 | Schnaeppchen |
5 | STTOOOPPP!!!! |
6 | Privatanbieter |
7 | *RENAULT*MEGANE*1.5dCi*GRANDTOUR*EMOTION*1HAND... |
8 | 0178/8055184 |
9 | Schlachtfest |
10 | 0178/8055184 |
11 | Finanzierunguebernahme |
12 | Volswagen |
13 | Reddbeat1976 |
14 | ~.A.U.D.I.~A6~3.0.TDI~QUATTRO~BOSE~ |
15 | wolsfagen |
16 | Ausschlachten |
17 | Gebrauchtwagen |
18 | 74828364829187482929 |
19 | Fahrberrit |
20 | Unfallfahrzeug |
21 | Mazda.121 |
22 | Unfallauto |
23 | Polo...... |
24 | Autoverkauf |
25 | Smart |
26 | Notverkauf |
27 | Motorschaden |
28 | Smart.klima |
29 | Hobbyaufgabe |
... | ... |
71 | Klima/5Tuerer/Alufelgen |
72 | 015780886026 |
73 | Gelaendewagen |
74 | Youngtimer |
75 | Notverkauf |
76 | Compfortline |
77 | 0040728824593 |
78 | TOOPWAGENNNDIESEL |
79 | WaldemarFunk |
80 | Ggggggnffkfklff |
81 | BMW116"8"fach"bereift"Start"stop"Tausch""RESER... |
82 | Zuverkaufen |
83 | Autobeschreibung |
84 | BMW116"8"fach"bereift"Start"stop"Tausch"moegli... |
85 | Gebrauchtfahrzeug |
86 | Diesel..... |
87 | 01788177890 |
88 | TOPPP/ZUSTAND//DIESEL |
89 | Anfaengerauto |
90 | Schlachtfest |
91 | Anfaengerauto |
92 | Anfaengerauto |
93 | Gelegenheit |
94 | Fiat.Punto.1.2 |
95 | Motorschaden |
96 | Autoverkauf |
97 | Vvvvvvvvvv |
98 | Beschaeftigt |
99 | Unfallauto! |
100 | Kleinwagen |
101 rows × 1 columns
There are 101 names that don't convey anything meaningful or they need translation to English. Overall, there are no consistent key words in the name
column that we can extract as is into separate columns that will provide value to analysis. In fact, it's a good candidate to drop from the dataframe.
autos.drop(['name'], axis=1, inplace=True)
autos.shape
(40893, 16)
autos.shape
(40893, 16)
autos[['date_crawled', 'ad_created', 'last_seen']].dtypes
date_crawled object ad_created object last_seen object dtype: object
autos[['date_crawled', 'ad_created', 'last_seen']].head()
date_crawled | ad_created | last_seen | |
---|---|---|---|
0 | 2016-03-26 17:47:46 | 2016-03-26 00:00:00 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | 2016-04-04 00:00:00 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | 2016-03-26 00:00:00 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | 2016-03-12 00:00:00 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | 2016-04-01 00:00:00 | 2016-04-01 14:38:50 |
autos['date_crawled'] = autos['date_crawled'].str[:10]
autos['ad_created'] = autos['ad_created'].str[:10]
autos['last_seen'] = autos['last_seen'].str[:10]
autos['date_crawled'] = autos['date_crawled'].str.replace('-','')
autos['ad_created'] = autos['ad_created'].str.replace('-','')
autos['last_seen'] = autos['last_seen'].str.replace('-','')
# Convert to data type integer
autos[['date_crawled', 'ad_created', 'last_seen']] = autos[['date_crawled', 'ad_created', 'last_seen']].astype(int)
autos[['date_crawled', 'ad_created', 'last_seen']].dtypes
date_crawled int64 ad_created int64 last_seen int64 dtype: object
autos[['date_crawled', 'ad_created', 'last_seen']].head()
date_crawled | ad_created | last_seen | |
---|---|---|---|
0 | 20160326 | 20160326 | 20160406 |
1 | 20160404 | 20160404 | 20160406 |
2 | 20160326 | 20160326 | 20160406 |
3 | 20160312 | 20160312 | 20160315 |
4 | 20160401 | 20160401 | 20160401 |
** What are some of the most common brand/model combinations?**
autos[['brand','model']].head()
brand | model | |
---|---|---|
0 | peugeot | andere |
1 | bmw | 7er |
2 | volkswagen | golf |
3 | smart | fortwo |
4 | ford | focus |
# Concatenate `brand`, `model` strings with a `_`
autos['brand_model'] = autos['brand']+'_'+autos['model']
brand_model_df = pd.DataFrame(autos['brand_model'].value_counts())
# The brand/model combinations that make the tope 10 most common listings
brand_model_df.head(10)
brand_model | |
---|---|
volkswagen_golf | 3134 |
bmw_3er | 2373 |
volkswagen_polo | 1291 |
volkswagen_passat | 1276 |
opel_corsa | 1254 |
opel_astra | 1210 |
audi_a4 | 1199 |
mercedes_benz_c_klasse | 1113 |
bmw_5er | 1074 |
mercedes_benz_e_klasse | 856 |
The most common cars by far are Volkswagen models.
How much cheaper are cars with damage than their non-damaged counterparts?
autos.head(0)
date_crawled | price | abtest | vehicle_type | registration_year | gearbox | power_p_s | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | postal_code | last_seen | brand_model |
---|
autos.unrepaired_damage.value_counts()
no 31417 yes 3336 Name: unrepaired_damage, dtype: int64
autos['unrepaired_damage'].isnull().sum()
6140
cars_with_damage = autos[autos['unrepaired_damage'] == 'yes']
cars_no_damage = autos[autos['unrepaired_damage'] == 'no']
print (cars_with_damage.shape)
print (cars_no_damage.shape)
(3336, 17) (31417, 17)
# Calculate the average price for just this set of rows
damage_mean = cars_with_damage['price'].mean()
no_damage_mean = cars_no_damage['price'].mean()
print (damage_mean)
print (no_damage_mean)
2803.3744004796163 7336.2024700003185
Do average prices follow any patterns based on the mileage?
mileages = pd.DataFrame(autos['odometer_km'].unique())
print (mileages.max() - mileages.min())
mileages
0 145000 dtype: int64
0 | |
---|---|
0 | 150000 |
1 | 70000 |
2 | 50000 |
3 | 10000 |
4 | 30000 |
5 | 90000 |
6 | 125000 |
7 | 20000 |
8 | 60000 |
9 | 5000 |
10 | 40000 |
11 | 80000 |
12 | 100000 |
Split the odometer_km into bins like this: (Note that left bin edge is exclusive and right bin edge is inclusive)
[1000, 31000, 61000, 91000, 121000, 151000]
# Create a new column `odometer_km_bins` that sets the x argument to the `odometer_km` column and sets the bins argument to the list of bin edge values
# `odometer_km` values will get assigned to the proper bin
autos['odometer_km_bins'] = pd.cut(x=autos['odometer_km'], bins=[1000, 31000, 61000, 91000, 121000, 151000])
# Print out some rows with the new column and checkout the bins
autos[['price', 'odometer_km', 'odometer_km_bins']].head()
price | odometer_km | odometer_km_bins | |
---|---|---|---|
0 | 5000 | 150000 | (121000, 151000] |
1 | 8500 | 150000 | (121000, 151000] |
2 | 8990 | 70000 | (61000, 91000] |
3 | 4350 | 70000 | (61000, 91000] |
4 | 1350 | 150000 | (121000, 151000] |
autos['odometer_km_bins'].value_counts()
(121000, 151000] 30377 (61000, 91000] 3888 (31000, 61000] 2729 (1000, 31000] 2051 (91000, 121000] 1848 Name: odometer_km_bins, dtype: int64
# Unique mileage bins to loop over, using index labels
autos['odometer_km_bins'].value_counts().index
CategoricalIndex([(121000, 151000], (61000, 91000], (31000, 61000], (1000, 31000], (91000, 121000]], categories=[(1000, 31000], (31000, 61000], (61000, 91000], (91000, 121000], (121000, 151000]], ordered=True, dtype='category')
# Produce a dictionary of average price for each mileage bin
avg_price_by_odo_bin = {}
for b in autos['odometer_km_bins'].value_counts().index:
# select rows that correspond to each mileage bin
b_rows = autos[autos['odometer_km_bins'] == b]
# Calculate the average price for just those rows
avg_price = b_rows["price"].mean()
# Assign the avg price value to the dictionary, using the odometer_km_bins as the key
avg_price_by_odo_bin[b] = avg_price
print (avg_price_by_odo_bin)
{Interval(31000, 61000, closed='right'): 14044.41920117259, Interval(1000, 31000, closed='right'): 17594.77815699659, Interval(61000, 91000, closed='right'): 9906.484567901234, Interval(91000, 121000, closed='right'): 8237.034632034633, Interval(121000, 151000, closed='right'): 4443.746584587023}
# Create a dataframe from the dictionary
odo_price_df = pd.DataFrame(list(avg_price_by_odo_bin.items()),columns = ['odometer_km_groups','avg_price'])
# Sort column in descending order
odo_price_df.sort_values('odometer_km_groups', ascending=False)
odometer_km_groups | avg_price | |
---|---|---|
4 | (121000, 151000] | 4443.746585 |
3 | (91000, 121000] | 8237.034632 |
2 | (61000, 91000] | 9906.484568 |
0 | (31000, 61000] | 14044.419201 |
1 | (1000, 31000] | 17594.778157 |
autos.odometer_km.plot.hist()
These are the top 6 brands by listings:
Volkswagen
BMW
Opel
Mercedes-Benz
Audi
Ford
Some listings are priced at millions of dollars for coupes and Limousines. That aspect gives the site an impression that not every listing is realistic and buyers could be in for a surprise or a long bidding process
Top brands Audi, Mercedes Benz and BMW, all German, on average, are also among the highest priced cars
Ford and Opel are top brands but much cheaper and don't make the list of top brands by price
Porsche, also a German car, is the priciest used car at an average of 49661. It's understandable because it is world's #1 luxury brand and has over 200 listings on this site
Sonstige autos is a distant second priciest at 14265, followed by
Mini, possibly, because it's owned by BMW
Renault is the least expensive used car at 2762
The average used car price is 7194
Volkswagen is by far the most popular brand and 3 of its models golf, polo and passat are the most commonly listed. However, it's not among the priciest. Its price falls below that of the average used car
Almost 75% of the listings are high-mileage cars, having mileages in 121000 - 151000 KM range
It's also worth noting that damaged cars do not dominate the listings. 8% of the listings are with (unrepaired) damages while 76% are with no (unrepaired) damages
On an average, cars with unrepaired damages cost over 4500 less than their non-damaged counterparts. Knowing average repair costs for a brand would help a buyer decide if it's worth paying higher price for a car without damages or if the price difference
covers the damages for that brand so he or she can get it for cheaper
Visualizing the data for Impact of mileage on avg.price of car
(Scatter plot)corraborated what splitting mileage into separate bins showed - that the average prices increased steadily for lower mileage bins:
For the high-end brands Audi, BMW and Mercedes, as mileage increased, the average price clearly decreased. Mileage is a indeed factor when it comes to price, even among high-end brands
Ford and Opel are much less expensive used cars unlike German brands. However, as mileage went up, their price went down, consistent with the trend seen with the expensive brands.
Volkswagen with average mileage similar to Opel and higher than Ford still costs more than both. That could be because it's just a popular brand, even though it's not in the league of expensive brands or comparable to the inexpensive ones
We may also infer that there are so many Volkswagen listings because they can sell easily due to their lower average price