This is a Guided Project of one of the DataQuest.io courses. It is a Data Cleaning project, in which we will take a web scrapped data set of the German eBay website, more specifically the eBay Kleinanzeigen section.
The data from DataQuest was altered to make the job a little bit more difficult. The original one was cleaner to make analysis easier, it was uploaded on Kaggle by the user orgesleka but it is not available anymore. But it can be found here.
We are going to clean the data and answer some questions that could come in the way.
The content of the data is in german, so one has to translate it first if one can not speak german. Those fields are included:
The fields lastSeen and dateCrawled could be used to estimate how long a car will be at least online before it is sold.
The aim of this project is to clean the data and analyze the included used car listings.
# import necessary libraries as we will use here
import numpy as np
import pandas as pd
# Read file with csv reader and encode file by encoding
autos = pd.read_csv("autos.csv", encoding = "Latin-1")
# view first 5 rows of dataset
autos.head()
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
# view last 5 rows in dataset
autos.tail()
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
49995 | 2016-03-27 14:38:19 | Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon | privat | Angebot | $24,900 | control | limousine | 2011 | automatik | 239 | q5 | 100,000km | 1 | diesel | audi | nein | 2016-03-27 00:00:00 | 0 | 82131 | 2016-04-01 13:47:40 |
49996 | 2016-03-28 10:50:25 | Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... | privat | Angebot | $1,980 | control | cabrio | 1996 | manuell | 75 | astra | 150,000km | 5 | benzin | opel | nein | 2016-03-28 00:00:00 | 0 | 44807 | 2016-04-02 14:18:02 |
49997 | 2016-04-02 14:44:48 | Fiat_500_C_1.2_Dualogic_Lounge | privat | Angebot | $13,200 | test | cabrio | 2014 | automatik | 69 | 500 | 5,000km | 11 | benzin | fiat | nein | 2016-04-02 00:00:00 | 0 | 73430 | 2016-04-04 11:47:27 |
49998 | 2016-03-08 19:25:42 | Audi_A3_2.0_TDI_Sportback_Ambition | privat | Angebot | $22,900 | control | kombi | 2013 | manuell | 150 | a3 | 40,000km | 11 | diesel | audi | nein | 2016-03-08 00:00:00 | 0 | 35683 | 2016-04-05 16:45:07 |
49999 | 2016-03-14 00:42:12 | Opel_Vectra_1.6_16V | privat | Angebot | $1,250 | control | limousine | 1996 | manuell | 101 | vectra | 150,000km | 1 | benzin | opel | nein | 2016-03-13 00:00:00 | 0 | 45897 | 2016-04-06 21:18:48 |
Since everything worked fine until here, we are going to take a closer look at the entries, looking for possible mistakes that could lead to wrong analysis.
# show the information details about dataframe
autos.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): dateCrawled 50000 non-null object name 50000 non-null object seller 50000 non-null object offerType 50000 non-null object price 50000 non-null object abtest 50000 non-null object vehicleType 44905 non-null object yearOfRegistration 50000 non-null int64 gearbox 47320 non-null object powerPS 50000 non-null int64 model 47242 non-null object odometer 50000 non-null object monthOfRegistration 50000 non-null int64 fuelType 45518 non-null object brand 50000 non-null object notRepairedDamage 40171 non-null object dateCreated 50000 non-null object nrOfPictures 50000 non-null int64 postalCode 50000 non-null int64 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
The dataset contain 50000 rows, 20 columns, few columns have null values, 5 integer datatype, 15 object datatype and columns contain cammel case.
We need to convert column names from cammel case to snakecase and rewrite some columns name to more easy readble.
With the help of the dataset dictionary, we are going to change them to more descriptive snakecase names.
# Printing the autos dataframe columns
autos.columns
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest', 'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model', 'odometer', 'monthOfRegistration', 'fuelType', 'brand', 'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode', 'lastSeen'], dtype='object')
# Changing few columns name with dictionary as instruction said.
autos.rename(columns = {"yearOfRegistration" : "registration_year",
"monthOfRegistration" : "registration_month",
"notRepairedDamage" : "unrepaired_damage",
"dateCreated" : "ad_created"},inplace = True)
# Checking the changed columns
autos.columns
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest', 'vehicleType', 'registration_year', 'gearbox', 'powerPS', 'model', 'odometer', 'registration_month', 'fuelType', 'brand', 'unrepaired_damage', 'ad_created', 'nrOfPictures', 'postalCode', 'lastSeen'], dtype='object')
# Changing the name to more meaningful snakecase ones
autos.columns = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'ab_test',
'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
'odometer', 'registration_month', 'fuel_type', 'brand',
'unrepaired_damage', 'ad_created','n_of_pictures', 'postal_code',
'last_seen']
# Checking it
autos.columns
Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'ab_test', 'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model', 'odometer', 'registration_month', 'fuel_type', 'brand', 'unrepaired_damage', 'ad_created', 'n_of_pictures', 'postal_code', 'last_seen'], dtype='object')
In column section we changed column name camel case to snake case with underscores, lower characters, changing column in meaningful name.
Now that we fixed the first problems we saw about column names, so we are going to check the data one more time to see if there are any duplicates (or other things that could take our attention).
# Print descriptive statistics for all columns
autos.describe(include = "all")
date_crawled | name | seller | offer_type | price | ab_test | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | n_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000 | 50000 | 50000 | 50000 | 50000 | 50000 | 44905 | 50000.000000 | 47320 | 50000.000000 | 47242 | 50000 | 50000.000000 | 45518 | 50000 | 40171 | 50000 | 50000.0 | 50000.000000 | 50000 |
unique | 48213 | 38754 | 2 | 2 | 2357 | 2 | 8 | NaN | 2 | NaN | 245 | 13 | NaN | 7 | 40 | 2 | 76 | NaN | NaN | 39481 |
top | 2016-03-21 20:37:19 | Ford_Fiesta | privat | Angebot | $0 | test | limousine | NaN | manuell | NaN | golf | 150,000km | NaN | benzin | volkswagen | nein | 2016-04-03 00:00:00 | NaN | NaN | 2016-04-07 06:17:27 |
freq | 3 | 78 | 49999 | 49999 | 1421 | 25756 | 12859 | NaN | 36993 | NaN | 4024 | 32424 | NaN | 30107 | 10687 | 35232 | 1946 | NaN | NaN | 8 |
mean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2005.073280 | NaN | 116.355920 | NaN | NaN | 5.723360 | NaN | NaN | NaN | NaN | 0.0 | 50813.627300 | NaN |
std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 105.712813 | NaN | 209.216627 | NaN | NaN | 3.711984 | NaN | NaN | NaN | NaN | 0.0 | 25779.747957 | NaN |
min | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1000.000000 | NaN | 0.000000 | NaN | NaN | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 1067.000000 | NaN |
25% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1999.000000 | NaN | 70.000000 | NaN | NaN | 3.000000 | NaN | NaN | NaN | NaN | 0.0 | 30451.000000 | NaN |
50% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2003.000000 | NaN | 105.000000 | NaN | NaN | 6.000000 | NaN | NaN | NaN | NaN | 0.0 | 49577.000000 | NaN |
75% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | NaN | 9.000000 | NaN | NaN | NaN | NaN | 0.0 | 71540.000000 | NaN |
max | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 9999.000000 | NaN | 17700.000000 | NaN | NaN | 12.000000 | NaN | NaN | NaN | NaN | 0.0 | 99998.000000 | NaN |
From the statistical description of datasets, its clearly described that column price and odometer is text data but it should be numeric, so we have to convert text to numeric.
Then no_of_pictures column has all values are 0 that has no importance for further data analysis.
Also we have to look into columns registration_year, registration_month and postal_code.
Another column name power_ps showed big difference between minimum and maximum value.
On top of that seller and offer_type columns have almost same values, only 2 unique values.
# Lets see how many unique values in seller column
autos["seller"].unique().shape
(2,)
# show the unique values
autos["seller"].unique()
array(['privat', 'gewerblich'], dtype=object)
# Lets see number of unique values in offer_type column
autos["offer_type"].unique().shape
(2,)
# show the unique values
autos["offer_type"].unique()
array(['Angebot', 'Gesuch'], dtype=object)
Looking at this words we can translate it and see that "gewerblich" seller is commercial and "gesuch" offer type is wanted or requested. This type of seller is not relevant (since it is just one entry), and the offer type too, just one entry and it is not we are looking for.
Now let's convert the columns price and odometer to numbers, as well as give to the odometer column a better name.
# Convert price column
autos["price"] = autos["price"].str.replace("$","").str.replace(",","").astype(float)
autos["price"].dtype
dtype('float64')
# lets see how many unique values in price column
autos["price"].unique().shape
(2357,)
# lets see the statistical view
autos["price"].describe()
count 5.000000e+04 mean 9.840044e+03 std 4.811044e+05 min 0.000000e+00 25% 1.100000e+03 50% 2.950000e+03 75% 7.200000e+03 max 1.000000e+08 Name: price, dtype: float64
# show the variations in values for price column
autos["price"].value_counts()
# though value_counts show all values so we can use head to show few values
autos["price"].value_counts().head()
# though value_counts return series,
#we can use series.sort_index with ascending=True/False to view the highest
# or lowest values with their counts, can use head() as well.
autos["price"].value_counts().sort_index(ascending = False).head()
99999999.0 1 27322222.0 1 12345678.0 3 11111111.0 2 10000000.0 1 Name: price, dtype: int64
# Show the highest and lowest value from price column
highest_value = autos["price"].value_counts().sort_index(ascending = False).head()
lowest_value = autos["price"].value_counts().sort_index(ascending = True).head()
print("higher price frequency:\n{} \n\nlower price frquency:\n{}"
.format(highest_value, lowest_value))
higher price frequency: 99999999.0 1 27322222.0 1 12345678.0 3 11111111.0 2 10000000.0 1 Name: price, dtype: int64 lower price frquency: 0.0 1421 1.0 156 2.0 3 3.0 1 5.0 2 Name: price, dtype: int64
# Convert odometer column and rename column as odometer_km
autos["odometer"] = autos["odometer"].str.replace("km","").str.replace(",","").astype(float)
autos.rename({"odometer": "odometer_km"}, axis = 1, inplace = True)
# show the variations in values for odometer column
autos["odometer_km"].value_counts()
# though value_counts show all values so we can use head to show few values
autos["odometer_km"].value_counts().head()
# though value_counts return series,
#we can use series.sort_index with ascending=True/False to view the highest
# or lowest values with their counts, can use head() as well.
print("Higher values ", autos["odometer_km"].value_counts().sort_index(ascending = False).head())
print("\n")
print("Lower values ", autos["odometer_km"].value_counts().sort_index(ascending = True).head())
Higher values 150000.0 32424 125000.0 5170 100000.0 2169 90000.0 1757 80000.0 1436 Name: odometer_km, dtype: int64 Lower values 5000.0 967 10000.0 264 20000.0 784 30000.0 789 40000.0 819 Name: odometer_km, dtype: int64
# Show the statistical details of odometer_km column
autos["odometer_km"].describe()
count 50000.000000 mean 125732.700000 std 40042.211706 min 5000.000000 25% 125000.000000 50% 150000.000000 75% 150000.000000 max 150000.000000 Name: odometer_km, dtype: float64
autos.isnull().sum()
date_crawled 0 name 0 seller 0 offer_type 0 price 0 ab_test 0 vehicle_type 5095 registration_year 0 gearbox 2680 power_ps 0 model 2758 odometer_km 0 registration_month 0 fuel_type 4482 brand 0 unrepaired_damage 9829 ad_created 0 n_of_pictures 0 postal_code 0 last_seen 0 dtype: int64
missing_value_columns = ["vehicle_type", "gearbox", "model", "fuel_type", "unrepaired_damage"]
for col in missing_value_columns:
pct_missing = np.mean(autos[col].isnull())
print('{} - {}%'.format(col, round(pct_missing*100)))
vehicle_type - 10.0% gearbox - 5.0% model - 6.0% fuel_type - 9.0% unrepaired_damage - 20.0%
The missing value percengtage is very low, less than 20%, we can neglect that but only unrepaired_damage column can impute with mode.
# fill the categorical column null values with mode
autos['unrepaired_damage'] = autos['unrepaired_damage'].fillna(autos['unrepaired_damage'].mode()[0])
# Fill rest of the columns null values with zero
autos.fillna(0, inplace=True)
# Show the datasets null view amount
autos.isnull().sum()
date_crawled 0 name 0 seller 0 offer_type 0 price 0 ab_test 0 vehicle_type 0 registration_year 0 gearbox 0 power_ps 0 model 0 odometer_km 0 registration_month 0 fuel_type 0 brand 0 unrepaired_damage 0 ad_created 0 n_of_pictures 0 postal_code 0 last_seen 0 dtype: int64
# Outlier detection
Q1=autos["price"].quantile(0.25)
Q3=autos["price"].quantile(0.75)
IQR=Q3-Q1
low_bound=Q1 - 1.5 * IQR
upper_bound=Q3 + 1.5 * IQR
print(low_bound,upper_bound)
-8050.0 16350.0
From outlier detection we determined lower bound and upper bound of dataset, above or below that range data is called outlier. We will remove that oulier and keep the data in between lowerbound and upperbound. Lets see interquartile price without outlier.
IQR_price=autos.price[(autos.price > low_bound) & (autos.price < upper_bound)]
IQR_price.shape
(46207,)
IQR_price.describe()
count 46207.000000 mean 3961.283550 std 3843.727282 min 0.000000 25% 1000.000000 50% 2500.000000 75% 5900.000000 max 16333.000000 Name: price, dtype: float64
# removing the price value zeros and count from 1 to 16333
autos = autos[autos['price'].between(1,16333)]
autos['price'].describe()
count 44786.000000 mean 4086.969790 std 3837.880854 min 1.000000 25% 1150.000000 50% 2700.000000 75% 5999.000000 max 16333.000000 Name: price, dtype: float64
There are 5 columns that should represent date values. Some of these columns were created by the crawler, some came from the website itself. We can differentiate by referring to the data dictionary:
date_crawled
: added by the crawlerlast_seen
: added by the crawlerad_created
: from the websiteregistration_month
: from the websiteregistration_year
: from the websiteRight now, the date_crawled, last_seen, and ad_created columns are all identified as string values by pandas. Because these three columns are represented as strings, we need to convert the data into a numerical representation so we can understand it quantitatively. The other two columns are represented as numeric values, so we can use methods like Series.describe() to understand the distribution without any extra data processing.
Now we are going to see how the date columns are formatted, how the data behaves and see if we need to drop any rows or correct any possible wrong entries. Let's start printing our date columns:
# Printing all date columns
autos.loc[:5, ['date_crawled', 'last_seen', 'ad_created', 'registration_month', 'registration_year']]
date_crawled | last_seen | ad_created | registration_month | registration_year | |
---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | 2016-04-06 06:45:54 | 2016-03-26 00:00:00 | 3 | 2004 |
1 | 2016-04-04 13:38:56 | 2016-04-06 14:45:08 | 2016-04-04 00:00:00 | 6 | 1997 |
2 | 2016-03-26 18:57:24 | 2016-04-06 20:15:37 | 2016-03-26 00:00:00 | 7 | 2009 |
3 | 2016-03-12 16:58:10 | 2016-03-15 03:16:28 | 2016-03-12 00:00:00 | 6 | 2007 |
4 | 2016-04-01 14:38:50 | 2016-04-01 14:38:50 | 2016-04-01 00:00:00 | 7 | 2003 |
5 | 2016-03-21 13:47:45 | 2016-04-06 09:45:21 | 2016-03-21 00:00:00 | 4 | 2006 |
As we cans ee above the 'registration_month' and 'registration_year' columns are already numbers so let's analyze the content already.
# Checking if there are more than 12 months
autos['registration_month'].unique()
array([ 3, 6, 7, 4, 8, 12, 10, 0, 9, 11, 5, 2, 1])
Everything seems normal, so lets see year column
# Checking the unique values from the year column
autos['registration_year'].unique()
array([2004, 1997, 2009, 2007, 2003, 2006, 1995, 1998, 2000, 2017, 2010, 1999, 1990, 1996, 1992, 2002, 2012, 2014, 2011, 2005, 2008, 1985, 2016, 1994, 1986, 2001, 2018, 1972, 1993, 1988, 2013, 1989, 1973, 1967, 1976, 4500, 1987, 1991, 2015, 1960, 1983, 1969, 1950, 1978, 1980, 1984, 1963, 1977, 1961, 1934, 1982, 1966, 1979, 1968, 1981, 1970, 1910, 1971, 1975, 5000, 4100, 2019, 1965, 1956, 9999, 1958, 1800, 1964, 1974, 1959, 1962, 1937, 1929, 1000, 1952, 1111, 8888, 1954, 1941, 1938, 2800, 5911, 1953, 4800, 1001])
We can see clearly that there are years that make no sense at all, like 9000, 4800 or any year greater than 2018, since 2016 is the year the data was crawled (maybe we can use some of this data, let's see bellow what percentage it is). Also makes no sense to take values from less than 1900, they will be removed too. With that in mind lets drop this rows.
autos = autos.loc[autos['registration_year'].between(1900,2018)]
autos.head()
date_crawled | name | seller | offer_type | price | ab_test | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | n_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | 5000.0 | control | bus | 2004 | manuell | 158 | andere | 150000.0 | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | 8500.0 | control | limousine | 1997 | automatik | 286 | 7er | 150000.0 | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | 8990.0 | test | limousine | 2009 | manuell | 102 | golf | 70000.0 | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | 4350.0 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70000.0 | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | 1350.0 | test | kombi | 2003 | manuell | 0 | focus | 150000.0 | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
Next we will configure the columns with data unformatted and to include the missing values in the distribution and to use percentage instead of counts, chain the series.value_counts(normalize = True, dropna=false) method.
# Th following 3 columns is unstructured, we will convert them as structured
autos[['date_crawled', 'ad_created', 'last_seen']][0:5]
date_crawled | ad_created | last_seen | |
---|---|---|---|
0 | 2016-03-26 17:47:46 | 2016-03-26 00:00:00 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | 2016-04-04 00:00:00 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | 2016-03-26 00:00:00 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | 2016-03-12 00:00:00 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | 2016-04-01 00:00:00 | 2016-04-01 14:38:50 |
The first 10 characters represent the day (e.g. 2016-03-12). To understand the date range, we can extract just the date values, use Series.value_counts() to generate a distribution, and then sort by the index.
To select the first 10 characters in each column, we can use Series.str[:10]:
print(autos['date_crawled'].str[:10])
0 2016-03-26 1 2016-04-04 2 2016-03-26 3 2016-03-12 4 2016-04-01 5 2016-03-21 6 2016-03-20 7 2016-03-16 8 2016-03-22 9 2016-03-16 10 2016-03-15 11 2016-03-16 12 2016-03-31 13 2016-03-23 14 2016-03-23 16 2016-03-16 17 2016-03-29 18 2016-03-26 19 2016-03-17 20 2016-03-05 23 2016-03-10 25 2016-03-21 26 2016-04-03 28 2016-03-19 29 2016-04-02 30 2016-03-14 31 2016-03-14 32 2016-03-20 33 2016-03-15 34 2016-04-05 ... 49964 2016-03-10 49965 2016-03-11 49966 2016-04-02 49967 2016-03-12 49968 2016-04-01 49969 2016-03-17 49970 2016-03-21 49971 2016-03-29 49972 2016-03-26 49973 2016-03-27 49975 2016-03-27 49976 2016-03-19 49977 2016-03-31 49978 2016-04-04 49979 2016-03-20 49980 2016-03-12 49981 2016-03-15 49982 2016-03-29 49983 2016-03-06 49985 2016-04-02 49986 2016-04-04 49988 2016-03-28 49989 2016-03-11 49991 2016-03-06 49992 2016-03-10 49993 2016-03-15 49994 2016-03-22 49996 2016-03-28 49997 2016-04-02 49999 2016-03-14 Name: date_crawled, Length: 44767, dtype: object
# Convert string columns to datetime
autos['date_crawled'] = pd.to_datetime(autos['date_crawled'])
#the timestamp is lost during the convertion
autos['ad_created'] = pd.to_datetime(autos['ad_created'])
autos['last_seen'] = pd.to_datetime(autos['last_seen'])
print("Counts of unique values for 'date_crawled' column, in percentages:", '\n',
autos['date_crawled'].value_counts(normalize=True, dropna=False).sort_index().head(10) * 100)
Counts of unique values for 'date_crawled' column, in percentages: 2016-03-05 14:06:30 0.002234 2016-03-05 14:06:40 0.002234 2016-03-05 14:07:04 0.002234 2016-03-05 14:07:08 0.002234 2016-03-05 14:07:21 0.002234 2016-03-05 14:07:26 0.002234 2016-03-05 14:07:40 0.002234 2016-03-05 14:07:45 0.002234 2016-03-05 14:08:00 0.004468 2016-03-05 14:08:05 0.004468 Name: date_crawled, dtype: float64
No anomalities with the 'date_crawled' column (something one would expect from the column created by the crawler).
print("Counts of unique values for 'ad_created' column, in percentages:", '\n',
autos['ad_created'].value_counts(normalize=True, dropna=False).sort_index().head(10) * 100)
Counts of unique values for 'ad_created' column, in percentages: 2015-08-10 0.002234 2015-09-09 0.002234 2015-11-10 0.002234 2015-12-05 0.002234 2015-12-30 0.002234 2016-01-03 0.002234 2016-01-07 0.002234 2016-01-10 0.004468 2016-01-13 0.002234 2016-01-16 0.002234 Name: ad_created, dtype: float64
The mistery is solved. There´re only 5 ads created in 2015, which definitely can´t be legitimate. It should be some specifics of crawled data. To confirm that, the documentation to the original data set should be revised.
print("Counts of unique values for 'last_seen' column, in percentages:", '\n',
autos['last_seen'].dt.date.value_counts(normalize=True, dropna=False).sort_index() * 100)
Counts of unique values for 'last_seen' column, in percentages: 2016-03-05 0.116157 2016-03-06 0.455693 2016-03-07 0.569616 2016-03-08 0.790761 2016-03-09 1.007439 2016-03-10 1.116894 2016-03-11 1.304532 2016-03-12 2.492908 2016-03-13 0.938191 2016-03-14 1.286662 2016-03-15 1.628432 2016-03-16 1.697679 2016-03-17 2.903925 2016-03-18 0.757254 2016-03-19 1.648536 2016-03-20 2.124333 2016-03-21 2.119865 2016-03-22 2.191346 2016-03-23 1.912123 2016-03-24 2.046150 2016-03-25 2.001474 2016-03-26 1.724485 2016-03-27 1.621730 2016-03-28 2.171242 2016-03-29 2.303036 2016-03-30 2.546519 2016-03-31 2.428128 2016-04-01 2.329841 2016-04-02 2.524181 2016-04-03 2.566623 2016-04-04 2.504077 2016-04-05 12.125003 2016-04-06 21.428731 2016-04-07 12.616436 Name: last_seen, dtype: float64
The last three days contain a disproportionate amount of 'last seen' values. Given that these are 6-10x the values from the previous days, it's unlikely that there was a massive spike in sales, and more likely that these values are to do with the crawling period ending and don't indicate car sales.
autos["registration_year"].describe()
count 44767.000000 mean 2002.996292 std 7.232618 min 1910.000000 25% 1999.000000 50% 2003.000000 75% 2007.000000 max 2018.000000 Name: registration_year, dtype: float64
As we sorted registration_year data earlier range from 2016 to 2019, so we have no doubtful data in this column.
autos.head(2)
date_crawled | name | seller | offer_type | price | ab_test | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | n_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | 5000.0 | control | bus | 2004 | manuell | 158 | andere | 150000.0 | 3 | lpg | peugeot | nein | 2016-03-26 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | 8500.0 | control | limousine | 1997 | automatik | 286 | 7er | 150000.0 | 6 | benzin | bmw | nein | 2016-04-04 | 0 | 71034 | 2016-04-06 14:45:08 |
autos.loc[:,["seller", "offer_type", "n_of_pictures"]].head()
seller | offer_type | n_of_pictures | |
---|---|---|---|
0 | privat | Angebot | 0 |
1 | privat | Angebot | 0 |
2 | privat | Angebot | 0 |
3 | privat | Angebot | 0 |
4 | privat | Angebot | 0 |
As these columns have 0 and duplicate values, so we can remove it from main dataframe, it will not important for further data analysis.
autos.drop(["seller", "offer_type", "n_of_pictures"], axis = 1, inplace = True)
autos.head()
date_crawled | name | price | ab_test | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | 5000.0 | control | bus | 2004 | manuell | 158 | andere | 150000.0 | 3 | lpg | peugeot | nein | 2016-03-26 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | 8500.0 | control | limousine | 1997 | automatik | 286 | 7er | 150000.0 | 6 | benzin | bmw | nein | 2016-04-04 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | 8990.0 | test | limousine | 2009 | manuell | 102 | golf | 70000.0 | 7 | benzin | volkswagen | nein | 2016-03-26 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | 4350.0 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70000.0 | 6 | benzin | smart | nein | 2016-03-12 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | 1350.0 | test | kombi | 2003 | manuell | 0 | focus | 150000.0 | 7 | benzin | ford | nein | 2016-04-01 | 39218 | 2016-04-01 14:38:50 |
autos["registration_year"].value_counts(normalize = True)
2000 0.070186 2005 0.064512 1999 0.064512 2003 0.059620 2004 0.059419 2001 0.058436 2006 0.057699 2002 0.054951 1998 0.052494 2007 0.047490 2008 0.044318 1997 0.043425 2009 0.040387 1996 0.030424 2017 0.030380 2010 0.027386 1995 0.027252 2016 0.026314 2011 0.024639 2012 0.017848 1994 0.013849 2018 0.010208 1993 0.009293 2013 0.008801 1992 0.008019 1990 0.007595 1991 0.007372 2014 0.005316 1989 0.003820 1988 0.002993 ... 1981 0.000581 1960 0.000447 1971 0.000424 1977 0.000424 1973 0.000424 1968 0.000380 1976 0.000380 1975 0.000357 1967 0.000357 1969 0.000335 1974 0.000335 1965 0.000246 1966 0.000246 1964 0.000201 1961 0.000112 1910 0.000112 1962 0.000089 1956 0.000089 1963 0.000089 1959 0.000089 1937 0.000067 1958 0.000067 1950 0.000067 1954 0.000045 1934 0.000045 1938 0.000022 1941 0.000022 1953 0.000022 1929 0.000022 1952 0.000022 Name: registration_year, Length: 72, dtype: float64
After sorting data in registration year we have found the values as on top from 1900-2018.
Now that we have made most of the cleaning, we are going to aggregate the items by car brands. This will bring more clarity to the dataset.
# Identify the unique values as we want to aggregate by car brand
autos["brand"].unique()
array(['peugeot', 'bmw', 'volkswagen', 'smart', 'ford', 'chrysler', 'seat', 'renault', 'mercedes_benz', 'audi', 'opel', 'mazda', 'mini', 'toyota', 'dacia', 'nissan', 'jeep', 'saab', 'volvo', 'mitsubishi', 'fiat', 'skoda', 'subaru', 'sonstige_autos', 'kia', 'citroen', 'porsche', 'hyundai', 'chevrolet', 'honda', 'daewoo', 'suzuki', 'trabant', 'land_rover', 'jaguar', 'alfa_romeo', 'lada', 'rover', 'daihatsu', 'lancia'], dtype=object)
# Checking the frequencies of car brands
brands = autos["brand"].value_counts()
brands.head()
volkswagen 9680 opel 5199 bmw 4576 mercedes_benz 3926 audi 3453 Name: brand, dtype: int64
# Create the frequency dataframe as percentage
brands_percentage = ((brands)/(autos["brand"].shape[0]))*100
# Printing the top 10
brands_percentage.head(10)
volkswagen 21.623071 opel 11.613465 bmw 10.221815 mercedes_benz 8.769853 audi 7.713271 ford 7.273215 renault 5.148882 peugeot 3.169746 fiat 2.807872 seat 1.976903 Name: brand, dtype: float64
Looking at the date above we selected the top 10 ones, this means the brands we are going to aggregate are: volkswagen (21.62%), opel (11.61%),bmw (10.22%), mercedez_benz (8.76%), audi (7.71%), ford (7.27%), renaut (5.14%), peugeot (3.16%), fiat (2.80%) and seat (1.97%) as the last one.
Now we'll make a dictionary to store the mean price information of each brand.
# choosing our top 10 brands in a list
auto_brands = ['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi',
'ford', 'renault', 'peugeot', 'fiat', 'seat']
# Creating empty dictionary to hold aggregate data
# by mean price and mean mileage
brand_mean_price = {}
brand_mean_mileage = {}
# Populating our dictionary
for brand in auto_brands:
mean_price = autos.loc[autos['brand'] == brand, 'price'].mean()
mean_mileage = autos.loc[autos['brand'] == brand, 'odometer_km'].mean()
brand_mean_price[brand] = mean_price
brand_mean_mileage[brand] = mean_mileage
print("brand_mean_price:\n\n",brand_mean_price)
print("\nbrand_mean_mileage:\n\n", brand_mean_mileage)
brand_mean_price: {'volkswagen': 4145.926859504132, 'fiat': 2724.6340493237867, 'mercedes_benz': 5207.541008660214, 'opel': 2684.1500288517022, 'renault': 2247.817787418655, 'audi': 5653.443672169128, 'bmw': 5617.18444055944, 'seat': 3701.974011299435, 'ford': 2939.946253071253, 'peugeot': 2942.6067653276955} brand_mean_mileage: {'volkswagen': 132967.45867768594, 'fiat': 117879.87271280827, 'mercedes_benz': 138285.7870606215, 'opel': 130619.34987497596, 'renault': 128839.47939262472, 'audi': 139842.1662322618, 'bmw': 138809.0034965035, 'seat': 124734.46327683616, 'ford': 127040.84766584767, 'peugeot': 128047.92107117688}
As we can see above the most expensive car brands are audi, bmw and mercedes benz with mean prices over 5,000 dollars. Followed by volkswagen and seat with prices varying between 3,500 dollars and 4,000 dollars. The less expensive are ford, fiat, peugeot and renault cars.
Below we are going to combine both price and mileage values inside a dataframe so we can compare both information. Also brand_mean_prices will be reduced to bmp and brand_mean_mileage will be reduced to bmm.
# Turning the dictionary into a dataframe
bmp_series = pd.Series(brand_mean_price)
bmm_series = pd.Series(brand_mean_mileage)
# Turning both series into a dataframe
brand_df = pd.DataFrame(bmp_series, columns = ['mean_price'])
brand_df['mean_mileage'] = bmm_series
brand_df
mean_price | mean_mileage | |
---|---|---|
audi | 5653.443672 | 139842.166232 |
bmw | 5617.184441 | 138809.003497 |
fiat | 2724.634049 | 117879.872713 |
ford | 2939.946253 | 127040.847666 |
mercedes_benz | 5207.541009 | 138285.787061 |
opel | 2684.150029 | 130619.349875 |
peugeot | 2942.606765 | 128047.921071 |
renault | 2247.817787 | 128839.479393 |
seat | 3701.974011 | 124734.463277 |
volkswagen | 4145.926860 | 132967.458678 |
As we can see above, there is a small difference between how much mileage a car has (at least in the mean values) and the price, the most expensive ones are the ones with the biggest mean mileage, except opel that has a big mean but not so big price. Also renault has big mileage compare to mean price. This data can lead to much more insights with more hours of work.
we have four columns have german words vehicle_type, gearbox, fuel_type, unrepaired_damage. Lets map all those columns into english.
# To print all the column levels
autos.columns
Index(['date_crawled', 'name', 'price', 'ab_test', 'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model', 'odometer_km', 'registration_month', 'fuel_type', 'brand', 'unrepaired_damage', 'ad_created', 'postal_code', 'last_seen'], dtype='object')
# Finding German words in vehicle_type column
autos["vehicle_type"].unique()
array(['bus', 'limousine', 'kleinwagen', 'kombi', 0, 'coupe', 'suv', 'cabrio', 'andere'], dtype=object)
# Translating German Values to English by Mapping Technique
vehicle_type = autos["vehicle_type"]
map_vehicle = {
"limousine":"limousine",
"kleinwagen":"small car",
"kombi":"combi",
"bus": "bus",
"cabrio":"convertible",
"coupe":"coupe",
"suv":"suv",
"andere":"other"
}
autos["vehicle_type"] = pd.Series(vehicle_type).map(map_vehicle)
# For gearbox column
autos["gearbox"].unique()
array(['manuell', 'automatik', 0], dtype=object)
# Translating German Values to English by Mapping Technique
gear_box = autos["gearbox"]
map_gearbox = {
"manuell":"manual",
"automatik":"automatic",
}
autos["gearbox"] = pd.Series(gear_box).map(map_gearbox)
# for fuel_type column
autos["fuel_type"].unique()
array(['lpg', 'benzin', 'diesel', 0, 'cng', 'hybrid', 'elektro', 'andere'], dtype=object)
# Translating German Values to English by Mapping Technique
fuel_type = autos["fuel_type"]
map_fuel_type = {
"benzin" :"petrol",
"diesel":"diesel",
"lpg" :"lpg",
"cpg":"cpg",
"hybrid" :"hybrid",
"andere":"other",
"elektro":"electric",
}
autos["fuel_type"] = pd.Series(fuel_type).map(map_fuel_type)
# for unrepaired_damage
autos["unrepaired_damage"].unique()
array(['nein', 'ja'], dtype=object)
# Translating German Values to English by Mapping Technique
damage_type = autos["unrepaired_damage"]
map_damage_type = {
"nein" :"no",
"ja":"yes",
}
autos["unrepaired_damage"] = pd.Series(damage_type).map(map_damage_type)
#to convert in numeric type of date crawled column
autos["date_crawled"] = autos["date_crawled"].astype(str).str[:10].str.replace("-","").astype(int)
print(autos["date_crawled"].head())
0 20160326 1 20160404 2 20160326 3 20160312 4 20160401 Name: date_crawled, dtype: int64
#to convert in numeric type of ad_created column
autos["ad_created"] = autos["ad_created"].astype(str).str[:10].str.replace("-","").astype(int)
print(autos["ad_created"].head())
0 20160326 1 20160404 2 20160326 3 20160312 4 20160401 Name: ad_created, dtype: int64
#to convert in numeric type of last_seen column
autos["last_seen"] = autos["last_seen"].astype(str).str[:10].str.replace("-","").astype(int)
print(autos["last_seen"].head())
0 20160406 1 20160406 2 20160406 3 20160315 4 20160401 Name: last_seen, dtype: int64
autos["name"].unique()
array(['Peugeot_807_160_NAVTECH_ON_BOARD', 'BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik', 'Volkswagen_Golf_1.6_United', ..., 'Audi_A6__S6__Avant_4.2_quattro_eventuell_Tausch_moegl', 'Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+Reifen_neu_!!', 'Fiat_500_C_1.2_Dualogic_Lounge'], dtype=object)
autos["name"].value_counts().head()
Ford_Fiesta 76 Volkswagen_Golf_1.4 75 BMW_316i 75 Volkswagen_Polo 72 BMW_318i 72 Name: name, dtype: int64
autos["name"] = autos["name"].str.replace("_", ",")
print(autos["name"].head().str.split())
0 [Peugeot,807,160,NAVTECH,ON,BOARD] 1 [BMW,740i,4,4,Liter,HAMANN,UMBAU,Mega,Optik] 2 [Volkswagen,Golf,1.6,United] 3 [Smart,smart,fortwo,coupe,softouch/F1/Klima/Pa... 4 [Ford,Focus,1,6,Benzin,TÜV,neu,ist,sehr,gepfle... Name: name, dtype: object
# grouping the dataset by model and brand
brand_mod_group = autos.groupby(["brand","model"])
# find the count and sort descending
brand_model = brand_mod_group["date_crawled"].count().sort_values(ascending=False)
print(brand_model[:10])
brand model volkswagen golf 3674 bmw 3er 2499 volkswagen polo 1683 opel corsa 1677 astra 1400 volkswagen passat 1331 audi a4 1151 mercedes_benz c_klasse 1030 bmw 5er 1008 mercedes_benz e_klasse 831 Name: date_crawled, dtype: int64
# splitting mileage into groups
df_by_km = autos.groupby(['odometer_km'])
df_by_km
<pandas.core.groupby.DataFrameGroupBy object at 0x7fbe3d8ae160>
# sort values by mean price
df_by_km['price'].mean().sort_values(ascending=False)
odometer_km 40000.0 8474.518095 30000.0 8151.000000 50000.0 7794.866287 10000.0 7470.352459 20000.0 7244.222477 60000.0 7239.508591 70000.0 6941.411333 80000.0 6610.982173 90000.0 5835.658683 100000.0 5539.574342 125000.0 4814.489199 150000.0 3270.599993 5000.0 2935.272222 Name: price, dtype: float64
We can clearly observe a pattern: the higher the mileage - the less is the mean price, except one outlier mileage of 5000. The cars with mileage between 10000-40000 have the best value, whereas cars with mileage higher than 125000 has the lowest value.
autos['unrepaired_damage'].unique()
array(['no', 'yes'], dtype=object)
df_by_damage = autos.groupby(['unrepaired_damage'])
df_by_damage
<pandas.core.groupby.DataFrameGroupBy object at 0x7fbe3e3e5eb8>
df_by_damage['price'].mean().sort_values(ascending=False)
unrepaired_damage no 4332.857520 yes 1961.651348 Name: price, dtype: float64
In the code above, we calculated the mean prices for cars with damage and without damage. As we can see, cars with no damage sell 2 times better than damaged cars.
The top brands by listings:
The Germans clearly love local automakers, with German brands taking the top 5 spots and making up over 61.21% of all listings.
We can see that Mercedes Benz vehicles are by far the most expensive amongst the top brands, on average costing three times more than the second most expensive brand Audi. However, it still makes up nearly 10% of all vehicles in the dataset which indicates many people are tying to sell their Mercedes Benz vehicles.
More research would need to be done in order to state whether or not the demand for Mercedes Benz vehicles is greater or less than the supply.
We also see that the three most expensive top brands (Mercedes Benz, Audi, BMW) make up nearly 30% of all listings, which could indicate there is not a strong correlation between price and popularity. However, more research would be required to determine whetether or not this is true.
There is no apparent relationship between the mean price and mean mileage amongst vehicles of the top brands.