Exploring eBay Used Car Sales Data in German website¶

Introduction¶

This is a Guided Project of one of the DataQuest.io courses. It is a Data Cleaning project, in which we will take a web scrapped data set of the German eBay website, more specifically the eBay Kleinanzeigen section.

The data from DataQuest was altered to make the job a little bit more difficult. The original one was cleaner to make analysis easier, it was uploaded on Kaggle by the user orgesleka but it is not available anymore. But it can be found here.

We are going to clean the data and answer some questions that could come in the way.

Column Summary¶

The content of the data is in german, so one has to translate it first if one can not speak german. Those fields are included:

dateCrawled : when this ad was first crawled, all field-values are taken from this date
name : "name" of the car
seller : private or dealer
offerType : The type of listing
price : the price on the ad to sell the car
abtest: Whether the listing is included in an A/B test
vehicleType: The type of car
yearOfRegistration : at which year the car was first registered
gearbox : Transmission type
powerPS : power of the car in PS
model : The car model name
kilometer : how many kilometers the car has driven
monthOfRegistration : at which month the car was first registered
fuelType : type of fuel car used
brand : brand of car
notRepairedDamage : if the car has a damage which is not repaired yet
dateCreated : the date for which the ad at ebay was created
nrOfPictures : number of pictures in the ad
postalCode: postal code for the location of the vehicle
lastSeenOnline : when the crawler saw this ad last online

The fields lastSeen and dateCrawled could be used to estimate how long a car will be at least online before it is sold.

Objective :¶

The aim of this project is to clean the data and analyze the included used car listings.

1 Import libraries and reading data¶

In [1]:

# import necessary libraries as we will use here
import numpy as np
import pandas as pd

In [2]:

# Read file with csv reader and encode file by encoding 
autos = pd.read_csv("autos.csv", encoding = "Latin-1")

In [3]:

# view first 5 rows of dataset
autos.head()

Out[3]:

	dateCrawled	name	seller	offerType	price	abtest	vehicleType	yearOfRegistration	gearbox	powerPS	model	odometer	monthOfRegistration	fuelType	brand	notRepairedDamage	dateCreated	postalCode	lastSeen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	$8,990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	2016-03-26 00:00:00	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	$4,350	control	kleinwagen	2007	automatik	71	fortwo	70,000km	6	benzin	smart	nein	2016-03-12 00:00:00	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	privat	Angebot	$1,350	test	kombi	2003	manuell	0	focus	150,000km	7	benzin	ford	nein	2016-04-01 00:00:00	39218	2016-04-01 14:38:50

In [4]:

# view last 5 rows in dataset
autos.tail()

Out[4]:

	dateCrawled	name	seller	offerType	price	abtest	vehicleType	yearOfRegistration	gearbox	powerPS	model	odometer	monthOfRegistration	fuelType	brand	notRepairedDamage	dateCreated	postalCode	lastSeen
49995	2016-03-27 14:38:19	Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon	privat	Angebot	$24,900	control	limousine	2011	automatik	239	q5	100,000km	1	diesel	audi	nein	2016-03-27 00:00:00	82131	2016-04-01 13:47:40
49996	2016-03-28 10:50:25	Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+...	privat	Angebot	$1,980	control	cabrio	1996	manuell	75	astra	150,000km	5	benzin	opel	nein	2016-03-28 00:00:00	44807	2016-04-02 14:18:02
49997	2016-04-02 14:44:48	Fiat_500_C_1.2_Dualogic_Lounge	privat	Angebot	$13,200	test	cabrio	2014	automatik	69	500	5,000km	11	benzin	fiat	nein	2016-04-02 00:00:00	73430	2016-04-04 11:47:27
49998	2016-03-08 19:25:42	Audi_A3_2.0_TDI_Sportback_Ambition	privat	Angebot	$22,900	control	kombi	2013	manuell	150	a3	40,000km	11	diesel	audi	nein	2016-03-08 00:00:00	35683	2016-04-05 16:45:07
49999	2016-03-14 00:42:12	Opel_Vectra_1.6_16V	privat	Angebot	$1,250	control	limousine	1996	manuell	101	vectra	150,000km	1	benzin	opel	nein	2016-03-13 00:00:00	45897	2016-04-06 21:18:48

2 Exploring data for insights¶

Since everything worked fine until here, we are going to take a closer look at the entries, looking for possible mistakes that could lead to wrong analysis.

In [5]:

# show the information details about dataframe
autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
dateCrawled            50000 non-null object
name                   50000 non-null object
seller                 50000 non-null object
offerType              50000 non-null object
price                  50000 non-null object
abtest                 50000 non-null object
vehicleType            44905 non-null object
yearOfRegistration     50000 non-null int64
gearbox                47320 non-null object
powerPS                50000 non-null int64
model                  47242 non-null object
odometer               50000 non-null object
monthOfRegistration    50000 non-null int64
fuelType               45518 non-null object
brand                  50000 non-null object
notRepairedDamage      40171 non-null object
dateCreated            50000 non-null object
nrOfPictures           50000 non-null int64
postalCode             50000 non-null int64
lastSeen               50000 non-null object
dtypes: int64(5), object(15)
memory usage: 7.6+ MB

The dataset contain 50000 rows, 20 columns, few columns have null values, 5 integer datatype, 15 object datatype and columns contain cammel case.

We need to convert column names from cammel case to snakecase and rewrite some columns name to more easy readble.

3 Rename columns¶

With the help of the dataset dictionary, we are going to change them to more descriptive snakecase names.

In [6]:

# Printing the autos dataframe columns
autos.columns

Out[6]:

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

In [7]:

# Changing few columns name with dictionary as instruction said.
autos.rename(columns = {"yearOfRegistration" : "registration_year", 
                        "monthOfRegistration" : "registration_month",
                       "notRepairedDamage" : "unrepaired_damage",
                       "dateCreated" : "ad_created"},inplace = True)

In [8]:

# Checking the changed columns
autos.columns

Out[8]:

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'registration_year', 'gearbox', 'powerPS', 'model',
       'odometer', 'registration_month', 'fuelType', 'brand',
       'unrepaired_damage', 'ad_created', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

In [9]:

# Changing the name to more meaningful snakecase ones
autos.columns = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'ab_test',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created','n_of_pictures', 'postal_code',
       'last_seen']

In [10]:

# Checking it
autos.columns

Out[10]:

Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'ab_test',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'n_of_pictures', 'postal_code',
       'last_seen'],
      dtype='object')

In column section we changed column name camel case to snake case with underscores, lower characters, changing column in meaningful name.

4 Checking for duplicates¶

Now that we fixed the first problems we saw about column names, so we are going to check the data one more time to see if there are any duplicates (or other things that could take our attention).

In [11]:

# Print descriptive statistics for all columns
autos.describe(include = "all")

Out[11]:

	date_crawled	name	seller	offer_type	price	ab_test	vehicle_type	registration_year	gearbox	power_ps	model	odometer	registration_month	fuel_type	brand	unrepaired_damage	ad_created	n_of_pictures	postal_code	last_seen
count	50000	50000	50000	50000	50000	50000	44905	50000.000000	47320	50000.000000	47242	50000	50000.000000	45518	50000	40171	50000	50000.0	50000.000000	50000
unique	48213	38754	2	2	2357	2	8	NaN	2	NaN	245	13	NaN	7	40	2	76	NaN	NaN	39481
top	2016-03-21 20:37:19	Ford_Fiesta	privat	Angebot	$0	test	limousine	NaN	manuell	NaN	golf	150,000km	NaN	benzin	volkswagen	nein	2016-04-03 00:00:00	NaN	NaN	2016-04-07 06:17:27
freq	3	78	49999	49999	1421	25756	12859	NaN	36993	NaN	4024	32424	NaN	30107	10687	35232	1946	NaN	NaN	8
mean	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2005.073280	NaN	116.355920	NaN	NaN	5.723360	NaN	NaN	NaN	NaN	0.0	50813.627300	NaN
std	NaN	NaN	NaN	NaN	NaN	NaN	NaN	105.712813	NaN	209.216627	NaN	NaN	3.711984	NaN	NaN	NaN	NaN	0.0	25779.747957	NaN
min	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1000.000000	NaN	0.000000	NaN	NaN	0.000000	NaN	NaN	NaN	NaN	0.0	1067.000000	NaN
25%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1999.000000	NaN	70.000000	NaN	NaN	3.000000	NaN	NaN	NaN	NaN	0.0	30451.000000	NaN
50%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2003.000000	NaN	105.000000	NaN	NaN	6.000000	NaN	NaN	NaN	NaN	0.0	49577.000000	NaN
75%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2008.000000	NaN	150.000000	NaN	NaN	9.000000	NaN	NaN	NaN	NaN	0.0	71540.000000	NaN
max	NaN	NaN	NaN	NaN	NaN	NaN	NaN	9999.000000	NaN	17700.000000	NaN	NaN	12.000000	NaN	NaN	NaN	NaN	0.0	99998.000000	NaN

From the statistical description of datasets, its clearly described that column price and odometer is text data but it should be numeric, so we have to convert text to numeric.

Then no_of_pictures column has all values are 0 that has no importance for further data analysis.

Also we have to look into columns registration_year, registration_month and postal_code.

Another column name power_ps showed big difference between minimum and maximum value.

On top of that seller and offer_type columns have almost same values, only 2 unique values.

In [12]:

# Lets see how many unique values in seller column
autos["seller"].unique().shape

Out[12]:

(2,)

In [13]:

# show the unique values
autos["seller"].unique()

Out[13]:

array(['privat', 'gewerblich'], dtype=object)

In [14]:

# Lets see number of unique values in offer_type column
autos["offer_type"].unique().shape

Out[14]:

(2,)

In [15]:

# show the unique values
autos["offer_type"].unique()

Out[15]:

array(['Angebot', 'Gesuch'], dtype=object)

Looking at this words we can translate it and see that "gewerblich" seller is commercial and "gesuch" offer type is wanted or requested. This type of seller is not relevant (since it is just one entry), and the offer type too, just one entry and it is not we are looking for.

Now let's convert the columns price and odometer to numbers, as well as give to the odometer column a better name.

5 Convert column from text to numeric¶

In [16]:

# Convert price column
autos["price"] = autos["price"].str.replace("$","").str.replace(",","").astype(float)
autos["price"].dtype

Out[16]:

dtype('float64')

In [17]:

# lets see how many unique values in price column
autos["price"].unique().shape

Out[17]:

(2357,)

In [18]:

# lets see the statistical view
autos["price"].describe()

Out[18]:

count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64

In [19]:

# show the variations in values for price column
autos["price"].value_counts()

# though value_counts show all values so we can use head to show few values
autos["price"].value_counts().head()

# though value_counts return series, 
#we can use series.sort_index with ascending=True/False to view the highest 
# or lowest values with their counts, can use head() as well. 
autos["price"].value_counts().sort_index(ascending = False).head()

Out[19]:

99999999.0    1
27322222.0    1
12345678.0    3
11111111.0    2
10000000.0    1
Name: price, dtype: int64

In [20]:

# Show the highest and lowest value from price column

highest_value = autos["price"].value_counts().sort_index(ascending = False).head()

lowest_value = autos["price"].value_counts().sort_index(ascending = True).head()

print("higher price frequency:\n{} \n\nlower price frquency:\n{}"
      .format(highest_value, lowest_value))

higher price frequency:
99999999.0    1
27322222.0    1
12345678.0    3
11111111.0    2
10000000.0    1
Name: price, dtype: int64 

lower price frquency:
0.0    1421
1.0     156
2.0       3
3.0       1
5.0       2
Name: price, dtype: int64

In [21]:

# Convert odometer column and rename column as odometer_km
autos["odometer"] = autos["odometer"].str.replace("km","").str.replace(",","").astype(float)

autos.rename({"odometer": "odometer_km"}, axis = 1, inplace = True)

In [22]:

# show the variations in values for odometer column
autos["odometer_km"].value_counts()

# though value_counts show all values so we can use head to show few values
autos["odometer_km"].value_counts().head()

# though value_counts return series, 
#we can use series.sort_index with ascending=True/False to view the highest 
# or lowest values with their counts, can use head() as well. 
print("Higher values ", autos["odometer_km"].value_counts().sort_index(ascending = False).head())
print("\n")
print("Lower values ", autos["odometer_km"].value_counts().sort_index(ascending = True).head())

Higher values  150000.0    32424
125000.0     5170
100000.0     2169
90000.0      1757
80000.0      1436
Name: odometer_km, dtype: int64


Lower values  5000.0     967
10000.0    264
20000.0    784
30000.0    789
40000.0    819
Name: odometer_km, dtype: int64

In [23]:

# Show the statistical details of odometer_km column
autos["odometer_km"].describe()

Out[23]:

count     50000.000000
mean     125732.700000
std       40042.211706
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64

6 Missing values & Outlier detection and removal¶

In [24]:

autos.isnull().sum()

Out[24]:

date_crawled             0
name                     0
seller                   0
offer_type               0
price                    0
ab_test                  0
vehicle_type          5095
registration_year        0
gearbox               2680
power_ps                 0
model                 2758
odometer_km              0
registration_month       0
fuel_type             4482
brand                    0
unrepaired_damage     9829
ad_created               0
n_of_pictures            0
postal_code              0
last_seen                0
dtype: int64

In [25]:

missing_value_columns = ["vehicle_type", "gearbox", "model", "fuel_type", "unrepaired_damage"]

In [26]:

for col in missing_value_columns:
    pct_missing = np.mean(autos[col].isnull())
    print('{} - {}%'.format(col, round(pct_missing*100)))

vehicle_type - 10.0%
gearbox - 5.0%
model - 6.0%
fuel_type - 9.0%
unrepaired_damage - 20.0%

The missing value percengtage is very low, less than 20%, we can neglect that but only unrepaired_damage column can impute with mode.

In [27]:

# fill the categorical column null values with mode 
autos['unrepaired_damage'] = autos['unrepaired_damage'].fillna(autos['unrepaired_damage'].mode()[0])

# Fill rest of the columns null values with zero
autos.fillna(0, inplace=True)

# Show the datasets null view amount
autos.isnull().sum()

Out[27]:

date_crawled          0
name                  0
seller                0
offer_type            0
price                 0
ab_test               0
vehicle_type          0
registration_year     0
gearbox               0
power_ps              0
model                 0
odometer_km           0
registration_month    0
fuel_type             0
brand                 0
unrepaired_damage     0
ad_created            0
n_of_pictures         0
postal_code           0
last_seen             0
dtype: int64

In [28]:

# Outlier detection 
Q1=autos["price"].quantile(0.25)
Q3=autos["price"].quantile(0.75)
IQR=Q3-Q1
low_bound=Q1 - 1.5 * IQR
upper_bound=Q3 + 1.5 * IQR
print(low_bound,upper_bound)

-8050.0 16350.0

From outlier detection we determined lower bound and upper bound of dataset, above or below that range data is called outlier. We will remove that oulier and keep the data in between lowerbound and upperbound. Lets see interquartile price without outlier.

In [29]:

IQR_price=autos.price[(autos.price > low_bound) & (autos.price < upper_bound)]
IQR_price.shape

Out[29]:

(46207,)

In [30]:

IQR_price.describe()

Out[30]:

count    46207.000000
mean      3961.283550
std       3843.727282
min          0.000000
25%       1000.000000
50%       2500.000000
75%       5900.000000
max      16333.000000
Name: price, dtype: float64

In [31]:

# removing the price value zeros and count from 1 to 16333
autos = autos[autos['price'].between(1,16333)]
autos['price'].describe()

Out[31]:

count    44786.000000
mean      4086.969790
std       3837.880854
min          1.000000
25%       1150.000000
50%       2700.000000
75%       5999.000000
max      16333.000000
Name: price, dtype: float64

7 Date columns cleaning¶

There are 5 columns that should represent date values. Some of these columns were created by the crawler, some came from the website itself. We can differentiate by referring to the data dictionary:

date_crawled: added by the crawler
last_seen: added by the crawler
ad_created: from the website
registration_month: from the website
registration_year: from the website

Right now, the date_crawled, last_seen, and ad_created columns are all identified as string values by pandas. Because these three columns are represented as strings, we need to convert the data into a numerical representation so we can understand it quantitatively. The other two columns are represented as numeric values, so we can use methods like Series.describe() to understand the distribution without any extra data processing.

Now we are going to see how the date columns are formatted, how the data behaves and see if we need to drop any rows or correct any possible wrong entries. Let's start printing our date columns:

In [32]:

# Printing all date columns
autos.loc[:5, ['date_crawled', 'last_seen', 'ad_created', 'registration_month', 'registration_year']]

Out[32]:

	date_crawled	last_seen	ad_created	registration_month	registration_year
0	2016-03-26 17:47:46	2016-04-06 06:45:54	2016-03-26 00:00:00	3	2004
1	2016-04-04 13:38:56	2016-04-06 14:45:08	2016-04-04 00:00:00	6	1997
2	2016-03-26 18:57:24	2016-04-06 20:15:37	2016-03-26 00:00:00	7	2009
3	2016-03-12 16:58:10	2016-03-15 03:16:28	2016-03-12 00:00:00	6	2007
4	2016-04-01 14:38:50	2016-04-01 14:38:50	2016-04-01 00:00:00	7	2003
5	2016-03-21 13:47:45	2016-04-06 09:45:21	2016-03-21 00:00:00	4	2006

As we cans ee above the 'registration_month' and 'registration_year' columns are already numbers so let's analyze the content already.

In [33]:

# Checking if there are more than 12 months
autos['registration_month'].unique()

Out[33]:

array([ 3,  6,  7,  4,  8, 12, 10,  0,  9, 11,  5,  2,  1])

Everything seems normal, so lets see year column

In [34]:

# Checking the unique values from the year column
autos['registration_year'].unique()

Out[34]:

array([2004, 1997, 2009, 2007, 2003, 2006, 1995, 1998, 2000, 2017, 2010,
       1999, 1990, 1996, 1992, 2002, 2012, 2014, 2011, 2005, 2008, 1985,
       2016, 1994, 1986, 2001, 2018, 1972, 1993, 1988, 2013, 1989, 1973,
       1967, 1976, 4500, 1987, 1991, 2015, 1960, 1983, 1969, 1950, 1978,
       1980, 1984, 1963, 1977, 1961, 1934, 1982, 1966, 1979, 1968, 1981,
       1970, 1910, 1971, 1975, 5000, 4100, 2019, 1965, 1956, 9999, 1958,
       1800, 1964, 1974, 1959, 1962, 1937, 1929, 1000, 1952, 1111, 8888,
       1954, 1941, 1938, 2800, 5911, 1953, 4800, 1001])

We can see clearly that there are years that make no sense at all, like 9000, 4800 or any year greater than 2018, since 2016 is the year the data was crawled (maybe we can use some of this data, let's see bellow what percentage it is). Also makes no sense to take values from less than 1900, they will be removed too. With that in mind lets drop this rows.

In [35]:

autos = autos.loc[autos['registration_year'].between(1900,2018)]
autos.head()

Out[35]:

	date_crawled	name	seller	offer_type	price	ab_test	vehicle_type	registration_year	gearbox	power_ps	model	odometer_km	registration_month	fuel_type	brand	unrepaired_damage	ad_created	postal_code	last_seen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	5000.0	control	bus	2004	manuell	158	andere	150000.0	3	lpg	peugeot	nein	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	8500.0	control	limousine	1997	automatik	286	7er	150000.0	6	benzin	bmw	nein	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	8990.0	test	limousine	2009	manuell	102	golf	70000.0	7	benzin	volkswagen	nein	2016-03-26 00:00:00	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	4350.0	control	kleinwagen	2007	automatik	71	fortwo	70000.0	6	benzin	smart	nein	2016-03-12 00:00:00	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	privat	Angebot	1350.0	test	kombi	2003	manuell	0	focus	150000.0	7	benzin	ford	nein	2016-04-01 00:00:00	39218	2016-04-01 14:38:50

Next we will configure the columns with data unformatted and to include the missing values in the distribution and to use percentage instead of counts, chain the series.value_counts(normalize = True, dropna=false) method.

In [36]:

# Th following 3 columns is unstructured, we will convert them as structured 
autos[['date_crawled', 'ad_created', 'last_seen']][0:5]

Out[36]:

	date_crawled	ad_created	last_seen
0	2016-03-26 17:47:46	2016-03-26 00:00:00	2016-04-06 06:45:54
1	2016-04-04 13:38:56	2016-04-04 00:00:00	2016-04-06 14:45:08
2	2016-03-26 18:57:24	2016-03-26 00:00:00	2016-04-06 20:15:37
3	2016-03-12 16:58:10	2016-03-12 00:00:00	2016-03-15 03:16:28
4	2016-04-01 14:38:50	2016-04-01 00:00:00	2016-04-01 14:38:50

The first 10 characters represent the day (e.g. 2016-03-12). To understand the date range, we can extract just the date values, use Series.value_counts() to generate a distribution, and then sort by the index.

To select the first 10 characters in each column, we can use Series.str[:10]:

In [37]:

print(autos['date_crawled'].str[:10])

0        2016-03-26
1        2016-04-04
2        2016-03-26
3        2016-03-12
4        2016-04-01
5        2016-03-21
6        2016-03-20
7        2016-03-16
8        2016-03-22
9        2016-03-16
10       2016-03-15
11       2016-03-16
12       2016-03-31
13       2016-03-23
14       2016-03-23
16       2016-03-16
17       2016-03-29
18       2016-03-26
19       2016-03-17
20       2016-03-05
23       2016-03-10
25       2016-03-21
26       2016-04-03
28       2016-03-19
29       2016-04-02
30       2016-03-14
31       2016-03-14
32       2016-03-20
33       2016-03-15
34       2016-04-05
            ...    
49964    2016-03-10
49965    2016-03-11
49966    2016-04-02
49967    2016-03-12
49968    2016-04-01
49969    2016-03-17
49970    2016-03-21
49971    2016-03-29
49972    2016-03-26
49973    2016-03-27
49975    2016-03-27
49976    2016-03-19
49977    2016-03-31
49978    2016-04-04
49979    2016-03-20
49980    2016-03-12
49981    2016-03-15
49982    2016-03-29
49983    2016-03-06
49985    2016-04-02
49986    2016-04-04
49988    2016-03-28
49989    2016-03-11
49991    2016-03-06
49992    2016-03-10
49993    2016-03-15
49994    2016-03-22
49996    2016-03-28
49997    2016-04-02
49999    2016-03-14
Name: date_crawled, Length: 44767, dtype: object

In [38]:

# Convert string columns to datetime 
autos['date_crawled'] = pd.to_datetime(autos['date_crawled'])

#the timestamp is lost during the convertion
autos['ad_created'] = pd.to_datetime(autos['ad_created']) 

autos['last_seen'] = pd.to_datetime(autos['last_seen'])

Exploring the 'date_crawled' column:¶

In [39]:

print("Counts of unique values for 'date_crawled' column, in percentages:", '\n', 
      autos['date_crawled'].value_counts(normalize=True, dropna=False).sort_index().head(10) * 100)

Counts of unique values for 'date_crawled' column, in percentages: 
 2016-03-05 14:06:30    0.002234
2016-03-05 14:06:40    0.002234
2016-03-05 14:07:04    0.002234
2016-03-05 14:07:08    0.002234
2016-03-05 14:07:21    0.002234
2016-03-05 14:07:26    0.002234
2016-03-05 14:07:40    0.002234
2016-03-05 14:07:45    0.002234
2016-03-05 14:08:00    0.004468
2016-03-05 14:08:05    0.004468
Name: date_crawled, dtype: float64

No anomalities with the 'date_crawled' column (something one would expect from the column created by the crawler).

Exploring the 'ad_created' column:¶

In [40]:

print("Counts of unique values for 'ad_created' column, in percentages:", '\n', 
      autos['ad_created'].value_counts(normalize=True, dropna=False).sort_index().head(10) * 100)

Counts of unique values for 'ad_created' column, in percentages: 
 2015-08-10    0.002234
2015-09-09    0.002234
2015-11-10    0.002234
2015-12-05    0.002234
2015-12-30    0.002234
2016-01-03    0.002234
2016-01-07    0.002234
2016-01-10    0.004468
2016-01-13    0.002234
2016-01-16    0.002234
Name: ad_created, dtype: float64

The mistery is solved. There´re only 5 ads created in 2015, which definitely can´t be legitimate. It should be some specifics of crawled data. To confirm that, the documentation to the original data set should be revised.

Exploring the 'last_seen' column¶

In [41]:

print("Counts of unique values for 'last_seen' column, in percentages:", '\n', 
      autos['last_seen'].dt.date.value_counts(normalize=True, dropna=False).sort_index() * 100)

Counts of unique values for 'last_seen' column, in percentages: 
 2016-03-05     0.116157
2016-03-06     0.455693
2016-03-07     0.569616
2016-03-08     0.790761
2016-03-09     1.007439
2016-03-10     1.116894
2016-03-11     1.304532
2016-03-12     2.492908
2016-03-13     0.938191
2016-03-14     1.286662
2016-03-15     1.628432
2016-03-16     1.697679
2016-03-17     2.903925
2016-03-18     0.757254
2016-03-19     1.648536
2016-03-20     2.124333
2016-03-21     2.119865
2016-03-22     2.191346
2016-03-23     1.912123
2016-03-24     2.046150
2016-03-25     2.001474
2016-03-26     1.724485
2016-03-27     1.621730
2016-03-28     2.171242
2016-03-29     2.303036
2016-03-30     2.546519
2016-03-31     2.428128
2016-04-01     2.329841
2016-04-02     2.524181
2016-04-03     2.566623
2016-04-04     2.504077
2016-04-05    12.125003
2016-04-06    21.428731
2016-04-07    12.616436
Name: last_seen, dtype: float64

The last three days contain a disproportionate amount of 'last seen' values. Given that these are 6-10x the values from the previous days, it's unlikely that there was a massive spike in sales, and more likely that these values are to do with the crawling period ending and don't indicate car sales.

What does registration year tells us?¶

In [42]:

autos["registration_year"].describe()

Out[42]:

count    44767.000000
mean      2002.996292
std          7.232618
min       1910.000000
25%       1999.000000
50%       2003.000000
75%       2007.000000
max       2018.000000
Name: registration_year, dtype: float64

As we sorted registration_year data earlier range from 2016 to 2019, so we have no doubtful data in this column.

Drop irrelevant data¶

In [43]:

autos.head(2)

Out[43]:

	date_crawled	name	seller	offer_type	price	ab_test	vehicle_type	registration_year	gearbox	power_ps	model	odometer_km	registration_month	fuel_type	brand	unrepaired_damage	ad_created	n_of_pictures	postal_code	last_seen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	5000.0	control	bus	2004	manuell	158	andere	150000.0	3	lpg	peugeot	nein	2016-03-26	0	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	8500.0	control	limousine	1997	automatik	286	7er	150000.0	6	benzin	bmw	nein	2016-04-04	0	71034	2016-04-06 14:45:08

In [44]:

autos.loc[:,["seller", "offer_type", "n_of_pictures"]].head()

Out[44]:

	seller	offer_type
0	privat	Angebot
1	privat	Angebot
2	privat	Angebot
3	privat	Angebot
4	privat	Angebot

As these columns have 0 and duplicate values, so we can remove it from main dataframe, it will not important for further data analysis.

In [45]:

autos.drop(["seller", "offer_type", "n_of_pictures"], axis = 1, inplace = True)
autos.head()

Out[45]:

	date_crawled	name	price	ab_test	vehicle_type	registration_year	gearbox	power_ps	model	odometer_km	registration_month	fuel_type	brand	unrepaired_damage	ad_created	postal_code	last_seen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	5000.0	control	bus	2004	manuell	158	andere	150000.0	3	lpg	peugeot	nein	2016-03-26	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	8500.0	control	limousine	1997	automatik	286	7er	150000.0	6	benzin	bmw	nein	2016-04-04	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	8990.0	test	limousine	2009	manuell	102	golf	70000.0	7	benzin	volkswagen	nein	2016-03-26	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	4350.0	control	kleinwagen	2007	automatik	71	fortwo	70000.0	6	benzin	smart	nein	2016-03-12	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	1350.0	test	kombi	2003	manuell	0	focus	150000.0	7	benzin	ford	nein	2016-04-01	39218	2016-04-01 14:38:50

In [46]:

autos["registration_year"].value_counts(normalize = True)

Out[46]:

2000    0.070186
2005    0.064512
1999    0.064512
2003    0.059620
2004    0.059419
2001    0.058436
2006    0.057699
2002    0.054951
1998    0.052494
2007    0.047490
2008    0.044318
1997    0.043425
2009    0.040387
1996    0.030424
2017    0.030380
2010    0.027386
1995    0.027252
2016    0.026314
2011    0.024639
2012    0.017848
1994    0.013849
2018    0.010208
1993    0.009293
2013    0.008801
1992    0.008019
1990    0.007595
1991    0.007372
2014    0.005316
1989    0.003820
1988    0.002993
          ...   
1981    0.000581
1960    0.000447
1971    0.000424
1977    0.000424
1973    0.000424
1968    0.000380
1976    0.000380
1975    0.000357
1967    0.000357
1969    0.000335
1974    0.000335
1965    0.000246
1966    0.000246
1964    0.000201
1961    0.000112
1910    0.000112
1962    0.000089
1956    0.000089
1963    0.000089
1959    0.000089
1937    0.000067
1958    0.000067
1950    0.000067
1954    0.000045
1934    0.000045
1938    0.000022
1941    0.000022
1953    0.000022
1929    0.000022
1952    0.000022
Name: registration_year, Length: 72, dtype: float64

After sorting data in registration year we have found the values as on top from 1900-2018.

8 Aggregation¶

Now that we have made most of the cleaning, we are going to aggregate the items by car brands. This will bring more clarity to the dataset.

In [47]:

# Identify the unique values as we want to aggregate by car brand
autos["brand"].unique()

Out[47]:

array(['peugeot', 'bmw', 'volkswagen', 'smart', 'ford', 'chrysler',
       'seat', 'renault', 'mercedes_benz', 'audi', 'opel', 'mazda',
       'mini', 'toyota', 'dacia', 'nissan', 'jeep', 'saab', 'volvo',
       'mitsubishi', 'fiat', 'skoda', 'subaru', 'sonstige_autos', 'kia',
       'citroen', 'porsche', 'hyundai', 'chevrolet', 'honda', 'daewoo',
       'suzuki', 'trabant', 'land_rover', 'jaguar', 'alfa_romeo', 'lada',
       'rover', 'daihatsu', 'lancia'], dtype=object)

In [48]:

# Checking the frequencies of car brands
brands = autos["brand"].value_counts()
brands.head()

Out[48]:

volkswagen       9680
opel             5199
bmw              4576
mercedes_benz    3926
audi             3453
Name: brand, dtype: int64

In [49]:

# Create the frequency dataframe as percentage
brands_percentage = ((brands)/(autos["brand"].shape[0]))*100

# Printing the top 10
brands_percentage.head(10)

Out[49]:

volkswagen       21.623071
opel             11.613465
bmw              10.221815
mercedes_benz     8.769853
audi              7.713271
ford              7.273215
renault           5.148882
peugeot           3.169746
fiat              2.807872
seat              1.976903
Name: brand, dtype: float64

Looking at the date above we selected the top 10 ones, this means the brands we are going to aggregate are: volkswagen (21.62%), opel (11.61%),bmw (10.22%), mercedez_benz (8.76%), audi (7.71%), ford (7.27%), renaut (5.14%), peugeot (3.16%), fiat (2.80%) and seat (1.97%) as the last one.

Now we'll make a dictionary to store the mean price information of each brand.

In [50]:

# choosing our top 10 brands in a list
auto_brands = ['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi',
               'ford', 'renault', 'peugeot', 'fiat', 'seat']

In [51]:

# Creating empty dictionary to hold aggregate data 
# by mean price and mean mileage
brand_mean_price = {}
brand_mean_mileage = {}

In [52]:

# Populating our dictionary
for brand in auto_brands:
    mean_price = autos.loc[autos['brand'] == brand, 'price'].mean()
    mean_mileage = autos.loc[autos['brand'] == brand, 'odometer_km'].mean()
    brand_mean_price[brand] = mean_price
    brand_mean_mileage[brand] = mean_mileage

print("brand_mean_price:\n\n",brand_mean_price)
print("\nbrand_mean_mileage:\n\n", brand_mean_mileage)

brand_mean_price:

 {'volkswagen': 4145.926859504132, 'fiat': 2724.6340493237867, 'mercedes_benz': 5207.541008660214, 'opel': 2684.1500288517022, 'renault': 2247.817787418655, 'audi': 5653.443672169128, 'bmw': 5617.18444055944, 'seat': 3701.974011299435, 'ford': 2939.946253071253, 'peugeot': 2942.6067653276955}

brand_mean_mileage:

 {'volkswagen': 132967.45867768594, 'fiat': 117879.87271280827, 'mercedes_benz': 138285.7870606215, 'opel': 130619.34987497596, 'renault': 128839.47939262472, 'audi': 139842.1662322618, 'bmw': 138809.0034965035, 'seat': 124734.46327683616, 'ford': 127040.84766584767, 'peugeot': 128047.92107117688}

As we can see above the most expensive car brands are audi, bmw and mercedes benz with mean prices over 5,000 dollars. Followed by volkswagen and seat with prices varying between 3,500 dollars and 4,000 dollars. The less expensive are ford, fiat, peugeot and renault cars.

Below we are going to combine both price and mileage values inside a dataframe so we can compare both information. Also brand_mean_prices will be reduced to bmp and brand_mean_mileage will be reduced to bmm.

In [53]:

# Turning the dictionary into a dataframe
bmp_series = pd.Series(brand_mean_price)
bmm_series = pd.Series(brand_mean_mileage)

In [54]:

# Turning both series into a dataframe
brand_df = pd.DataFrame(bmp_series, columns = ['mean_price'])
brand_df['mean_mileage'] = bmm_series
brand_df

Out[54]:

	mean_price	mean_mileage
audi	5653.443672	139842.166232
bmw	5617.184441	138809.003497
fiat	2724.634049	117879.872713
ford	2939.946253	127040.847666
mercedes_benz	5207.541009	138285.787061
opel	2684.150029	130619.349875
peugeot	2942.606765	128047.921071
renault	2247.817787	128839.479393
seat	3701.974011	124734.463277
volkswagen	4145.926860	132967.458678

As we can see above, there is a small difference between how much mileage a car has (at least in the mean values) and the price, the most expensive ones are the ones with the biggest mean mileage, except opel that has a big mean but not so big price. Also renault has big mileage compare to mean price. This data can lead to much more insights with more hours of work.

9 More Data cleaning¶

Identify categorical data that uses german words, translate them and map the values to their english counterparts¶

we have four columns have german words vehicle_type, gearbox, fuel_type, unrepaired_damage. Lets map all those columns into english.

In [55]:

# To print all the column levels
autos.columns

Out[55]:

Index(['date_crawled', 'name', 'price', 'ab_test', 'vehicle_type',
       'registration_year', 'gearbox', 'power_ps', 'model', 'odometer_km',
       'registration_month', 'fuel_type', 'brand', 'unrepaired_damage',
       'ad_created', 'postal_code', 'last_seen'],
      dtype='object')

In [56]:

# Finding German words in vehicle_type column
autos["vehicle_type"].unique()

Out[56]:

array(['bus', 'limousine', 'kleinwagen', 'kombi', 0, 'coupe', 'suv',
       'cabrio', 'andere'], dtype=object)

In [57]:

# Translating German Values to English by Mapping Technique
vehicle_type = autos["vehicle_type"]
map_vehicle = {
    "limousine":"limousine",
    "kleinwagen":"small car",
    "kombi":"combi",
    "bus": "bus",
    "cabrio":"convertible",
    "coupe":"coupe",
    "suv":"suv",
    "andere":"other"
}

autos["vehicle_type"] = pd.Series(vehicle_type).map(map_vehicle)

In [58]:

# For gearbox column
autos["gearbox"].unique()

Out[58]:

array(['manuell', 'automatik', 0], dtype=object)

In [59]:

# Translating German Values to English by Mapping Technique
gear_box = autos["gearbox"]
map_gearbox = {
    "manuell":"manual",
    "automatik":"automatic",
    }

autos["gearbox"] = pd.Series(gear_box).map(map_gearbox)

In [60]:

# for fuel_type column
autos["fuel_type"].unique()

Out[60]:

array(['lpg', 'benzin', 'diesel', 0, 'cng', 'hybrid', 'elektro', 'andere'],
      dtype=object)

In [61]:

# Translating German Values to English by Mapping Technique
fuel_type = autos["fuel_type"]
map_fuel_type = {
    "benzin" :"petrol",
    "diesel":"diesel",
    "lpg" :"lpg",
    "cpg":"cpg",
    "hybrid" :"hybrid",
    "andere":"other",
    "elektro":"electric",
}

autos["fuel_type"] = pd.Series(fuel_type).map(map_fuel_type)

In [62]:

# for unrepaired_damage
autos["unrepaired_damage"].unique()

Out[62]:

array(['nein', 'ja'], dtype=object)

In [63]:

# Translating German Values to English by Mapping Technique
damage_type = autos["unrepaired_damage"]
map_damage_type = {
    "nein" :"no",
    "ja":"yes",
}

autos["unrepaired_damage"] = pd.Series(damage_type).map(map_damage_type)

Convert the dates to be uniform numeric data, so "2016-03-21" becomes the integer 20160321¶

In [64]:

#to convert in numeric type of date crawled column
autos["date_crawled"] = autos["date_crawled"].astype(str).str[:10].str.replace("-","").astype(int)
print(autos["date_crawled"].head())

0    20160326
1    20160404
2    20160326
3    20160312
4    20160401
Name: date_crawled, dtype: int64

In [65]:

#to convert in numeric type of ad_created column
autos["ad_created"] = autos["ad_created"].astype(str).str[:10].str.replace("-","").astype(int)
print(autos["ad_created"].head())

0    20160326
1    20160404
2    20160326
3    20160312
4    20160401
Name: ad_created, dtype: int64

In [66]:

#to convert in numeric type of last_seen column
autos["last_seen"] = autos["last_seen"].astype(str).str[:10].str.replace("-","").astype(int)
print(autos["last_seen"].head())

0    20160406
1    20160406
2    20160406
3    20160315
4    20160401
Name: last_seen, dtype: int64

See if there are particular keywords in the name column that you can extract as new columns¶

In [67]:

autos["name"].unique()

Out[67]:

array(['Peugeot_807_160_NAVTECH_ON_BOARD',
       'BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik',
       'Volkswagen_Golf_1.6_United', ...,
       'Audi_A6__S6__Avant_4.2_quattro_eventuell_Tausch_moegl',
       'Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+Reifen_neu_!!',
       'Fiat_500_C_1.2_Dualogic_Lounge'], dtype=object)

In [68]:

autos["name"].value_counts().head()

Out[68]:

Ford_Fiesta            76
Volkswagen_Golf_1.4    75
BMW_316i               75
Volkswagen_Polo        72
BMW_318i               72
Name: name, dtype: int64

In [69]:

autos["name"] = autos["name"].str.replace("_", ",")

In [70]:

print(autos["name"].head().str.split())

0                   [Peugeot,807,160,NAVTECH,ON,BOARD]
1         [BMW,740i,4,4,Liter,HAMANN,UMBAU,Mega,Optik]
2                         [Volkswagen,Golf,1.6,United]
3    [Smart,smart,fortwo,coupe,softouch/F1/Klima/Pa...
4    [Ford,Focus,1,6,Benzin,TÜV,neu,ist,sehr,gepfle...
Name: name, dtype: object

Thus, name column can be split into Model_name and Engine or other Specification columns using extract and split commands¶

Find the most common brand/model combinations.¶

In [71]:

# grouping the dataset by model and brand
brand_mod_group = autos.groupby(["brand","model"])

# find the count and sort descending
brand_model = brand_mod_group["date_crawled"].count().sort_values(ascending=False)

print(brand_model[:10])

brand          model   
volkswagen     golf        3674
bmw            3er         2499
volkswagen     polo        1683
opel           corsa       1677
               astra       1400
volkswagen     passat      1331
audi           a4          1151
mercedes_benz  c_klasse    1030
bmw            5er         1008
mercedes_benz  e_klasse     831
Name: date_crawled, dtype: int64

Split the odometer_km into groups, and use aggregation to see if average prices follows any patterns based on the milage.¶

In [72]:

# splitting mileage into groups 
df_by_km = autos.groupby(['odometer_km'])
df_by_km

Out[72]:

<pandas.core.groupby.DataFrameGroupBy object at 0x7fbe3d8ae160>

In [75]:

# sort values by mean price
df_by_km['price'].mean().sort_values(ascending=False)

Out[75]:

odometer_km
40000.0     8474.518095
30000.0     8151.000000
50000.0     7794.866287
10000.0     7470.352459
20000.0     7244.222477
60000.0     7239.508591
70000.0     6941.411333
80000.0     6610.982173
90000.0     5835.658683
100000.0    5539.574342
125000.0    4814.489199
150000.0    3270.599993
5000.0      2935.272222
Name: price, dtype: float64

We can clearly observe a pattern: the higher the mileage - the less is the mean price, except one outlier mileage of 5000. The cars with mileage between 10000-40000 have the best value, whereas cars with mileage higher than 125000 has the lowest value.

How much cheaper are cars with damage than their non-damaged counterparts:¶

In [76]:

autos['unrepaired_damage'].unique()

Out[76]:

array(['no', 'yes'], dtype=object)

In [77]:

df_by_damage = autos.groupby(['unrepaired_damage'])
df_by_damage 

Out[77]:

<pandas.core.groupby.DataFrameGroupBy object at 0x7fbe3e3e5eb8>

In [78]:

df_by_damage['price'].mean().sort_values(ascending=False)

Out[78]:

unrepaired_damage
no     4332.857520
yes    1961.651348
Name: price, dtype: float64

In the code above, we calculated the mean prices for cars with damage and without damage. As we can see, cars with no damage sell 2 times better than damaged cars.

Conclusion¶

The top brands by listings:

Volkswagen (21.21%)
BMW (11.00%)
Opel (10.82%)
Mercedes Benz (9.54%)
Audi (8.64%)
Ford (6.98%)

The Germans clearly love local automakers, with German brands taking the top 5 spots and making up over 61.21% of all listings.

We can see that Mercedes Benz vehicles are by far the most expensive amongst the top brands, on average costing three times more than the second most expensive brand Audi. However, it still makes up nearly 10% of all vehicles in the dataset which indicates many people are tying to sell their Mercedes Benz vehicles.

More research would need to be done in order to state whether or not the demand for Mercedes Benz vehicles is greater or less than the supply.

We also see that the three most expensive top brands (Mercedes Benz, Audi, BMW) make up nearly 30% of all listings, which could indicate there is not a strong correlation between price and popularity. However, more research would be required to determine whetether or not this is true.

There is no apparent relationship between the mean price and mean mileage amongst vehicles of the top brands.

In [ ]: