eBay Kleinanzeigen is a classifieds website of ebay Germany. Just like any other classifieds website, the site allows users to post products and services which other users maybe interested in. The site has specific sections for different product including a services section for cars, boats and bikes Link. The section is quite broad and its listings include:
Data related to vehicle listings was scrapped from the above mentioned source to assess and glean insights related to the listings. However, the data is significantly muddled and will require cleaning before any meaningful information can be derived from the same.
The collection of data was mainly aimed at vehicle listing. A data dictionary of the data collected and their relative definitions have been provided below.
The amount of data is significant and moreover is aimed at a German audience.
The goal of this project is to clean the dataset to the point at which it becomes useful to extract vehicular related insights from the listing. By vehicular, I mean four-wheel vehicles.
Reading the data
A few functions have been defined below to assist with data analysis.
import pandas as pd
import numpy as np
import datetime
# Function name: fix_column(a_column_name)
# Input: The name of a column of the dataset
# Output: A standardized version of the column_name
# Description: The column names in the dataset are not consistent. This function will rename some of the columns and set all the
# column names to a common standard
def fix_column(a_column_name):
a_column_name = a_column_name.replace("yearOfRegistration","registration_year")
a_column_name = a_column_name.replace("monthOfRegistration","registration_month")
a_column_name = a_column_name.replace("notRepairedDamage", "unrepaired_damage")
a_column_name = a_column_name.replace("dateCreated", "ad_created")
a_column_name = camel_to_snake(a_column_name)
return a_column_name
# Function name: camel_to_snake(a_column_name)
# Input: The name of a column of the dataset
# Output: The given name in snake case
# Description: The function converts a given column name to snake case to keep it consistent with python standards
def camel_to_snake(a_column_name):
pos = 0
for letter in a_column_name:
if letter.isupper() == True:
pos = a_column_name.index(letter)
string1 = a_column_name[:pos]
string2 = a_column_name[pos:]
a_column_name = string1+"_"+string2
return (a_column_name.lower())
# Function name: print_full(a_list)
# Input: A python list
# Output: The full list
# Description: Jupyter notebooks by default do not display all the data for large datasets. It shows a few lines and summarizes
# the rest using ellipsis. This function helps to see the full python list
def print_full(a_list):
pd.set_option('display.max_rows', len(a_list))
print(a_list)
pd.reset_option('display.max_rows')
# Function name: column_details(the_dataframe, a_column)
# Input: A python dataframe and one of its columns
# Output: Details related to the column
# Description: This function provides analysis information related to a column associated to a python dataframe.
# It handles only string and numerical columns
def column_details(the_dataframe, a_column):
print("Number of unique values in the {0} column:".format(a_column), the_dataframe[a_column].unique().shape[0])
print("---------------------")
if the_dataframe[a_column].dtype == "object":
print("Statistical data related to the {0} column:\n".format(a_column),the_dataframe[a_column].describe())
else:
print("Statistical data related to the {0} column:\n".format(a_column),the_dataframe[a_column].describe().map('{:,.2f}'.format))
print("---------------------")
print("Unique counts for each entry in the {0} column:\n".format(a_column))
print(the_dataframe[a_column].value_counts().sort_index(ascending=True))
# Function name: date_column_details(the_dataframe, a_date_column)
# Input: A python dataframe and one of its date related columns
# Output: Details related to the column
# Description: This function provides analysis information related to a column associated to a python dataframe.
# It handles only date columns and is tailored for use in this project
def date_column_details(the_dataframe, a_date_column):
print("Unique counts in percentage for each entry in the {0} column:\n".format(a_date_column))
date = the_dataframe[a_date_column].str[:10]
date_counts = date.value_counts(normalize=True, dropna=False).sort_index(ascending=True)
print(date_counts)
autos = pd.read_csv("autos.csv", encoding = "Latin-1")
autos
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
49995 | 2016-03-27 14:38:19 | Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon | privat | Angebot | $24,900 | control | limousine | 2011 | automatik | 239 | q5 | 100,000km | 1 | diesel | audi | nein | 2016-03-27 00:00:00 | 0 | 82131 | 2016-04-01 13:47:40 |
49996 | 2016-03-28 10:50:25 | Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... | privat | Angebot | $1,980 | control | cabrio | 1996 | manuell | 75 | astra | 150,000km | 5 | benzin | opel | nein | 2016-03-28 00:00:00 | 0 | 44807 | 2016-04-02 14:18:02 |
49997 | 2016-04-02 14:44:48 | Fiat_500_C_1.2_Dualogic_Lounge | privat | Angebot | $13,200 | test | cabrio | 2014 | automatik | 69 | 500 | 5,000km | 11 | benzin | fiat | nein | 2016-04-02 00:00:00 | 0 | 73430 | 2016-04-04 11:47:27 |
49998 | 2016-03-08 19:25:42 | Audi_A3_2.0_TDI_Sportback_Ambition | privat | Angebot | $22,900 | control | kombi | 2013 | manuell | 150 | a3 | 40,000km | 11 | diesel | audi | nein | 2016-03-08 00:00:00 | 0 | 35683 | 2016-04-05 16:45:07 |
49999 | 2016-03-14 00:42:12 | Opel_Vectra_1.6_16V | privat | Angebot | $1,250 | control | limousine | 1996 | manuell | 101 | vectra | 150,000km | 1 | benzin | opel | nein | 2016-03-13 00:00:00 | 0 | 45897 | 2016-04-06 21:18:48 |
50000 rows × 20 columns
autos.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): dateCrawled 50000 non-null object name 50000 non-null object seller 50000 non-null object offerType 50000 non-null object price 50000 non-null object abtest 50000 non-null object vehicleType 44905 non-null object yearOfRegistration 50000 non-null int64 gearbox 47320 non-null object powerPS 50000 non-null int64 model 47242 non-null object odometer 50000 non-null object monthOfRegistration 50000 non-null int64 fuelType 45518 non-null object brand 50000 non-null object notRepairedDamage 40171 non-null object dateCreated 50000 non-null object nrOfPictures 50000 non-null int64 postalCode 50000 non-null int64 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
Below is a quick view of what the data in the file looks like.
autos.head(5)
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
Issue in the data
The first issue that comes up when looking at the data is the column name. They all have camel casing and some of the names are too long. These will be fixed using the fix_column() function defined earlier
autos_columns_fixed = []
autos.columns
for a_column in autos.columns:
autos_columns_fixed.append(fix_column(a_column))
autos.columns = autos_columns_fixed
autos.head()
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_p_s | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
autos.describe()
registration_year | power_p_s | registration_month | nr_of_pictures | postal_code | |
---|---|---|---|---|---|
count | 50000.000000 | 50000.000000 | 50000.000000 | 50000.0 | 50000.000000 |
mean | 2005.073280 | 116.355920 | 5.723360 | 0.0 | 50813.627300 |
std | 105.712813 | 209.216627 | 3.711984 | 0.0 | 25779.747957 |
min | 1000.000000 | 0.000000 | 0.000000 | 0.0 | 1067.000000 |
25% | 1999.000000 | 70.000000 | 3.000000 | 0.0 | 30451.000000 |
50% | 2003.000000 | 105.000000 | 6.000000 | 0.0 | 49577.000000 |
75% | 2008.000000 | 150.000000 | 9.000000 | 0.0 | 71540.000000 |
max | 9999.000000 | 17700.000000 | 12.000000 | 0.0 | 99998.000000 |
Two columns related to the price and odometer reading of a vehicle are not displayed above. A quick look of the data will reveal that these columns contain string values and as such have been treated as strings. A fix will be required to consider them as numerical.
fixed_column_odometer = autos["odometer"].str.strip("km").str.replace(",","").astype(int)
fixed_column_price = autos["price"].str.strip("$").str.replace(",","").astype(int)
autos["odometer"] = fixed_column_odometer
autos.rename({"odometer":"odometer_km"}, axis=1, inplace=True)
autos["price"] = fixed_column_price
autos.describe(include="all")
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_p_s | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000 | 50000 | 50000 | 50000 | 5.000000e+04 | 50000 | 44905 | 50000.000000 | 47320 | 50000.000000 | 47242 | 50000.000000 | 50000.000000 | 45518 | 50000 | 40171 | 50000 | 50000.0 | 50000.000000 | 50000 |
unique | 48213 | 38754 | 2 | 2 | NaN | 2 | 8 | NaN | 2 | NaN | 245 | NaN | NaN | 7 | 40 | 2 | 76 | NaN | NaN | 39481 |
top | 2016-03-11 22:38:16 | Ford_Fiesta | privat | Angebot | NaN | test | limousine | NaN | manuell | NaN | golf | NaN | NaN | benzin | volkswagen | nein | 2016-04-03 00:00:00 | NaN | NaN | 2016-04-07 06:17:27 |
freq | 3 | 78 | 49999 | 49999 | NaN | 25756 | 12859 | NaN | 36993 | NaN | 4024 | NaN | NaN | 30107 | 10687 | 35232 | 1946 | NaN | NaN | 8 |
mean | NaN | NaN | NaN | NaN | 9.840044e+03 | NaN | NaN | 2005.073280 | NaN | 116.355920 | NaN | 125732.700000 | 5.723360 | NaN | NaN | NaN | NaN | 0.0 | 50813.627300 | NaN |
std | NaN | NaN | NaN | NaN | 4.811044e+05 | NaN | NaN | 105.712813 | NaN | 209.216627 | NaN | 40042.211706 | 3.711984 | NaN | NaN | NaN | NaN | 0.0 | 25779.747957 | NaN |
min | NaN | NaN | NaN | NaN | 0.000000e+00 | NaN | NaN | 1000.000000 | NaN | 0.000000 | NaN | 5000.000000 | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 1067.000000 | NaN |
25% | NaN | NaN | NaN | NaN | 1.100000e+03 | NaN | NaN | 1999.000000 | NaN | 70.000000 | NaN | 125000.000000 | 3.000000 | NaN | NaN | NaN | NaN | 0.0 | 30451.000000 | NaN |
50% | NaN | NaN | NaN | NaN | 2.950000e+03 | NaN | NaN | 2003.000000 | NaN | 105.000000 | NaN | 150000.000000 | 6.000000 | NaN | NaN | NaN | NaN | 0.0 | 49577.000000 | NaN |
75% | NaN | NaN | NaN | NaN | 7.200000e+03 | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | 150000.000000 | 9.000000 | NaN | NaN | NaN | NaN | 0.0 | 71540.000000 | NaN |
max | NaN | NaN | NaN | NaN | 1.000000e+08 | NaN | NaN | 9999.000000 | NaN | 17700.000000 | NaN | 150000.000000 | 12.000000 | NaN | NaN | NaN | NaN | 0.0 | 99998.000000 | NaN |
Three columns have excactly the same data for almost every row which will have no significant use for analysis and are therefore being removed.
autos.drop(["seller", "offer_type", "nr_of_pictures"], axis=1)
date_crawled | name | price | abtest | vehicle_type | registration_year | gearbox | power_p_s | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | 5000 | control | bus | 2004 | manuell | 158 | andere | 150000 | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | 8500 | control | limousine | 1997 | automatik | 286 | 7er | 150000 | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | 8990 | test | limousine | 2009 | manuell | 102 | golf | 70000 | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | 4350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70000 | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | 1350 | test | kombi | 2003 | manuell | 0 | focus | 150000 | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 39218 | 2016-04-01 14:38:50 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
49995 | 2016-03-27 14:38:19 | Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon | 24900 | control | limousine | 2011 | automatik | 239 | q5 | 100000 | 1 | diesel | audi | nein | 2016-03-27 00:00:00 | 82131 | 2016-04-01 13:47:40 |
49996 | 2016-03-28 10:50:25 | Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... | 1980 | control | cabrio | 1996 | manuell | 75 | astra | 150000 | 5 | benzin | opel | nein | 2016-03-28 00:00:00 | 44807 | 2016-04-02 14:18:02 |
49997 | 2016-04-02 14:44:48 | Fiat_500_C_1.2_Dualogic_Lounge | 13200 | test | cabrio | 2014 | automatik | 69 | 500 | 5000 | 11 | benzin | fiat | nein | 2016-04-02 00:00:00 | 73430 | 2016-04-04 11:47:27 |
49998 | 2016-03-08 19:25:42 | Audi_A3_2.0_TDI_Sportback_Ambition | 22900 | control | kombi | 2013 | manuell | 150 | a3 | 40000 | 11 | diesel | audi | nein | 2016-03-08 00:00:00 | 35683 | 2016-04-05 16:45:07 |
49999 | 2016-03-14 00:42:12 | Opel_Vectra_1.6_16V | 1250 | control | limousine | 1996 | manuell | 101 | vectra | 150000 | 1 | benzin | opel | nein | 2016-03-13 00:00:00 | 45897 | 2016-04-06 21:18:48 |
50000 rows × 17 columns
Detailed analysis of column data
A thorough analysis of each column will be required before making an assessment of how muddled the data is. Assigning a column name in the cell below gives a statistical view of the data in the given column. The cell below that should give a full view of the spread. This second cell is only required if the count of unique data in the given column is big.
#Details related to a column:
column_name = "registration_month"
column_details(autos,column_name)
Number of unique values in the registration_month column: 13 --------------------- Statistical data related to the registration_month column: count 50,000.00 mean 5.72 std 3.71 min 0.00 25% 3.00 50% 6.00 75% 9.00 max 12.00 Name: registration_month, dtype: object --------------------- Unique counts for each entry in the registration_month column: 0 5075 1 3282 2 3008 3 5071 4 4102 5 4107 6 4368 7 3949 8 3191 9 3389 10 3651 11 3360 12 3447 Name: registration_month, dtype: int64
# Full view of the unique counts:
print_full(autos[column_name].value_counts().sort_index(ascending=True))
0 5075 1 3282 2 3008 3 5071 4 4102 5 4107 6 4368 7 3949 8 3191 9 3389 10 3651 11 3360 12 3447 Name: registration_month, dtype: int64
After an in-depth analysis of every column, the following issues have been realised:
Definition of a vehicle
Before going forward a definition is required for a vehicle listing. After a detailed analysis, the following is my definition for the listing of a vehicle for sale on the classifieds:
It must have a price above $0.00 otherwise it was not meant for a sale
It must have a valid registration year and month. Most car buyers look out for that information and most sellers make sure to put that up to ensure the car gets sold
The year of registration must be between 1949-2018. One could argue that there were cars before 1949 but I'm assuming they would be mostly collector's items. Also the number of cars below this year range are less.
The more pressing question is for cars with the years 2017 and 2018. The assumption is that this data has been manually tampered with. I doubt the system itself would have given the user the ability to set these year. Moreover there are about 800 vehicles with these years and removing them all could affect analysis considering other eliminations that are being carried out. Lastly the vehicles with these years have all their other data in order and a random sampling using their names, to search on Google reveals they are vehicles manufactured pre 2016
While I could also stipulate that the listing must have a vehicle_type, I have noticed that there are vehicle listings without vehicle types that have all their other information in order. For now I've left this out.
cleaned_autos = autos[autos["price"]!=0]
cleaned_autos = cleaned_autos[cleaned_autos["registration_year"].between(1949,2018)]
cleaned_autos = cleaned_autos[cleaned_autos["registration_month"]>0]
#cleaned_autos = cleaned_autos[cleaned_autos["vehicle_type"].notnull()]
print(cleaned_autos)
date_crawled name \ 0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD 1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik 2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United 3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... 4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... ... ... ... 49995 2016-03-27 14:38:19 Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon 49996 2016-03-28 10:50:25 Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... 49997 2016-04-02 14:44:48 Fiat_500_C_1.2_Dualogic_Lounge 49998 2016-03-08 19:25:42 Audi_A3_2.0_TDI_Sportback_Ambition 49999 2016-03-14 00:42:12 Opel_Vectra_1.6_16V seller offer_type price abtest vehicle_type registration_year \ 0 privat Angebot 5000 control bus 2004 1 privat Angebot 8500 control limousine 1997 2 privat Angebot 8990 test limousine 2009 3 privat Angebot 4350 control kleinwagen 2007 4 privat Angebot 1350 test kombi 2003 ... ... ... ... ... ... ... 49995 privat Angebot 24900 control limousine 2011 49996 privat Angebot 1980 control cabrio 1996 49997 privat Angebot 13200 test cabrio 2014 49998 privat Angebot 22900 control kombi 2013 49999 privat Angebot 1250 control limousine 1996 gearbox power_p_s model odometer_km registration_month \ 0 manuell 158 andere 150000 3 1 automatik 286 7er 150000 6 2 manuell 102 golf 70000 7 3 automatik 71 fortwo 70000 6 4 manuell 0 focus 150000 7 ... ... ... ... ... ... 49995 automatik 239 q5 100000 1 49996 manuell 75 astra 150000 5 49997 automatik 69 500 5000 11 49998 manuell 150 a3 40000 11 49999 manuell 101 vectra 150000 1 fuel_type brand unrepaired_damage ad_created \ 0 lpg peugeot nein 2016-03-26 00:00:00 1 benzin bmw nein 2016-04-04 00:00:00 2 benzin volkswagen nein 2016-03-26 00:00:00 3 benzin smart nein 2016-03-12 00:00:00 4 benzin ford nein 2016-04-01 00:00:00 ... ... ... ... ... 49995 diesel audi nein 2016-03-27 00:00:00 49996 benzin opel nein 2016-03-28 00:00:00 49997 benzin fiat nein 2016-04-02 00:00:00 49998 diesel audi nein 2016-03-08 00:00:00 49999 benzin opel nein 2016-03-13 00:00:00 nr_of_pictures postal_code last_seen 0 0 79588 2016-04-06 06:45:54 1 0 71034 2016-04-06 14:45:08 2 0 35394 2016-04-06 20:15:37 3 0 33729 2016-03-15 03:16:28 4 0 39218 2016-04-01 14:38:50 ... ... ... ... 49995 0 82131 2016-04-01 13:47:40 49996 0 44807 2016-04-02 14:18:02 49997 0 73430 2016-04-04 11:47:27 49998 0 35683 2016-04-05 16:45:07 49999 0 45897 2016-04-06 21:18:48 [44074 rows x 20 columns]
Changing values from German to English
The data in some of the columns contain German. Further analysis could be slowed if we have to keep looking at the transalation each time. Values in German will therefore have to be changed to English.
cleaned_autos["vehicle_type"] = cleaned_autos["vehicle_type"].map({"andere":"others", "bus":"bus",
"cabrios":"convertible", "coupe":"coupe",
"kleinewagen":"small_car", "kombi":"combi",
"limousine":"limousine", "suv":"suv"},
na_action='ignore')
cleaned_autos["gearbox"] = cleaned_autos["gearbox"].map({"manuell":"manual", "automatik":"automatic"}, na_action = 'ignore')
cleaned_autos["fuel_type"] = cleaned_autos["fuel_type"].map({"andere":"others", "benzin":"gasoline",
"cng":"cng", "diesel":"diesel",
"elektro":"electric", "hybrid":"hybrid", "lpg":"lpg"},
na_action='ignore')
cleaned_autos["unrepaired_damage"] = cleaned_autos["unrepaired_damage"].map({"ja":"yes", "nein":"no"}, na_action = 'ignore')
print(cleaned_autos)
date_crawled name \ 0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD 1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik 2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United 3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... 4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... ... ... ... 49995 2016-03-27 14:38:19 Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon 49996 2016-03-28 10:50:25 Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... 49997 2016-04-02 14:44:48 Fiat_500_C_1.2_Dualogic_Lounge 49998 2016-03-08 19:25:42 Audi_A3_2.0_TDI_Sportback_Ambition 49999 2016-03-14 00:42:12 Opel_Vectra_1.6_16V seller offer_type price abtest vehicle_type registration_year \ 0 privat Angebot 5000 control bus 2004 1 privat Angebot 8500 control limousine 1997 2 privat Angebot 8990 test limousine 2009 3 privat Angebot 4350 control NaN 2007 4 privat Angebot 1350 test combi 2003 ... ... ... ... ... ... ... 49995 privat Angebot 24900 control limousine 2011 49996 privat Angebot 1980 control NaN 1996 49997 privat Angebot 13200 test NaN 2014 49998 privat Angebot 22900 control combi 2013 49999 privat Angebot 1250 control limousine 1996 gearbox power_p_s model odometer_km registration_month \ 0 manual 158 andere 150000 3 1 automatic 286 7er 150000 6 2 manual 102 golf 70000 7 3 automatic 71 fortwo 70000 6 4 manual 0 focus 150000 7 ... ... ... ... ... ... 49995 automatic 239 q5 100000 1 49996 manual 75 astra 150000 5 49997 automatic 69 500 5000 11 49998 manual 150 a3 40000 11 49999 manual 101 vectra 150000 1 fuel_type brand unrepaired_damage ad_created \ 0 lpg peugeot no 2016-03-26 00:00:00 1 gasoline bmw no 2016-04-04 00:00:00 2 gasoline volkswagen no 2016-03-26 00:00:00 3 gasoline smart no 2016-03-12 00:00:00 4 gasoline ford no 2016-04-01 00:00:00 ... ... ... ... ... 49995 diesel audi no 2016-03-27 00:00:00 49996 gasoline opel no 2016-03-28 00:00:00 49997 gasoline fiat no 2016-04-02 00:00:00 49998 diesel audi no 2016-03-08 00:00:00 49999 gasoline opel no 2016-03-13 00:00:00 nr_of_pictures postal_code last_seen 0 0 79588 2016-04-06 06:45:54 1 0 71034 2016-04-06 14:45:08 2 0 35394 2016-04-06 20:15:37 3 0 33729 2016-03-15 03:16:28 4 0 39218 2016-04-01 14:38:50 ... ... ... ... 49995 0 82131 2016-04-01 13:47:40 49996 0 44807 2016-04-02 14:18:02 49997 0 73430 2016-04-04 11:47:27 49998 0 35683 2016-04-05 16:45:07 49999 0 45897 2016-04-06 21:18:48 [44074 rows x 20 columns]
Identifying a Price range
Identifying a valid price range will require further analysis of the price column.
#Details related to a column:
column_name = "price"
column_details(cleaned_autos,column_name)
Number of unique values in the price column: 2316 --------------------- Statistical data related to the price column: count 44,074.00 mean 10,374.52 std 506,282.97 min 1.00 25% 1,399.00 50% 3,400.00 75% 7,900.00 max 99,999,999.00 Name: price, dtype: object --------------------- Unique counts for each entry in the price column: 1 88 2 1 3 1 5 1 9 1 .. 10000000 1 11111111 1 12345678 2 27322222 1 99999999 1 Name: price, Length: 2316, dtype: int64
# Full view of the unique counts:
print_full(cleaned_autos[column_name].value_counts().sort_index(ascending=True))
1 88 2 1 3 1 5 1 9 1 10 2 11 1 13 1 17 1 20 1 30 2 40 1 45 1 47 1 49 1 50 26 55 1 60 5 65 1 66 1 70 3 75 2 79 1 80 9 89 1 90 4 99 15 100 83 110 2 111 2 115 2 117 1 120 22 125 3 129 1 130 7 135 1 139 1 140 6 145 1 149 7 150 149 156 2 160 5 170 3 173 1 175 9 179 1 180 24 185 1 188 1 190 11 193 1 195 2 198 1 199 30 200 167 215 1 217 1 219 1 220 24 222 10 225 5 230 7 235 2 238 1 240 3 248 1 249 12 250 193 255 1 260 3 269 1 270 3 275 6 280 24 290 17 295 1 299 41 300 252 310 1 320 5 325 2 329 2 330 5 333 14 340 8 349 14 350 249 356 1 359 1 360 7 370 18 375 4 379 1 380 20 385 1 388 1 390 23 395 2 399 58 400 225 410 1 414 1 420 4 425 3 430 8 435 1 440 1 444 5 449 12 450 204 459 1 460 3 470 5 475 4 480 19 485 2 490 26 495 3 499 74 500 630 501 1 510 2 517 1 520 7 525 4 530 6 540 2 549 10 550 296 554 1 555 25 560 4 566 1 570 6 575 5 579 1 580 16 590 48 595 2 598 2 599 110 600 432 606 1 620 6 625 3 628 1 630 2 640 3 644 1 648 1 649 11 650 353 660 4 666 11 669 2 670 6 675 5 679 1 680 23 686 1 689 2 690 62 695 3 699 119 700 320 710 1 719 1 720 7 725 2 729 1 730 4 740 3 744 1 745 2 749 19 750 360 755 1 760 6 770 6 777 17 780 17 785 1 789 1 790 43 795 2 799 111 800 407 810 2 820 6 825 3 829 1 830 5 834 1 839 1 840 5 846 1 849 12 850 363 855 2 860 1 870 11 875 3 879 1 880 12 885 1 888 28 889 1 890 64 895 3 898 1 899 117 900 341 910 1 919 1 920 1 925 1 930 4 940 2 945 2 949 11 950 344 951 1 958 1 960 1 965 2 970 6 975 2 980 44 985 4 989 1 990 131 995 4 996 1 998 4 999 388 1000 513 1039 1 1040 1 1049 5 1050 86 1059 1 1070 1 1080 5 1090 3 1095 3 1098 1 1099 37 1100 322 1111 33 1112 1 1120 2 1149 10 1150 197 1169 1 1170 1 1180 3 1189 1 1190 35 1195 1 1199 115 1200 542 1201 2 1209 1 1212 2 1221 1 1222 3 1234 2 1240 2 1247 1 1249 4 1250 297 1265 1 1270 3 1275 1 1280 12 1285 3 1290 36 1295 1 1299 121 1300 327 1310 1 1325 1 1330 1 1333 2 1340 2 1345 2 1349 5 1350 246 1355 1 1370 3 1375 2 1379 1 1380 9 1385 1 1390 65 1395 5 1398 2 1399 91 1400 255 1414 1 1420 1 1425 2 1430 3 1432 1 1440 2 1444 3 1448 2 1449 7 1450 231 1466 1 1470 2 1475 2 1480 8 1485 1 1490 83 1494 1 1495 3 1497 1 1498 1 1499 142 1500 633 1520 2 1525 2 1530 2 1545 1 1549 5 1550 136 1555 7 1560 1 1570 2 1575 2 1579 1 1580 5 1590 39 1595 3 1599 88 1600 274 1630 2 1640 1 1642 1 1645 1 1649 7 1650 219 1659 1 1660 2 1666 3 1670 1 1675 2 1680 8 1689 1 1690 41 1695 2 1698 1 1699 80 1700 236 1730 1 1749 4 1750 225 1755 1 1759 1 1760 1 1765 1 1770 1 1775 1 1777 4 1780 1 1790 40 1795 1 1797 1 1798 1 1799 69 1800 304 1820 1 1840 3 1845 1 1849 5 1850 206 1856 1 1860 1 1870 1 1875 2 1880 5 1888 3 1890 35 1895 1 1899 48 1900 209 1906 1 1925 1 1930 1 1933 1 1935 1 1945 1 1949 1 1950 189 1955 1 1960 1 1970 2 1975 1 1980 15 1981 1 1984 1 1985 1 1990 113 1995 4 1996 1 1998 4 1999 296 2000 394 2001 1 2004 1 2033 1 2035 1 2050 26 2070 1 2090 2 2095 1 2099 21 2100 186 2111 2 2128 1 2134 1 2140 1 2149 2 2150 93 2175 1 2180 1 2190 29 2195 1 2199 48 2200 345 2210 6 2222 24 2225 1 2241 1 2245 1 2249 2 2250 139 2265 2 2270 1 2280 6 2288 1 2290 28 2295 2 2299 38 2300 261 2321 2 2333 5 2340 3 2349 2 2350 123 2359 1 2380 2 2390 30 2398 2 2399 52 2400 179 2410 1 2430 1 2444 4 2449 3 2450 117 2455 1 2459 1 2460 1 2470 2 2479 1 2480 8 2490 65 2496 1 2498 1 2499 130 2500 569 2549 1 2550 60 2555 5 2569 1 2570 1 2580 2 2589 1 2590 19 2598 1 2599 46 2600 219 2649 3 2650 118 2651 1 2655 1 2660 2 2666 3 2670 1 2671 1 2680 2 2690 22 2695 1 2699 31 2700 197 2725 1 2749 7 2750 161 2775 2 2777 3 2780 2 2785 1 2789 1 2790 28 2798 1 2799 47 2800 263 2849 1 2850 124 2860 3 2870 1 2874 1 2880 1 2888 3 2890 26 2895 1 2899 29 2900 248 2910 1 2920 1 2944 1 2949 4 2950 168 2980 7 2985 1 2986 1 2989 1 2990 129 2993 1 2995 5 2997 1 2998 4 2999 220 3000 321 3001 2 3010 1 3012 1 3020 1 3049 1 3050 9 3055 1 3060 2 3075 1 3080 1 3099 11 3100 121 3111 2 3119 1 3120 1 3122 1 3125 1 3129 1 3149 3 3150 57 3178 1 3180 2 3190 12 3195 1 3199 37 3200 241 3211 1 3220 1 3222 2 3230 2 3240 2 3249 3 3250 105 3260 1 3266 1 3279 1 3280 3 3285 1 3290 21 3295 1 3299 35 3300 178 3318 1 3330 3 3333 34 3349 1 3350 58 3360 1 3380 4 3390 16 3395 1 3399 24 3400 130 3410 1 3420 1 3425 1 3444 2 3450 80 3455 1 3456 1 3470 2 3475 1 3480 5 3485 1 3490 42 3495 5 3498 2 3499 77 3500 450 3540 1 3549 3 3550 44 3555 4 3560 1 3580 1 3590 13 3599 20 3600 117 3620 1 3633 1 3650 79 3660 1 3666 6 3670 3 3680 2 3690 27 3699 35 3700 148 3710 1 3725 1 3749 2 3750 103 3760 1 3774 1 3777 3 3780 1 3790 24 3795 1 3799 39 3800 242 3849 1 3850 74 3876 1 3879 1 3880 2 3888 4 3890 17 3895 2 3899 28 3900 228 3940 2 3945 1 3949 1 3950 130 3952 1 3955 2 3957 1 3965 1 3969 1 3970 1 3975 1 3980 7 3989 1 3990 100 3993 1 3995 3 3996 1 3998 3 3999 198 4000 215 4004 1 4005 1 4050 2 4090 3 4099 8 4100 75 4111 1 4123 1 4125 1 4149 1 4150 36 4159 1 4180 1 4190 14 4199 28 4200 209 4201 1 4210 1 4220 1 4222 1 4239 1 4249 4 4250 90 4270 1 4275 1 4279 1 4280 1 4286 1 4290 14 4295 1 4299 24 4300 126 4320 1 4333 1 4335 1 4349 1 4350 43 4390 19 4398 1 4399 26 4400 84 4440 1 4444 22 4450 35 4470 2 4475 1 4480 2 4485 1 4490 37 4495 1 4497 2 4498 2 4499 55 4500 362 4510 1 4525 1 4545 1 4549 1 4550 27 4555 2 4567 1 4580 2 4590 16 4598 1 4599 13 4600 80 4649 1 4650 59 4655 1 4666 2 4680 2 4690 13 4699 20 4700 93 4740 1 4750 63 4755 1 4777 1 4780 1 4790 13 4799 14 4800 172 4830 2 4840 1 4850 58 4855 1 4860 1 4877 1 4888 2 4890 7 4895 1 4899 18 4900 182 4949 3 4950 122 4955 1 4980 10 4985 2 4990 95 4994 1 4995 7 4998 4 4999 155 5000 216 5012 1 5015 1 5049 1 5050 3 5099 4 5100 45 5120 1 5150 15 5180 1 5185 1 5190 4 5195 1 5198 1 5199 18 5200 148 5222 3 5248 1 5249 2 5250 54 5255 1 5290 15 5298 1 5299 15 5300 83 5333 3 5350 32 5380 2 5390 10 5399 14 5400 70 5444 1 5450 37 5454 1 5475 1 5485 1 5489 1 5490 46 5495 1 5499 47 5500 322 5540 1 5550 17 5555 30 5590 5 5599 5 5600 84 5634 1 5650 28 5666 2 5685 1 5689 1 5690 9 5695 2 5699 11 5700 61 5740 1 5749 1 5750 40 5790 7 5799 17 5800 118 5849 1 5850 39 5855 1 5870 2 5879 1 5880 2 5888 2 5890 13 5895 1 5899 10 5900 204 5913 1 5924 1 5949 1 5950 80 5980 3 5990 84 5995 6 5997 1 5998 1 5999 135 6000 175 6039 1 6050 1 6099 4 6100 40 6140 1 6149 2 6150 16 6155 1 6170 1 6190 4 6195 1 6199 17 6200 102 6202 1 6208 1 6222 2 6249 2 6250 46 6279 1 6290 11 6295 1 6299 24 6300 96 6333 1 6340 1 6350 27 6389 1 6390 9 6399 10 6400 58 6410 1 6444 1 6447 1 6450 42 6460 1 6470 1 6479 1 6480 5 6490 38 6495 1 6498 1 6499 48 6500 285 6530 1 6543 1 6545 1 6550 16 6578 1 6590 7 6599 8 6600 50 6650 17 6666 14 6680 3 6690 7 6695 1 6699 10 6700 64 6725 1 6745 1 6749 1 6750 57 6755 1 6770 1 6777 2 6790 11 6792 1 6795 2 6799 8 6800 125 6840 2 6849 2 6850 41 6870 2 6879 1 6888 2 6889 1 6890 16 6895 1 6899 16 6900 159 6911 1 6920 1 6940 1 6950 67 6969 1 6970 3 6977 1 6979 1 6980 5 6989 2 6990 74 6995 3 6998 1 6999 101 7000 154 7050 1 7085 1 7099 2 7100 23 7150 14 7190 7 7199 9 7200 100 7234 1 7240 1 7249 1 7250 49 7280 1 7290 7 7295 1 7299 15 7300 67 7333 1 7350 29 7380 1 7390 8 7395 2 7399 13 7400 60 7420 1 7430 1 7435 1 7449 1 7450 32 7460 1 7470 1 7490 43 7498 2 7499 56 7500 202 7550 6 7555 2 7590 8 7599 11 7600 67 7650 26 7655 1 7680 1 7690 8 7699 7 7700 56 7750 29 7777 22 7790 8 7795 1 7799 11 7800 97 7820 1 7825 1 7840 1 7849 2 7850 25 7880 2 7888 3 7890 3 7899 15 7900 148 7930 1 7949 2 7950 65 7980 6 7985 1 7990 67 7995 2 7999 109 8000 151 8008 1 8050 2 8098 1 8099 1 8100 24 8150 16 8180 1 8190 4 8199 7 8200 66 8222 2 8240 1 8249 2 8250 35 8285 1 8290 5 8299 13 8300 46 8333 2 8345 1 8350 19 8360 1 8390 8 8399 6 8400 42 8419 1 8420 1 8449 1 8450 31 8490 27 8495 1 8499 39 8500 214 8550 15 8555 2 8590 4 8595 1 8599 10 8600 36 8650 18 8655 1 8666 2 8690 5 8699 8 8700 58 8720 1 8750 38 8777 1 8790 6 8798 1 8799 9 8800 68 8850 16 8854 1 8875 1 8880 2 8888 16 8890 4 8899 2 8900 140 8930 1 8944 1 8950 51 8970 1 8980 1 8990 68 8995 2 8999 89 9000 106 9050 1 9099 2 9100 18 9130 1 9149 1 9150 10 9190 2 9195 1 9199 7 9200 62 9203 1 9249 1 9250 35 9270 1 9280 1 9290 9 9299 6 9300 43 9325 1 9349 2 9350 17 9390 4 9399 9 9400 28 9450 23 9455 1 9480 1 9490 7 9499 30 9500 196 9550 2 9577 1 9590 1 9599 3 9600 25 9650 11 9655 1 9666 1 9690 5 9699 6 9700 29 9725 1 9750 28 9773 1 9777 2 9790 7 9799 7 9800 76 9845 1 9850 22 9870 2 9890 11 9895 2 9899 2 9900 132 9950 32 9970 1 9975 2 9980 10 9989 1 9990 82 9995 5 9998 2 9999 120 10000 99 10009 1 10050 2 10080 1 10090 1 10100 9 10111 1 10150 3 10175 1 10199 7 10200 21 10230 1 10250 14 10280 2 10290 5 10299 9 10300 25 10333 1 10350 11 10390 1 10399 3 10400 30 10450 12 10480 1 10488 1 10490 24 10495 1 10498 1 10499 22 10500 179 10550 6 10555 1 10556 1 10590 3 10599 5 10600 21 10650 8 10666 2 10680 1 10690 1 10699 3 10700 24 10750 10 10777 1 10790 1 10799 5 10800 56 10830 1 10850 8 10880 2 10888 1 10890 2 10899 3 10900 90 10949 1 10950 25 10975 2 10980 5 10988 1 10989 1 10990 50 10995 1 10998 3 10999 51 11000 94 11090 1 11100 5 11111 5 11140 1 11149 1 11150 1 11199 2 11200 32 11240 1 11250 11 11270 1 11299 3 11300 26 11350 6 11360 1 11389 1 11390 3 11399 7 11400 28 11430 1 11448 1 11450 13 11480 3 11490 15 11495 1 11499 20 11500 139 11540 1 11550 4 11555 1 11590 1 11599 1 11600 11 11650 6 11666 1 11687 1 11690 3 11699 3 11700 15 11749 1 11750 13 11780 1 11790 1 11799 2 11800 33 11825 1 11850 14 11888 1 11890 4 11899 1 11900 93 11930 2 11950 29 11960 1 11970 1 11980 4 11982 1 11989 1 11990 41 11995 4 11999 41 12000 102 12100 8 12150 2 12190 2 12199 1 12200 24 12250 13 12290 3 12299 5 12300 22 12333 1 12340 1 12349 1 12350 7 12390 1 12395 1 12399 6 12400 24 12449 2 12450 12 12470 1 12480 1 12490 20 12499 20 12500 135 12549 2 12550 1 12560 1 12580 1 12590 2 12599 1 12600 13 12690 1 12696 1 12699 2 12700 25 12750 18 12777 1 12780 1 12790 6 12799 1 12800 59 12850 7 12860 1 12888 1 12889 1 12890 4 12898 1 12899 7 12900 106 12950 24 12970 1 12980 3 12986 1 12990 21 12995 1 12998 1 12999 50 13000 91 13049 1 13100 6 13150 2 13199 2 13200 30 13250 6 13290 1 13299 3 13300 15 13333 2 13349 1 13350 5 13362 1 13383 1 13387 1 13390 3 13399 1 13400 13 13450 11 13480 1 13490 12 13498 1 13499 15 13500 124 13520 1 13550 5 13555 2 13560 1 13599 4 13600 11 13650 4 13680 1 13700 6 13750 10 13780 2 13790 1 13798 1 13799 5 13800 36 13850 6 13888 1 13890 4 13899 2 13900 97 13911 1 13950 20 13970 2 13975 1 13980 4 13988 1 13990 34 13995 3 13999 37 14000 67 14050 1 14100 2 14190 2 14199 1 14200 27 14230 1 14250 11 14290 2 14299 3 14300 8 14321 1 14337 1 14350 3 14390 2 14399 3 14400 7 14444 2 14450 8 14489 1 14490 10 14498 1 14499 17 14500 111 14525 1 14550 3 14590 1 14599 3 14600 15 14699 5 14700 14 14750 9 14790 1 14800 25 14850 12 14863 1 14880 1 14888 1 14890 7 14899 1 14900 84 14950 20 14980 3 14986 2 14990 34 14995 2 14998 1 14999 53 15000 73 15099 1 15100 1 15186 1 15199 1 15200 22 15249 1 15250 9 15290 1 15298 1 15299 3 15300 15 15350 3 15390 1 15399 3 15400 13 15413 1 15450 7 15470 1 15480 1 15490 12 15492 1 15499 13 15500 80 15550 2 15555 3 15590 3 15600 11 15650 2 15670 1 15700 13 15749 1 15750 6 15777 1 15799 3 15800 26 15823 1 15850 5 15870 1 15880 3 15888 1 15890 5 15899 3 15900 68 15911 1 15950 12 15990 33 15999 42 16000 49 16100 2 16150 2 16190 1 16200 16 16250 9 16290 3 16299 4 16300 11 16333 2 16350 9 16375 1 16390 2 16399 3 16400 16 16450 6 16480 2 16490 11 16499 9 16500 88 16550 2 16590 4 16599 2 16600 6 16650 1 16660 1 16666 1 16680 1 16690 1 16699 1 16700 17 16750 5 16790 5 16799 3 16800 27 16830 1 16845 1 16850 3 16880 2 16890 4 16899 3 16900 80 16950 6 16960 1 16975 1 16980 2 16990 14 16995 1 16998 1 16999 26 17000 43 17100 3 17149 1 17190 2 17200 13 17249 1 17250 7 17290 1 17299 3 17300 13 17350 3 17400 13 17419 1 17450 6 17490 10 17499 10 17500 78 17520 1 17550 1 17590 2 17599 1 17600 6 17650 2 17695 1 17699 2 17700 12 17750 5 17777 3 17799 1 17800 17 17830 1 17850 9 17890 3 17899 1 17900 58 17950 11 17980 1 17989 1 17990 19 17996 1 17999 25 18000 39 18090 1 18100 2 18150 1 18200 8 18250 6 18299 3 18300 8 18310 1 18350 2 18399 1 18400 6 18430 1 18450 7 18470 1 18490 8 18499 8 18500 62 18599 2 18600 8 18650 2 18699 2 18700 11 18750 4 18790 3 18798 1 18799 2 18800 16 18850 3 18888 2 18890 1 18900 45 18940 1 18950 8 18977 1 18990 14 18999 20 19000 24 19100 1 19121 1 19199 2 19200 9 19299 4 19300 6 19350 2 19399 1 19400 4 19444 1 19450 2 19480 1 19490 4 19499 2 19500 82 19550 1 19599 1 19600 6 19650 3 19666 1 19690 1 19699 1 19700 4 19750 3 19777 1 19780 1 19800 20 19850 4 19890 2 19900 49 19911 1 19950 8 19968 1 19970 1 19980 1 19990 27 19995 4 19998 2 19999 27 20000 29 20123 1 20150 1 20199 1 20200 5 20222 1 20250 1 20290 1 20299 1 20300 1 20390 2 20399 1 20400 4 20450 1 20480 1 20490 6 20495 2 20499 4 20500 27 20550 3 20589 1 20590 1 20595 1 20650 3 20700 2 20750 4 20790 1 20799 1 20800 11 20850 4 20888 1 20890 3 20900 34 20911 1 20950 6 20980 1 20985 1 20987 1 20990 14 20999 10 21000 30 21049 1 21200 4 21250 2 21275 1 21299 1 21300 5 21400 8 21444 1 21450 2 21490 7 21499 8 21500 41 21590 2 21599 1 21600 3 21690 1 21700 1 21750 3 21780 2 21790 1 21800 4 21850 3 21888 1 21890 2 21900 37 21950 6 21959 1 21980 2 21990 14 21995 1 21999 14 22000 26 22199 1 22222 1 22250 3 22280 1 22290 1 22299 1 22300 6 22399 1 22400 8 22450 3 22490 6 22499 1 22500 47 22555 1 22580 1 22600 3 22650 2 22690 1 22700 2 22750 2 22790 1 22799 1 22800 10 22850 3 22880 1 22890 1 22899 1 22900 38 22950 8 22980 1 22989 1 22990 5 22996 1 22999 7 23000 26 23123 1 23200 4 23233 1 23250 3 23300 1 23333 2 23350 1 23400 4 23450 1 23490 5 23499 8 23500 46 23540 1 23590 1 23600 4 23650 1 23700 4 23750 2 23790 1 23799 1 23800 6 23850 1 23890 1 23899 1 23900 25 23950 5 23989 1 23990 6 23999 10 24000 20 24150 1 24200 6 24242 1 24290 1 24300 9 24350 1 24360 1 24400 3 24444 1 24450 2 24480 1 24490 3 24499 5 24500 32 24550 1 24600 3 24699 1 24750 2 24790 1 24800 3 24850 3 24888 1 24890 3 24895 1 24900 24 24950 7 24980 2 24987 1 24990 7 24999 19 25000 27 25100 1 25200 4 25250 2 25300 1 25350 3 25390 1 25400 2 25450 1 25490 1 25499 2 25500 24 25590 1 25600 2 25650 2 25698 1 25699 1 25750 2 25800 5 25850 1 25880 1 25900 23 25980 3 25987 2 25990 11 25999 9 26000 17 26150 1 26200 3 26250 2 26294 1 26300 7 26390 2 26400 3 26450 2 26499 6 26500 29 26600 1 26660 1 26700 2 26725 1 26750 2 26800 5 26849 1 26850 3 26899 2 26900 23 26950 3 26990 5 26999 12 27000 22 27020 1 27250 1 27280 1 27299 1 27300 5 27350 1 27400 2 27450 2 27490 4 27499 6 27500 26 27600 1 27650 1 27690 1 27700 3 27750 2 27777 3 27800 5 27850 2 27900 23 27950 3 27990 7 27999 8 28000 21 28200 3 28290 1 28300 1 28399 1 28400 1 28450 5 28470 1 28490 2 28499 3 28500 19 28600 3 28700 1 28750 2 28800 4 28850 1 28900 16 28950 3 28990 5 28999 5 29000 6 29200 3 29250 1 29300 1 29333 2 29350 2 29380 1 29400 1 29445 1 29450 1 29490 3 29499 2 29500 20 29550 1 29600 1 29699 1 29700 2 29750 2 29777 1 29800 4 29850 2 29890 1 29900 14 29950 5 29970 1 29980 1 29989 1 29990 9 29999 15 30000 10 30200 1 30400 1 30499 2 30500 7 30570 1 30650 1 30700 2 30800 1 30888 1 30899 1 30900 5 30933 1 30950 1 30987 1 30990 2 30999 2 31000 8 31111 1 31199 2 31200 1 31313 1 31400 1 31450 1 31490 3 31499 2 31500 13 31600 1 31799 2 31800 5 31900 12 31950 2 31990 4 31999 8 32000 10 32150 1 32222 2 32400 2 32490 1 32500 16 32700 2 32900 7 32949 1 32950 2 32990 3 32999 4 33000 8 33200 1 33300 1 33449 1 33490 2 33499 1 33500 8 33600 1 33650 1 33700 2 33750 1 33777 1 33800 3 33850 1 33900 10 33950 2 33980 1 33990 3 33999 3 34000 8 34150 1 34490 1 34499 2 34500 15 34550 1 34650 1 34750 1 34800 4 34850 1 34890 1 34900 12 34940 1 34950 2 34980 1 34990 3 34996 1 34999 4 35000 9 35222 1 35370 1 35450 1 35460 1 35499 1 35500 3 35555 1 35700 1 35790 2 35800 1 35890 2 35900 9 35911 1 35950 1 35980 1 35990 1 35999 2 36000 12 36500 4 36675 1 36800 1 36850 1 36900 4 36999 2 37000 4 37300 1 37400 1 37450 1 37480 1 37499 1 37500 4 37700 1 37750 1 37800 4 37850 2 37900 9 37950 1 37999 3 38000 4 38200 1 38400 1 38450 1 38500 7 38700 1 38760 1 38800 2 38850 1 38900 8 38950 1 38990 2 38999 2 39000 2 39300 2 39500 9 39700 1 39800 1 39890 1 39900 12 39911 1 39979 1 39990 1 39999 6 40000 3 40499 1 40500 2 40800 1 40850 1 40900 3 40990 1 40999 2 41000 3 41500 6 41800 1 41850 1 41900 5 41999 1 42000 3 42222 1 42500 3 42750 1 42760 1 42900 7 42990 2 42996 2 42999 1 43000 2 43461 1 43500 2 43900 5 44000 4 44200 2 44444 1 44497 1 44499 2 44500 1 44777 1 44900 5 44990 1 44996 1 45000 5 45500 4 45800 1 45900 3 45949 1 45950 1 46000 2 46200 1 46500 1 46800 1 46900 3 46911 1 46990 1 46999 1 47000 4 47499 1 47500 3 47800 1 47900 3 47950 1 47997 1 48000 4 48300 1 48490 1 48500 4 48600 1 48700 2 48850 1 48888 1 48900 3 48999 2 49000 3 49500 4 49900 4 49999 3 50000 1 50500 2 50900 1 51000 1 51500 1 51900 2 51990 1 51999 1 52500 3 52900 5 52911 1 53000 4 53500 2 53900 2 54500 2 54990 1 55000 5 55500 1 55555 1 55800 1 55900 1 55996 1 55999 1 56000 4 56500 2 56800 1 56900 1 57800 1 58500 2 58700 1 58900 1 59000 3 59500 3 59850 1 60000 2 61500 1 61900 1 61950 1 61999 1 62000 2 62900 2 63000 1 63499 1 63999 2 64280 1 64500 2 64600 1 64900 3 64990 1 64999 1 65000 1 65699 1 65700 1 65990 1 66500 1 66964 1 67000 1 67911 1 68000 1 68300 1 68500 1 68750 1 68900 1 69500 1 69900 1 69993 1 69997 1 69999 1 70000 1 70850 1 71000 1 72500 1 72600 1 72900 1 73500 1 73900 1 73996 1 74900 3 74999 2 75000 1 75900 1 75997 1 76997 1 78911 1 79500 1 79933 1 79980 1 79999 1 80000 3 82987 1 83000 1 84000 1 84997 1 85000 1 86500 1 88900 1 89000 1 89900 1 93000 2 93911 1 94999 1 98500 1 99000 2 99900 2 104900 1 105000 2 109999 1 114400 1 115000 1 115991 1 116000 1 119500 1 119900 1 120000 2 128000 1 129000 1 130000 1 135000 1 137999 1 139997 1 145000 1 151990 1 155000 1 163500 1 163991 1 169000 1 169999 1 175000 1 180000 1 190000 1 194000 1 197000 1 198000 1 220000 1 259000 1 265000 1 295000 1 299000 1 345000 1 350000 1 999990 1 999999 2 1234566 1 1300000 1 3890000 1 10000000 1 11111111 1 12345678 2 27322222 1 99999999 1 Name: price, dtype: int64
After analysis of the data I believe the ideal price range for a vehicle listing to be considered as valid is 200-350000. the reasons for the same are below
refined_autos = cleaned_autos[cleaned_autos["price"].between(200,350000)]
column_details(refined_autos, "price")
Number of unique values in the price column: 2250 --------------------- Statistical data related to the price column: count 43,509.00 mean 6,312.19 std 9,310.05 min 200.00 25% 1,450.00 50% 3,499.00 75% 7,990.00 max 350,000.00 Name: price, dtype: object --------------------- Unique counts for each entry in the price column: 200 167 215 1 217 1 219 1 220 24 ... 265000 1 295000 1 299000 1 345000 1 350000 1 Name: price, Length: 2250, dtype: int64
Analysis of Date columns
The following cells allow analysis of the date related columns
date_column = "ad_created"
date_column_details(autos,date_column)
Unique counts in percentage for each entry in the ad_created column: 2015-06-11 0.00002 2015-08-10 0.00002 2015-09-09 0.00002 2015-11-10 0.00002 2015-12-05 0.00002 ... 2016-04-03 0.03892 2016-04-04 0.03688 2016-04-05 0.01184 2016-04-06 0.00326 2016-04-07 0.00128 Name: ad_created, Length: 76, dtype: float64
print_full(refined_autos["ad_created"].value_counts(normalize=True).sort_index(ascending=True))
I have not been able to extract anything significant from date related data that could contribute to the above mentioned goal. A trend that I did notice was that during the month of March there was a linear increase in the creation of car ads. Its relevance, however would have to be part of another analysis.
Insights from the cleaned data
Since we now have a cleaner dataset from when we started, it is possible glean some insights from the data.
Question:How do German manufactured, most likely second hand vehicles, compare to each other when it comes to price?
print_full(refined_autos["brand"].value_counts())
volkswagen 9119 bmw 4866 opel 4508 mercedes_benz 4338 audi 3838 ford 2946 renault 2029 peugeot 1294 fiat 1092 seat 824 skoda 735 nissan 679 mazda 643 smart 641 citroen 611 toyota 578 hyundai 447 volvo 404 mini 404 sonstige_autos 389 honda 349 mitsubishi 348 kia 331 alfa_romeo 287 porsche 272 suzuki 254 chevrolet 252 chrysler 151 dacia 121 jeep 102 daihatsu 101 land_rover 93 subaru 84 saab 74 jaguar 68 daewoo 67 rover 58 lancia 46 trabant 42 lada 24 Name: brand, dtype: int64
german_brands = ["volkswagen", "bmw", "opel", "mercedes_benz", "audi", "ford", "porsche"]
brand_mean_price = {}
for a_brand in german_brands:
brand_mean_price[a_brand] = refined_autos[refined_autos["brand"] == a_brand]["price"].mean()
print(brand_mean_price)
mean_price_series = pd.Series(brand_mean_price)
{'volkswagen': 5757.820923346858, 'bmw': 8635.09946568023, 'opel': 3217.0858473824314, 'mercedes_benz': 8856.378284923929, 'audi': 9715.781136008338, 'ford': 4029.385268160217, 'porsche': 47803.96323529412}
Insights
Question: An earlier insight revealed the brands and their average prices. However, these prices do not take in to account the distance covered by these vehicles. The more distance a vehicle has covered, the less its asking price (vice-versa may not be true as the age of the car also causes it to depreciate in value). Does this stand?
brand_mean_mileage = {}
for a_brand in german_brands:
brand_mean_mileage[a_brand] = refined_autos[refined_autos["brand"] == a_brand ]["odometer_km"].mean()
print(brand_mean_mileage)
mean_mileage_series = pd.Series(brand_mean_mileage)
{'volkswagen': 128296.41408049127, 'bmw': 132672.62638717634, 'opel': 128738.90860692103, 'mercedes_benz': 130992.3928077455, 'audi': 128870.5054715998, 'ford': 123906.99253224711, 'porsche': 98198.5294117647}
brand_mean = pd.DataFrame(mean_price_series, columns = ["brand_mean_price"])
brand_mean["brand_mean_milage"] = mean_mileage_series
print(brand_mean)
brand_mean_price brand_mean_milage volkswagen 5757.820923 128296.414080 bmw 8635.099466 132672.626387 opel 3217.085847 128738.908607 mercedes_benz 8856.378285 130992.392808 audi 9715.781136 128870.505472 ford 4029.385268 123906.992532 porsche 47803.963235 98198.529412
Insight
Question: What are the most common brand-model pairs?
common_brand_model = {}
for a_brand in refined_autos["brand"].unique():
most_common_model=refined_autos[refined_autos["brand"] == a_brand]["model"].value_counts()
if most_common_model.empty != True:
common_brand_model[a_brand] = most_common_model.index[0]
print(common_brand_model)
{'peugeot': '2_reihe', 'bmw': '3er', 'volkswagen': 'golf', 'smart': 'fortwo', 'ford': 'focus', 'chrysler': 'andere', 'seat': 'ibiza', 'renault': 'twingo', 'audi': 'a4', 'opel': 'corsa', 'mazda': '3_reihe', 'porsche': '911', 'mini': 'cooper', 'mercedes_benz': 'c_klasse', 'toyota': 'yaris', 'dacia': 'sandero', 'nissan': 'micra', 'jeep': 'grand', 'saab': 'andere', 'volvo': 'v40', 'mitsubishi': 'colt', 'jaguar': 'andere', 'fiat': 'punto', 'skoda': 'octavia', 'subaru': 'legacy', 'kia': 'andere', 'citroen': 'andere', 'chevrolet': 'andere', 'hyundai': 'i_reihe', 'honda': 'civic', 'daewoo': 'matiz', 'suzuki': 'andere', 'trabant': '601', 'land_rover': 'freelander', 'alfa_romeo': '147', 'daihatsu': 'cuore', 'rover': 'andere', 'lancia': 'ypsilon', 'lada': 'niva'}
Insights
Question: How does damage affect the price of a vehicle?
repaired_mean_price = {}
non_repaired_mean_price = {}
#refined_autos["brand"].unique()
#german_brands
for a_brand in refined_autos["brand"].unique():
the_brand = refined_autos[refined_autos["brand"] == a_brand]
its_repaired_vehicles = the_brand[refined_autos["unrepaired_damage"] == "no"]
its_non_repaired_vehicles = the_brand[refined_autos["unrepaired_damage"] == "yes"]
repaired_mean_price[a_brand] = its_repaired_vehicles["price"].mean()
non_repaired_mean_price[a_brand] = its_non_repaired_vehicles["price"].mean()
repaired_mean_price_series = pd.Series(repaired_mean_price)
non_repaired_mean_price_series = pd.Series(non_repaired_mean_price)
repair_vs_non_repair = pd.DataFrame(repaired_mean_price_series, columns = ["repaired_mean_price"])
repair_vs_non_repair["non_repaired_mean_price"] = non_repaired_mean_price_series
repair_vs_non_repair["ratio"] = repair_vs_non_repair["repaired_mean_price"]/repair_vs_non_repair["non_repaired_mean_price"]
print(repair_vs_non_repair)
print("--------------------")
print("On average a repaired vehicle can be priced at {0} times its damaged counterpart".
format(repair_vs_non_repair["ratio"].mean()))
C:\Users\maxen.x\AppData\Local\Continuum\anaconda3\lib\site-packages\ipykernel_launcher.py:8: UserWarning: Boolean Series key will be reindexed to match DataFrame index. C:\Users\maxen.x\AppData\Local\Continuum\anaconda3\lib\site-packages\ipykernel_launcher.py:9: UserWarning: Boolean Series key will be reindexed to match DataFrame index. if __name__ == '__main__':
repaired_mean_price non_repaired_mean_price ratio peugeot 3693.942708 1544.211538 2.392122 bmw 9533.062916 3801.875346 2.507463 volkswagen 6594.260499 2413.531557 2.732204 smart 3924.225549 1461.609756 2.684865 ford 4739.693925 1569.277612 3.020303 chrysler 4013.972727 2239.000000 1.792752 seat 5208.316535 1854.594937 2.808331 renault 3151.620377 1278.918367 2.464286 audi 10962.620746 3510.127090 3.123141 sonstige_autos 16201.951890 5916.758621 2.738316 opel 3725.334341 1553.578495 2.397905 mazda 5195.745303 1605.015873 3.237192 porsche 51125.256098 14670.000000 3.485021 mini 11138.889488 4595.000000 2.424133 mercedes_benz 9825.391405 4097.339623 2.397993 toyota 5656.648421 3560.175000 1.588868 dacia 6217.416667 4459.625000 1.394157 nissan 5804.160156 2112.176471 2.747952 jeep 12537.595238 3074.875000 4.077432 saab 3942.258621 657.000000 6.000394 volvo 5861.224684 1669.209302 3.511378 mitsubishi 4274.434615 1568.195122 2.725703 jaguar 14118.357143 4993.625000 2.827276 fiat 3535.093590 1375.596491 2.569862 skoda 7143.399334 3735.671875 1.912213 subaru 5544.500000 2856.285714 1.941157 kia 6971.317647 2177.125000 3.202075 citroen 4293.597015 2067.507692 2.076702 chevrolet 7425.036269 3423.277778 2.168984 hyundai 6216.828652 2817.127660 2.206797 honda 5035.646154 1560.189189 3.227587 daewoo 1206.511111 524.800000 2.298992 suzuki 4861.124402 1726.000000 2.816410 trabant 3091.760000 817.250000 3.783126 land_rover 21688.512821 5223.750000 4.151905 alfa_romeo 4880.843602 2118.500000 2.303915 daihatsu 1828.564103 950.000000 1.924804 rover 1962.257143 888.500000 2.208506 lancia 3106.685714 4774.500000 0.650683 lada 3064.411765 1400.000000 2.188866 -------------------- On average a repaired vehicle can be priced at 2.6677943265188437 times its damaged counterpart
Insights
A quick summary
eBay Kleinanzeige has a lot of listings related to cars. Through this project we were able to extract, clean and review a small portion of their unclean listing data related to cars from 2016.
Through the course of cleaning I have had to remove a lot of rows based on assumptions which I have stated. After cleaning the data it was used to extract meaningful insights. Thus meeting the goal that was set at the start of this project.