The dataset comprises of used car sales information from eBay Kleinanzeigen, a classifieds section of the German eBay website. The dataset was scraped, cleaned and uploaded to Kaggle originally, which was then modified and dirtied for the guided project practice.
The dataset comprises of 50000 rows and 20 columns.
The aim of this project is to clean and analyze the used car listings. To familiarize with some of the unique benefits jupyter notebook provides for pandas. In addition, working on dataset will help in better understanding of numpy and pandas libraries.
*Section 1 - Import Liabraries, create dataframe from csv, display info of dataframe*
import numpy as np
import pandas as pd
autos = pd.read_csv("autos.csv",encoding="Latin-1")
# suppress scientifc notation of values
pd.options.display.float_format = '{:.4f}'.format
# autos dataset
autos.info()
autos.head()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 dateCrawled 50000 non-null object 1 name 50000 non-null object 2 seller 50000 non-null object 3 offerType 50000 non-null object 4 price 50000 non-null object 5 abtest 50000 non-null object 6 vehicleType 44905 non-null object 7 yearOfRegistration 50000 non-null int64 8 gearbox 47320 non-null object 9 powerPS 50000 non-null int64 10 model 47242 non-null object 11 odometer 50000 non-null object 12 monthOfRegistration 50000 non-null int64 13 fuelType 45518 non-null object 14 brand 50000 non-null object 15 notRepairedDamage 40171 non-null object 16 dateCreated 50000 non-null object 17 nrOfPictures 50000 non-null int64 18 postalCode 50000 non-null int64 19 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 26-03-2016 17:47 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 26-03-2016 00:00 | 0 | 79588 | 06-04-2016 06:45 |
1 | 04-04-2016 13:38 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 04-04-2016 00:00 | 0 | 71034 | 06-04-2016 14:45 |
2 | 26-03-2016 18:57 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 26-03-2016 00:00 | 0 | 35394 | 06-04-2016 20:15 |
3 | 12-03-2016 16:58 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 12-03-2016 00:00 | 0 | 33729 | 15-03-2016 03:16 |
4 | 01-04-2016 14:38 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 01-04-2016 00:00 | 0 | 39218 | 01-04-2016 14:38 |
Observations based on dataframe.info() function
*Section 2 - Rename column names of the autos dataset, to be more readable and logical*
# assign array of column names of autos dataset
autos_columns = autos.columns
# update columns names as specified for date related columns
updated_columns = []
for each_column in autos_columns:
if each_column == "yearOfRegistration":
each_column = "registration_year"
elif each_column == "monthOfRegistration":
each_column = "registration_month"
elif each_column == "notRepairedDamage":
each_column = "unrepaired_damage"
elif each_column == "dateCreated":
each_column = "ad_created"
else:
each_column = each_column.lower()
updated_columns.append(each_column)
# assign updated columns names to autos columns
autos.columns = updated_columns
print("\033[1m old_name: \033[0m", autos.columns, "\n")
# update rest of the column names from camel case to snake case
autos.rename(columns={"datecrawled":"date_crawled", "offertype":"offer_type", "vehicletype":"vehicle_type",
"fueltype":"fuel_type", "nrofpictures":"nr_of_pictures", "postalcode":"postal_code",
"lastseen":"last_seen"}, inplace=True)
print("\033[1m new_names: \033[0m", autos.columns)
old_name: Index(['datecrawled', 'name', 'seller', 'offertype', 'price', 'abtest', 'vehicletype', 'registration_year', 'gearbox', 'powerps', 'model', 'odometer', 'registration_month', 'fueltype', 'brand', 'unrepaired_damage', 'ad_created', 'nrofpictures', 'postalcode', 'lastseen'], dtype='object') new_names: Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest', 'vehicle_type', 'registration_year', 'gearbox', 'powerps', 'model', 'odometer', 'registration_month', 'fuel_type', 'brand', 'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code', 'last_seen'], dtype='object')
*Section 3 - Need for descriptive statistics of autos dataset*
# describe function with include All parameter to get statistics of non-numerics columns as well
autos.describe(include='all')
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | powerps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000 | 50000 | 50000 | 50000 | 50000 | 50000 | 44905 | 50000.0000 | 47320 | 50000.0000 | 47242 | 50000 | 50000.0000 | 45518 | 50000 | 40171 | 50000 | 50000.0000 | 50000.0000 | 50000 |
unique | 12073 | 38754 | 2 | 2 | 2357 | 2 | 8 | nan | 2 | nan | 245 | 13 | nan | 7 | 40 | 2 | 76 | nan | nan | 10432 |
top | 28-03-2016 14:49 | Ford_Fiesta | privat | Angebot | $0 | test | limousine | nan | manuell | nan | golf | 150,000km | nan | benzin | volkswagen | nein | 03-04-2016 00:00 | nan | nan | 06-04-2016 02:45 |
freq | 15 | 78 | 49999 | 49999 | 1421 | 25756 | 12859 | nan | 36993 | nan | 4024 | 32424 | nan | 30107 | 10687 | 35232 | 1946 | nan | nan | 112 |
mean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2005.0733 | NaN | 116.3559 | NaN | NaN | 5.7234 | NaN | NaN | NaN | NaN | 0.0000 | 50813.6273 | NaN |
std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 105.7128 | NaN | 209.2166 | NaN | NaN | 3.7120 | NaN | NaN | NaN | NaN | 0.0000 | 25779.7480 | NaN |
min | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1000.0000 | NaN | 0.0000 | NaN | NaN | 0.0000 | NaN | NaN | NaN | NaN | 0.0000 | 1067.0000 | NaN |
25% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1999.0000 | NaN | 70.0000 | NaN | NaN | 3.0000 | NaN | NaN | NaN | NaN | 0.0000 | 30451.0000 | NaN |
50% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2003.0000 | NaN | 105.0000 | NaN | NaN | 6.0000 | NaN | NaN | NaN | NaN | 0.0000 | 49577.0000 | NaN |
75% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2008.0000 | NaN | 150.0000 | NaN | NaN | 9.0000 | NaN | NaN | NaN | NaN | 0.0000 | 71540.0000 | NaN |
max | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 9999.0000 | NaN | 17700.0000 | NaN | NaN | 12.0000 | NaN | NaN | NaN | NaN | 0.0000 | 99998.0000 | NaN |
*Observations based on describe function for autos dataset*
*Section 4 - Process potential columns for - dtype conversions, update/filter for null and/or non-null values, analysis for redundant/ non-imapctful data values, or complete deletion from autos dataset*
# conversion of price and odometer columns
print("price top 10 rows values")
print(autos["price"].head(10))
print("------------------------")
print("odometer last 10 row values")
print(autos["odometer"].tail(10))
price top 10 rows values 0 $5,000 1 $8,500 2 $8,990 3 $4,350 4 $1,350 5 $7,900 6 $300 7 $1,990 8 $250 9 $590 Name: price, dtype: object ------------------------ odometer last 10 row values 49990 30,000km 49991 150,000km 49992 125,000km 49993 150,000km 49994 150,000km 49995 100,000km 49996 150,000km 49997 5,000km 49998 40,000km 49999 150,000km Name: odometer, dtype: object
Based on results above following changes need to be made to price and odometer columns:
# replace $ and , symbols from price column and convert to float
autos["price"] = autos["price"].str.replace("$","").str.replace(",","").astype(float)
# replace km and , symbols from odometer column and convert to float
autos["odometer"] = autos["odometer"].str.replace("km","").str.replace(",","").astype(float)
# rename price and odometer columns
autos.rename(columns={"price" : "price_USD", "odometer" : "odometer_km"}, inplace=True)
autos.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date_crawled 50000 non-null object 1 name 50000 non-null object 2 seller 50000 non-null object 3 offer_type 50000 non-null object 4 price_USD 50000 non-null float64 5 abtest 50000 non-null object 6 vehicle_type 44905 non-null object 7 registration_year 50000 non-null int64 8 gearbox 47320 non-null object 9 powerps 50000 non-null int64 10 model 47242 non-null object 11 odometer_km 50000 non-null float64 12 registration_month 50000 non-null int64 13 fuel_type 45518 non-null object 14 brand 50000 non-null object 15 unrepaired_damage 40171 non-null object 16 ad_created 50000 non-null object 17 nr_of_pictures 50000 non-null int64 18 postal_code 50000 non-null int64 19 last_seen 50000 non-null object dtypes: float64(2), int64(5), object(13) memory usage: 7.6+ MB
Columns price and odometer have been coverted from object dtype to float dtype and renamed to price_USD and odometer_km respectively.
# analyse unique values in each column of autos dataset
print("Column Name : ","Unique Value Count")
print("-----------------------------------")
for each_column in autos.columns:
print(each_column,":", autos[each_column].unique().shape[0])
print('\n')
# explore selective columns further which have only 1-3 unique values
print("Column Name : ","Unique Values")
print("-----------------------------------")
selective_list = ["seller","offer_type","abtest","gearbox","unrepaired_damage","nr_of_pictures"]
for each_name in selective_list:
print(each_name,":", autos[each_name].unique())
Column Name : Unique Value Count ----------------------------------- date_crawled : 12073 name : 38754 seller : 2 offer_type : 2 price_USD : 2357 abtest : 2 vehicle_type : 9 registration_year : 97 gearbox : 3 powerps : 448 model : 246 odometer_km : 13 registration_month : 13 fuel_type : 8 brand : 40 unrepaired_damage : 3 ad_created : 76 nr_of_pictures : 1 postal_code : 7014 last_seen : 10432 Column Name : Unique Values ----------------------------------- seller : ['privat' 'gewerblich'] offer_type : ['Angebot' 'Gesuch'] abtest : ['control' 'test'] gearbox : ['manuell' 'automatik' nan] unrepaired_damage : ['nein' nan 'ja'] nr_of_pictures : [0]
Based on the earlier and above observations nr_of_pictures column is a candidate to be dropped from the autos dataset as the column provides redudant/ irrelevant data for analysis.
# delete nr_of_pictures column from autos dataframe
del autos["nr_of_pictures"]
# review columns present in autos dataset
autos.columns
Index(['date_crawled', 'name', 'seller', 'offer_type', 'price_USD', 'abtest', 'vehicle_type', 'registration_year', 'gearbox', 'powerps', 'model', 'odometer_km', 'registration_month', 'fuel_type', 'brand', 'unrepaired_damage', 'ad_created', 'postal_code', 'last_seen'], dtype='object')
Identify outliers/unusual datavalues in price_USD and odometer_km columns
# display unique value shape for price and odometer
print("price unique shape:",autos["price_USD"].unique().shape[0])
print("odometer unique shape:",autos["odometer_km"].unique().shape[0])
print('----------------------')
# describe method for price_USD and odometer_km columns
print(autos["price_USD"].describe())
print('----------------------')
print(autos["odometer_km"].describe())
print('----------------------')
price unique shape: 2357 odometer unique shape: 13 ---------------------- count 50000.0000 mean 9840.0438 std 481104.3805 min 0.0000 25% 1100.0000 50% 2950.0000 75% 7200.0000 max 99999999.0000 Name: price_USD, dtype: float64 ---------------------- count 50000.0000 mean 125732.7000 std 40042.2117 min 5000.0000 25% 125000.0000 50% 150000.0000 75% 150000.0000 max 150000.0000 Name: odometer_km, dtype: float64 ----------------------
# value count for highest and lowest priced cars
print(autos["price_USD"].value_counts().sort_index(ascending=True).tail(25))
print('----------------------')
print(autos["price_USD"].value_counts().sort_index(ascending=True).head(25))
169999.0000 1 175000.0000 1 180000.0000 1 190000.0000 1 194000.0000 1 197000.0000 1 198000.0000 1 220000.0000 1 250000.0000 1 259000.0000 1 265000.0000 1 295000.0000 1 299000.0000 1 345000.0000 1 350000.0000 1 999990.0000 1 999999.0000 2 1234566.0000 1 1300000.0000 1 3890000.0000 1 10000000.0000 1 11111111.0000 2 12345678.0000 3 27322222.0000 1 99999999.0000 1 Name: price_USD, dtype: int64 ---------------------- 0.0000 1421 1.0000 156 2.0000 3 3.0000 1 5.0000 2 8.0000 1 9.0000 1 10.0000 7 11.0000 2 12.0000 3 13.0000 2 14.0000 1 15.0000 2 17.0000 3 18.0000 1 20.0000 4 25.0000 5 29.0000 1 30.0000 7 35.0000 1 40.0000 6 45.0000 4 47.0000 1 49.0000 4 50.0000 49 Name: price_USD, dtype: int64
*Based on the results of highest and lowest priced cars following observations can be made:*
# create price range between 300 and 350000
autos = autos[autos["price_USD"].between(300,350000)]
# display lowest and highest values for price_USD column
print("lowest price:",autos["price_USD"].min(), "highest price:",autos["price_USD"].max())
lowest price: 300.0 highest price: 350000.0
# value count for highest and lowest odometer_km readings
print(autos["odometer_km"].value_counts().sort_index(ascending=True))
5000.0000 689 10000.0000 240 20000.0000 736 30000.0000 771 40000.0000 812 50000.0000 1003 60000.0000 1141 70000.0000 1204 80000.0000 1401 90000.0000 1705 100000.0000 2069 125000.0000 4953 150000.0000 30144 Name: odometer_km, dtype: int64
*Section 5 - Process columns with date type values represented as other dtype*
Following columns contain date type data values, however stored as object/int dtype in autos dataset:
# first 5 records from date_crawled, ad_created and last_seen columns
print(autos[["date_crawled", "ad_created", "last_seen"]][0:5])
date_crawled ad_created last_seen 0 26-03-2016 17:47 26-03-2016 00:00 06-04-2016 06:45 1 04-04-2016 13:38 04-04-2016 00:00 06-04-2016 14:45 2 26-03-2016 18:57 26-03-2016 00:00 06-04-2016 20:15 3 12-03-2016 16:58 12-03-2016 00:00 15-03-2016 03:16 4 01-04-2016 14:38 01-04-2016 00:00 01-04-2016 14:38
Calculation of % wise data distribution for date_crawled, ad_created and last_seen columns
print("\033[1m% wise data distribution for date_crawled (inclusive of null values, arranged by oldest to newest)\033[0m")
print('--------------------------------------------------------------------')
print(autos["date_crawled"].str[:10].value_counts(normalize=True,dropna=False).sort_index(ascending=True).mul(100))
print('\n')
print("\033[1m% wise data distribution for date_crawled (inclusive of null values, arranged by highest to lowest)\033[0m")
print('--------------------------------------------------------------------')
print(autos["date_crawled"].str[:10].value_counts(normalize=True,dropna=False).sort_values(ascending=False).mul(100))
% wise data distribution for date_crawled (inclusive of null values, arranged by oldest to newest) -------------------------------------------------------------------- 01-04-2016 3.3861 02-04-2016 3.5632 03-04-2016 3.8619 04-04-2016 3.6528 05-03-2016 2.5369 05-04-2016 1.3122 06-03-2016 1.4082 06-04-2016 0.3158 07-03-2016 3.6208 07-04-2016 0.1366 08-03-2016 3.3093 09-03-2016 3.2837 10-03-2016 3.2474 11-03-2016 3.2688 12-03-2016 3.7040 13-03-2016 1.5640 14-03-2016 3.6464 15-03-2016 3.4181 16-03-2016 2.9466 17-03-2016 3.1450 18-03-2016 1.2866 19-03-2016 3.4693 20-03-2016 3.8022 21-03-2016 3.7488 22-03-2016 3.2922 23-03-2016 3.2282 24-03-2016 2.9274 25-03-2016 3.1279 26-03-2016 3.2325 27-03-2016 3.1109 28-03-2016 3.4992 29-03-2016 3.3776 30-03-2016 3.3776 31-03-2016 3.1919 Name: date_crawled, dtype: float64 % wise data distribution for date_crawled (inclusive of null values, arranged by highest to lowest) -------------------------------------------------------------------- 03-04-2016 3.8619 20-03-2016 3.8022 21-03-2016 3.7488 12-03-2016 3.7040 04-04-2016 3.6528 14-03-2016 3.6464 07-03-2016 3.6208 02-04-2016 3.5632 28-03-2016 3.4992 19-03-2016 3.4693 15-03-2016 3.4181 01-04-2016 3.3861 30-03-2016 3.3776 29-03-2016 3.3776 08-03-2016 3.3093 22-03-2016 3.2922 09-03-2016 3.2837 11-03-2016 3.2688 10-03-2016 3.2474 26-03-2016 3.2325 23-03-2016 3.2282 31-03-2016 3.1919 17-03-2016 3.1450 25-03-2016 3.1279 27-03-2016 3.1109 16-03-2016 2.9466 24-03-2016 2.9274 05-03-2016 2.5369 13-03-2016 1.5640 06-03-2016 1.4082 05-04-2016 1.3122 18-03-2016 1.2866 06-04-2016 0.3158 07-04-2016 0.1366 Name: date_crawled, dtype: float64
Above results indicate the autos dataset has listings for a period of March 2016 to April 2016. Also barring few months, the distribution for the listings created are similar and fluctaute between 3% to 3.9%.
print("\033[1m% wise data distribution for last_seen (inclusive of null values)\033[0m")
print('--------------------------------------------------------------------')
print(autos["last_seen"].str[:10].value_counts(normalize=True,dropna=False).sort_index(ascending=True).mul(100))
% wise data distribution for last_seen (inclusive of null values)
--------------------------------------------------------------------
01-04-2016 2.2894
02-04-2016 2.4814
03-04-2016 2.5092
04-04-2016 2.4473
05-03-2016 0.1067
05-04-2016 12.5757
06-03-2016 0.4267
06-04-2016 22.3287
07-03-2016 0.5355
07-04-2016 13.2905
08-03-2016 0.7126
09-03-2016 0.9580
10-03-2016 1.0455
11-03-2016 1.2333
12-03-2016 2.3940
13-03-2016 0.8855
14-03-2016 1.2439
15-03-2016 1.5725
16-03-2016 1.6152
17-03-2016 2.7887
18-03-2016 0.7254
19-03-2016 1.5512
20-03-2016 2.0654
21-03-2016 2.0462
22-03-2016 2.1486
23-03-2016 1.8563
24-03-2016 1.9544
25-03-2016 1.8904
26-03-2016 1.6600
27-03-2016 1.5554
28-03-2016 2.0910
29-03-2016 2.1870
30-03-2016 2.4452
31-03-2016 2.3833
Name: last_seen, dtype: float64
Percentage wise distribution for last_seen columns suggests that for 3 days in April 2016 ie. 5th, 6th and 7th April, the crawler frequented the listing online. For rest of the days between Mar 2016 and April 2016, the visits to listing were more or less uniform.
print("\033[1m% wise data distribution for ad_created (inclusive of null values)\033[0m")
print('--------------------------------------------------------------------')
print(autos["ad_created"].str[:10].value_counts(normalize=True,dropna=False).sort_index(ascending=True).mul(100))
% wise data distribution for ad_created (inclusive of null values)
--------------------------------------------------------------------
01-02-2016 0.0021
01-03-2016 0.0085
01-04-2016 3.3818
02-02-2016 0.0043
02-03-2016 0.0107
...
29-02-2016 0.0171
29-03-2016 3.3797
30-03-2016 3.3584
30-12-2015 0.0021
31-03-2016 3.1962
Name: ad_created, Length: 76, dtype: float64
Apparently, the ad_created field comprises of more dates than date_crwaled and last_seen columns.
autos["registration_year"].describe()
count 46868.0000 mean 2004.8328 std 88.0658 min 1000.0000 25% 1999.0000 50% 2004.0000 75% 2008.0000 max 9999.0000 Name: registration_year, dtype: float64
The registration year can be considered as the year the car was first purchased, registered and brought on road. However, two values make no sense. Minimum of the column which is 1000 and maximum of the column which 9999. Both the years are outside a valid time frame.
The dataset needs to be filtered with valid registration year values.
*Section 6 - Process invalid registration years*
# unique years present in registration year column
print(autos["registration_year"].sort_values().unique(),'\n')
[1000 1001 1111 1800 1910 1927 1929 1931 1934 1937 1938 1939 1941 1943 1948 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2800 4100 4500 4800 5000 5911 6200 8888 9000 9999]
Since all the listings of the cars have been created in the year 2016, any car with registration year greater than 2016 is not a valid listing in the autos dataset. Also, the first documented invention of an automobile is in the year 1769. The closest year to that is 1800 in the autos dataset.
Hence, from the given autos dataset, any registration year prior to 1910 and after 2016 won't make any sense!
# rows that fall outside the 1910 - 2016 bracket
autos["registration_year"].between(1910,2016).value_counts(normalize=True).mul(100)
True 96.0741 False 3.9259 Name: registration_year, dtype: float64
Based on the results above, it can be concluded that the listings with invalid registration year i.e. less than 1910 and more than 2016, contribute about 4% of the total autos dataset. Therefore these records can be removed.
# delete records with registration year less than 1910 and more than 2016
autos = autos[autos["registration_year"].between(1910,2016)== True]
# display record count of valid registration years
print("Records with valid registration year (1910 to 2016):",autos.shape[0])
Records with valid registration year (1910 to 2016): 45028
*Section 7 - Brand wise analysis of used car listings*
# unique brands in autos dataset
print("unique brands in autos dataset:")
print("-------------------------------")
print(autos["brand"].unique(),'\n')
# 10 most listed brands
top_10_listed_brands = autos["brand"].value_counts(normalize=True).head(10)
print("10 most listed brands in autos dataset:")
print("---------------------------------------")
print(top_10_listed_brands,"\n")
top_10_listed_brands = autos["brand"].value_counts(normalize=True).head(10).index
print("Top 10 brands as",top_10_listed_brands)
unique brands in autos dataset: ------------------------------- ['peugeot' 'bmw' 'volkswagen' 'smart' 'ford' 'chrysler' 'renault' 'mercedes_benz' 'audi' 'sonstige_autos' 'opel' 'mazda' 'porsche' 'mini' 'seat' 'toyota' 'dacia' 'nissan' 'jeep' 'saab' 'volvo' 'mitsubishi' 'jaguar' 'fiat' 'skoda' 'subaru' 'kia' 'citroen' 'chevrolet' 'hyundai' 'honda' 'daewoo' 'suzuki' 'trabant' 'land_rover' 'alfa_romeo' 'lada' 'rover' 'daihatsu' 'lancia'] 10 most listed brands in autos dataset: --------------------------------------- volkswagen 0.2112 bmw 0.1124 opel 0.1042 mercedes_benz 0.0990 audi 0.0887 ford 0.0673 renault 0.0457 peugeot 0.0296 fiat 0.0248 seat 0.0182 Name: brand, dtype: float64 Top 10 brands as Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford', 'renault', 'peugeot', 'fiat', 'seat'], dtype='object')
As indicated by the top 10 brands from the dataset, brand Volkswagen is most prefered one, with about 21% contribution of the total listings. The only american brand in the top 10 is Ford with 6% of listings.
Calculation of average price of top 10 brands
avg_cost_10brands = {}
temp_sort_table = []
for each_brand in top_10_listed_brands:
brand_price = autos.loc[autos["brand"] == each_brand, ["brand", "price_USD"]]
avg_cost = brand_price["price_USD"].mean()
avg_cost_10brands[each_brand] = round(avg_cost)
# steps to sort results in avg cost of top 10 brands in descending order
key_val_combo = (avg_cost_10brands[each_brand], each_brand)
temp_sort_table.append(key_val_combo)
temp_sort_table = sorted(temp_sort_table, reverse=True)
# displaying results for avg price dictionary and it's sorted results
print(avg_cost_10brands,"\n")
print("\033[1mSorted results for avg price\033[0m")
for each_val in temp_sort_table:
print(each_val[1],":",each_val[0])
{'volkswagen': 5596, 'bmw': 8456, 'opel': 3172, 'mercedes_benz': 8711, 'audi': 9447, 'ford': 4023, 'renault': 2633, 'peugeot': 3222, 'fiat': 3008, 'seat': 4573}
Sorted results for avg price
audi : 9447
mercedes_benz : 8711
bmw : 8456
volkswagen : 5596
seat : 4573
ford : 4023
peugeot : 3222
opel : 3172
fiat : 3008
renault : 2633
Calculation of average mileage of top 10 brands
avg_mil_10brands = {}
temp_sort_table = []
for each_brand in top_10_listed_brands:
brand_kms = autos.loc[autos["brand"] == each_brand, ["brand", "odometer_km"]]
avg_kms = brand_kms["odometer_km"].mean()
avg_mil_10brands[each_brand] = round(avg_kms)
# steps to sort results in avg cost of top 10 brands in descending order
key_val_combo = (avg_mil_10brands[each_brand], each_brand)
temp_sort_table.append(key_val_combo)
temp_sort_table = sorted(temp_sort_table, reverse=True)
# display results for avg mileage dictionary and it's sorted results
print(avg_mil_10brands,"\n")
print("\033[1mSorted results for avg mileage\033[0m")
for each_val in temp_sort_table:
print(each_val[1],":",each_val[0])
{'volkswagen': 128571, 'bmw': 132763, 'opel': 128924, 'mercedes_benz': 131114, 'audi': 129190, 'ford': 123928, 'renault': 127421, 'peugeot': 126604, 'fiat': 116413, 'seat': 121429}
Sorted results for avg mileage
bmw : 132763
mercedes_benz : 131114
audi : 129190
opel : 128924
volkswagen : 128571
renault : 127421
peugeot : 126604
ford : 123928
seat : 121429
fiat : 116413
*Section 8 - Finalise analysis and dispaly results for 10 most popular brands on eBay Kleinanzeigen*
This section will help in understanding two pandas methods:
# convert average price and average mileage dictionaries into panda series
price_sr = pd.Series(avg_cost_10brands)
mileage_sr = pd.Series(avg_mil_10brands)
# create dataframe from price_sr series
top10_brands = pd.DataFrame(price_sr, columns=["average_price"])
# add mileage column in top10_brands dataframe
top10_brands["average_mileage"] = mileage_sr
# display top10_brands dataframe sorted on descending average mileage
top10_brands.sort_values(["average_mileage"], ascending=False)
average_price | average_mileage | |
---|---|---|
bmw | 8456 | 132763 |
mercedes_benz | 8711 | 131114 |
audi | 9447 | 129190 |
opel | 3172 | 128924 |
volkswagen | 5596 | 128571 |
renault | 2633 | 127421 |
peugeot | 3222 | 126604 |
ford | 4023 | 123928 |
seat | 4573 | 121429 |
fiat | 3008 | 116413 |
The dataframe top10_brands displays the 10 most preferred brands, the average price and average mileage for all the cars listed under the brand from the autos dataset.
Explore categorical columns from autos dataset to look for german language words and convert them to English
(Column model has been skipped for this section.)
# unique values from categorical info columns (excludes column model)
column_list = ["seller", "offer_type", "vehicle_type", "gearbox", "fuel_type", "unrepaired_damage"]
for each_col in column_list:
print(each_col,":",autos[each_col].unique(),"\n")
seller : ['privat'] offer_type : ['Angebot'] vehicle_type : ['bus' 'limousine' 'kleinwagen' 'kombi' nan 'coupe' 'suv' 'cabrio' 'andere'] gearbox : ['manuell' 'automatik' nan] fuel_type : ['lpg' 'benzin' 'diesel' nan 'cng' 'hybrid' 'elektro' 'andere'] unrepaired_damage : ['nein' nan 'ja']
# create a mapping dictionary to map the german words to their respective english translations
ger_to_eng_map = { # for seller column
"privat":"private", "gewerblich":"commercial",
# for offer type column
"Angebot": "special offer", "Gesuch":"application",
# for vehicle type column
"bus":"bus", "limousine":"sedan", "kleinwagen":"small car", "kombi":"combination",
"coupe":"coupe", "suv":"suv", "cabrio":"convertible",
# for gearbox column
"manuell":"manual", "automatik":"automatic",
# for fuel type column
"lpg":"lpg", "benzin":"petrol", "diesel":"diesel", "cng":"cng", "hybrid":"hybrid",
"elektro":"electric", "andere":"others",
# for unrepaired damage column
"nein":"no", "ja":"yes" }
# map german keywords with english keywords and update respective columns in autos dataset
for each_col in column_list:
autos[each_col] = autos[each_col].map(ger_to_eng_map)
print(each_col,":",autos[each_col].unique(),"\n")
seller : ['private'] offer_type : ['special offer'] vehicle_type : ['bus' 'sedan' 'small car' 'combination' nan 'coupe' 'suv' 'convertible' 'others'] gearbox : ['manual' 'automatic' nan] fuel_type : ['lpg' 'petrol' 'diesel' nan 'cng' 'hybrid' 'electric' 'others'] unrepaired_damage : ['no' nan 'yes']
*Update date formats from yyyy-mm-dd hh:mm:ss to yyyyddmm and convert date values as int dtype*
# display date columns sample data before update
print(str("\33[1m before date formatting and conversion \33[0m").center(60))
print(autos[["date_crawled","ad_created","last_seen"]].head())
print("\n")
# update date columns format and convert to int type
autos["date_crawled"] = autos["date_crawled"].str.split(" ").str[0].str.replace("-","").astype(int)
autos["ad_created"] = autos["ad_created"].str.split(" ").str[0].str.replace("-","").astype(int)
autos["last_seen"] = autos["last_seen"].str.split(" ").str[0].str.replace("-","").astype(int)
# display date columns sample data after update
print("\33[1m after date formatting and conversion \33[0m")
print(autos[["date_crawled","ad_created","last_seen"]].head())
before date formatting and conversion date_crawled ad_created last_seen 0 26-03-2016 17:47 26-03-2016 00:00 06-04-2016 06:45 1 04-04-2016 13:38 04-04-2016 00:00 06-04-2016 14:45 2 26-03-2016 18:57 26-03-2016 00:00 06-04-2016 20:15 3 12-03-2016 16:58 12-03-2016 00:00 15-03-2016 03:16 4 01-04-2016 14:38 01-04-2016 00:00 01-04-2016 14:38 after date formatting and conversion date_crawled ad_created last_seen 0 26032016 26032016 6042016 1 4042016 4042016 6042016 2 26032016 26032016 6042016 3 12032016 12032016 15032016 4 1042016 1042016 1042016
Most common brand - model combinations
This section excludes null values in the model column of the autos dataset. Also, for the brand sonstige_autos, the model column is full of null values in the dataset. Hence, for this section brand sonstige_autos has been completely omitted.
# brand name list in autos dataset
brand_list = autos["brand"].unique()
# dictionary to store brand name and model name combinations
brand_model = {}
# identify most common model listed for a given brand in autos dataset
for each_brand in brand_list:
# exclude brand "sonstige_autos" from brand model combination
if each_brand != "sonstige_autos":
model_name = (autos.loc[autos["brand"] == each_brand, "model"].value_counts(dropna=True).
sort_values(ascending=False).head(1).index[0])
brand_model[each_brand] = model_name
# display result of brand model combination
print("\33[1m Most common brand - model combination \33[0m")
print("---------------------------------------")
for each_brand in brand_model:
print(each_brand,"-", brand_model[each_brand])
Most common brand - model combination
---------------------------------------
peugeot - 2_reihe
bmw - 3er
volkswagen - golf
smart - fortwo
ford - focus
chrysler - andere
renault - twingo
mercedes_benz - c_klasse
audi - a4
opel - corsa
mazda - 6_reihe
porsche - 911
mini - cooper
seat - ibiza
toyota - yaris
dacia - sandero
nissan - micra
jeep - grand
saab - andere
volvo - v70
mitsubishi - colt
jaguar - andere
fiat - punto
skoda - octavia
subaru - legacy
kia - andere
citroen - andere
chevrolet - andere
hyundai - i_reihe
honda - civic
daewoo - matiz
suzuki - andere
trabant - 601
land_rover - freelander
alfa_romeo - 156
lada - niva
rover - andere
daihatsu - cuore
lancia - ypsilon
Average price variations based on distance travelled
# identify unique results of ododmeter_kms column
autos["odometer_km"].value_counts().sort_index()
# set-up four ranges of kms covered by the used cars in autos dataset
kms_below_30K_avgUSD = autos.loc[autos["odometer_km"] <= 30000,"price_USD"].mean()
kms_40k_70K_avgUSD = autos.loc[autos["odometer_km"].between(40000,70000),"price_USD"].mean()
kms_80k_100K_avgUSD = autos.loc[autos["odometer_km"].between(80000,100000),"price_USD"].mean()
kms_above_100k_avgUSD = autos.loc[autos["odometer_km"] > 100000,"price_USD"].mean()
# display results of average price by kms covered
print("Average price of used cars based on distance travelled")
print("------------------------------------------------------")
print("distance travelled below 30000 kms:", int(kms_below_30K_avgUSD))
print("distance travelled between 40000 - 70000 kms:", int(kms_40k_70K_avgUSD))
print("distance travelled between 80000 - 100000 kms:", int(kms_80k_100K_avgUSD))
print("distance travelled above 100000 kms:", int(kms_above_100k_avgUSD))
Average price of used cars based on distance travelled ------------------------------------------------------ distance travelled below 30000 kms: 16321 distance travelled between 40000 - 70000 kms: 13035 distance travelled between 80000 - 100000 kms: 8812 distance travelled above 100000 kms: 4265
Average price of the cars listed, decline as the distance travelled by the cars increases.
Comparison between average price of cars with damage and cars without damage
# record count of cars with and without damage
print(autos["unrepaired_damage"].value_counts(dropna=False),"\n")
no_damage_avg_price = autos.loc[autos["unrepaired_damage"] == "no", "price_USD"].mean()
print("average price of cars without damage:",int(no_damage_avg_price))
damage_avg_price = autos.loc[autos["unrepaired_damage"] == "yes", "price_USD"].mean()
print("average price of cars with damage:", int(damage_avg_price))
print("\33[1mDamaged cars cost", round(no_damage_avg_price/damage_avg_price,4),"times lesser than Non-damaged cars. \33[0m")
no 33490
NaN 7484
yes 4054
Name: unrepaired_damage, dtype: int64
average price of cars without damage: 7235
average price of cars with damage: 2488
Damaged cars cost 2.9075 times lesser than Non-damaged cars.