The our dataset contains data about offered for sale used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website. This dataset was originally scraped and uploaded to Kaggle by user orgesleka. The original dataset isn't available on Kaggle anymore, but you can find it here. The goal of my works - analyze data with this dataset and examine general trends if its are.
# Import required modules - pandas and numpy
import pandas as pd
import numpy as np
# Read the 'autos.csv' file as 'autos'
autos = pd.read_csv("autos.csv", encoding = "Latin-1")
# Pring first five rows
print(autos.head())
# get general information about data set
autos.info()
# check that data set hasn't index column from dataset
autos.index
dateCrawled name \ 0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD 1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik 2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United 3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... 4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... seller offerType price abtest vehicleType yearOfRegistration \ 0 privat Angebot $5,000 control bus 2004 1 privat Angebot $8,500 control limousine 1997 2 privat Angebot $8,990 test limousine 2009 3 privat Angebot $4,350 control kleinwagen 2007 4 privat Angebot $1,350 test kombi 2003 gearbox powerPS model odometer monthOfRegistration fuelType \ 0 manuell 158 andere 150,000km 3 lpg 1 automatik 286 7er 150,000km 6 benzin 2 manuell 102 golf 70,000km 7 benzin 3 automatik 71 fortwo 70,000km 6 benzin 4 manuell 0 focus 150,000km 7 benzin brand notRepairedDamage dateCreated nrOfPictures \ 0 peugeot nein 2016-03-26 00:00:00 0 1 bmw nein 2016-04-04 00:00:00 0 2 volkswagen nein 2016-03-26 00:00:00 0 3 smart nein 2016-03-12 00:00:00 0 4 ford nein 2016-04-01 00:00:00 0 postalCode lastSeen 0 79588 2016-04-06 06:45:54 1 71034 2016-04-06 14:45:08 2 35394 2016-04-06 20:15:37 3 33729 2016-03-15 03:16:28 4 39218 2016-04-01 14:38:50 <class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): dateCrawled 50000 non-null object name 50000 non-null object seller 50000 non-null object offerType 50000 non-null object price 50000 non-null object abtest 50000 non-null object vehicleType 44905 non-null object yearOfRegistration 50000 non-null int64 gearbox 47320 non-null object powerPS 50000 non-null int64 model 47242 non-null object odometer 50000 non-null object monthOfRegistration 50000 non-null int64 fuelType 45518 non-null object brand 50000 non-null object notRepairedDamage 40171 non-null object dateCreated 50000 non-null object nrOfPictures 50000 non-null int64 postalCode 50000 non-null int64 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
RangeIndex(start=0, stop=50000, step=1)
As we see:
get name of columns:
autos.columns
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest', 'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model', 'odometer', 'monthOfRegistration', 'fuelType', 'brand', 'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode', 'lastSeen'], dtype='object')
and manual rename it with calling rename() function, also rename 'price' to the 'price_usd' and 'odometer' to 'odometer_km'
autos.rename({'dateCrawled':'date_crawled', 'offerType':'offer_type', 'price':'price_usd',
'vehicleType':'vehicle_type', 'yearOfRegistration':'registration_year', 'powerPS':'power_ps',
'odometer':'odometer_km', 'monthOfRegistration':'registration_month', 'fuelType':'fuel_type',
'notRepairedDamage':'unrepaired_damage', 'dateCreated':'ad_created',
'nrOfPictures':'nr_of_pictures', 'postalCode':'postal_code',
'lastSeen':'last_seen'}, axis=1, inplace=True)
and checkj it:
autos.head()
date_crawled | name | seller | offer_type | price_usd | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
Let's look on data in our dataset with descibe() function.
autos.describe(include = 'all')
date_crawled | name | seller | offer_type | price_usd | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000 | 50000 | 50000 | 50000 | 50000 | 50000 | 44905 | 50000.000000 | 47320 | 50000.000000 | 47242 | 50000 | 50000.000000 | 45518 | 50000 | 40171 | 50000 | 50000.0 | 50000.000000 | 50000 |
unique | 48213 | 38754 | 2 | 2 | 2357 | 2 | 8 | NaN | 2 | NaN | 245 | 13 | NaN | 7 | 40 | 2 | 76 | NaN | NaN | 39481 |
top | 2016-03-10 15:36:24 | Ford_Fiesta | privat | Angebot | $0 | test | limousine | NaN | manuell | NaN | golf | 150,000km | NaN | benzin | volkswagen | nein | 2016-04-03 00:00:00 | NaN | NaN | 2016-04-07 06:17:27 |
freq | 3 | 78 | 49999 | 49999 | 1421 | 25756 | 12859 | NaN | 36993 | NaN | 4024 | 32424 | NaN | 30107 | 10687 | 35232 | 1946 | NaN | NaN | 8 |
mean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2005.073280 | NaN | 116.355920 | NaN | NaN | 5.723360 | NaN | NaN | NaN | NaN | 0.0 | 50813.627300 | NaN |
std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 105.712813 | NaN | 209.216627 | NaN | NaN | 3.711984 | NaN | NaN | NaN | NaN | 0.0 | 25779.747957 | NaN |
min | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1000.000000 | NaN | 0.000000 | NaN | NaN | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 1067.000000 | NaN |
25% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1999.000000 | NaN | 70.000000 | NaN | NaN | 3.000000 | NaN | NaN | NaN | NaN | 0.0 | 30451.000000 | NaN |
50% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2003.000000 | NaN | 105.000000 | NaN | NaN | 6.000000 | NaN | NaN | NaN | NaN | 0.0 | 49577.000000 | NaN |
75% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | NaN | 9.000000 | NaN | NaN | NaN | NaN | 0.0 | 71540.000000 | NaN |
max | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 9999.000000 | NaN | 17700.000000 | NaN | NaN | 12.000000 | NaN | NaN | NaN | NaN | 0.0 | 99998.000000 | NaN |
as we see:
Lets estimate range and format of values in columns 'odometer_km' and 'price' for determinate the conversion criterion.
s = autos['odometer_km'].unique()
print("odometer_km", s, s.shape, sep = "\n")
p = autos['price_usd'].unique()
print("price_usd", p, p.shape, sep = "\n")
y = autos['registration_year'].unique()
print("registration_year", y, y.shape, sep = "\n")
odometer_km ['150,000km' '70,000km' '50,000km' '80,000km' '10,000km' '30,000km' '125,000km' '90,000km' '20,000km' '60,000km' '5,000km' '100,000km' '40,000km'] (13,) price_usd ['$5,000' '$8,500' '$8,990' ... '$385' '$22,200' '$16,995'] (2357,) registration_year [2004 1997 2009 2007 2003 2006 1995 1998 2000 2017 2010 1999 1982 1990 2015 2014 1996 1992 2005 2002 2012 2011 2008 1985 2016 1994 1986 2001 2018 2013 1972 1993 1988 1989 1967 1973 1956 1976 4500 1987 1991 1983 1960 1969 1950 1978 1980 1984 1963 1977 1961 1968 1934 1965 1971 1966 1979 1981 1970 1974 1910 1975 5000 4100 2019 1959 9996 9999 6200 1964 1958 1800 1948 1931 1943 9000 1941 1962 1927 1937 1929 1000 1957 1952 1111 1955 1939 8888 1954 1938 2800 5911 1500 1953 1951 4800 1001] (97,)
let's begin optimization data set.
autos.drop(columns = ['nr_of_pictures','seller','offer_type'], inplace=True)
autos.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 17 columns): date_crawled 50000 non-null object name 50000 non-null object price_usd 50000 non-null object abtest 50000 non-null object vehicle_type 44905 non-null object registration_year 50000 non-null int64 gearbox 47320 non-null object power_ps 50000 non-null int64 model 47242 non-null object odometer_km 50000 non-null object registration_month 50000 non-null int64 fuel_type 45518 non-null object brand 50000 non-null object unrepaired_damage 40171 non-null object ad_created 50000 non-null object postal_code 50000 non-null int64 last_seen 50000 non-null object dtypes: int64(4), object(13) memory usage: 6.5+ MB
autos['price_usd'] = autos['price_usd'].str.replace(',','').str.replace('$','').astype('int64')
autos['odometer_km'] = autos['odometer_km'].str.replace('km','').str.replace(',','').astype('int64')
autos[['odometer_km', 'price_usd', 'registration_year', 'power_ps' ]].describe()
odometer_km | price_usd | registration_year | power_ps | |
---|---|---|---|---|
count | 50000.000000 | 5.000000e+04 | 50000.000000 | 50000.000000 |
mean | 125732.700000 | 9.840044e+03 | 2005.073280 | 116.355920 |
std | 40042.211706 | 4.811044e+05 | 105.712813 | 209.216627 |
min | 5000.000000 | 0.000000e+00 | 1000.000000 | 0.000000 |
25% | 125000.000000 | 1.100000e+03 | 1999.000000 | 70.000000 |
50% | 150000.000000 | 2.950000e+03 | 2003.000000 | 105.000000 |
75% | 150000.000000 | 7.200000e+03 | 2008.000000 | 150.000000 |
max | 150000.000000 | 1.000000e+08 | 9999.000000 | 17700.000000 |
Lets analyze these values:
Lets calculate IQR for these values
irq_reg_year = autos['registration_year'].quantile(0.75) - autos['registration_year'].quantile(0.25)
print('IRQ reg_year =', irq_reg_year)
irq_power_ps = autos['power_ps'].quantile(0.75) - autos['power_ps'].quantile(0.25)
print('IRQ power_ps = ', irq_power_ps)
irq_price_usd = autos['price_usd'].quantile(0.75) - autos['price_usd'].quantile(0.25)
print('IRQ price_usd = ', irq_price_usd)
IRQ reg_year = 9.0 IRQ power_ps = 80.0 IRQ price_usd = 6100.0
Remember the general rules about outlier values:
Formally if we hence this formulas we can get next paradox values - negative power, negative price, 2021 year registration and etc. Hence for define reliable limit we will operate real values in some case.
Before estimate high outlier limit for years see to max and min values for 'date_crawled' and 'ad_created' columns for understanding time of publication of data:
ad_created_max = max(pd.to_datetime(autos['ad_created']))
ad_created_min = min(pd.to_datetime(autos['ad_created']))
print("ad_created_max = ", ad_created_max, ", ad_created_min = ", ad_created_min )
date_crawled_max = max(pd.to_datetime(autos['date_crawled']))
date_crawled_min = min(pd.to_datetime(autos['date_crawled']))
print("date_crawled_max = ",date_crawled_max, ", date_crawled_min", date_crawled_min )
ad_created_max = 2016-04-07 00:00:00 , ad_created_min = 2015-06-11 00:00:00 date_crawled_max = 2016-04-07 14:36:56 , date_crawled_min 2016-03-05 14:06:30
Define high and low outlier limits for year, power and price:
1.Define low and high outlier limit for 'registration_year':
for low limit use formula
year_low_limit = 1999 - 1,5 * 9 = 1985 (round down)
For defining high limit we see time period, see above, for sale advertisements in the dataset - from June 2015 from April 2016. We know that automaker write in the vehicle registration certificate real date of manufacturing, hense a high outlier limit can't be greater than 2016 years.
year_high_limit = 2016
2.Define low and high outlier limit for 'power_ps':
If we formaly use formula we get -50 ps and row 'power_ps' contains and zero value. Extract unque values from 'power_ps' column and sorted its.
power = sorted(autos['power_ps'].unique())
print (power[0:10])
[0, 1, 2, 3, 4, 5, 6, 8, 9, 10]
Assignment values that doesn't equal zero and higher than -50 ps
power_low_limit = 1
for define high limit use formula
power_high_limit = 150 + 80 * 1,5 = 270
NB. A little power
3.Define low and high outlier limit for 'price_usd':
If we use formula we get low limit outlier ~ - 8 000 usd. Use real minimal value that doesn't equal zero.
price = sorted(autos['price_usd'].unique())
print(price[0:10])
[0, 1, 2, 3, 5, 8, 9, 10, 11, 12]
so define:
price_low_limit = 1
for high limit outlier use formula:
price_high_limit = 7200 + 6100 * 1,5 = 16350
We define low and high limits for necessary values and let's modify autos data set. Create copy current autos and assignment it name 'autos_raw_copy" and extract data without outliers by some steps. clear year. Print autos_raw_copy for further comparison.
autos_raw_copy = autos.copy()
autos_raw_copy[['odometer_km', 'price_usd', 'registration_year', 'power_ps' ]].describe()
odometer_km | price_usd | registration_year | power_ps | |
---|---|---|---|---|
count | 50000.000000 | 5.000000e+04 | 50000.000000 | 50000.000000 |
mean | 125732.700000 | 9.840044e+03 | 2005.073280 | 116.355920 |
std | 40042.211706 | 4.811044e+05 | 105.712813 | 209.216627 |
min | 5000.000000 | 0.000000e+00 | 1000.000000 | 0.000000 |
25% | 125000.000000 | 1.100000e+03 | 1999.000000 | 70.000000 |
50% | 150000.000000 | 2.950000e+03 | 2003.000000 | 105.000000 |
75% | 150000.000000 | 7.200000e+03 | 2008.000000 | 150.000000 |
max | 150000.000000 | 1.000000e+08 | 9999.000000 | 17700.000000 |
Clean year:
autos= autos[autos['registration_year'].between(1985, 2016)]
autos[['odometer_km', 'price_usd', 'registration_year', 'power_ps' ]].describe()
odometer_km | price_usd | registration_year | power_ps | |
---|---|---|---|---|
count | 47215.000000 | 4.721500e+04 | 47215.000000 | 47215.000000 |
mean | 126118.288679 | 9.103618e+03 | 2003.319814 | 117.634650 |
std | 39607.366404 | 4.836483e+05 | 6.035886 | 196.367757 |
min | 5000.000000 | 0.000000e+00 | 1985.000000 | 0.000000 |
25% | 125000.000000 | 1.111000e+03 | 1999.000000 | 75.000000 |
50% | 150000.000000 | 2.950000e+03 | 2003.000000 | 109.000000 |
75% | 150000.000000 | 7.250000e+03 | 2008.000000 | 150.000000 |
max | 150000.000000 | 1.000000e+08 | 2016.000000 | 17700.000000 |
As wee see 2 785 records ( -5.57% from original ) are outliers by year.
Clean power_ps:
autos= autos[autos['power_ps'].between(1, 270)]
autos[['odometer_km', 'price_usd', 'registration_year', 'power_ps' ]].describe()
odometer_km | price_usd | registration_year | power_ps | |
---|---|---|---|---|
count | 41045.000000 | 4.104500e+04 | 41045.000000 | 41045.000000 |
mean | 126401.266902 | 8.942654e+03 | 2003.420563 | 119.698648 |
std | 38795.620424 | 5.153831e+05 | 5.950197 | 47.966269 |
min | 5000.000000 | 0.000000e+00 | 1985.000000 | 1.000000 |
25% | 125000.000000 | 1.250000e+03 | 1999.000000 | 80.000000 |
50% | 150000.000000 | 3.100000e+03 | 2003.000000 | 115.000000 |
75% | 150000.000000 | 7.300000e+03 | 2008.000000 | 150.000000 |
max | 150000.000000 | 1.000000e+08 | 2016.000000 | 270.000000 |
As wee see 6 170 records (-12.34% from original ) are outliers by power_ps.
Clean price_usd':
autos= autos[autos['price_usd'].between(1, 16350)]
autos[['odometer_km', 'price_usd', 'registration_year', 'power_ps' ]].describe()
odometer_km | price_usd | registration_year | power_ps | |
---|---|---|---|---|
count | 37486.000000 | 37486.000000 | 37486.000000 | 37486.000000 |
mean | 130087.632716 | 4247.285413 | 2002.919063 | 115.380329 |
std | 35291.177041 | 3849.517064 | 5.657016 | 45.229916 |
min | 5000.000000 | 1.000000 | 1985.000000 | 1.000000 |
25% | 125000.000000 | 1214.250000 | 1999.000000 | 75.000000 |
50% | 150000.000000 | 2900.000000 | 2003.000000 | 109.000000 |
75% | 150000.000000 | 6300.000000 | 2007.000000 | 143.000000 |
max | 150000.000000 | 16350.000000 | 2016.000000 | 270.000000 |
As wee see ** 3 559 records (-7.12% from original )** are outliers by price_usd.
As result 12514 records (-25.03% from original) are outliers by year, power and price
check cleaned dataset
autos.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 37486 entries, 0 to 49999 Data columns (total 17 columns): date_crawled 37486 non-null object name 37486 non-null object price_usd 37486 non-null int64 abtest 37486 non-null object vehicle_type 36062 non-null object registration_year 37486 non-null int64 gearbox 36822 non-null object power_ps 37486 non-null int64 model 36098 non-null object odometer_km 37486 non-null int64 registration_month 37486 non-null int64 fuel_type 35431 non-null object brand 37486 non-null object unrepaired_damage 31858 non-null object ad_created 37486 non-null object postal_code 37486 non-null int64 last_seen 37486 non-null object dtypes: int64(6), object(11) memory usage: 5.1+ MB
Formally we can remove contains null records in other columns, but considering free style sale advertisements in case eBay Kleinanzeigen it will be redundant - we get reliable dataset based of year registration, price and power that contains non zero values. Think which values have real significance for further analyzing:
**Summary, we analyze next values:
# Exract datetime valies
year_month_dt = pd.to_datetime(autos['ad_created'])
# Convert to string Month Year
year_month_str = year_month_dt.dt.strftime('%B, %Y')
# Create dictionary
month_year_dic = {}
for value in year_month_str:
if value not in month_year_dic:
month_year_dic[value] = 1
else:
month_year_dic[value] += 1
# Convert dictionary to lists
month_year = []
values = []
for k, v in month_year_dic.items ():
month_year.append(k)
values.append(v)
# Convert list to dataframe
month_year_df = pd.DataFrame(list(zip(month_year, values)),
columns =['Month, Year', 'Ad numbers'])
# Calculate sum of column Ad numbers
total = month_year_df['Ad numbers'].sum()
# Add new column Percent
month_year_df['Percents %'] = month_year_df['Ad numbers'] / total
# Sort by descending
month_year_df.sort_values(by =['Percents %'], ascending = False)
# Format output for dataframe
month_year_df.style.format({"Ad numbers":"{:,}", "Percents %":"{:.3%}"})
Month, Year | Ad numbers | Percents % | |
---|---|---|---|
0 | September, 2015 | 1 | 0.003% |
1 | February, 2016 | 52 | 0.139% |
2 | March, 2016 | 31,403 | 83.773% |
3 | January, 2016 | 7 | 0.019% |
4 | November, 2015 | 1 | 0.003% |
5 | December, 2015 | 2 | 0.005% |
6 | August, 2015 | 1 | 0.003% |
7 | April, 2016 | 6,019 | 16.057% |
Remember that ad_created_max = 07 April 2016 and ad_created_min = 11 June 2015. As we see in average car sells didn't more than two month.
# Extract values from column 'registration_year'
reg_year = autos['registration_year']
#Create dictionary reg_year_dic = {}
reg_year_dic = {}
for value in reg_year:
if value not in reg_year_dic:
reg_year_dic[value] = 1
else:
reg_year_dic[value] += 1
# Calculate sum of values
total = sum(reg_year_dic.values())
year = []
values = []
for k, v in reg_year_dic.items ():
year.append(k)
values.append(v)
# Convert list to dataframe
year_df = pd.DataFrame(list(zip(year, values)),
columns =['Year reg', 'Ad numbers'])
# Calculate sum of column Ad numbers
total = year_df['Ad numbers'].sum()
# Sort by descending
year_df.sort_values(by = ['Ad numbers'], inplace = True, ascending=False)
# Add new column Percent
year_df['Percents %'] = year_df['Ad numbers'] / total
year_df.style.format({"Ad numbers":"{:,}", "Percents %":"{:.3%}"})
Year reg | Ad numbers | Percents % | |
---|---|---|---|
14 | 1999 | 2,516 | 6.712% |
15 | 2000 | 2,494 | 6.653% |
20 | 2005 | 2,452 | 6.541% |
21 | 2006 | 2,408 | 6.424% |
18 | 2003 | 2,408 | 6.424% |
19 | 2004 | 2,403 | 6.410% |
16 | 2001 | 2,346 | 6.258% |
17 | 2002 | 2,205 | 5.882% |
13 | 1998 | 2,070 | 5.522% |
22 | 2007 | 1,958 | 5.223% |
23 | 2008 | 1,857 | 4.954% |
12 | 1997 | 1,724 | 4.599% |
24 | 2009 | 1,716 | 4.578% |
25 | 2010 | 1,182 | 3.153% |
11 | 1996 | 1,178 | 3.143% |
26 | 2011 | 1,066 | 2.844% |
10 | 1995 | 994 | 2.652% |
31 | 2016 | 828 | 2.209% |
27 | 2012 | 779 | 2.078% |
9 | 1994 | 533 | 1.422% |
28 | 2013 | 385 | 1.027% |
8 | 1993 | 364 | 0.971% |
7 | 1992 | 325 | 0.867% |
6 | 1991 | 288 | 0.768% |
5 | 1990 | 260 | 0.694% |
29 | 2014 | 227 | 0.606% |
4 | 1989 | 145 | 0.387% |
3 | 1988 | 113 | 0.301% |
30 | 2015 | 85 | 0.227% |
0 | 1985 | 63 | 0.168% |
2 | 1987 | 58 | 0.155% |
1 | 1986 | 56 | 0.149% |
as we see, 76.18% offering for sale cars made in 1997 - 2009 years.
Get 15 top brand list
autos['brand'].value_counts().head(15).sort_values(ascending=False)
volkswagen 8182 opel 4337 bmw 3943 mercedes_benz 3118 audi 2970 ford 2736 renault 1857 peugeot 1235 fiat 1021 seat 743 skoda 681 mazda 628 smart 584 nissan 567 citroen 567 Name: brand, dtype: int64
Us we see most offering for sale cars making the The big German Car Trinity - VAG (it owned also trade marks Audi, Seat, Scoda), BMW and Mersedes-Benz.
Us we know, mileage contain discrete and for my opinion artificial values, because I examine mean values for price, power and made year for these brands. I don't use dictionary but use their idea for creating little dataframe. Let's begin:
# Creating a list of list of brands for mean price, power and make year
agregate = []
for brand in autos['brand'].value_counts().head(15).index:
row_list = []
row_list.append(brand)
row_list.append(int(round(autos[autos['brand'] == brand]
['price_usd'].mean(), 0)))
row_list.append(int(round(autos[autos['brand'] == brand]
['power_ps'].mean(), 0)))
row_list.append(int(round(autos[autos['brand'] == brand]
['registration_year'].mean(), 0)))
agregate.append(row_list)
# Convert list of list to dataframe
agregate_df = pd.DataFrame.from_records(agregate)
# Assign name of columns
agregate_df.columns = ["Brand", "Price", "Power PS", "Made year"]
# Descending sort by price
agregate_df.sort_values(by =['Price'], ascending = False, inplace = True)
# Custom format to price
agregate_df.style.format({"Price":"{:,}"})
Brand | Price | Power PS | Made year | |
---|---|---|---|---|
4 | audi | 5,842 | 151 | 2003 |
10 | skoda | 5,833 | 104 | 2007 |
2 | bmw | 5,707 | 159 | 2002 |
3 | mercedes_benz | 5,206 | 147 | 2002 |
0 | volkswagen | 4,348 | 102 | 2002 |
14 | nissan | 4,013 | 99 | 2003 |
9 | seat | 3,965 | 94 | 2004 |
12 | smart | 3,746 | 63 | 2006 |
11 | mazda | 3,745 | 118 | 2003 |
13 | citroen | 3,657 | 95 | 2005 |
7 | peugeot | 3,108 | 99 | 2004 |
5 | ford | 3,102 | 102 | 2003 |
1 | opel | 2,846 | 99 | 2003 |
8 | fiat | 2,810 | 80 | 2004 |
6 | renault | 2,468 | 90 | 2003 |
We see dispersion by price. As we see brands Audi, BMW and Mersedes-Benz are the most expensive (with almost the same years of make). Lets see dispersion by power:
agregate_df.sort_values(by =['Power PS'], ascending = False, inplace = True)
# Custom format to price
agregate_df.style.format({"Price":"{:,}"})
Brand | Price | Power PS | Made year | |
---|---|---|---|---|
2 | bmw | 5,707 | 159 | 2002 |
4 | audi | 5,842 | 151 | 2003 |
3 | mercedes_benz | 5,206 | 147 | 2002 |
11 | mazda | 3,745 | 118 | 2003 |
10 | skoda | 5,833 | 104 | 2007 |
0 | volkswagen | 4,348 | 102 | 2002 |
5 | ford | 3,102 | 102 | 2003 |
14 | nissan | 4,013 | 99 | 2003 |
7 | peugeot | 3,108 | 99 | 2004 |
1 | opel | 2,846 | 99 | 2003 |
13 | citroen | 3,657 | 95 | 2005 |
9 | seat | 3,965 | 94 | 2004 |
6 | renault | 2,468 | 90 | 2003 |
8 | fiat | 2,810 | 80 | 2004 |
12 | smart | 3,746 | 63 | 2006 |
As we see brands Audi, BMW and Mersedes-Benz have the most power engine (with almost the same years of make). Lets see dispersion by make year:
agregate_df.sort_values(by =['Made year'], ascending = False, inplace = True)
# Custom format to price
agregate_df.style.format({"Price":"{:,}"})
Brand | Price | Power PS | Made year | |
---|---|---|---|---|
10 | skoda | 5,833 | 104 | 2007 |
12 | smart | 3,746 | 63 | 2006 |
13 | citroen | 3,657 | 95 | 2005 |
7 | peugeot | 3,108 | 99 | 2004 |
9 | seat | 3,965 | 94 | 2004 |
8 | fiat | 2,810 | 80 | 2004 |
4 | audi | 5,842 | 151 | 2003 |
11 | mazda | 3,745 | 118 | 2003 |
5 | ford | 3,102 | 102 | 2003 |
14 | nissan | 4,013 | 99 | 2003 |
1 | opel | 2,846 | 99 | 2003 |
6 | renault | 2,468 | 90 | 2003 |
2 | bmw | 5,707 | 159 | 2002 |
3 | mercedes_benz | 5,206 | 147 | 2002 |
0 | volkswagen | 4,348 | 102 | 2002 |
As we see, brands Scoda, Smart and Citroen have the smallest age
We define 3 statistically insignificant columns ('seller', 'offer_type', and 'nr_of_pictures');
We define right interval for all type vehicles (motocycle, motobike, cars) and remove 12514 outliers (-25.03% from original). From task decription we haven't target separate only cars - In this case I would use 20 ps (power engine Ford - T 1905 - 1927 model year) for low limit outlier.
We understand that necessary checked calculated by formula limits for outliers and adjust to the surrounding real conditions - otherwise we get cars from future or ancient with negative or zero price and power.
We understand that mileage has artificial discrete values which cannot be used for defining limits of outliers.
We see that most popular cars in the Germany - the big German Car Trinity - VAG (it owned also trade marks Audi, Seat, Scoda), BMW and Mersedes-Benz. Luxury model this brands have almost the same price and power (with case of the almost same make year)
Created on Jan 27, 2021
@author: Vadim Maklakov, used some ideas from public Internet resources.
© 3-clause BSD License
Software environment:
Debian 10.7
Python 3.8.7
required preinstalled next python modules:
csv
numpy
pandas