eBay is an online ecommerce market that allows consumers to sell anything ranging from kitchen appliances to toys to even cars. In this report we will be taking a look at the used car sales from eBay Kleinanzeigen (a classifieds section of the German eBay site). The original dataset was retried on Kaggle, but the version we will be working with is provided by Dataquest and has been dirtied to more closely resemble a scraped dataset.
The dataset dictionary is provided as follows:
dateCrawled
- When this ad was first crawled. All field-values are taken from this date.name
- Name of the carseller
- Whether the seller is private or a dealerofferType
- The type of listingprice
- The price on the ad to sell the carabtest
- Whether the listing is included in an A/B testvehicleType
- The year in which the car was first registeredyearOfRegistration
- The year in which the car was first registeredgearbox
- The transmission typepowerPS
- The power of the car in PSmodel
- The car model namekilometer
- How many kilometers the car has drivenmonthOfRegistration
- The month in which the car was first registeredfuelType
- What type of fuel the car usesbrand
- The brand of the carnotRepairedDamage
- If the car has a damage which is not yet repaireddateCreated
- The date on which the eBay listing was creatednrOfPictures
- The number of pictures in the adpostalCode
- The postal code for the location of the vehiclelastSeenOnline
- When the crawler saw this ad last onlineimport numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
%matplotlib inline
autos_df = pd.read_csv('autos.csv', encoding = 'Latin-1')
autos_df.head()
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
autos_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 dateCrawled 50000 non-null object 1 name 50000 non-null object 2 seller 50000 non-null object 3 offerType 50000 non-null object 4 price 50000 non-null object 5 abtest 50000 non-null object 6 vehicleType 44905 non-null object 7 yearOfRegistration 50000 non-null int64 8 gearbox 47320 non-null object 9 powerPS 50000 non-null int64 10 model 47242 non-null object 11 odometer 50000 non-null object 12 monthOfRegistration 50000 non-null int64 13 fuelType 45518 non-null object 14 brand 50000 non-null object 15 notRepairedDamage 40171 non-null object 16 dateCreated 50000 non-null object 17 nrOfPictures 50000 non-null int64 18 postalCode 50000 non-null int64 19 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
The dataset contains 50000 entries with 20 columns. We can see we are missing data for 5 columns.
To make the columns more readable we will be convert the column names form camelcase to snake case.
We will also rename several columns to make them more descriptive.
yearOfRegistration
to registration_year
monthOfRegistration
to registration_month
notRepairedDamage
to unrepaired_damage
dateCreated
to ad_created
autos_df.columns = ['date_crawled', 'name', 'seller', 'offer_type', 'price',
'ab_test', 'vehicle_type', 'registration_year', 'gearbox',
'power_ps','model', 'odometer', 'registration_month',
'fuel_type', 'brand', 'unrepaired_damage', 'ad_created',
'num_photos', 'postal_code', 'last_seen']
autos_df.head()
date_crawled | name | seller | offer_type | price | ab_test | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | num_photos | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
autos_df.describe(include='all')
date_crawled | name | seller | offer_type | price | ab_test | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | num_photos | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000 | 50000 | 50000 | 50000 | 50000 | 50000 | 44905 | 50000.000000 | 47320 | 50000.000000 | 47242 | 50000 | 50000.000000 | 45518 | 50000 | 40171 | 50000 | 50000.0 | 50000.000000 | 50000 |
unique | 48213 | 38754 | 2 | 2 | 2357 | 2 | 8 | NaN | 2 | NaN | 245 | 13 | NaN | 7 | 40 | 2 | 76 | NaN | NaN | 39481 |
top | 2016-04-04 16:40:33 | Ford_Fiesta | privat | Angebot | $0 | test | limousine | NaN | manuell | NaN | golf | 150,000km | NaN | benzin | volkswagen | nein | 2016-04-03 00:00:00 | NaN | NaN | 2016-04-07 06:17:27 |
freq | 3 | 78 | 49999 | 49999 | 1421 | 25756 | 12859 | NaN | 36993 | NaN | 4024 | 32424 | NaN | 30107 | 10687 | 35232 | 1946 | NaN | NaN | 8 |
mean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2005.073280 | NaN | 116.355920 | NaN | NaN | 5.723360 | NaN | NaN | NaN | NaN | 0.0 | 50813.627300 | NaN |
std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 105.712813 | NaN | 209.216627 | NaN | NaN | 3.711984 | NaN | NaN | NaN | NaN | 0.0 | 25779.747957 | NaN |
min | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1000.000000 | NaN | 0.000000 | NaN | NaN | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 1067.000000 | NaN |
25% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1999.000000 | NaN | 70.000000 | NaN | NaN | 3.000000 | NaN | NaN | NaN | NaN | 0.0 | 30451.000000 | NaN |
50% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2003.000000 | NaN | 105.000000 | NaN | NaN | 6.000000 | NaN | NaN | NaN | NaN | 0.0 | 49577.000000 | NaN |
75% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | NaN | 9.000000 | NaN | NaN | NaN | NaN | 0.0 | 71540.000000 | NaN |
max | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 9999.000000 | NaN | 17700.000000 | NaN | NaN | 12.000000 | NaN | NaN | NaN | NaN | 0.0 | 99998.000000 | NaN |
autos_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date_crawled 50000 non-null object 1 name 50000 non-null object 2 seller 50000 non-null object 3 offer_type 50000 non-null object 4 price 50000 non-null object 5 ab_test 50000 non-null object 6 vehicle_type 44905 non-null object 7 registration_year 50000 non-null int64 8 gearbox 47320 non-null object 9 power_ps 50000 non-null int64 10 model 47242 non-null object 11 odometer 50000 non-null object 12 registration_month 50000 non-null int64 13 fuel_type 45518 non-null object 14 brand 50000 non-null object 15 unrepaired_damage 40171 non-null object 16 ad_created 50000 non-null object 17 num_photos 50000 non-null int64 18 postal_code 50000 non-null int64 19 last_seen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
Looking at the dataset, we observe that seller
and offer_type
appear to be quite similar in their summary statistics. Both only contain two unique values and the frequency of the more popular value is 49999 for both columns. This may imply that there is redundant data in these two column.
There are several columns that have their datatype converted for easier analysis.
num_photos
seems to contain large number of missing values.
autos_df['seller'].value_counts()
privat 49999 gewerblich 1 Name: seller, dtype: int64
autos_df['offer_type'].value_counts()
Angebot 49999 Gesuch 1 Name: offer_type, dtype: int64
Translated from German to English:
autos_df[autos_df['seller'] == 'gewerblich']
date_crawled | name | seller | offer_type | price | ab_test | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | num_photos | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
7738 | 2016-03-15 18:06:22 | Verkaufe_mehrere_Fahrzeuge_zum_Verschrotten | gewerblich | Angebot | $100 | control | kombi | 2000 | manuell | 0 | megane | 150,000km | 8 | benzin | renault | NaN | 2016-03-15 00:00:00 | 0 | 65232 | 2016-04-06 17:15:37 |
autos_df[autos_df['offer_type'] == 'Gesuch']
date_crawled | name | seller | offer_type | price | ab_test | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | num_photos | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
17541 | 2016-04-03 15:48:33 | Suche_VW_T5_Multivan | privat | Gesuch | $0 | test | bus | 2005 | NaN | 0 | transporter | 150,000km | 0 | NaN | volkswagen | NaN | 2016-04-03 00:00:00 | 0 | 29690 | 2016-04-05 15:16:06 |
Despite looking as it there was redundant values, we now see that "commercial" does not imply "formal request". We will be leaving these two columns in the dataset.
date_crawled
,last_seen
,ad_created
to datetimeprice
,odometer
to integerautos_df['date_crawled'] = pd.to_datetime(autos_df['date_crawled'])
autos_df['last_seen'] = pd.to_datetime(autos_df['last_seen'])
autos_df['ad_created'] = pd.to_datetime(autos_df['ad_created'])
For price and odometer, we want to removed any notations to be able to convert them into integers.
autos_df.head()
date_crawled | name | seller | offer_type | price | ab_test | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | num_photos | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 | 0 | 39218 | 2016-04-01 14:38:50 |
autos_df['price'] = autos_df['price'].str.replace('$','').str.replace(',','').astype(int)
autos_df['price'].head()
0 5000 1 8500 2 8990 3 4350 4 1350 Name: price, dtype: int64
autos_df['odometer'] = autos_df['odometer'].str.replace('km', '').str.replace(',', '').astype(int)
autos_df.rename({'odometer': 'odometer_km'}, axis=1, inplace=True)
autos_df['odometer_km'].head()
0 150000 1 150000 2 70000 3 70000 4 150000 Name: odometer_km, dtype: int64
autos_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date_crawled 50000 non-null datetime64[ns] 1 name 50000 non-null object 2 seller 50000 non-null object 3 offer_type 50000 non-null object 4 price 50000 non-null int64 5 ab_test 50000 non-null object 6 vehicle_type 44905 non-null object 7 registration_year 50000 non-null int64 8 gearbox 47320 non-null object 9 power_ps 50000 non-null int64 10 model 47242 non-null object 11 odometer_km 50000 non-null int64 12 registration_month 50000 non-null int64 13 fuel_type 45518 non-null object 14 brand 50000 non-null object 15 unrepaired_damage 40171 non-null object 16 ad_created 50000 non-null datetime64[ns] 17 num_photos 50000 non-null int64 18 postal_code 50000 non-null int64 19 last_seen 50000 non-null datetime64[ns] dtypes: datetime64[ns](3), int64(7), object(10) memory usage: 7.6+ MB
autos_df['price'].describe().apply(lambda x: format(x, 'f'))
count 50000.000000 mean 9840.043760 std 481104.380500 min 0.000000 25% 1100.000000 50% 2950.000000 75% 7200.000000 max 99999999.000000 Name: price, dtype: object
Looking the price and odometer values, we observe there are several questionable values.
autos_df["price"].value_counts().sort_index(ascending=False).head(20)
99999999 1 27322222 1 12345678 3 11111111 2 10000000 1 3890000 1 1300000 1 1234566 1 999999 2 999990 1 350000 1 345000 1 299000 1 295000 1 265000 1 259000 1 250000 1 220000 1 198000 1 197000 1 Name: price, dtype: int64
autos_df["price"].value_counts().sort_index().head(20)
0 1421 1 156 2 3 3 1 5 2 8 1 9 1 10 7 11 2 12 3 13 2 14 1 15 2 17 3 18 1 20 4 25 5 29 1 30 7 35 1 Name: price, dtype: int64
At the very top of the price range, there includes prices well over a million, while only a couple dollars at the bottom. One explanations to these abnormally low prices can be attributed to auctions, which may start at 1 dollar. Another explanation is that the price is a place holder for negotiations .
To prevent these values from skewing the data, we will be removing the entries with price less than 100 and greater than 350000.
autos_df = autos_df[autos_df['price'].between(100,350001)]
autos_df['price'].describe()
count 48224.000000 mean 5930.371433 std 9078.372762 min 100.000000 25% 1250.000000 50% 3000.000000 75% 7499.000000 max 350000.000000 Name: price, dtype: float64
Compared to what we had before the mean dropped significantly, from 9840 to 5930, and the standard deviation has shrunk from 481104 to 9078.
import seaborn as sns
print(sns.__version__)
0.10.1
fig = plt.figure(figsize=(15,6))
ax1 = fig.add_subplot(121)
ax1 = sns.distplot(autos_df['price'])
ax2 = fig.add_subplot(122)
ax2 = sns.distplot(np.log10(autos_df['price']))
log10_prices = np.log10(autos_df['price'])
log10_prices.describe()
count 48224.000000 mean 3.477612 std 0.527608 min 2.000000 25% 3.096910 50% 3.477121 75% 3.875003 max 5.544068 Name: price, dtype: float64
# Student t 95% confidence interveral
stats.t.interval(0.95, len(log10_prices)-1, loc=np.mean(log10_prices), scale= stats.sem(log10_prices))
(3.472902555662607, 3.4823207767984217)
Using a log base 10 regression on the price of vehicles sold, we obtain a log normal distribution of the price of cars sold. We are able to observe:
We are also able to construct a 95% confidence interval indicating that there is 95% chance the price of cars sold on eBay Kleinanzeigen is in has a mean between \$2951 to \$3019.
print(autos_df["odometer_km"].describe())
autos_df["odometer_km"].value_counts()
count 48224.000000 mean 125919.148142 std 39543.339640 min 5000.000000 25% 125000.000000 50% 150000.000000 75% 150000.000000 max 150000.000000 Name: odometer_km, dtype: float64
150000 31212 125000 5037 100000 2101 90000 1733 80000 1412 70000 1214 60000 1153 50000 1009 40000 814 30000 777 5000 760 20000 757 10000 245 Name: odometer_km, dtype: int64
fig = plt.figure(figsize=(10,5))
ax = fig.add_subplot()
ax = plt.hist(autos_df['odometer_km'])
We see all the odometer values are rounded in the thousands and the vast majority of cars have mileage over 150k+ km.
autos_df['date_crawled'].dt.year.value_counts()
2016 48224 Name: date_crawled, dtype: int64
autos_df['date_crawled'].dt.month.value_counts()
3 40408 4 7816 Name: date_crawled, dtype: int64
All of the entries were crawled during March 2016 and into April 2016.
autos_df['ad_created'].dt.year.value_counts()
2016 48218 2015 6 Name: ad_created, dtype: int64
autos_df['ad_created'].dt.month.value_counts()
3 40380 4 7765 2 61 1 12 12 2 11 1 9 1 8 1 6 1 Name: ad_created, dtype: int64
Much of the ad posting were also created during March 2016 to April 2016. With only 61 listing posted on Feburary. This can either indicate many of the cars were sold off within a month or two, or that sellers down old postings within a month or two.
autos_df['registration_year'].describe()
count 48224.000000 mean 2004.730964 std 87.897388 min 1000.000000 25% 1999.000000 50% 2004.000000 75% 2008.000000 max 9999.000000 Name: registration_year, dtype: float64
We can see some suspicious values mixed in in the registration years; we have minimum value of 1000 and maximum value of 9999. We will be capping the range from 1950 to 2016.
autos_df = autos_df[autos_df['registration_year'].between(1950,2016)]
fig = plt.figure(figsize=(10,5))
ax = fig.add_subplot()
ax = plt.hist(autos_df['registration_year'], bins=15)
We can see most of the cars were registered from late 1990s to 2010s, with a spike in 2016 indicating new registrations.
autos_df['brand'].value_counts(normalize=True)
volkswagen 0.211460 bmw 0.110197 opel 0.107241 mercedes_benz 0.096644 audi 0.086803 ford 0.069796 renault 0.047070 peugeot 0.029869 fiat 0.025618 seat 0.018258 skoda 0.016424 nissan 0.015345 mazda 0.015237 smart 0.014201 citroen 0.014050 toyota 0.012798 hyundai 0.010014 sonstige_autos 0.009388 volvo 0.009129 mini 0.008805 mitsubishi 0.008180 honda 0.007877 kia 0.007079 alfa_romeo 0.006669 porsche 0.006021 suzuki 0.005935 chevrolet 0.005676 chrysler 0.003518 dacia 0.002655 daihatsu 0.002504 jeep 0.002288 land_rover 0.002115 subaru 0.002115 saab 0.001662 jaguar 0.001532 daewoo 0.001489 trabant 0.001360 rover 0.001338 lancia 0.001058 lada 0.000583 Name: brand, dtype: float64
len(autos_df['brand'].unique())
40
As expected, as we are analyzing data from eBay Kleinanzeigen, German cars dominate the top 5 car brands, occupying more than 50% of the listings. Despite this there are a wide array of car in the postings, with a total of 40 unique car brands.
There are many brands that do not have significant percentage of listings, so we will limit our analysis to the top 10 brands in the listings.
brands_10 = autos_df['brand'].value_counts().head(10).index
autos_10 = autos_df[autos_df['brand'].isin(brands_10)]
autos_10_gb = autos_10.groupby('brand')
autos_10_gb['price'].describe().reindex(brands_10)
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
volkswagen | 9798.0 | 5433.228618 | 6216.493931 | 100.0 | 1250.00 | 3000.0 | 7500.0 | 64500.0 |
bmw | 5106.0 | 8381.066588 | 9864.113876 | 100.0 | 2399.00 | 5990.0 | 11237.5 | 259000.0 |
opel | 4969.0 | 3005.930972 | 3613.537077 | 100.0 | 800.00 | 1650.0 | 3800.0 | 38990.0 |
mercedes_benz | 4478.0 | 8669.448638 | 10188.212637 | 100.0 | 2350.00 | 5200.0 | 11787.5 | 180000.0 |
audi | 4022.0 | 9380.718548 | 10029.181115 | 100.0 | 2312.50 | 6400.0 | 12999.0 | 175000.0 |
ford | 3234.0 | 3767.004947 | 5792.612410 | 100.0 | 800.00 | 1800.0 | 4300.0 | 130000.0 |
renault | 2181.0 | 2496.940394 | 3592.998395 | 100.0 | 750.00 | 1390.0 | 2990.0 | 93000.0 |
peugeot | 1384.0 | 3113.860549 | 3132.992093 | 100.0 | 999.75 | 2100.0 | 4000.0 | 25500.0 |
fiat | 1187.0 | 2836.873631 | 3038.786980 | 100.0 | 799.50 | 1690.0 | 3800.0 | 25000.0 |
seat | 846.0 | 4433.419622 | 4777.410241 | 100.0 | 1000.00 | 2800.0 | 6300.0 | 33500.0 |
We can see here that there are three distince catigories of cars:
autos_10_gb['odometer_km'].describe().reindex(brands_10)
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
volkswagen | 9798.0 | 128802.816901 | 38353.245288 | 5000.0 | 125000.0 | 150000.0 | 150000.0 | 150000.0 |
bmw | 5106.0 | 132720.329025 | 34581.206872 | 5000.0 | 125000.0 | 150000.0 | 150000.0 | 150000.0 |
opel | 4969.0 | 129405.312940 | 36969.592163 | 5000.0 | 125000.0 | 150000.0 | 150000.0 | 150000.0 |
mercedes_benz | 4478.0 | 131069.673962 | 35901.474208 | 5000.0 | 125000.0 | 150000.0 | 150000.0 | 150000.0 |
audi | 4022.0 | 129245.400298 | 38743.748370 | 5000.0 | 125000.0 | 150000.0 | 150000.0 | 150000.0 |
ford | 3234.0 | 124386.209029 | 39625.163863 | 5000.0 | 100000.0 | 150000.0 | 150000.0 | 150000.0 |
renault | 2181.0 | 128337.918386 | 36731.107945 | 5000.0 | 125000.0 | 150000.0 | 150000.0 | 150000.0 |
peugeot | 1384.0 | 127127.890173 | 36008.940436 | 5000.0 | 100000.0 | 150000.0 | 150000.0 | 150000.0 |
fiat | 1187.0 | 116950.294861 | 42557.966296 | 5000.0 | 90000.0 | 125000.0 | 150000.0 | 150000.0 |
seat | 846.0 | 121536.643026 | 40847.564282 | 5000.0 | 100000.0 | 150000.0 | 150000.0 | 150000.0 |
Regardless of brands, all cars have high mileage on them, expected of second hand cars. More expensive brands on average have slightly higher mileage over the less expensive brands.