In this project, we'll work with a dataset of used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website.
The dataset was originally scraped and uploaded to Kaggle. We've made a few modifications from the original dataset that was uploaded to Kaggle:
The aim of this project is to clean the data and analyze the included used car listings.
import pandas as pd
import numpy as np
autos = pd.read_csv("autos.csv", encoding = "Latin-1")
autos.head()
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
autos.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): dateCrawled 50000 non-null object name 50000 non-null object seller 50000 non-null object offerType 50000 non-null object price 50000 non-null object abtest 50000 non-null object vehicleType 44905 non-null object yearOfRegistration 50000 non-null int64 gearbox 47320 non-null object powerPS 50000 non-null int64 model 47242 non-null object odometer 50000 non-null object monthOfRegistration 50000 non-null int64 fuelType 45518 non-null object brand 50000 non-null object notRepairedDamage 40171 non-null object dateCreated 50000 non-null object nrOfPictures 50000 non-null int64 postalCode 50000 non-null int64 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
import re
autos.rename({"yearOfRegistration":"registration_year",
"monthOfRegistration":"registration_month",
"notRepairedDamage":"unrepaired_damage",
"dateCreated":"ad_created"}, axis = 1, inplace = True)
columns = autos.columns
def cleaning(name):
name = re.sub(r'(?<!^)(?=[A-Z])', '_', name).lower()
return name
new_columns = []
for name in columns:
new_column = cleaning(name)
new_columns.append(new_column)
autos.columns = new_columns
print(autos.columns)
Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest', 'vehicle_type', 'registration_year', 'gearbox', 'power_p_s', 'model', 'odometer', 'registration_month', 'fuel_type', 'brand', 'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code', 'last_seen'], dtype='object')
autos.head()
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_p_s | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
Above, We edited some columns name:
autos.describe(include = "all")
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_p_s | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000 | 50000 | 50000 | 50000 | 50000 | 50000 | 44905 | 50000.000000 | 47320 | 50000.000000 | 47242 | 50000 | 50000.000000 | 45518 | 50000 | 40171 | 50000 | 50000.0 | 50000.000000 | 50000 |
unique | 48213 | 38754 | 2 | 2 | 2357 | 2 | 8 | NaN | 2 | NaN | 245 | 13 | NaN | 7 | 40 | 2 | 76 | NaN | NaN | 39481 |
top | 2016-03-30 17:37:35 | Ford_Fiesta | privat | Angebot | $0 | test | limousine | NaN | manuell | NaN | golf | 150,000km | NaN | benzin | volkswagen | nein | 2016-04-03 00:00:00 | NaN | NaN | 2016-04-07 06:17:27 |
freq | 3 | 78 | 49999 | 49999 | 1421 | 25756 | 12859 | NaN | 36993 | NaN | 4024 | 32424 | NaN | 30107 | 10687 | 35232 | 1946 | NaN | NaN | 8 |
mean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2005.073280 | NaN | 116.355920 | NaN | NaN | 5.723360 | NaN | NaN | NaN | NaN | 0.0 | 50813.627300 | NaN |
std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 105.712813 | NaN | 209.216627 | NaN | NaN | 3.711984 | NaN | NaN | NaN | NaN | 0.0 | 25779.747957 | NaN |
min | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1000.000000 | NaN | 0.000000 | NaN | NaN | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 1067.000000 | NaN |
25% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1999.000000 | NaN | 70.000000 | NaN | NaN | 3.000000 | NaN | NaN | NaN | NaN | 0.0 | 30451.000000 | NaN |
50% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2003.000000 | NaN | 105.000000 | NaN | NaN | 6.000000 | NaN | NaN | NaN | NaN | 0.0 | 49577.000000 | NaN |
75% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | NaN | 9.000000 | NaN | NaN | NaN | NaN | 0.0 | 71540.000000 | NaN |
max | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 9999.000000 | NaN | 17700.000000 | NaN | NaN | 12.000000 | NaN | NaN | NaN | NaN | 0.0 | 99998.000000 | NaN |
autos.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): date_crawled 50000 non-null object name 50000 non-null object seller 50000 non-null object offer_type 50000 non-null object price 50000 non-null object abtest 50000 non-null object vehicle_type 44905 non-null object registration_year 50000 non-null int64 gearbox 47320 non-null object power_p_s 50000 non-null int64 model 47242 non-null object odometer 50000 non-null object registration_month 50000 non-null int64 fuel_type 45518 non-null object brand 50000 non-null object unrepaired_damage 40171 non-null object ad_created 50000 non-null object nr_of_pictures 50000 non-null int64 postal_code 50000 non-null int64 last_seen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
We can see that , "Seller" and "offer_type" column has almost one type i.e "private" and "Angebot" respectively. So we can asume that its just one time and we can no longer need it .
As we see "nr_of_picture" seems a bit more supicious. And it required bit more investigation.
#Dropping Seller and offer_type columns
autos.drop(columns= {"seller","offer_type"}, inplace = True)
autos.nr_of_pictures.value_counts()
0 50000 Name: nr_of_pictures, dtype: int64
Since all the value In "nr_of_pictures" columns has 0 pictures, We can remove it.
# Removing nr_of_pictures columns
autos.drop(columns= {"nr_of_pictures"}, inplace = True)
# Changing data type form string to int
autos["price"]= autos["price"].str.replace("$","").str.replace(",","").astype(int)
autos["odometer"]= autos["odometer"].str.replace(",","").str.replace("km","").astype(int)
#Renaming odometer to odometer_km
autos.rename({"odometer":"odometer_km"}, axis = 1, inplace = True)
# Total number of Unique values in price
autos.price.unique().shape
(2357,)
#Finding maximum and minimum of price columns
autos.price.describe()
count 5.000000e+04 mean 9.840044e+03 std 4.811044e+05 min 0.000000e+00 25% 1.100000e+03 50% 2.950000e+03 75% 7.200000e+03 max 1.000000e+08 Name: price, dtype: float64
autos.price.value_counts().sort_index().head(20)
0 1421 1 156 2 3 3 1 5 2 8 1 9 1 10 7 11 2 12 3 13 2 14 1 15 2 17 3 18 1 20 4 25 5 29 1 30 7 35 1 Name: price, dtype: int64
autos.price.value_counts().sort_index(ascending = False).head(20)
99999999 1 27322222 1 12345678 3 11111111 2 10000000 1 3890000 1 1300000 1 1234566 1 999999 2 999990 1 350000 1 345000 1 299000 1 295000 1 265000 1 259000 1 250000 1 220000 1 198000 1 197000 1 Name: price, dtype: int64
There are 1421 riws in which price for a car is 0 . We need to remove whole rows where price is equal to 0 .
The highest Price for a Car in Ebay was almost $ 1 million doller.
#Removing rows where price of a car os equal to 0
autos.price= autos.price[autos.price !=0 ]
autos.price.value_counts().sort_index().head(20)
1.0 156 2.0 3 3.0 1 5.0 2 8.0 1 9.0 1 10.0 7 11.0 2 12.0 3 13.0 2 14.0 1 15.0 2 17.0 3 18.0 1 20.0 4 25.0 5 29.0 1 30.0 7 35.0 1 40.0 6 Name: price, dtype: int64
autos.odometer_km.describe()
count 50000.000000 mean 125732.700000 std 40042.211706 min 5000.000000 25% 125000.000000 50% 150000.000000 75% 150000.000000 max 150000.000000 Name: odometer_km, dtype: float64
autos.odometer_km.value_counts()
150000 32424 125000 5170 100000 2169 90000 1757 80000 1436 70000 1230 60000 1164 50000 1027 5000 967 40000 819 30000 789 20000 784 10000 264 Name: odometer_km, dtype: int64
It seems like about two third of car in ebay has travelled abouT 150000 km and few of the vehicel are below 50000 km .
There are 5 columns that should represent date values. Some of these columns were created by the crawler, some came from the website itself.
Right now, the date_crawled, last_seen, and ad_created columns are all identified as string values by pandas. Because these three columns are represented as strings, we need to convert the data into a numerical representation so we can understand it quantitatively. The other two columns are represented as numeric values, so we can use methods like Series.describe() to understand the distribution without any extra data processing.
autos[["date_crawled","last_seen","ad_created"]][0:5]
date_crawled | last_seen | ad_created | |
---|---|---|---|
0 | 2016-03-26 17:47:46 | 2016-04-06 06:45:54 | 2016-03-26 00:00:00 |
1 | 2016-04-04 13:38:56 | 2016-04-06 14:45:08 | 2016-04-04 00:00:00 |
2 | 2016-03-26 18:57:24 | 2016-04-06 20:15:37 | 2016-03-26 00:00:00 |
3 | 2016-03-12 16:58:10 | 2016-03-15 03:16:28 | 2016-03-12 00:00:00 |
4 | 2016-04-01 14:38:50 | 2016-04-01 14:38:50 | 2016-04-01 00:00:00 |
# In frequency
autos["date_crawled"].str[:10].value_counts()
2016-04-03 1934 2016-03-20 1891 2016-03-21 1876 2016-03-12 1839 2016-03-14 1831 2016-04-04 1826 2016-03-07 1798 2016-04-02 1770 2016-03-19 1745 2016-03-28 1742 2016-03-29 1709 2016-03-15 1699 2016-04-01 1690 2016-03-30 1681 2016-03-08 1665 2016-03-09 1661 2016-03-22 1647 2016-03-26 1624 2016-03-11 1624 2016-03-23 1619 2016-03-10 1606 2016-03-31 1596 2016-03-25 1587 2016-03-17 1576 2016-03-27 1552 2016-03-16 1475 2016-03-24 1455 2016-03-05 1269 2016-03-13 778 2016-03-06 697 2016-04-05 655 2016-03-18 653 2016-04-06 159 2016-04-07 71 Name: date_crawled, dtype: int64
# In percentage
autos["date_crawled"].str[:10].value_counts(normalize = True, dropna= False)
2016-04-03 0.03868 2016-03-20 0.03782 2016-03-21 0.03752 2016-03-12 0.03678 2016-03-14 0.03662 2016-04-04 0.03652 2016-03-07 0.03596 2016-04-02 0.03540 2016-03-19 0.03490 2016-03-28 0.03484 2016-03-29 0.03418 2016-03-15 0.03398 2016-04-01 0.03380 2016-03-30 0.03362 2016-03-08 0.03330 2016-03-09 0.03322 2016-03-22 0.03294 2016-03-26 0.03248 2016-03-11 0.03248 2016-03-23 0.03238 2016-03-10 0.03212 2016-03-31 0.03192 2016-03-25 0.03174 2016-03-17 0.03152 2016-03-27 0.03104 2016-03-16 0.02950 2016-03-24 0.02910 2016-03-05 0.02538 2016-03-13 0.01556 2016-03-06 0.01394 2016-04-05 0.01310 2016-03-18 0.01306 2016-04-06 0.00318 2016-04-07 0.00142 Name: date_crawled, dtype: float64
# Ranking the date in ascending order(earliest to latest)
autos["date_crawled"].str[:10].value_counts(normalize = True, dropna= False).sort_index()
2016-03-05 0.02538 2016-03-06 0.01394 2016-03-07 0.03596 2016-03-08 0.03330 2016-03-09 0.03322 2016-03-10 0.03212 2016-03-11 0.03248 2016-03-12 0.03678 2016-03-13 0.01556 2016-03-14 0.03662 2016-03-15 0.03398 2016-03-16 0.02950 2016-03-17 0.03152 2016-03-18 0.01306 2016-03-19 0.03490 2016-03-20 0.03782 2016-03-21 0.03752 2016-03-22 0.03294 2016-03-23 0.03238 2016-03-24 0.02910 2016-03-25 0.03174 2016-03-26 0.03248 2016-03-27 0.03104 2016-03-28 0.03484 2016-03-29 0.03418 2016-03-30 0.03362 2016-03-31 0.03192 2016-04-01 0.03380 2016-04-02 0.03540 2016-04-03 0.03868 2016-04-04 0.03652 2016-04-05 0.01310 2016-04-06 0.00318 2016-04-07 0.00142 Name: date_crawled, dtype: float64
autos["last_seen"].str[:10].value_counts(normalize = True, dropna= False)
2016-04-06 0.22100 2016-04-07 0.13092 2016-04-05 0.12428 2016-03-17 0.02792 2016-04-03 0.02536 2016-04-02 0.02490 2016-03-30 0.02484 2016-04-04 0.02462 2016-03-31 0.02384 2016-03-12 0.02382 2016-04-01 0.02310 2016-03-29 0.02234 2016-03-22 0.02158 2016-03-28 0.02086 2016-03-21 0.02074 2016-03-20 0.02070 2016-03-24 0.01956 2016-03-25 0.01920 2016-03-23 0.01858 2016-03-26 0.01696 2016-03-16 0.01644 2016-03-27 0.01602 2016-03-15 0.01588 2016-03-19 0.01574 2016-03-14 0.01280 2016-03-11 0.01252 2016-03-10 0.01076 2016-03-09 0.00986 2016-03-13 0.00898 2016-03-08 0.00760 2016-03-18 0.00742 2016-03-07 0.00536 2016-03-06 0.00442 2016-03-05 0.00108 Name: last_seen, dtype: float64
The crawler recorded the date it last saw any listing, which allows us to determine on what day a listing was removed, presumably because the car was sold.
The last three days contain a disproportionate amount of 'last seen' values. Given that these are 6-10x the values from the previous days, it's unlikely that there was a massive spike in sales, and more likely that these values are to do with the crawling period ending and don't indicate car sales.
autos["ad_created"].str[:10].value_counts(normalize = True, dropna= False)
2016-04-03 0.03892 2016-03-20 0.03786 2016-03-21 0.03772 2016-04-04 0.03688 2016-03-12 0.03662 2016-03-14 0.03522 2016-04-02 0.03508 2016-03-28 0.03496 2016-03-07 0.03474 2016-03-29 0.03414 2016-03-19 0.03384 2016-04-01 0.03380 2016-03-15 0.03374 2016-03-30 0.03344 2016-03-08 0.03334 2016-03-09 0.03324 2016-03-22 0.03280 2016-03-11 0.03278 2016-03-26 0.03256 2016-03-23 0.03218 2016-03-31 0.03192 2016-03-25 0.03188 2016-03-10 0.03186 2016-03-17 0.03120 2016-03-27 0.03090 2016-03-16 0.03000 2016-03-24 0.02908 2016-03-05 0.02304 2016-03-13 0.01692 2016-03-06 0.01512 ... 2016-02-19 0.00006 2016-02-24 0.00004 2016-02-20 0.00004 2016-02-26 0.00004 2016-02-09 0.00004 2016-02-02 0.00004 2016-02-14 0.00004 2016-01-10 0.00004 2016-02-18 0.00004 2016-02-05 0.00004 2016-02-08 0.00002 2015-06-11 0.00002 2016-01-13 0.00002 2016-02-11 0.00002 2015-11-10 0.00002 2016-02-22 0.00002 2015-12-05 0.00002 2015-12-30 0.00002 2016-01-16 0.00002 2016-01-22 0.00002 2016-02-01 0.00002 2016-01-14 0.00002 2015-08-10 0.00002 2016-02-16 0.00002 2015-09-09 0.00002 2016-02-07 0.00002 2016-01-07 0.00002 2016-01-29 0.00002 2016-02-17 0.00002 2016-01-03 0.00002 Name: ad_created, Length: 76, dtype: float64
There is a large variety of ad created dates. Most fall within 1-2 months of the listing date, but a few are quite old, with the oldest at around 9 months.
autos.registration_year.describe()
count 50000.000000 mean 2005.073280 std 105.712813 min 1000.000000 25% 1999.000000 50% 2003.000000 75% 2008.000000 max 9999.000000 Name: registration_year, dtype: float64
We can see that ,registration_year column contains some odd values:
Because a car can't be first registered after the listing was seen, any vehicle with a registration year above 2016 is definitely inaccurate. Determining the earliest valid year is more difficult. Realistically, it could be somewhere in the first few decades of the 1900s.
Let's count the number of listings with cars that fall outside the 1900 - 2016 interval and see if it's safe to remove those rows entirely, or if we need more custom logic.
(~autos["registration_year"].between(1900,2016)).sum()/autos.shape[0]
0.03944
Since there are just 4% of rows has registration year other than 1900 to 2016 , we will remove it.
autos = autos[autos["registration_year"].between(1900,2016)]
# TOP 20 YEARSFOR REGISTRATION
autos.registration_year.value_counts(normalize = True, dropna= False).head(20)
2000 0.069834 2005 0.062776 1999 0.062464 2004 0.056988 2003 0.056779 2006 0.056384 2001 0.056280 2002 0.052740 1998 0.051074 2007 0.047972 2008 0.046452 2009 0.043683 1997 0.042225 2011 0.034022 2010 0.033251 1996 0.030066 2012 0.027546 2016 0.027401 1995 0.027338 2013 0.016782 Name: registration_year, dtype: float64
It seems like most of vehicles were registered within 20 years.
# Peforiming Aggreagtion for top 10 brands mean price in ebay
print(autos["brand"].value_counts().head(10))
print(autos["brand"].value_counts().tail(10))
volkswagen 10188 bmw 5284 opel 5195 mercedes_benz 4580 audi 4149 ford 3352 renault 2274 peugeot 1418 fiat 1242 seat 873 Name: brand, dtype: int64 jeep 108 subaru 105 land_rover 98 saab 77 jaguar 76 trabant 75 daewoo 72 rover 65 lancia 52 lada 29 Name: brand, dtype: int64
Almost one fift of the car are of "volkswagen" brand where as just 29 of cars are from "lada" brand.
# Finding average price of top 10 branded car
brand_price = {}
top_10 = autos["brand"].value_counts().head(10).index
for key in top_10:
selected_rows = autos[autos["brand"]==key]
mean_value = selected_rows["price"].mean()
brand_price[key] =int( mean_value)
print(brand_price)
price_series = pd.Series(brand_price)
price_df= pd.DataFrame(price_series, columns=["mean_price"]).sort_values(by = "mean_price", ascending = False)
print(price_df)
{'bmw': 8571, 'volkswagen': 6729, 'renault': 2474, 'audi': 9336, 'opel': 5432, 'fiat': 2813, 'ford': 7456, 'mercedes_benz': 30829, 'seat': 4397, 'peugeot': 3094} mean_price mercedes_benz 30829 audi 9336 bmw 8571 ford 7456 volkswagen 6729 opel 5432 seat 4397 peugeot 3094 fiat 2813 renault 2474
The highest mean price of top 10 brand in ebay was of "mercedes_benz" with about $$31000 and the least was of "renalut" brand with just about $2500 per average. And also:
#Finding average milage for top 10 brand cars
brand_milage = {}
top_10 = autos["brand"].value_counts().head(10).index
for key in top_10:
selected_rows = autos[autos["brand"]==key]
mean_value = selected_rows["odometer_km"].mean()
brand_milage[key] = int(mean_value)
print(brand_milage)
milage_series = pd.Series(brand_milage)
milage_df= pd.DataFrame(milage_series, columns=["mean_milage"]).sort_values(by= "mean_milage", ascending = False)
print(milage_df)
{'bmw': 132434, 'volkswagen': 128730, 'renault': 128183, 'audi': 129287, 'opel': 129227, 'fiat': 116553, 'ford': 124046, 'mercedes_benz': 130860, 'seat': 121563, 'peugeot': 127136} mean_milage bmw 132434 mercedes_benz 130860 audi 129287 opel 129227 volkswagen 128730 renault 128183 peugeot 127136 ford 124046 seat 121563 fiat 116553
#Compairing mean_milage and mean_price of the top 10 brand cars
brand_info = milage_df
brand_info["mean_price"]= price_series
print(brand_info)
mean_milage mean_price bmw 132434 8571 mercedes_benz 130860 30829 audi 129287 9336 opel 129227 5432 volkswagen 128730 6729 renault 128183 2474 peugeot 127136 3094 ford 124046 7456 seat 121563 4397 fiat 116553 2813
We can see that price of a brand car does not depends mostly on the milage of that brand car i.e less the mean milage(odometer), less the price is not applied to this. There may be other condtions that made the price of the car less or more eg Year of Registration , unrepaired_damage etc