The aim of this project is to clean the data and analyze the included used car listings.
Our dataset represents used cars from eBay Kleinanzeigen, a classifields section of the German eBay website. The dataset was originally scrapped by user orgesleka and modified by DQ for learning purpose.
The data dictionary provided with data is as follows:
dateCrawled - When this ad was first crawled. All field-values are taken from this date.
name - Name of the car.
seller - Whether the seller is private or a dealer.
offerType - The type of listing
price - The price on the ad to sell the car.
abtest - Whether the listing is included in an A/B test.
vehicleType - The vehicle Type.
yearOfRegistration - The year in which the car was first registered.
gearbox - The transmission type.
powerPS - The power of the car in PS.
model - The car model name.
kilometer - How many kilometers the car has driven.
monthOfRegistration - The month in which the car was first registered.
fuelType - What type of fuel the car uses.
brand - The brand of the car.
notRepairedDamage - If the car has a damage which is not yet repaired.
dateCreated - The date on which the eBay listing was created.
nrOfPictures - The number of pictures in the ad.
postalCode - The postal code for the location of the vehicle.
lastSeenOnline - When the crawler saw this ad last online.
So, let's move step by step:
# import libraries:
import numpy as np
import pandas as pd
# reading csv file:
autos = pd.read_csv("autos.csv", encoding = "Latin-1")
# getting general idea about the dataset:
autos.info()
autos.head(7)
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 dateCrawled 50000 non-null object 1 name 50000 non-null object 2 seller 50000 non-null object 3 offerType 50000 non-null object 4 price 50000 non-null object 5 abtest 50000 non-null object 6 vehicleType 44905 non-null object 7 yearOfRegistration 50000 non-null int64 8 gearbox 47320 non-null object 9 powerPS 50000 non-null int64 10 model 47242 non-null object 11 odometer 50000 non-null object 12 monthOfRegistration 50000 non-null int64 13 fuelType 45518 non-null object 14 brand 50000 non-null object 15 notRepairedDamage 40171 non-null object 16 dateCreated 50000 non-null object 17 nrOfPictures 50000 non-null int64 18 postalCode 50000 non-null int64 19 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
5 | 2016-03-21 13:47:45 | Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto... | privat | Angebot | $7,900 | test | bus | 2006 | automatik | 150 | voyager | 150,000km | 4 | diesel | chrysler | NaN | 2016-03-21 00:00:00 | 0 | 22962 | 2016-04-06 09:45:21 |
6 | 2016-03-20 17:55:21 | VW_Golf_III_GT_Special_Electronic_Green_Metall... | privat | Angebot | $300 | test | limousine | 1995 | manuell | 90 | golf | 150,000km | 8 | benzin | volkswagen | NaN | 2016-03-20 00:00:00 | 0 | 31535 | 2016-03-23 02:48:59 |
Notes:
Our dataset have 50000 entries and 20 columns, five of columns are integer type and 15 are of object type. Not all columns are filled with information , as in "vehicleType 44905 non-null object". Some object columns can be transformed to integer type , for example "price" or "odometr".
from yearOfRegistration to registration_year
from monthOfRegistration to registration_month
from notRepairedDamage to unrepaired_damage
from dateCreated to ad_created
from camelcase to snakecase
# existing column names:
autos.columns
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest', 'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model', 'odometer', 'monthOfRegistration', 'fuelType', 'brand', 'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode', 'lastSeen'], dtype='object')
# column labels copy:
autos_copy = autos.columns.copy()
# function to format labels:
def change(label):
label = label.replace ("yearOfRegistration", "registration_year")
label = label.replace("monthOfRegistration", "registration_month")
label = label.replace("notRepairedDamage", "unrepaired_damage")
label = label.replace("dateCreated", "ad_created")
label = label.replace("dateCrawled", "date_crawled")
label = label.replace("offerType", "offer_type")
label = label.replace("vehicleType", "vehicle_type")
label = label.replace("powerPS", "power_ps")
label = label.replace("fuelType", "fuel_type")
label = label.replace("postalCode", "postal_code")
label = label.replace("lastSeen", "last_seen")
label = label.replace("nrOfPictures", "nr_of_pictures")
return label
# append function to empty list:
new_lab =[]
for name in autos_copy:
n = change(name)
new_lab.append(n)
# append list to the copy arrey:
autos_copy = new_lab
# append our copy arrey to existing names arrey:
autos.columns = autos_copy
# check current view of column labels:
autos.head()
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
Notes:
Now we've changed the format of column labels from camelcase to snakecase and reworded some names for better visual reading
# descriptive statistics:
autos.describe(include = "all")
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000 | 50000 | 50000 | 50000 | 50000 | 50000 | 44905 | 50000.000000 | 47320 | 50000.000000 | 47242 | 50000 | 50000.000000 | 45518 | 50000 | 40171 | 50000 | 50000.0 | 50000.000000 | 50000 |
unique | 48213 | 38754 | 2 | 2 | 2357 | 2 | 8 | NaN | 2 | NaN | 245 | 13 | NaN | 7 | 40 | 2 | 76 | NaN | NaN | 39481 |
top | 2016-03-16 21:50:53 | Ford_Fiesta | privat | Angebot | $0 | test | limousine | NaN | manuell | NaN | golf | 150,000km | NaN | benzin | volkswagen | nein | 2016-04-03 00:00:00 | NaN | NaN | 2016-04-07 06:17:27 |
freq | 3 | 78 | 49999 | 49999 | 1421 | 25756 | 12859 | NaN | 36993 | NaN | 4024 | 32424 | NaN | 30107 | 10687 | 35232 | 1946 | NaN | NaN | 8 |
mean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2005.073280 | NaN | 116.355920 | NaN | NaN | 5.723360 | NaN | NaN | NaN | NaN | 0.0 | 50813.627300 | NaN |
std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 105.712813 | NaN | 209.216627 | NaN | NaN | 3.711984 | NaN | NaN | NaN | NaN | 0.0 | 25779.747957 | NaN |
min | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1000.000000 | NaN | 0.000000 | NaN | NaN | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 1067.000000 | NaN |
25% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1999.000000 | NaN | 70.000000 | NaN | NaN | 3.000000 | NaN | NaN | NaN | NaN | 0.0 | 30451.000000 | NaN |
50% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2003.000000 | NaN | 105.000000 | NaN | NaN | 6.000000 | NaN | NaN | NaN | NaN | 0.0 | 49577.000000 | NaN |
75% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | NaN | 9.000000 | NaN | NaN | NaN | NaN | 0.0 | 71540.000000 | NaN |
max | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 9999.000000 | NaN | 17700.000000 | NaN | NaN | 12.000000 | NaN | NaN | NaN | NaN | 0.0 | 99998.000000 | NaN |
Notes:
#remove non-numeric characters from columns "price" and "odometer"
autos["price"] = autos["price"].str.replace("$","").str.replace(",","")
autos["odometer"] = autos["odometer"].str.replace("km", "").str.replace(",","")
#convert string to integer:
autos["price"] = autos["price"].astype(int)
autos["odometer"] = autos["odometer"].astype(int)
#check data types:
print(autos[["price","odometer"]].dtypes)
#rename column name:
autos.rename({"odometer": "odometer_km"}, axis = 1, inplace = True)
price int32 odometer int32 dtype: object
# unique values:
print("Price unique values: ",autos["price"].unique().shape)
print("Odometer_km unique values: ",autos["odometer_km"].unique().shape)
Price unique values: (2357,) Odometer_km unique values: (13,)
# columns description:
print("Price description:","\n",autos["price"].describe())
print("\n")
print("Odometer_km description:","\n", autos["odometer_km"].describe())
Price description: count 5.000000e+04 mean 9.840044e+03 std 4.811044e+05 min 0.000000e+00 25% 1.100000e+03 50% 2.950000e+03 75% 7.200000e+03 max 1.000000e+08 Name: price, dtype: float64 Odometer_km description: count 50000.000000 mean 125732.700000 std 40042.211706 min 5000.000000 25% 125000.000000 50% 150000.000000 75% 150000.000000 max 150000.000000 Name: odometer_km, dtype: float64
Notes:
as we see above, statistics for price column described in E-notion format. So we will look for min,max,mean with the help of methods
# min, max, mean for price column:
print("Price min: ",autos["price"].min())
print("Price max: ",autos["price"].max())
print("Price mean: ", autos["price"].mean())
Price min: 0 Price max: 99999999 Price mean: 9840.04376
#value counts for price column:
autos["price"].value_counts().sort_index(ascending= False).head(15)
99999999 1 27322222 1 12345678 3 11111111 2 10000000 1 3890000 1 1300000 1 1234566 1 999999 2 999990 1 350000 1 345000 1 299000 1 295000 1 265000 1 Name: price, dtype: int64
autos["price"].value_counts().sort_index(ascending= True).head(10)
0 1421 1 156 2 3 3 1 5 2 8 1 9 1 10 7 11 2 12 3 Name: price, dtype: int64
#value counts for odometer_km:
autos["odometer_km"].value_counts()
150000 32424 125000 5170 100000 2169 90000 1757 80000 1436 70000 1230 60000 1164 50000 1027 5000 967 40000 819 30000 789 20000 784 10000 264 Name: odometer_km, dtype: int64
# removing entries with no value in price column:
autos = autos[autos["price"].between(10,350000)]
autos["price"].describe()
count 48401.000000 mean 5908.885457 std 9068.695231 min 10.000000 25% 1250.000000 50% 3000.000000 75% 7498.000000 max 350000.000000 Name: price, dtype: float64
Notes:
Column "Odometer_km" have been left as it is, but in column "Price" entries from 0 to 10 and from 350 000 to 99999999 removed, because it is not common for car to have no cost or too high price.
There are 5 columns with date information in the dataset: 'data_crewled' ,'last_seen', 'ad_created' are strings and 'registration_month', 'registration_year' are integers.
For numeric data we will use series.describe() method to understand the distribution, but firstly we need to convert strings into numeric.
# view on our string dates:
autos[['date_crawled','ad_created','last_seen']][0:5]
date_crawled | ad_created | last_seen | |
---|---|---|---|
0 | 2016-03-26 17:47:46 | 2016-03-26 00:00:00 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | 2016-04-04 00:00:00 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | 2016-03-26 00:00:00 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | 2016-03-12 00:00:00 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | 2016-04-01 00:00:00 | 2016-04-01 14:38:50 |
# value counts for date_crawled:
autos['date_crawled'].str[:10].value_counts(normalize = True, dropna = False).sort_index()
2016-03-05 0.025371 2016-03-06 0.014070 2016-03-07 0.036053 2016-03-08 0.033264 2016-03-09 0.033037 2016-03-10 0.032210 2016-03-11 0.032603 2016-03-12 0.036962 2016-03-13 0.015661 2016-03-14 0.036631 2016-03-15 0.034276 2016-03-16 0.029504 2016-03-17 0.031611 2016-03-18 0.012913 2016-03-19 0.034751 2016-03-20 0.037809 2016-03-21 0.037293 2016-03-22 0.032933 2016-03-23 0.032272 2016-03-24 0.029400 2016-03-25 0.031570 2016-03-26 0.032251 2016-03-27 0.031115 2016-03-28 0.034855 2016-03-29 0.034111 2016-03-30 0.033739 2016-03-31 0.031818 2016-04-01 0.033739 2016-04-02 0.035495 2016-04-03 0.038574 2016-04-04 0.036487 2016-04-05 0.013078 2016-04-06 0.003161 2016-04-07 0.001384 Name: date_crawled, dtype: float64
#value counts for ad_created:
autos['ad_created'].str[:10].value_counts(normalize = True, dropna = False).sort_index(ascending = False)
2016-04-07 0.001240 2016-04-06 0.003244 2016-04-05 0.011797 2016-04-04 0.036859 2016-04-03 0.038822 ... 2015-12-05 0.000021 2015-11-10 0.000021 2015-09-09 0.000021 2015-08-10 0.000021 2015-06-11 0.000021 Name: ad_created, Length: 76, dtype: float64
#value counts for last_seen:
autos['last_seen'].str[:10].value_counts(normalize = True, dropna = False).sort_index(ascending = False)
2016-04-07 0.132043 2016-04-06 0.221607 2016-04-05 0.124956 2016-04-04 0.024504 2016-04-03 0.025185 2016-04-02 0.024855 2016-04-01 0.022871 2016-03-31 0.023842 2016-03-30 0.024752 2016-03-29 0.022334 2016-03-28 0.020867 2016-03-27 0.015599 2016-03-26 0.016818 2016-03-25 0.019194 2016-03-24 0.019752 2016-03-23 0.018595 2016-03-22 0.021384 2016-03-21 0.020599 2016-03-20 0.020640 2016-03-19 0.015826 2016-03-18 0.007335 2016-03-17 0.028057 2016-03-16 0.016446 2016-03-15 0.015867 2016-03-14 0.012624 2016-03-13 0.008863 2016-03-12 0.023781 2016-03-11 0.012376 2016-03-10 0.010620 2016-03-09 0.009628 2016-03-08 0.007355 2016-03-07 0.005413 2016-03-06 0.004339 2016-03-05 0.001074 Name: last_seen, dtype: float64
Notes:
This informmation represents that active period is for about a month. Ad_created includes information of later periods but with low percentage, because some sellers are still in process of distribution their cars for half a year.
# registration_year description:
autos['registration_year'].describe()
count 48401.000000 mean 2004.774715 std 88.790201 min 1000.000000 25% 1999.000000 50% 2004.000000 75% 2008.000000 max 9999.000000 Name: registration_year, dtype: float64
Notes:
Very strange data for min = 1000 and max =9999, that are unrealistic.
The year 1886 is regarded as the birth year of the modern car when German inventor Karl Benz patented his Benz Patent-Motorwagen. Cars became widely available in the early 20th century. One of the first cars accessible to the masses was the 1908 Model T, an American car manufactured by the Ford Motor Company (Wikipedia)
So the min year - 1886
Max year - 2016 (the year of data gathered)
# selecting range between 1886 and 2016:
autos = autos[autos["registration_year"].between(1886,2016)]
# new description of registration_year column:
autos["registration_year"].describe()
count 46524.000000 mean 2002.926468 std 7.155528 min 1910.000000 25% 1999.000000 50% 2003.000000 75% 2008.000000 max 2016.000000 Name: registration_year, dtype: float64
Notes:
Now we see , that the oldest car is of 1910 year.
# value counts for registration_year:
autos['registration_year'].value_counts(normalize = True)
2000 0.067234 2005 0.062914 1999 0.062032 2004 0.058013 2003 0.057949 ... 1948 0.000021 1931 0.000021 1950 0.000021 1927 0.000021 1952 0.000021 Name: registration_year, Length: 78, dtype: float64
** Notes:** From statistics above we can came to counclusion that the most ads on the site with cars in range 10-15 years old (2000 - 2005 +- 1 year)
#unique brands:
autos['brand'].unique()
array(['peugeot', 'bmw', 'volkswagen', 'smart', 'ford', 'chrysler', 'seat', 'renault', 'mercedes_benz', 'audi', 'sonstige_autos', 'opel', 'mazda', 'porsche', 'mini', 'toyota', 'dacia', 'nissan', 'jeep', 'saab', 'volvo', 'mitsubishi', 'jaguar', 'fiat', 'skoda', 'subaru', 'kia', 'citroen', 'chevrolet', 'hyundai', 'honda', 'daewoo', 'suzuki', 'trabant', 'land_rover', 'alfa_romeo', 'lada', 'rover', 'daihatsu', 'lancia'], dtype=object)
# value counts:
autos['brand'].value_counts(ascending = False)
volkswagen 9833 bmw 5116 opel 4994 mercedes_benz 4488 audi 4030 ford 3256 renault 2193 peugeot 1390 fiat 1195 seat 849 skoda 764 nissan 712 mazda 709 smart 661 citroen 653 toyota 593 hyundai 467 sonstige_autos 448 volvo 425 mini 408 mitsubishi 381 honda 365 kia 330 alfa_romeo 309 porsche 283 suzuki 276 chevrolet 265 chrysler 164 dacia 123 daihatsu 117 jeep 106 subaru 99 land_rover 98 saab 77 jaguar 73 daewoo 70 trabant 65 rover 62 lancia 50 lada 27 Name: brand, dtype: int64
** Notes:** Our next step will be analysis of prices of japan brands in Germany. Let's group the most 5 popular brands : "toyota", "nissan", "mazda", "mitsubishi", "honda" and see the mean prices for it.
# select 5 japan brands:
japan_group = autos[(autos['brand'] == "toyota") | (autos['brand'] == "mazda")| (autos['brand'] == 'nissan')|(autos['brand'] == 'mitsubishi')|(autos['brand']== 'honda')]
# check our new arrey
japan_group.head()
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
19 | 2016-03-17 13:36:21 | mazda_tribute_2.0_mit_gas_und_tuev_neu_2018 | privat | Angebot | 4150 | control | suv | 2004 | manuell | 124 | andere | 150000 | 2 | lpg | mazda | nein | 2016-03-17 00:00:00 | 0 | 40878 | 2016-03-17 14:45:58 |
45 | 2016-03-12 17:39:55 | Toyota_Starlet_1.3 | privat | Angebot | 1000 | test | kleinwagen | 1995 | manuell | 75 | andere | 150000 | 3 | benzin | toyota | nein | 2016-03-12 00:00:00 | 0 | 74597 | 2016-04-07 08:45:19 |
61 | 2016-04-04 22:58:31 | Nissan_Primera | privat | Angebot | 995 | test | limousine | 1994 | manuell | 101 | primera | 150000 | 4 | benzin | nissan | nein | 2016-04-04 00:00:00 | 0 | 19057 | 2016-04-07 02:17:18 |
81 | 2016-03-12 19:46:34 | Nissan_Micra_K12___super_Kleinwagen! | privat | Angebot | 2000 | test | kleinwagen | 2007 | manuell | 65 | micra | 150000 | 9 | benzin | nissan | ja | 2016-03-12 00:00:00 | 0 | 52156 | 2016-03-26 05:15:53 |
85 | 2016-03-11 12:36:21 | Toyota_yaris_cool | privat | Angebot | 5000 | test | kleinwagen | 2009 | manuell | 0 | yaris | 60000 | 12 | NaN | toyota | NaN | 2016-03-11 00:00:00 | 0 | 30916 | 2016-04-07 08:17:01 |
# list of unique names:
japan_u = japan_group['brand'].unique()
print(japan_u)
['mazda' 'toyota' 'nissan' 'mitsubishi' 'honda']
# empty dictionary for final data:
japan_b = {}
#iterating our list of brands and calculating mean of the price:
for c in japan_u:
rows = japan_group[japan_group['brand'] == c ]
mean = rows["price"].mean()
japan_b[c] = mean
#iterating dictionary:
for k , v in japan_b.items():
print( k,": ", v)
mazda : 4112.596614950635 toyota : 5167.091062394604 nissan : 4750.063202247191 mitsubishi : 3421.293963254593 honda : 4119.109589041096
Notes: As a result, the most expensive is "Toyota" and "Nissan". "Mazda" and "Honda" in the middle.
# storing a new dictionary with mileage values:
japan_m ={}
for c in japan_u:
rows = japan_group[japan_group['brand'] == c ]
mileage = rows["odometer_km"].mean()
japan_m[c] = mileage
# converting dictionaries into series:
price_s = pd.Series(japan_b).astype(int)
mileage_s = pd.Series(japan_m).astype(int)
# converting series into dataframe:
jb = pd.DataFrame(price_s, columns = ['mean_price'])
jb
mean_price | |
---|---|
mazda | 4112 |
toyota | 5167 |
nissan | 4750 |
mitsubishi | 3421 |
honda | 4119 |
# adding a new column with mileage values:
jb['mean_mileage'] = mileage_s
jb
mean_price | mean_mileage | |
---|---|---|
mazda | 4112 | 124464 |
toyota | 5167 | 115944 |
nissan | 4750 | 118370 |
mitsubishi | 3421 | 126837 |
honda | 4119 | 122493 |
** Conclusion:**
Avarage prices for japan cars in Gernamy are in the range between $4000 -$5000. The price depends from brand. If the buyer want Toyota, he should be ready to pay about $5000 , if the budget less, he can look for such brand as Mitsubishi.
Average mileage is in the range 115000 -125000 kilometers and do not have much impact on the price.