In this project we will be cleaning the data and analyze the listings. Looking at the data dictionary, I can see that the column names are upper and lower case, this will have to be fixed. First, importingg the libraries and reading the file.
import pandas as pd
import csv
The UTF-8 encoding did not work hence tried Latin-1 that read the file in.
autos = pd.read_csv("autos.csv", encoding = "Latin-1")
Just viewing 3 rows since it was too big to get the column headings and data in the screen.
I see that Pandas has provided its own indexing as did not specify it.
autos.head(3)
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
Checking out the columns and data types.
print(autos.columns)
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest', 'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model', 'odometer', 'monthOfRegistration', 'fuelType', 'brand', 'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode', 'lastSeen'], dtype='object')
Most of the columns are objects or integers.
autos.shape
(50000, 20)
autos.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 dateCrawled 50000 non-null object 1 name 50000 non-null object 2 seller 50000 non-null object 3 offerType 50000 non-null object 4 price 50000 non-null object 5 abtest 50000 non-null object 6 vehicleType 44905 non-null object 7 yearOfRegistration 50000 non-null int64 8 gearbox 47320 non-null object 9 powerPS 50000 non-null int64 10 model 47242 non-null object 11 odometer 50000 non-null object 12 monthOfRegistration 50000 non-null int64 13 fuelType 45518 non-null object 14 brand 50000 non-null object 15 notRepairedDamage 40171 non-null object 16 dateCreated 50000 non-null object 17 nrOfPictures 50000 non-null int64 18 postalCode 50000 non-null int64 19 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
Most of the values are available in columns. Looks like the worst column notRepairedDamage is ~80% non null.
Changing the column headings to lowercase. Note since I only see the upper and lowercase in the column labels and no other significant issue. Not writing a fucntion but just a for loop, to clean it.
Tried various ideas but can not convert the column labels to lower case which is strange!
autos.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 dateCrawled 50000 non-null object 1 name 50000 non-null object 2 seller 50000 non-null object 3 offerType 50000 non-null object 4 price 50000 non-null object 5 abtest 50000 non-null object 6 vehicleType 44905 non-null object 7 yearOfRegistration 50000 non-null int64 8 gearbox 47320 non-null object 9 powerPS 50000 non-null int64 10 model 47242 non-null object 11 odometer 50000 non-null object 12 monthOfRegistration 50000 non-null int64 13 fuelType 45518 non-null object 14 brand 50000 non-null object 15 notRepairedDamage 40171 non-null object 16 dateCreated 50000 non-null object 17 nrOfPictures 50000 non-null int64 18 postalCode 50000 non-null int64 19 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
Going as per the instructions of the project.
autos.columns
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest', 'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model', 'odometer', 'monthOfRegistration', 'fuelType', 'brand', 'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode', 'lastSeen'], dtype='object')
autos = autos.copy()
autos.columns = ['datecrawled', 'name', 'seller', 'offertype', 'price', 'abtest',
'vehicletype', 'registration_year', 'gearbox', 'powerps', 'model',
'odometer', 'registration_month', 'fueltype', 'brand',
'unrepaired_damage', 'ad_created', 'nrofpictures', 'postalcode',
'lastseen']
autos.columns
Index(['datecrawled', 'name', 'seller', 'offertype', 'price', 'abtest', 'vehicletype', 'registration_year', 'gearbox', 'powerps', 'model', 'odometer', 'registration_month', 'fueltype', 'brand', 'unrepaired_damage', 'ad_created', 'nrofpictures', 'postalcode', 'lastseen'], dtype='object')
So apparently due to snakecase vs camelcase the replace method does not work. Hence done manually, this is not a problem since it is not goign to be done again and again.
autos.head(3)
datecrawled | name | seller | offertype | price | abtest | vehicletype | registration_year | gearbox | powerps | model | odometer | registration_month | fueltype | brand | unrepaired_damage | ad_created | nrofpictures | postalcode | lastseen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
autos["registration_year"] = autos["registration_year"].astype(int)
autos["powerps"] = autos["powerps"].astype(int)
autos["registration_month"] = autos["registration_month"].astype(int)
autos["nrofpictures"] = autos["nrofpictures"].astype(int)
autos["postalcode"] = autos["postalcode"].astype(int)
autos.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 datecrawled 50000 non-null object 1 name 50000 non-null object 2 seller 50000 non-null object 3 offertype 50000 non-null object 4 price 50000 non-null object 5 abtest 50000 non-null object 6 vehicletype 44905 non-null object 7 registration_year 50000 non-null int64 8 gearbox 47320 non-null object 9 powerps 50000 non-null int64 10 model 47242 non-null object 11 odometer 50000 non-null object 12 registration_month 50000 non-null int64 13 fueltype 45518 non-null object 14 brand 50000 non-null object 15 unrepaired_damage 40171 non-null object 16 ad_created 50000 non-null object 17 nrofpictures 50000 non-null int64 18 postalcode 50000 non-null int64 19 lastseen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
Below, making the price and odometer readings float. First stripping the km or $ and commas and then converting to float.
autos["odometer"] = autos["odometer"].str.replace(",","")
autos["odometer"] = autos["odometer"].str.replace("km","")
autos["odometer"] = autos["odometer"].astype(float)
autos["price"] = autos["price"].str.replace(",","")
autos["price"] = autos["price"].str.replace("$","")
autos["price"] = autos["price"].astype(float)
autos.head(3)
datecrawled | name | seller | offertype | price | abtest | vehicletype | registration_year | gearbox | powerps | model | odometer | registration_month | fueltype | brand | unrepaired_damage | ad_created | nrofpictures | postalcode | lastseen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | 5000.0 | control | bus | 2004 | manuell | 158 | andere | 150000.0 | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | 8500.0 | control | limousine | 1997 | automatik | 286 | 7er | 150000.0 | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | 8990.0 | test | limousine | 2009 | manuell | 102 | golf | 70000.0 | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
Now follwinf as per project guidelines - screen 3.
autos.describe(include = "all")
datecrawled | name | seller | offertype | price | abtest | vehicletype | registration_year | gearbox | powerps | model | odometer | registration_month | fueltype | brand | unrepaired_damage | ad_created | nrofpictures | postalcode | lastseen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000 | 50000 | 50000 | 50000 | 5.000000e+04 | 50000 | 44905 | 50000.000000 | 47320 | 50000.000000 | 47242 | 50000.000000 | 50000.000000 | 45518 | 50000 | 40171 | 50000 | 50000.0 | 50000.000000 | 50000 |
unique | 48213 | 38754 | 2 | 2 | NaN | 2 | 8 | NaN | 2 | NaN | 245 | NaN | NaN | 7 | 40 | 2 | 76 | NaN | NaN | 39481 |
top | 2016-03-27 22:55:05 | Ford_Fiesta | privat | Angebot | NaN | test | limousine | NaN | manuell | NaN | golf | NaN | NaN | benzin | volkswagen | nein | 2016-04-03 00:00:00 | NaN | NaN | 2016-04-07 06:17:27 |
freq | 3 | 78 | 49999 | 49999 | NaN | 25756 | 12859 | NaN | 36993 | NaN | 4024 | NaN | NaN | 30107 | 10687 | 35232 | 1946 | NaN | NaN | 8 |
mean | NaN | NaN | NaN | NaN | 9.840044e+03 | NaN | NaN | 2005.073280 | NaN | 116.355920 | NaN | 125732.700000 | 5.723360 | NaN | NaN | NaN | NaN | 0.0 | 50813.627300 | NaN |
std | NaN | NaN | NaN | NaN | 4.811044e+05 | NaN | NaN | 105.712813 | NaN | 209.216627 | NaN | 40042.211706 | 3.711984 | NaN | NaN | NaN | NaN | 0.0 | 25779.747957 | NaN |
min | NaN | NaN | NaN | NaN | 0.000000e+00 | NaN | NaN | 1000.000000 | NaN | 0.000000 | NaN | 5000.000000 | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 1067.000000 | NaN |
25% | NaN | NaN | NaN | NaN | 1.100000e+03 | NaN | NaN | 1999.000000 | NaN | 70.000000 | NaN | 125000.000000 | 3.000000 | NaN | NaN | NaN | NaN | 0.0 | 30451.000000 | NaN |
50% | NaN | NaN | NaN | NaN | 2.950000e+03 | NaN | NaN | 2003.000000 | NaN | 105.000000 | NaN | 150000.000000 | 6.000000 | NaN | NaN | NaN | NaN | 0.0 | 49577.000000 | NaN |
75% | NaN | NaN | NaN | NaN | 7.200000e+03 | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | 150000.000000 | 9.000000 | NaN | NaN | NaN | NaN | 0.0 | 71540.000000 | NaN |
max | NaN | NaN | NaN | NaN | 1.000000e+08 | NaN | NaN | 9999.000000 | NaN | 17700.000000 | NaN | 150000.000000 | 12.000000 | NaN | NaN | NaN | NaN | 0.0 | 99998.000000 | NaN |
Upon analyzing the above results, it seems like seller and offertype have pretty much all similar values, hence both columns can be dropped. Similarly, nrofpictures is also mostly 0, hence can be dropped.
Further investigation should be conducted on registration_month since there are a few 0s there and registration_year has a max of 9999.
autos["registration_year"].value_counts()
2000 3354 2005 3015 1999 3000 2004 2737 2003 2727 ... 1931 1 1929 1 1001 1 9996 1 1952 1 Name: registration_year, Length: 97, dtype: int64
autos["registration_year"].unique().shape
(97,)
The years do have some incorrect values such as 1001, 9996. I think if the year is not between 1901 and 2010, it should be considered as "Not Known".
autos["registration_month"].value_counts()
0 5075 3 5071 6 4368 5 4107 4 4102 7 3949 10 3651 12 3447 9 3389 11 3360 1 3282 8 3191 2 3008 Name: registration_month, dtype: int64
autos["registration_year"].describe()
count 50000.000000 mean 2005.073280 std 105.712813 min 1000.000000 25% 1999.000000 50% 2003.000000 75% 2008.000000 max 9999.000000 Name: registration_year, dtype: float64
The month seems very interesting, looks like there are 5075 values of 0 which essentially means that these months are not known.
autos["registration_month"].describe()
count 50000.000000 mean 5.723360 std 3.711984 min 0.000000 25% 3.000000 50% 6.000000 75% 9.000000 max 12.000000 Name: registration_month, dtype: float64
Identifying very high and low prices and odometer readings.
print(autos["price"].max())
print(autos["price"].min())
print(autos["price"].mean())
autos["price"].describe()
99999999.0 0.0 9840.04376
count 5.000000e+04 mean 9.840044e+03 std 4.811044e+05 min 0.000000e+00 25% 1.100000e+03 50% 2.950000e+03 75% 7.200000e+03 max 1.000000e+08 Name: price, dtype: float64
When I look above at the prices, there are obviously issues since there are outliers. Since these are cars selling on eBay and the 25th to 75th percentile lies between 1100 and 7200 and the mean is 9840, this indicates tha there are outliers which are skewing the mean. Median should also be checked.
print(autos["price"].median())
2950.0
The median above is 2950, which seems reasonable. Hence, let us drop rows where cars are priced less than 500 and more than 15000.
print(autos.shape)
(50000, 20)
autos_test = autos[autos["price"].between(500,15000)]
print(autos_test.shape)
(40771, 20)
Hmmm ... close to 10k cars were dropped, ot good let us increase the band to [250,25000]
autos_test_2 = autos[autos["price"].between(250,25000)]
print(autos_test_2.shape)
(45911, 20)
so we have 5k more in the sample. Let me also check how many have 0 value.
autos["price"].value_counts()
0.0 1421 500.0 781 1500.0 734 2500.0 643 1200.0 639 ... 6202.0 1 18310.0 1 898.0 1 11240.0 1 789.0 1 Name: price, Length: 2357, dtype: int64
So 1421 cars have 0 price, still we have 3.5k cars not accounted for this is a lot. We need to increase the price range for filtering. Some rich folks are selling on eBay!
autos_test_3 = autos[autos["price"].between(250,50000)]
print(autos_test_3.shape)
(47102, 20)
Looks like after 50k the gain decreases significantly, this is a question to p
autos = autos[autos["price"].between(250,50000)]
print(autos.shape)
(47102, 20)
Odometer readings look good, since they are used cars.
print(autos["odometer"].max())
print(autos["odometer"].min())
150000.0 5000.0
Working with dates and times - Screen 5 onwards for instructions.
Calculating the distirbution of dates in - date_crawled, ad_created, and last_seen columns.
autos
datecrawled | name | seller | offertype | price | abtest | vehicletype | registration_year | gearbox | powerps | model | odometer | registration_month | fueltype | brand | unrepaired_damage | ad_created | nrofpictures | postalcode | lastseen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | 5000.0 | control | bus | 2004 | manuell | 158 | andere | 150000.0 | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | 8500.0 | control | limousine | 1997 | automatik | 286 | 7er | 150000.0 | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | 8990.0 | test | limousine | 2009 | manuell | 102 | golf | 70000.0 | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | 4350.0 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70000.0 | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | 1350.0 | test | kombi | 2003 | manuell | 0 | focus | 150000.0 | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
49995 | 2016-03-27 14:38:19 | Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon | privat | Angebot | 24900.0 | control | limousine | 2011 | automatik | 239 | q5 | 100000.0 | 1 | diesel | audi | nein | 2016-03-27 00:00:00 | 0 | 82131 | 2016-04-01 13:47:40 |
49996 | 2016-03-28 10:50:25 | Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... | privat | Angebot | 1980.0 | control | cabrio | 1996 | manuell | 75 | astra | 150000.0 | 5 | benzin | opel | nein | 2016-03-28 00:00:00 | 0 | 44807 | 2016-04-02 14:18:02 |
49997 | 2016-04-02 14:44:48 | Fiat_500_C_1.2_Dualogic_Lounge | privat | Angebot | 13200.0 | test | cabrio | 2014 | automatik | 69 | 500 | 5000.0 | 11 | benzin | fiat | nein | 2016-04-02 00:00:00 | 0 | 73430 | 2016-04-04 11:47:27 |
49998 | 2016-03-08 19:25:42 | Audi_A3_2.0_TDI_Sportback_Ambition | privat | Angebot | 22900.0 | control | kombi | 2013 | manuell | 150 | a3 | 40000.0 | 11 | diesel | audi | nein | 2016-03-08 00:00:00 | 0 | 35683 | 2016-04-05 16:45:07 |
49999 | 2016-03-14 00:42:12 | Opel_Vectra_1.6_16V | privat | Angebot | 1250.0 | control | limousine | 1996 | manuell | 101 | vectra | 150000.0 | 1 | benzin | opel | nein | 2016-03-13 00:00:00 | 0 | 45897 | 2016-04-06 21:18:48 |
47102 rows × 20 columns
autos.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 47102 entries, 0 to 49999 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 datecrawled 47102 non-null object 1 name 47102 non-null object 2 seller 47102 non-null object 3 offertype 47102 non-null object 4 price 47102 non-null float64 5 abtest 47102 non-null object 6 vehicletype 42947 non-null object 7 registration_year 47102 non-null int64 8 gearbox 45054 non-null object 9 powerps 47102 non-null int64 10 model 44823 non-null object 11 odometer 47102 non-null float64 12 registration_month 47102 non-null int64 13 fueltype 43470 non-null object 14 brand 47102 non-null object 15 unrepaired_damage 38657 non-null object 16 ad_created 47102 non-null object 17 nrofpictures 47102 non-null int64 18 postalcode 47102 non-null int64 19 lastseen 47102 non-null object dtypes: float64(2), int64(5), object(13) memory usage: 7.5+ MB
Below, develop frequencies for datecrawled, ad_created and Lastseen. Includes missing values put dropna and for percentages normalize.
autos["datecrawled"].str[:10].value_counts(normalize = True,dropna = False)
2016-04-03 0.038788 2016-03-20 0.038045 2016-03-21 0.037387 2016-03-12 0.037047 2016-04-04 0.036559 2016-03-14 0.036516 2016-03-07 0.036177 2016-04-02 0.035625 2016-03-28 0.035073 2016-03-19 0.034627 2016-03-15 0.034181 2016-03-30 0.033884 2016-04-01 0.033820 2016-03-29 0.033799 2016-03-08 0.033098 2016-03-09 0.032971 2016-03-22 0.032737 2016-03-11 0.032737 2016-03-10 0.032568 2016-03-26 0.032313 2016-03-23 0.032207 2016-03-31 0.031825 2016-03-17 0.031485 2016-03-25 0.031273 2016-03-27 0.031039 2016-03-16 0.029447 2016-03-24 0.029256 2016-03-05 0.025328 2016-03-13 0.015668 2016-03-06 0.014118 2016-04-05 0.013036 2016-03-18 0.012823 2016-04-06 0.003185 2016-04-07 0.001359 Name: datecrawled, dtype: float64
The date crawled is generally evenly distributed as you woudl expect from the computer.
autos["ad_created"].str[:10].value_counts(normalize = True,dropna = False)
2016-04-03 0.039022 2016-03-20 0.038151 2016-03-21 0.037599 2016-04-04 0.036941 2016-03-12 0.036877 ... 2016-01-22 0.000021 2015-09-09 0.000021 2016-02-01 0.000021 2016-02-17 0.000021 2016-01-29 0.000021 Name: ad_created, Length: 76, dtype: float64
autos["lastseen"].str[:10].value_counts(normalize = True,dropna = False)
2016-04-06 0.222538 2016-04-07 0.132436 2016-04-05 0.125303 2016-03-17 0.028130 2016-04-03 0.025179 2016-04-02 0.024861 2016-04-04 0.024627 2016-03-30 0.024606 2016-03-12 0.023948 2016-03-31 0.023842 2016-04-01 0.022971 2016-03-29 0.022016 2016-03-22 0.021485 2016-03-28 0.020848 2016-03-20 0.020594 2016-03-21 0.020530 2016-03-24 0.019617 2016-03-25 0.018959 2016-03-23 0.018471 2016-03-26 0.016645 2016-03-16 0.016305 2016-03-15 0.015796 2016-03-19 0.015604 2016-03-27 0.015541 2016-03-14 0.012505 2016-03-11 0.012377 2016-03-10 0.010530 2016-03-09 0.009596 2016-03-13 0.008959 2016-03-18 0.007240 2016-03-08 0.007197 2016-03-07 0.005350 2016-03-06 0.004331 2016-03-05 0.001062 Name: lastseen, dtype: float64
The projects is asking us to clean reg_year now. As I was trying to do earlier.
autos["registration_year"].describe()
count 47102.000000 mean 2004.842661 std 88.923813 min 1000.000000 25% 1999.000000 50% 2004.000000 75% 2008.000000 max 9999.000000 Name: registration_year, dtype: float64
autos_test_4 = autos[autos["registration_year"].between(1900,2016)]
print(autos_test_4.shape)
(45254, 20)
As showsn above the 1900 to 2016 range covers, 45.254k of the ~47k entries, which seems good. As I had mentioned earlier this seems reasonable since cars which are sold second hand at eBay would be too old otherwise (infact any car after 1990s shoudl be old), and as explained in the project max year can be 2016.
Thus, using the above cut offs and removing the rows outside this range.
autos = autos[autos["registration_year"].between(1900,2016)]
print(autos.shape)
(45254, 20)
Below I calculated the average price for each brand. But after that I will calculate average for select few brands.
unique_brands = autos["brand"].unique()
avg_brand_price = {}
for brand in unique_brands:
avg_price = autos.loc[autos["brand"] == brand,"price"].mean()
avg_brand_price[brand] = avg_price
print(avg_brand_price)
{'peugeot': 3167.9941133186167, 'bmw': 8181.520323599053, 'volkswagen': 5532.8090625, 'smart': 3596.40273556231, 'ford': 3815.6854734111544, 'chrysler': 3506.864197530864, 'seat': 4531.4691656590085, 'renault': 2550.726230291448, 'mercedes_benz': 8257.55262564796, 'audi': 9105.847602309817, 'sonstige_autos': 9923.65083135392, 'opel': 3126.904552129222, 'mazda': 4227.840348330915, 'porsche': 23866.584615384614, 'mini': 10691.06157635468, 'toyota': 5175.555743243243, 'dacia': 5915.528455284553, 'nissan': 4889.322720694646, 'jeep': 11230.028571428571, 'saab': 3292.6266666666666, 'volvo': 5016.275534441806, 'mitsubishi': 3509.730458221024, 'jaguar': 10840.764705882353, 'fiat': 2962.0123456790125, 'skoda': 6459.275496688741, 'subaru': 4197.95744680851, 'kia': 6036.388379204893, 'citroen': 3847.1105919003116, 'hyundai': 5445.095444685467, 'chevrolet': 6594.827586206897, 'honda': 4163.08864265928, 'daewoo': 1147.936507936508, 'suzuki': 4276.698113207547, 'trabant': 1990.551724137931, 'land_rover': 15481.445652173914, 'alfa_romeo': 4010.0496688741723, 'lada': 2688.296296296296, 'rover': 1626.9180327868853, 'daihatsu': 1758.0092592592594, 'lancia': 3512.5833333333335}
Calculating average price for the 5 most popular brands.
brands = autos["brand"].value_counts()
brands.shape
brands.index[4]
'audi'
top_7_brands = brands.index[:7]
print(top_7_brands)
Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford', 'renault'], dtype='object')
avg_brand_price_7 = {}
for brand in top_7_brands:
avg_price = autos.loc[autos["brand"] == brand,"price"].mean()
avg_brand_price_7[brand] = avg_price
print(avg_brand_price_7)
{'volkswagen': 5532.8090625, 'bmw': 8181.520323599053, 'opel': 3126.904552129222, 'mercedes_benz': 8257.55262564796, 'audi': 9105.847602309817, 'ford': 3815.6854734111544, 'renault': 2550.726230291448}
The above output is not sorted hence tough to understand which is the most and least avg expensive car by brand. Below, sorted descending gives a better picture. Audi is the most expensive.
sorted_top_7 = sorted(avg_brand_price_7.items(), key=lambda x:x[1], reverse = True)
print(sorted_top_7)
[('audi', 9105.847602309817), ('mercedes_benz', 8257.55262564796), ('bmw', 8181.520323599053), ('volkswagen', 5532.8090625), ('ford', 3815.6854734111544), ('opel', 3126.904552129222), ('renault', 2550.726230291448)]
Make a series out of the avg_brand_price dictionary (we created this earlier). Note the S in Series is capital!
bmp_series = pd.Series(avg_brand_price)
print(bmp_series)
peugeot 3167.994113 bmw 8181.520324 volkswagen 5532.809063 smart 3596.402736 ford 3815.685473 chrysler 3506.864198 seat 4531.469166 renault 2550.726230 mercedes_benz 8257.552626 audi 9105.847602 sonstige_autos 9923.650831 opel 3126.904552 mazda 4227.840348 porsche 23866.584615 mini 10691.061576 toyota 5175.555743 dacia 5915.528455 nissan 4889.322721 jeep 11230.028571 saab 3292.626667 volvo 5016.275534 mitsubishi 3509.730458 jaguar 10840.764706 fiat 2962.012346 skoda 6459.275497 subaru 4197.957447 kia 6036.388379 citroen 3847.110592 hyundai 5445.095445 chevrolet 6594.827586 honda 4163.088643 daewoo 1147.936508 suzuki 4276.698113 trabant 1990.551724 land_rover 15481.445652 alfa_romeo 4010.049669 lada 2688.296296 rover 1626.918033 daihatsu 1758.009259 lancia 3512.583333 dtype: float64
Since we did not create a dictionary for avg_odometer by brand doing it below.
avg_odometer = {}
for brand in unique_brands:
avg_odo = autos.loc[autos["brand"] == brand,"odometer"].mean()
avg_odometer[brand] = avg_odo
#print(avg_odometer)
Making a series out of the avg_odometer dictionary.
odometer_series = pd.Series(avg_odometer)
print(odometer_series)
peugeot 126725.533481 bmw 133011.049724 volkswagen 128710.937500 smart 99734.042553 ford 124134.241245 chrysler 133055.555556 seat 121608.222491 renault 127723.363593 mercedes_benz 131645.255803 audi 129789.103691 sonstige_autos 92482.185273 opel 129040.276904 mazda 124121.915820 porsche 112538.461538 mini 88682.266010 toyota 116106.418919 dacia 84268.292683 nissan 118002.894356 jeep 128047.619048 saab 144266.666667 volvo 138527.315914 mitsubishi 126738.544474 jaguar 128308.823529 fiat 116772.486772 skoda 111039.735099 subaru 126170.212766 kia 112186.544343 citroen 119657.320872 hyundai 106605.206074 chevrolet 100900.383142 honda 122659.279778 daewoo 119444.444444 suzuki 107283.018868 trabant 55431.034483 land_rover 122282.608696 alfa_romeo 132284.768212 lada 83518.518519 rover 137459.016393 daihatsu 117222.222222 lancia 121770.833333 dtype: float64
Making a dataframe out of any one series - I made it form prices.
df = pd.DataFrame(bmp_series,columns = ['mean_price'])
df
mean_price | |
---|---|
peugeot | 3167.994113 |
bmw | 8181.520324 |
volkswagen | 5532.809063 |
smart | 3596.402736 |
ford | 3815.685473 |
chrysler | 3506.864198 |
seat | 4531.469166 |
renault | 2550.726230 |
mercedes_benz | 8257.552626 |
audi | 9105.847602 |
sonstige_autos | 9923.650831 |
opel | 3126.904552 |
mazda | 4227.840348 |
porsche | 23866.584615 |
mini | 10691.061576 |
toyota | 5175.555743 |
dacia | 5915.528455 |
nissan | 4889.322721 |
jeep | 11230.028571 |
saab | 3292.626667 |
volvo | 5016.275534 |
mitsubishi | 3509.730458 |
jaguar | 10840.764706 |
fiat | 2962.012346 |
skoda | 6459.275497 |
subaru | 4197.957447 |
kia | 6036.388379 |
citroen | 3847.110592 |
hyundai | 5445.095445 |
chevrolet | 6594.827586 |
honda | 4163.088643 |
daewoo | 1147.936508 |
suzuki | 4276.698113 |
trabant | 1990.551724 |
land_rover | 15481.445652 |
alfa_romeo | 4010.049669 |
lada | 2688.296296 |
rover | 1626.918033 |
daihatsu | 1758.009259 |
lancia | 3512.583333 |
Added the odometer series to the df dataframe above.
df['mean_odometer'] = odometer_series
Print and check whether both the columns are populated.
df
mean_price | mean_odometer | |
---|---|---|
peugeot | 3167.994113 | 126725.533481 |
bmw | 8181.520324 | 133011.049724 |
volkswagen | 5532.809063 | 128710.937500 |
smart | 3596.402736 | 99734.042553 |
ford | 3815.685473 | 124134.241245 |
chrysler | 3506.864198 | 133055.555556 |
seat | 4531.469166 | 121608.222491 |
renault | 2550.726230 | 127723.363593 |
mercedes_benz | 8257.552626 | 131645.255803 |
audi | 9105.847602 | 129789.103691 |
sonstige_autos | 9923.650831 | 92482.185273 |
opel | 3126.904552 | 129040.276904 |
mazda | 4227.840348 | 124121.915820 |
porsche | 23866.584615 | 112538.461538 |
mini | 10691.061576 | 88682.266010 |
toyota | 5175.555743 | 116106.418919 |
dacia | 5915.528455 | 84268.292683 |
nissan | 4889.322721 | 118002.894356 |
jeep | 11230.028571 | 128047.619048 |
saab | 3292.626667 | 144266.666667 |
volvo | 5016.275534 | 138527.315914 |
mitsubishi | 3509.730458 | 126738.544474 |
jaguar | 10840.764706 | 128308.823529 |
fiat | 2962.012346 | 116772.486772 |
skoda | 6459.275497 | 111039.735099 |
subaru | 4197.957447 | 126170.212766 |
kia | 6036.388379 | 112186.544343 |
citroen | 3847.110592 | 119657.320872 |
hyundai | 5445.095445 | 106605.206074 |
chevrolet | 6594.827586 | 100900.383142 |
honda | 4163.088643 | 122659.279778 |
daewoo | 1147.936508 | 119444.444444 |
suzuki | 4276.698113 | 107283.018868 |
trabant | 1990.551724 | 55431.034483 |
land_rover | 15481.445652 | 122282.608696 |
alfa_romeo | 4010.049669 | 132284.768212 |
lada | 2688.296296 | 83518.518519 |
rover | 1626.918033 | 137459.016393 |
daihatsu | 1758.009259 | 117222.222222 |
lancia | 3512.583333 | 121770.833333 |
Sorting by mean_price to see how it compares with odometer.
df.sort_values("mean_price",axis = 0, ascending = False,inplace = True)
df
mean_price | mean_odometer | |
---|---|---|
porsche | 23866.584615 | 112538.461538 |
land_rover | 15481.445652 | 122282.608696 |
jeep | 11230.028571 | 128047.619048 |
jaguar | 10840.764706 | 128308.823529 |
mini | 10691.061576 | 88682.266010 |
sonstige_autos | 9923.650831 | 92482.185273 |
audi | 9105.847602 | 129789.103691 |
mercedes_benz | 8257.552626 | 131645.255803 |
bmw | 8181.520324 | 133011.049724 |
chevrolet | 6594.827586 | 100900.383142 |
skoda | 6459.275497 | 111039.735099 |
kia | 6036.388379 | 112186.544343 |
dacia | 5915.528455 | 84268.292683 |
volkswagen | 5532.809063 | 128710.937500 |
hyundai | 5445.095445 | 106605.206074 |
toyota | 5175.555743 | 116106.418919 |
volvo | 5016.275534 | 138527.315914 |
nissan | 4889.322721 | 118002.894356 |
seat | 4531.469166 | 121608.222491 |
suzuki | 4276.698113 | 107283.018868 |
mazda | 4227.840348 | 124121.915820 |
subaru | 4197.957447 | 126170.212766 |
honda | 4163.088643 | 122659.279778 |
alfa_romeo | 4010.049669 | 132284.768212 |
citroen | 3847.110592 | 119657.320872 |
ford | 3815.685473 | 124134.241245 |
smart | 3596.402736 | 99734.042553 |
lancia | 3512.583333 | 121770.833333 |
mitsubishi | 3509.730458 | 126738.544474 |
chrysler | 3506.864198 | 133055.555556 |
saab | 3292.626667 | 144266.666667 |
peugeot | 3167.994113 | 126725.533481 |
opel | 3126.904552 | 129040.276904 |
fiat | 2962.012346 | 116772.486772 |
lada | 2688.296296 | 83518.518519 |
renault | 2550.726230 | 127723.363593 |
trabant | 1990.551724 | 55431.034483 |
daihatsu | 1758.009259 | 117222.222222 |
rover | 1626.918033 | 137459.016393 |
daewoo | 1147.936508 | 119444.444444 |
Sorted by mean_odometer.
df.sort_values("mean_odometer",axis = 0, ascending = False,inplace = True)
df
mean_price | mean_odometer | |
---|---|---|
saab | 3292.626667 | 144266.666667 |
volvo | 5016.275534 | 138527.315914 |
rover | 1626.918033 | 137459.016393 |
chrysler | 3506.864198 | 133055.555556 |
bmw | 8181.520324 | 133011.049724 |
alfa_romeo | 4010.049669 | 132284.768212 |
mercedes_benz | 8257.552626 | 131645.255803 |
audi | 9105.847602 | 129789.103691 |
opel | 3126.904552 | 129040.276904 |
volkswagen | 5532.809063 | 128710.937500 |
jaguar | 10840.764706 | 128308.823529 |
jeep | 11230.028571 | 128047.619048 |
renault | 2550.726230 | 127723.363593 |
mitsubishi | 3509.730458 | 126738.544474 |
peugeot | 3167.994113 | 126725.533481 |
subaru | 4197.957447 | 126170.212766 |
ford | 3815.685473 | 124134.241245 |
mazda | 4227.840348 | 124121.915820 |
honda | 4163.088643 | 122659.279778 |
land_rover | 15481.445652 | 122282.608696 |
lancia | 3512.583333 | 121770.833333 |
seat | 4531.469166 | 121608.222491 |
citroen | 3847.110592 | 119657.320872 |
daewoo | 1147.936508 | 119444.444444 |
nissan | 4889.322721 | 118002.894356 |
daihatsu | 1758.009259 | 117222.222222 |
fiat | 2962.012346 | 116772.486772 |
toyota | 5175.555743 | 116106.418919 |
porsche | 23866.584615 | 112538.461538 |
kia | 6036.388379 | 112186.544343 |
skoda | 6459.275497 | 111039.735099 |
suzuki | 4276.698113 | 107283.018868 |
hyundai | 5445.095445 | 106605.206074 |
chevrolet | 6594.827586 | 100900.383142 |
smart | 3596.402736 | 99734.042553 |
sonstige_autos | 9923.650831 | 92482.185273 |
mini | 10691.061576 | 88682.266010 |
dacia | 5915.528455 | 84268.292683 |
lada | 2688.296296 | 83518.518519 |
trabant | 1990.551724 | 55431.034483 |
%matplotlib inline
ax1 = df.plot.scatter(x ="mean_price", y ="mean_odometer")
print(ax1)
AxesSubplot(0.125,0.125;0.775x0.755)
Each dot above represents a brand and its corresponding average price and average odometer reading. It seems like as the average miles increases across brands the avergae price of the car decreases. Of course there are exceptions such as the point at ~120k odometer and 25k price.