In this project, we'll work with a dataset of used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website.
import pandas as pd
autos = pd.read_csv('autos.csv', encoding ='Latin-1')
autos.info()
autos.head()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): dateCrawled 50000 non-null object name 50000 non-null object seller 50000 non-null object offerType 50000 non-null object price 50000 non-null object abtest 50000 non-null object vehicleType 44905 non-null object yearOfRegistration 50000 non-null int64 gearbox 47320 non-null object powerPS 50000 non-null int64 model 47242 non-null object odometer 50000 non-null object monthOfRegistration 50000 non-null int64 fuelType 45518 non-null object brand 50000 non-null object notRepairedDamage 40171 non-null object dateCreated 50000 non-null object nrOfPictures 50000 non-null int64 postalCode 50000 non-null int64 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
autos.columns
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest', 'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model', 'odometer', 'monthOfRegistration', 'fuelType', 'brand', 'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode', 'lastSeen'], dtype='object')
We see that the column names are all in camelcase, which capitalizes every word after the first one. We would like to replace these column names with the snakecase version of each name, which places an underscore between each word.
autos = autos.rename(columns={'yearOfRegistration':'registration_year'}).rename(columns={'monthOfRegistration':'registration_month'}).rename(columns={'notRepairedDamage':'unrepaired_damage'}).rename(columns={'dateCreated':'ad_created'})
autos.columns
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest', 'vehicleType', 'registration_year', 'gearbox', 'powerPS', 'model', 'odometer', 'registration_month', 'fuelType', 'brand', 'unrepaired_damage', 'ad_created', 'nrOfPictures', 'postalCode', 'lastSeen'], dtype='object')
autos = autos.rename(columns={'dateCrawled':'date_crawled'}).rename(columns = {'offerType':'offer_type'}).rename(columns = {'vehicleType':'vehicle_type'}).rename(columns = {'powerPS':'power_ps'}).rename(columns={'fuelType':'fuel_type'}).rename(columns={'nrOfPictures':'nr_of_pictures'}).rename(columns={'postalCode':'postal_code'}).rename(columns={'lastSeen':'last_seen'})
autos.columns
Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest', 'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model', 'odometer', 'registration_month', 'fuel_type', 'brand', 'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code', 'last_seen'], dtype='object')
autos.head()
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
The column names have now all been changed to reflect the style of snakecase.
autos.describe()
registration_year | power_ps | registration_month | nr_of_pictures | postal_code | |
---|---|---|---|---|---|
count | 50000.000000 | 50000.000000 | 50000.000000 | 50000.0 | 50000.000000 |
mean | 2005.073280 | 116.355920 | 5.723360 | 0.0 | 50813.627300 |
std | 105.712813 | 209.216627 | 3.711984 | 0.0 | 25779.747957 |
min | 1000.000000 | 0.000000 | 0.000000 | 0.0 | 1067.000000 |
25% | 1999.000000 | 70.000000 | 3.000000 | 0.0 | 30451.000000 |
50% | 2003.000000 | 105.000000 | 6.000000 | 0.0 | 49577.000000 |
75% | 2008.000000 | 150.000000 | 9.000000 | 0.0 | 71540.000000 |
max | 9999.000000 | 17700.000000 | 12.000000 | 0.0 | 99998.000000 |
As we can see, the column for "nr_of_pictures" doesn't give any significant amount of data, as all values are equivalent, so we will need to explore this column to see if we can drop it from our analysis.
We also see that the registration year doesn't match what would be acceptable in terms of the minimum value of '1000' and the maximum value of '9999'. We will have to look into why these values come up in the column and adjust any info that is invalid.
Lastly, a few columns are listed as string values when they could be converted to integers. Namely, the prices and the odometer readings can be converted to numeric values and added to our analysis.
autos['odometer'] = autos['odometer'].str.replace('km','').str.replace(',','').astype(int)
autos['odometer'].head()
0 150000 1 150000 2 70000 3 70000 4 150000 Name: odometer, dtype: int64
autos['price'] = autos['price'].str.replace('$','').str.replace(',','').astype(int)
autos['price'].head()
0 5000 1 8500 2 8990 3 4350 4 1350 Name: price, dtype: int64
autos['unrepaired_damage'] = autos['unrepaired_damage'].str.replace('nein','0').str.replace('ja','1').str.replace('nan','0').astype(bool)
autos['unrepaired_damage'].head()
0 True 1 True 2 True 3 True 4 True Name: unrepaired_damage, dtype: bool
autos.describe()
price | registration_year | power_ps | odometer | registration_month | nr_of_pictures | postal_code | |
---|---|---|---|---|---|---|---|
count | 5.000000e+04 | 50000.000000 | 50000.000000 | 50000.000000 | 50000.000000 | 50000.0 | 50000.000000 |
mean | 9.840044e+03 | 2005.073280 | 116.355920 | 125732.700000 | 5.723360 | 0.0 | 50813.627300 |
std | 4.811044e+05 | 105.712813 | 209.216627 | 40042.211706 | 3.711984 | 0.0 | 25779.747957 |
min | 0.000000e+00 | 1000.000000 | 0.000000 | 5000.000000 | 0.000000 | 0.0 | 1067.000000 |
25% | 1.100000e+03 | 1999.000000 | 70.000000 | 125000.000000 | 3.000000 | 0.0 | 30451.000000 |
50% | 2.950000e+03 | 2003.000000 | 105.000000 | 150000.000000 | 6.000000 | 0.0 | 49577.000000 |
75% | 7.200000e+03 | 2008.000000 | 150.000000 | 150000.000000 | 9.000000 | 0.0 | 71540.000000 |
max | 1.000000e+08 | 9999.000000 | 17700.000000 | 150000.000000 | 12.000000 | 0.0 | 99998.000000 |
autos['registration_year'].unique()
array([2004, 1997, 2009, 2007, 2003, 2006, 1995, 1998, 2000, 2017, 2010, 1999, 1982, 1990, 2015, 2014, 1996, 1992, 2005, 2002, 2012, 2011, 2008, 1985, 2016, 1994, 1986, 2001, 2018, 2013, 1972, 1993, 1988, 1989, 1967, 1973, 1956, 1976, 4500, 1987, 1991, 1983, 1960, 1969, 1950, 1978, 1980, 1984, 1963, 1977, 1961, 1968, 1934, 1965, 1971, 1966, 1979, 1981, 1970, 1974, 1910, 1975, 5000, 4100, 2019, 1959, 9996, 9999, 6200, 1964, 1958, 1800, 1948, 1931, 1943, 9000, 1941, 1962, 1927, 1937, 1929, 1000, 1957, 1952, 1111, 1955, 1939, 8888, 1954, 1938, 2800, 5911, 1500, 1953, 1951, 4800, 1001])
autos.rename(columns={'odometer':'odometer_km'},inplace=True)
autos.head()
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | 5000 | control | bus | 2004 | manuell | 158 | andere | 150000 | 3 | lpg | peugeot | True | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | 8500 | control | limousine | 1997 | automatik | 286 | 7er | 150000 | 6 | benzin | bmw | True | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | 8990 | test | limousine | 2009 | manuell | 102 | golf | 70000 | 7 | benzin | volkswagen | True | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | 4350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70000 | 6 | benzin | smart | True | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | 1350 | test | kombi | 2003 | manuell | 0 | focus | 150000 | 7 | benzin | ford | True | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
autos['seller'].unique()
array(['privat', 'gewerblich'], dtype=object)
autos['offer_type'].unique()
array(['Angebot', 'Gesuch'], dtype=object)
autos['odometer_km'].unique().shape
(13,)
autos['odometer_km'].describe()
count 50000.000000 mean 125732.700000 std 40042.211706 min 5000.000000 25% 125000.000000 50% 150000.000000 75% 150000.000000 max 150000.000000 Name: odometer_km, dtype: float64
autos['odometer_km'].value_counts()
150000 32424 125000 5170 100000 2169 90000 1757 80000 1436 70000 1230 60000 1164 50000 1027 5000 967 40000 819 30000 789 20000 784 10000 264 Name: odometer_km, dtype: int64
autos['price'].unique().shape
(2357,)
autos['price'].describe()
count 5.000000e+04 mean 9.840044e+03 std 4.811044e+05 min 0.000000e+00 25% 1.100000e+03 50% 2.950000e+03 75% 7.200000e+03 max 1.000000e+08 Name: price, dtype: float64
autos['price'].value_counts().head()
0 1421 500 781 1500 734 2500 643 1000 639 Name: price, dtype: int64
autos['price'].value_counts().sort_index().head()
0 1421 1 156 2 3 3 1 5 2 Name: price, dtype: int64
Now, we will look to remove any outliers in our data that may affect our analysis. We will start by removing any cars who have a listed price of 0, as this is not likely to be accurate. Similarly, we will eliminate any values that seem to be too high.
Then, we will look into the 'odometer_km' column for any values that are either too high or too low, then remove the rows that correspond with these outliers.
autos['price'].describe()
count 5.000000e+04 mean 9.840044e+03 std 4.811044e+05 min 0.000000e+00 25% 1.100000e+03 50% 2.950000e+03 75% 7.200000e+03 max 1.000000e+08 Name: price, dtype: float64
autos = autos[autos['price'].between(10,500000)]
autos['price'].sort_values()
27658 10 35009 10 17079 10 33893 10 44597 10 29246 10 42186 10 35555 11 9255 11 15747 12 18443 12 3416 12 5132 13 4598 13 19632 14 40856 15 45157 15 30587 17 38552 17 27813 17 8570 18 37141 20 26772 20 46835 20 33200 20 18592 25 33838 25 43572 25 36026 25 14322 25 ... 43282 119900 44406 120000 2751 120000 8232 128000 1878 129000 49815 130000 14268 135000 2454 137999 32185 139997 49668 145000 33638 151990 10500 155000 45387 163500 22673 163991 18509 169000 32840 169999 11433 175000 17140 180000 20351 190000 28090 194000 43668 197000 40918 198000 37840 220000 38299 250000 47337 259000 12682 265000 35923 295000 34723 299000 14715 345000 36818 350000 Name: price, Length: 48401, dtype: int64
autos['odometer_km'].describe()
count 48401.000000 mean 125799.570257 std 39722.502478 min 5000.000000 25% 125000.000000 50% 150000.000000 75% 150000.000000 max 150000.000000 Name: odometer_km, dtype: float64
autos['odometer_km'].sort_values()
15428 5000 29518 5000 38385 5000 6750 5000 2040 5000 12971 5000 5031 5000 844 5000 18003 5000 2029 5000 33120 5000 22849 5000 18014 5000 20074 5000 26188 5000 43514 5000 2006 5000 31828 5000 37022 5000 41672 5000 20060 5000 38336 5000 26139 5000 3416 5000 26138 5000 30656 5000 10119 5000 6675 5000 12855 5000 33845 5000 ... 20234 150000 20235 150000 20237 150000 20238 150000 20239 150000 20240 150000 20241 150000 20242 150000 20243 150000 20223 150000 20370 150000 20222 150000 20220 150000 20193 150000 20195 150000 20196 150000 20198 150000 20199 150000 20200 150000 20201 150000 20202 150000 20203 150000 20206 150000 20207 150000 20208 150000 20214 150000 20216 150000 20217 150000 20221 150000 49999 150000 Name: odometer_km, Length: 48401, dtype: int64
We removed all prices under $10 and over $500,000 since these were both outliers in regards to our expectations. Additionally, we also analyzed the odometer readings after removing the outliers from the prices and found that there were no remaining outliers in this column, so we left the data set as it was.
Next, we will look to convert the 'date_crawled', 'last_seen', 'ad_created', 'registration_month', and 'registration_year' columns into data that we can understand and analyze quantitatively.
autos.describe()
price | registration_year | power_ps | odometer_km | registration_month | nr_of_pictures | postal_code | |
---|---|---|---|---|---|---|---|
count | 48401.000000 | 48401.000000 | 48401.000000 | 48401.000000 | 48401.000000 | 48401.0 | 48401.000000 |
mean | 5908.885457 | 2004.774715 | 117.362079 | 125799.570257 | 5.790665 | 0.0 | 50993.775418 |
std | 9068.695231 | 88.790201 | 200.923769 | 39722.502478 | 3.682315 | 0.0 | 25744.823947 |
min | 10.000000 | 1000.000000 | 0.000000 | 5000.000000 | 0.000000 | 0.0 | 1067.000000 |
25% | 1250.000000 | 1999.000000 | 72.000000 | 125000.000000 | 3.000000 | 0.0 | 30823.000000 |
50% | 3000.000000 | 2004.000000 | 107.000000 | 150000.000000 | 6.000000 | 0.0 | 49733.000000 |
75% | 7498.000000 | 2008.000000 | 150.000000 | 150000.000000 | 9.000000 | 0.0 | 71679.000000 |
max | 350000.000000 | 9999.000000 | 17700.000000 | 150000.000000 | 12.000000 | 0.0 | 99998.000000 |
autos['registration_year'].head()
0 2004 1 1997 2 2009 3 2007 4 2003 Name: registration_year, dtype: int64
autos['date_crawled'].head()
0 2016-03-26 17:47:46 1 2016-04-04 13:38:56 2 2016-03-26 18:57:24 3 2016-03-12 16:58:10 4 2016-04-01 14:38:50 Name: date_crawled, dtype: object
autos['date_crawled'].str[:10].value_counts(normalize=True, dropna=False)
2016-04-03 0.038574 2016-03-20 0.037809 2016-03-21 0.037293 2016-03-12 0.036962 2016-03-14 0.036631 2016-04-04 0.036487 2016-03-07 0.036053 2016-04-02 0.035495 2016-03-28 0.034855 2016-03-19 0.034751 2016-03-15 0.034276 2016-03-29 0.034111 2016-04-01 0.033739 2016-03-30 0.033739 2016-03-08 0.033264 2016-03-09 0.033037 2016-03-22 0.032933 2016-03-11 0.032603 2016-03-23 0.032272 2016-03-26 0.032251 2016-03-10 0.032210 2016-03-31 0.031818 2016-03-17 0.031611 2016-03-25 0.031570 2016-03-27 0.031115 2016-03-16 0.029504 2016-03-24 0.029400 2016-03-05 0.025371 2016-03-13 0.015661 2016-03-06 0.014070 2016-04-05 0.013078 2016-03-18 0.012913 2016-04-06 0.003161 2016-04-07 0.001384 Name: date_crawled, dtype: float64
The analysis performed above shows that the crawler created all of its entries between March 1, 2016 and April 30, 2016, with a greater percentage happening in late March/ early April.
autos['date_crawled'].str[:10].sort_values()
33784 2016-03-05 34224 2016-03-05 20939 2016-03-05 28933 2016-03-05 20958 2016-03-05 9850 2016-03-05 13612 2016-03-05 47733 2016-03-05 20965 2016-03-05 13620 2016-03-05 13602 2016-03-05 34272 2016-03-05 20991 2016-03-05 21011 2016-03-05 21016 2016-03-05 47746 2016-03-05 42671 2016-03-05 28960 2016-03-05 3456 2016-03-05 20977 2016-03-05 21039 2016-03-05 42785 2016-03-05 42849 2016-03-05 3304 2016-03-05 26925 2016-03-05 42954 2016-03-05 3311 2016-03-05 34139 2016-03-05 26924 2016-03-05 9853 2016-03-05 ... 8576 2016-04-07 41121 2016-04-07 12119 2016-04-07 33996 2016-04-07 44551 2016-04-07 31608 2016-04-07 44797 2016-04-07 49569 2016-04-07 6183 2016-04-07 42072 2016-04-07 2213 2016-04-07 9363 2016-04-07 29938 2016-04-07 29853 2016-04-07 4719 2016-04-07 44894 2016-04-07 24853 2016-04-07 26209 2016-04-07 19809 2016-04-07 21527 2016-04-07 4752 2016-04-07 7982 2016-04-07 21510 2016-04-07 37112 2016-04-07 23409 2016-04-07 4378 2016-04-07 16358 2016-04-07 39445 2016-04-07 3654 2016-04-07 38743 2016-04-07 Name: date_crawled, Length: 48401, dtype: object
Upon further inspection, it seems that every entry was entered between March 5 and April 7 of 2016.
autos['ad_created'].str[:10].value_counts(normalize=True, dropna=False)
2016-04-03 0.038822 2016-03-20 0.037871 2016-03-21 0.037520 2016-04-04 0.036859 2016-03-12 0.036776 2016-03-14 0.035268 2016-04-02 0.035185 2016-03-28 0.034979 2016-03-07 0.034793 2016-03-29 0.034049 2016-03-15 0.034008 2016-04-01 0.033718 2016-03-19 0.033636 2016-03-30 0.033553 2016-03-08 0.033264 2016-03-09 0.033119 2016-03-11 0.032933 2016-03-22 0.032747 2016-03-26 0.032313 2016-03-23 0.032107 2016-03-10 0.031921 2016-03-31 0.031859 2016-03-25 0.031694 2016-03-17 0.031280 2016-03-27 0.031032 2016-03-16 0.029999 2016-03-24 0.029338 2016-03-05 0.022933 2016-03-13 0.017024 2016-03-06 0.015330 ... 2016-02-14 0.000041 2016-02-12 0.000041 2016-02-24 0.000041 2016-02-02 0.000041 2016-02-20 0.000041 2016-02-05 0.000041 2016-02-18 0.000041 2016-01-10 0.000041 2016-02-26 0.000041 2015-09-09 0.000021 2016-01-16 0.000021 2015-06-11 0.000021 2016-02-16 0.000021 2016-01-07 0.000021 2016-02-08 0.000021 2016-02-07 0.000021 2016-02-01 0.000021 2016-01-22 0.000021 2016-01-13 0.000021 2016-02-22 0.000021 2016-01-29 0.000021 2016-02-09 0.000021 2015-12-05 0.000021 2016-01-14 0.000021 2015-12-30 0.000021 2015-08-10 0.000021 2016-01-03 0.000021 2016-02-17 0.000021 2016-02-11 0.000021 2015-11-10 0.000021 Name: ad_created, Length: 76, dtype: float64
Similar to our observation before, a great deal of the ads in our data set were created in mid to late March and early April of 2016.
autos['ad_created'].str[:10].sort_values()
22781 2015-06-11 20649 2015-08-10 34883 2015-09-09 2243 2015-11-10 27986 2015-12-05 36993 2015-12-30 2232 2016-01-03 39270 2016-01-07 24851 2016-01-10 49743 2016-01-10 34224 2016-01-13 23708 2016-01-14 36429 2016-01-16 12518 2016-01-22 10942 2016-01-27 4696 2016-01-27 41508 2016-01-27 10735 2016-01-29 13756 2016-02-01 34058 2016-02-02 15081 2016-02-02 46108 2016-02-05 11432 2016-02-05 38294 2016-02-07 25855 2016-02-08 48887 2016-02-09 23602 2016-02-11 9706 2016-02-12 19356 2016-02-12 9467 2016-02-14 ... 39045 2016-04-07 1138 2016-04-07 29853 2016-04-07 30532 2016-04-07 23957 2016-04-07 7536 2016-04-07 8036 2016-04-07 20941 2016-04-07 21527 2016-04-07 49569 2016-04-07 12119 2016-04-07 10765 2016-04-07 4378 2016-04-07 44797 2016-04-07 37268 2016-04-07 42822 2016-04-07 45786 2016-04-07 24929 2016-04-07 47709 2016-04-07 22899 2016-04-07 47588 2016-04-07 48922 2016-04-07 11538 2016-04-07 13976 2016-04-07 37112 2016-04-07 41121 2016-04-07 16304 2016-04-07 29938 2016-04-07 39445 2016-04-07 48576 2016-04-07 Name: ad_created, Length: 48401, dtype: object
When we look at the values in the 'ad_created' column sorted by date, we notice that the earliest dates any ads were created is June 11, 2015 while the most recent ads recorded are from April 7, 2016.
autos['last_seen'].str[:10].value_counts(normalize=True, dropna=False)
2016-04-06 0.221607 2016-04-07 0.132043 2016-04-05 0.124956 2016-03-17 0.028057 2016-04-03 0.025185 2016-04-02 0.024855 2016-03-30 0.024752 2016-04-04 0.024504 2016-03-31 0.023842 2016-03-12 0.023781 2016-04-01 0.022871 2016-03-29 0.022334 2016-03-22 0.021384 2016-03-28 0.020867 2016-03-20 0.020640 2016-03-21 0.020599 2016-03-24 0.019752 2016-03-25 0.019194 2016-03-23 0.018595 2016-03-26 0.016818 2016-03-16 0.016446 2016-03-15 0.015867 2016-03-19 0.015826 2016-03-27 0.015599 2016-03-14 0.012624 2016-03-11 0.012376 2016-03-10 0.010620 2016-03-09 0.009628 2016-03-13 0.008863 2016-03-08 0.007355 2016-03-18 0.007335 2016-03-07 0.005413 2016-03-06 0.004339 2016-03-05 0.001074 Name: last_seen, dtype: float64
When we observe the percentages of ads based on the 'last_seen' column, we see that over half of the ads have been checked on between April 5-7, which corresponds to the latest dates that the data was collected.
autos['last_seen'].str[:10].sort_values()
24801 2016-03-05 27294 2016-03-05 34030 2016-03-05 13133 2016-03-05 39178 2016-03-05 1736 2016-03-05 22493 2016-03-05 48946 2016-03-05 9628 2016-03-05 30389 2016-03-05 11329 2016-03-05 4488 2016-03-05 40848 2016-03-05 33684 2016-03-05 3580 2016-03-05 1027 2016-03-05 36848 2016-03-05 48668 2016-03-05 3983 2016-03-05 18611 2016-03-05 33712 2016-03-05 37880 2016-03-05 34139 2016-03-05 2346 2016-03-05 10942 2016-03-05 19738 2016-03-05 669 2016-03-05 26529 2016-03-05 10142 2016-03-05 31117 2016-03-05 ... 18978 2016-04-07 18983 2016-04-07 18998 2016-04-07 19007 2016-04-07 42300 2016-04-07 19010 2016-04-07 19014 2016-04-07 19022 2016-04-07 19024 2016-04-07 19039 2016-04-07 19040 2016-04-07 19042 2016-04-07 19043 2016-04-07 19046 2016-04-07 19111 2016-04-07 19110 2016-04-07 42265 2016-04-07 19091 2016-04-07 19089 2016-04-07 42272 2016-04-07 18593 2016-04-07 19083 2016-04-07 42275 2016-04-07 19063 2016-04-07 19058 2016-04-07 42282 2016-04-07 42283 2016-04-07 19047 2016-04-07 19080 2016-04-07 17702 2016-04-07 Name: last_seen, Length: 48401, dtype: object
Upon further notice, every ad in our database has a last seen date between March 5 and April 7 of 2016, precisely equivalent to the date_crawled column in regards to the range.
autos['registration_year'].describe()
count 48401.000000 mean 2004.774715 std 88.790201 min 1000.000000 25% 1999.000000 50% 2004.000000 75% 2008.000000 max 9999.000000 Name: registration_year, dtype: float64
We notice that there are cars whose registration year is marked as '1000' and some are much higher than '2016', which cannot be the case. This means that we will have to remove some data, as the registration years column has outliers that do not make sense.
autos['registration_year'].sort_values(ascending=False)
8012 9999 33950 9999 38076 9999 49910 9000 25003 8888 8360 6200 27618 5911 24519 5000 22799 5000 4164 5000 49153 5000 42079 4800 453 4500 4549 4100 27578 2800 5763 2019 49185 2019 34800 2018 17069 2018 31810 2018 24334 2018 5950 2018 6321 2018 22339 2018 9575 2018 41840 2018 38978 2018 27672 2018 6597 2018 21379 2018 ... 25556 1954 35453 1953 23372 1952 44406 1951 35921 1951 14020 1950 11047 1948 11585 1943 25792 1941 13963 1941 24855 1939 26103 1938 21421 1937 23804 1937 39725 1937 26607 1937 2573 1934 2221 1934 11246 1931 22101 1929 21416 1927 28693 1910 45157 1910 22659 1910 30781 1910 10556 1800 32585 1800 24511 1111 49283 1001 22316 1000 Name: registration_year, Length: 48401, dtype: int64
autos = autos[autos['registration_year'].between(1899,2016)]
autos['registration_year'].describe()
count 46524.000000 mean 2002.926468 std 7.155528 min 1910.000000 25% 1999.000000 50% 2003.000000 75% 2008.000000 max 2016.000000 Name: registration_year, dtype: float64
Upon removing the outliers, now our data, in regards to registration year, makes more sense.
autos['brand'].value_counts()
volkswagen 9833 bmw 5116 opel 4994 mercedes_benz 4488 audi 4030 ford 3256 renault 2193 peugeot 1390 fiat 1195 seat 849 skoda 764 nissan 712 mazda 709 smart 661 citroen 653 toyota 593 hyundai 467 sonstige_autos 448 volvo 425 mini 408 mitsubishi 381 honda 365 kia 330 alfa_romeo 309 porsche 283 suzuki 276 chevrolet 265 chrysler 164 dacia 123 daihatsu 117 jeep 106 subaru 99 land_rover 98 saab 77 jaguar 73 daewoo 70 trabant 65 rover 62 lancia 50 lada 27 Name: brand, dtype: int64
top_auto_brands = autos['brand'].value_counts().sort_values(ascending=False).head(20).index
top_autos_price = {}
top_auto_brands
Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford', 'renault', 'peugeot', 'fiat', 'seat', 'skoda', 'nissan', 'mazda', 'smart', 'citroen', 'toyota', 'hyundai', 'sonstige_autos', 'volvo', 'mini'], dtype='object')
We have decided to aggregate over the top 20 brands that are in our data set. This should give us relevant info for each brand that shows most frequently.
for car in top_auto_brands:
brand_row = autos[autos['brand'] == car]
mean_price = brand_row['price'].mean()
mean_mileage = brand_row['odometer_km'].mean()
top_autos_price[car] = mean_price
sorted_top_autos_price = sorted(top_autos_price.items(), key=lambda x: x[1], reverse=True)
print(sorted_top_autos_price)
[('sonstige_autos', 12613.939732142857), ('mini', 10639.450980392157), ('audi', 9362.169478908188), ('mercedes_benz', 8657.285427807486), ('bmw', 8367.020719311962), ('skoda', 6384.667539267016), ('volkswagen', 5418.33997762636), ('hyundai', 5376.740899357602), ('toyota', 5167.091062394604), ('volvo', 4969.774117647059), ('nissan', 4750.063202247191), ('seat', 4417.942285041225), ('mazda', 4112.596614950635), ('citroen', 3784.924961715161), ('ford', 3757.5282555282556), ('smart', 3580.2239031770046), ('peugeot', 3100.6928057553955), ('opel', 2991.9164997997595), ('fiat', 2818.456066945607), ('renault', 2483.889192886457)]
In the code above, we created a dictionary that refers to each car brand from our top 20, along with the mean price for each brand. Looking at our results, we see that Sonstige Autos has our highest mean price, followed by Mini and Audi. The lowest mean price is shown to be Renault, with Fiat and Opel rounding out the other two lowest mean price.
Additionally, the mean price for all the car brands in the top 20 falls between $2,483 and $12,614.
top_autos_mileage = {}
for car in top_auto_brands:
brand_row = autos[autos['brand'] == car]
mean_price = brand_row['price'].mean()
mean_mileage = brand_row['odometer_km'].mean()
top_autos_mileage[car] = mean_mileage
print(top_autos_mileage)
{'audi': 129208.43672456576, 'toyota': 115944.35075885328, 'skoda': 110746.07329842931, 'peugeot': 127122.30215827338, 'fiat': 117066.94560669456, 'opel': 129314.1770124149, 'nissan': 118370.78651685393, 'volkswagen': 128706.39682701108, 'seat': 121166.0777385159, 'ford': 124210.68796068797, 'mercedes_benz': 130838.9037433155, 'citroen': 119647.77947932619, 'hyundai': 106541.75588865097, 'mazda': 124464.03385049365, 'volvo': 138294.11764705883, 'renault': 128144.09484724123, 'bmw': 132553.75293197812, 'mini': 88308.82352941176, 'sonstige_autos': 90446.42857142857, 'smart': 99326.77760968229}
mean_price_series = pd.Series(top_autos)
mean_mileage_series = pd.Series(top_autos_mileage)
print(mean_price_series)
audi 9362.169479 bmw 8367.020719 citroen 3784.924962 fiat 2818.456067 ford 3757.528256 hyundai 5376.740899 mazda 4112.596615 mercedes_benz 8657.285428 mini 10639.450980 nissan 4750.063202 opel 2991.916500 peugeot 3100.692806 renault 2483.889193 seat 4417.942285 skoda 6384.667539 smart 3580.223903 sonstige_autos 12613.939732 toyota 5167.091062 volkswagen 5418.339978 volvo 4969.774118 dtype: float64
print(mean_mileage_series)
audi 129208.436725 bmw 132553.752932 citroen 119647.779479 fiat 117066.945607 ford 124210.687961 hyundai 106541.755889 mazda 124464.033850 mercedes_benz 130838.903743 mini 88308.823529 nissan 118370.786517 opel 129314.177012 peugeot 127122.302158 renault 128144.094847 seat 121166.077739 skoda 110746.073298 smart 99326.777610 sonstige_autos 90446.428571 toyota 115944.350759 volkswagen 128706.396827 volvo 138294.117647 dtype: float64
brand_price_mileage_df = pd.DataFrame({'mean_price': mean_price_series, 'mean_mileage': mean_mileage_series})
brand_price_mileage_df
mean_mileage | mean_price | |
---|---|---|
audi | 129208.436725 | 9362.169479 |
bmw | 132553.752932 | 8367.020719 |
citroen | 119647.779479 | 3784.924962 |
fiat | 117066.945607 | 2818.456067 |
ford | 124210.687961 | 3757.528256 |
hyundai | 106541.755889 | 5376.740899 |
mazda | 124464.033850 | 4112.596615 |
mercedes_benz | 130838.903743 | 8657.285428 |
mini | 88308.823529 | 10639.450980 |
nissan | 118370.786517 | 4750.063202 |
opel | 129314.177012 | 2991.916500 |
peugeot | 127122.302158 | 3100.692806 |
renault | 128144.094847 | 2483.889193 |
seat | 121166.077739 | 4417.942285 |
skoda | 110746.073298 | 6384.667539 |
smart | 99326.777610 | 3580.223903 |
sonstige_autos | 90446.428571 | 12613.939732 |
toyota | 115944.350759 | 5167.091062 |
volkswagen | 128706.396827 | 5418.339978 |
volvo | 138294.117647 | 4969.774118 |
We have now added average mileage and average price to a data frame for the top 20 brands from our data set. We have created two columns to show the values as such, with the index being the brand names themselves.
Our analysis of the data shows a few observations. The brands Sonstige Autos and Mini, which were both in the top 3 for the highest average price, also happen to be the lowest two in average mileage.
Audi, which was in the top 3 in regards to average price, falls on the top 4 in regards to average mileage. This means that Audis have the highest cost per miles driven in comparison to the other cars in the list.
Opel and Renault both have very high average mileage, which makes the lower average price seem appropriate. In contrast, Fiat falls somewhere in the middle in regards to mileage, which means that the value with regards to price per mile is likely lower than many of the other cars in the list.