This project aims to clean and analyze a dataset from a German car classifieds section on eBay. Pandas will be used to analyze the data. 50,000 data points were sampled from the original data set. While the data is no longer available on Kaggle (where it was initially scraped and uploaded), it can be found here. The dataset was purposefully tampered with as a means of learning how to scrape and clean imperfect data.
import pandas as pd
import numpy as np
autos = pd.read_csv("autos.csv", encoding = "Latin-1")
autos
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
49995 | 2016-03-27 14:38:19 | Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon | privat | Angebot | $24,900 | control | limousine | 2011 | automatik | 239 | q5 | 100,000km | 1 | diesel | audi | nein | 2016-03-27 00:00:00 | 0 | 82131 | 2016-04-01 13:47:40 |
49996 | 2016-03-28 10:50:25 | Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... | privat | Angebot | $1,980 | control | cabrio | 1996 | manuell | 75 | astra | 150,000km | 5 | benzin | opel | nein | 2016-03-28 00:00:00 | 0 | 44807 | 2016-04-02 14:18:02 |
49997 | 2016-04-02 14:44:48 | Fiat_500_C_1.2_Dualogic_Lounge | privat | Angebot | $13,200 | test | cabrio | 2014 | automatik | 69 | 500 | 5,000km | 11 | benzin | fiat | nein | 2016-04-02 00:00:00 | 0 | 73430 | 2016-04-04 11:47:27 |
49998 | 2016-03-08 19:25:42 | Audi_A3_2.0_TDI_Sportback_Ambition | privat | Angebot | $22,900 | control | kombi | 2013 | manuell | 150 | a3 | 40,000km | 11 | diesel | audi | nein | 2016-03-08 00:00:00 | 0 | 35683 | 2016-04-05 16:45:07 |
49999 | 2016-03-14 00:42:12 | Opel_Vectra_1.6_16V | privat | Angebot | $1,250 | control | limousine | 1996 | manuell | 101 | vectra | 150,000km | 1 | benzin | opel | nein | 2016-03-13 00:00:00 | 0 | 45897 | 2016-04-06 21:18:48 |
50000 rows × 20 columns
print(autos.info())
print(autos.head())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 dateCrawled 50000 non-null object 1 name 50000 non-null object 2 seller 50000 non-null object 3 offerType 50000 non-null object 4 price 50000 non-null object 5 abtest 50000 non-null object 6 vehicleType 44905 non-null object 7 yearOfRegistration 50000 non-null int64 8 gearbox 47320 non-null object 9 powerPS 50000 non-null int64 10 model 47242 non-null object 11 odometer 50000 non-null object 12 monthOfRegistration 50000 non-null int64 13 fuelType 45518 non-null object 14 brand 50000 non-null object 15 notRepairedDamage 40171 non-null object 16 dateCreated 50000 non-null object 17 nrOfPictures 50000 non-null int64 18 postalCode 50000 non-null int64 19 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB None dateCrawled name \ 0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD 1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik 2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United 3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... 4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... seller offerType price abtest vehicleType yearOfRegistration \ 0 privat Angebot $5,000 control bus 2004 1 privat Angebot $8,500 control limousine 1997 2 privat Angebot $8,990 test limousine 2009 3 privat Angebot $4,350 control kleinwagen 2007 4 privat Angebot $1,350 test kombi 2003 gearbox powerPS model odometer monthOfRegistration fuelType \ 0 manuell 158 andere 150,000km 3 lpg 1 automatik 286 7er 150,000km 6 benzin 2 manuell 102 golf 70,000km 7 benzin 3 automatik 71 fortwo 70,000km 6 benzin 4 manuell 0 focus 150,000km 7 benzin brand notRepairedDamage dateCreated nrOfPictures \ 0 peugeot nein 2016-03-26 00:00:00 0 1 bmw nein 2016-04-04 00:00:00 0 2 volkswagen nein 2016-03-26 00:00:00 0 3 smart nein 2016-03-12 00:00:00 0 4 ford nein 2016-04-01 00:00:00 0 postalCode lastSeen 0 79588 2016-04-06 06:45:54 1 71034 2016-04-06 14:45:08 2 35394 2016-04-06 20:15:37 3 33729 2016-03-15 03:16:28 4 39218 2016-04-01 14:38:50
vehicleType
,gearbox
,model
,fuelType
,notRepairedDamage
contain roughly 2,500 - 10,000 null values. Also, certain column values are not integers. Some data cleaning will be required for columns like dateCrawled
,price
,odometer
,dateCreated
, and lastSeen
. In order to analyze the data in these columns, these values need to be evaluated as integers.
autos.columns
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest', 'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model', 'odometer', 'monthOfRegistration', 'fuelType', 'brand', 'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode', 'lastSeen'], dtype='object')
The above columns need to be modified in the following way:
yearOfRegistration
,monthOfRegistration
,notRepairedDamage
, as well as dateCreated
will be reworded and all columns will be rewritten in snakecase.
The corrected snakecase columns will be reassigned back to autos.columns
autos.columns = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'ab_test',
'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
'odometer', 'registration_month', 'fuel_type', 'brand',
'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code',
'last_seen']
autos.columns
Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'ab_test', 'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model', 'odometer', 'registration_month', 'fuel_type', 'brand', 'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code', 'last_seen'], dtype='object')
autos.head()
date_crawled | name | seller | offer_type | price | ab_test | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
autos.describe(include = 'all')
date_crawled | name | seller | offer_type | price | ab_test | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000 | 50000 | 50000 | 50000 | 50000 | 50000 | 44905 | 50000.000000 | 47320 | 50000.000000 | 47242 | 50000 | 50000.000000 | 45518 | 50000 | 40171 | 50000 | 50000.0 | 50000.000000 | 50000 |
unique | 48213 | 38754 | 2 | 2 | 2357 | 2 | 8 | NaN | 2 | NaN | 245 | 13 | NaN | 7 | 40 | 2 | 76 | NaN | NaN | 39481 |
top | 2016-03-08 10:40:35 | Ford_Fiesta | privat | Angebot | $0 | test | limousine | NaN | manuell | NaN | golf | 150,000km | NaN | benzin | volkswagen | nein | 2016-04-03 00:00:00 | NaN | NaN | 2016-04-07 06:17:27 |
freq | 3 | 78 | 49999 | 49999 | 1421 | 25756 | 12859 | NaN | 36993 | NaN | 4024 | 32424 | NaN | 30107 | 10687 | 35232 | 1946 | NaN | NaN | 8 |
mean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2005.073280 | NaN | 116.355920 | NaN | NaN | 5.723360 | NaN | NaN | NaN | NaN | 0.0 | 50813.627300 | NaN |
std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 105.712813 | NaN | 209.216627 | NaN | NaN | 3.711984 | NaN | NaN | NaN | NaN | 0.0 | 25779.747957 | NaN |
min | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1000.000000 | NaN | 0.000000 | NaN | NaN | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 1067.000000 | NaN |
25% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1999.000000 | NaN | 70.000000 | NaN | NaN | 3.000000 | NaN | NaN | NaN | NaN | 0.0 | 30451.000000 | NaN |
50% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2003.000000 | NaN | 105.000000 | NaN | NaN | 6.000000 | NaN | NaN | NaN | NaN | 0.0 | 49577.000000 | NaN |
75% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | NaN | 9.000000 | NaN | NaN | NaN | NaN | 0.0 | 71540.000000 | NaN |
max | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 9999.000000 | NaN | 17700.000000 | NaN | NaN | 12.000000 | NaN | NaN | NaN | NaN | 0.0 | 99998.000000 | NaN |
Only one column (nr_of_pictures
) appears like a good candidate to be dropped because the column description indicates 50,00 entries that are all NaN
.
Suitable columns for investigation look like name
, price
, vehicle_type
, registration_year
, gearbox
,model
,odometer
,fuel_type
,brand
,unrepaired_damage
, ad_created
, and last_seen
.
Numeric data stored as text which needs to be cleaned for further analysis would be the following columns: date_crawled
,price
,odometer
, ad_created
, and last_seen
.
vehicle_type_counts = autos["vehicle_type"].value_counts()
print(vehicle_type_counts)
limousine 12859 kleinwagen 10822 kombi 9127 bus 4093 cabrio 3061 coupe 2537 suv 1986 andere 420 Name: vehicle_type, dtype: int64
The data shows two columns with values with frequencies that are essentially one value: seller
and offer_type
. Other columns which need more investigation are vehicle_type
, model
, and brand
. Examples of columns which are stored as text that need further cleaning are: price
,registration_year
,odometer
, and nr_of_pictures
.
autos["price"] = autos["price"].str.replace('$','').str.replace(',','').astype(int)
autos.rename({"price":"price_$"}, axis = 1, inplace = True)
autos.head()
date_crawled | name | seller | offer_type | price_$ | ab_test | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | 5000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | 8500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | 8990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | 4350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | 1350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
autos["odometer"] = autos["odometer"].str.replace(',','').str.replace('km','').astype(int)
autos.rename({"odometer":"odometer_km"}, axis = 1, inplace = True)
autos.head()
date_crawled | name | seller | offer_type | price_$ | ab_test | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | 5000 | control | bus | 2004 | manuell | 158 | andere | 150000 | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | 8500 | control | limousine | 1997 | automatik | 286 | 7er | 150000 | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | 8990 | test | limousine | 2009 | manuell | 102 | golf | 70000 | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | 4350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70000 | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | 1350 | test | kombi | 2003 | manuell | 0 | focus | 150000 | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
the values by category within the odometer_km
column and the price
column.
odometer_km = autos["odometer_km"]
odometer_km.unique().shape
(13,)
odometer_km.describe
<bound method NDFrame.describe of 0 150000 1 150000 2 70000 3 70000 4 150000 ... 49995 100000 49996 150000 49997 5000 49998 40000 49999 150000 Name: odometer_km, Length: 50000, dtype: int64>
odometer_km.value_counts().sort_index(ascending = False).head()
150000 32424 125000 5170 100000 2169 90000 1757 80000 1436 Name: odometer_km, dtype: int64
odometer_km.value_counts()
150000 32424 125000 5170 100000 2169 90000 1757 80000 1436 70000 1230 60000 1164 50000 1027 5000 967 40000 819 30000 789 20000 784 10000 264 Name: odometer_km, dtype: int64
autos["price_$"].describe()
count 5.000000e+04 mean 9.840044e+03 std 4.811044e+05 min 0.000000e+00 25% 1.100000e+03 50% 2.950000e+03 75% 7.200000e+03 max 1.000000e+08 Name: price_$, dtype: float64
autos["price_$"].value_counts().sort_index(ascending = True).head(10)
0 1421 1 156 2 3 3 1 5 2 8 1 9 1 10 7 11 2 12 3 Name: price_$, dtype: int64
autos["price_$"].value_counts().sort_index(ascending = False).head(20)
99999999 1 27322222 1 12345678 3 11111111 2 10000000 1 3890000 1 1300000 1 1234566 1 999999 2 999990 1 350000 1 345000 1 299000 1 295000 1 265000 1 259000 1 250000 1 220000 1 198000 1 197000 1 Name: price_$, dtype: int64
autos = autos[autos["price_$"].between(1,100000)]
autos["price_$"].shape
(48526,)
## autos.loc[autos["price_$"] < 1, "price_$" ] = np.nan
## autos.loc[autos["price_$"] > 100000, "price_$" ] = np.nan
## price_ = autos["price_$"]
## price_no_null = price_.dropna(axis = 0 )
## price_no_null.describe()
## price_no_null.unique().shape
autos["price_$"].describe()
count 48526.000000 mean 5755.574146 std 7514.501772 min 1.000000 25% 1200.000000 50% 3000.000000 75% 7450.000000 max 99900.000000 Name: price_$, dtype: float64
autos["price_$"].value_counts().sort_index(ascending=True).head(10)
1 156 2 3 3 1 5 2 8 1 9 1 10 7 11 2 12 3 13 2 Name: price_$, dtype: int64
autos["price_$"].value_counts().sort_index(ascending = False).head(20)
99900 2 99000 2 98500 1 94999 1 93911 1 93000 2 89900 1 89000 1 88900 1 86500 1 85000 1 84997 1 84000 1 83000 1 82987 1 80000 3 79999 1 79980 1 79933 1 79500 1 Name: price_$, dtype: int64
the price values are upwards of 100 million dollars. The price column needed to be brought within a more reasonable range. More than 90% of the data (counts
after filtering divided by counts
before filtering) was found with values between 1 and 100,000 dollars. Since it's not unreasonable to assume that bids start at 1 dollar, those values were included. Value counts also indicates that this price (1 dollar) contains two orders of magnitude more value counts than the next nine subsequent prices. Conversely, the highest 20 values still contained within the price range, (1- 1,000,000 dollars) have a great deal fewer value counts. While the higher range is important to include, it's clear that the listings of interest should probably not exceed 1 million dollars.
## autos["date_crawled"] = autos['date_crawled'].str[:10]
autos["date_crawled"] = autos["date_crawled"].str[:10]
<ipython-input-32-d68c5cc0133a>:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy autos["date_crawled"] = autos["date_crawled"].str[:10]
autos['date_crawled'].value_counts(normalize=True, dropna=False).sort_index(ascending = True)
2016-03-05 0.025306 2016-03-06 0.014054 2016-03-07 0.036043 2016-03-08 0.033281 2016-03-09 0.033096 2016-03-10 0.032210 2016-03-11 0.032601 2016-03-12 0.036929 2016-03-13 0.015682 2016-03-14 0.036578 2016-03-15 0.034270 2016-03-16 0.029613 2016-03-17 0.031633 2016-03-18 0.012880 2016-03-19 0.034806 2016-03-20 0.037897 2016-03-21 0.037361 2016-03-22 0.032910 2016-03-23 0.032230 2016-03-24 0.029345 2016-03-25 0.031612 2016-03-26 0.032189 2016-03-27 0.031076 2016-03-28 0.034847 2016-03-29 0.034126 2016-03-30 0.033652 2016-03-31 0.031839 2016-04-01 0.033652 2016-04-02 0.035507 2016-04-03 0.038598 2016-04-04 0.036517 2016-04-05 0.013086 2016-04-06 0.003174 2016-04-07 0.001401 Name: date_crawled, dtype: float64
autos["ad_created"] = autos["ad_created"].str[:10]
<ipython-input-34-e8d1dccb314d>:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy autos["ad_created"] = autos["ad_created"].str[:10]
autos['ad_created'].value_counts(normalize=True, dropna=False).sort_index(ascending = True)
2015-06-11 0.000021 2015-08-10 0.000021 2015-09-09 0.000021 2015-11-10 0.000021 2015-12-05 0.000021 ... 2016-04-03 0.038845 2016-04-04 0.036887 2016-04-05 0.011808 2016-04-06 0.003256 2016-04-07 0.001257 Name: ad_created, Length: 76, dtype: float64
However, there are a few that were created upwards of 9-10 months prior to this 1-2 month March/April span.
autos['last_seen'] = autos['last_seen'].str[:10]
<ipython-input-36-ba4857993d1d>:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy autos['last_seen'] = autos['last_seen'].str[:10]
autos['last_seen'].value_counts(normalize=True, dropna=False).sort_index(ascending = True)
2016-03-05 0.001072 2016-03-06 0.004328 2016-03-07 0.005399 2016-03-08 0.007419 2016-03-09 0.009582 2016-03-10 0.010675 2016-03-11 0.012385 2016-03-12 0.023802 2016-03-13 0.008902 2016-03-14 0.012612 2016-03-15 0.015888 2016-03-16 0.016445 2016-03-17 0.028109 2016-03-18 0.007336 2016-03-19 0.015847 2016-03-20 0.020649 2016-03-21 0.020628 2016-03-22 0.021391 2016-03-23 0.018526 2016-03-24 0.019742 2016-03-25 0.019227 2016-03-26 0.016754 2016-03-27 0.015641 2016-03-28 0.020855 2016-03-29 0.022359 2016-03-30 0.024791 2016-03-31 0.023802 2016-04-01 0.022813 2016-04-02 0.024914 2016-04-03 0.025203 2016-04-04 0.024502 2016-04-05 0.124634 2016-04-06 0.221840 2016-04-07 0.131929 Name: last_seen, dtype: float64
last_seen
¶is generally indicative of a car being sold, (and consequently the add being taken down) it may also indicate that the crawler had been spending a couple days taking inventory of what ads were stll listed on the site. Thus, the last time the ad was seen by human is not necessarily indicative of the higher frequencies shown April 5th through April 7th. It also more than likely indicates that there wasn't a spike in sales, but instead, a bot conflating differences in human and artificial last seen counts by perhaps taking a kind of monthly inventory.
autos['registration_year'].describe()
count 48526.000000 mean 2004.754935 std 88.678603 min 1000.000000 25% 1999.000000 50% 2004.000000 75% 2008.000000 max 9999.000000 Name: registration_year, dtype: float64
registration_year
above¶which clearly need to be removed from the data set. The automobile was invented c. 1885 and at the time of this dataset being pulled, the year wasn't later than 2016. Furthermore, to be more conservative with our estimates, let's assume a vast majority of consumers will probably NOT be shopping for a car before 1900.
autos = autos[autos['registration_year'].between(1900,2016)]
autos['registration_year'].describe()
count 46642.000000 mean 2002.908709 std 7.176230 min 1910.000000 25% 1999.000000 50% 2003.000000 75% 2008.000000 max 2016.000000 Name: registration_year, dtype: float64
describe what listings remain after being filtered by range of years. Since there is still a market for early 20th century cars, (given the first car was invented in 1886) and since the last listing was seen in April of 2016, it was determined that this range was best.
autos['registration_year'].value_counts(normalize = True).sort_index().tail(50)
1967 0.000557 1968 0.000536 1969 0.000407 1970 0.000815 1971 0.000557 1972 0.000708 1973 0.000493 1974 0.000515 1975 0.000386 1976 0.000450 1977 0.000450 1978 0.000943 1979 0.000729 1980 0.001822 1981 0.000600 1982 0.000879 1983 0.001093 1984 0.001093 1985 0.002037 1986 0.001544 1987 0.001544 1988 0.002873 1989 0.003731 1990 0.007440 1991 0.007268 1992 0.007911 1993 0.009112 1994 0.013486 1995 0.026285 1996 0.029437 1997 0.041808 1998 0.050598 1999 0.062111 2000 0.067643 2001 0.056494 2002 0.053300 2003 0.057866 2004 0.057952 2005 0.062948 2006 0.057245 2007 0.048797 2008 0.047447 2009 0.044702 2010 0.033982 2011 0.034776 2012 0.028022 2013 0.017130 2014 0.014172 2015 0.008319 2016 0.026028 Name: registration_year, dtype: float64
it's quite clear that with the exception of 2015, between 1994 and 2016 inclusive, that these registration years comprise greater than 1 percent of total listings with many of these years being well above 3 percent.
values = autos['brand'].value_counts(normalize=True)
print(values)
top_brands = values["volkswagen":"ford"].index ##help understand syntax?
## or top_brands = values[values > 0.05].index
print('\n')
print(values[values > 0.05])
print(top_brands)
volkswagen 0.211440 bmw 0.110072 opel 0.107671 mercedes_benz 0.096480 audi 0.086617 ford 0.069937 renault 0.047189 peugeot 0.029866 fiat 0.025664 seat 0.018288 skoda 0.016423 nissan 0.015287 mazda 0.015201 smart 0.014172 citroen 0.014022 toyota 0.012714 hyundai 0.010034 sonstige_autos 0.009712 volvo 0.009155 mini 0.008769 mitsubishi 0.008233 honda 0.007847 kia 0.007075 alfa_romeo 0.006646 suzuki 0.005939 chevrolet 0.005703 porsche 0.005574 chrysler 0.003516 dacia 0.002637 daihatsu 0.002508 jeep 0.002273 subaru 0.002144 land_rover 0.002101 saab 0.001651 jaguar 0.001565 daewoo 0.001501 trabant 0.001394 rover 0.001329 lancia 0.001072 lada 0.000579 Name: brand, dtype: float64 volkswagen 0.211440 bmw 0.110072 opel 0.107671 mercedes_benz 0.096480 audi 0.086617 ford 0.069937 Name: brand, dtype: float64 Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford'], dtype='object')
in order to get a sense for what the mean cost is for any brand that comprises over 5 percent of the total listings. Thus, more popular well-known brands can be understood in this analysis as well as their mean costs.
MOT_automobiles = {} ##mean of top automobiles
for brands in top_brands:
Selected_rows = autos[autos["brand"] == brands]
mean = Selected_rows["price_$"].mean()
MOT_automobiles[brands] = int(mean)
MOT_auto_list = sorted(MOT_automobiles.items(), key = lambda x:x[1])
sort_MOT_auto_list = dict(MOT_auto_list)
MOT_autolist = dict(MOT_auto_list)
MOT_automobiles
{'volkswagen': 5402, 'bmw': 8201, 'opel': 2975, 'mercedes_benz': 8529, 'audi': 9295, 'ford': 3710}
and the mean price for the brand posted, there doesn't seem to be a correlation between how many posts a brand has with it's cost (i.e. more posts means higher cost). There is one example where a higher mean cost has the second lowest frequency as compared to the other six brands (audi) and also an example of where a relatively lower mean cost has the highest frequency of postings (volkswagen). Frequency is based on popularity and popularity seems to be multi-faceted; one component perhaps being due to affordability, another being due to perceived quality of the vehicle, and yet another perhaps being due to some combination of the aforementioned.
new_values = values[values > 0.05]
print(new_values)
volkswagen 0.211440 bmw 0.110072 opel 0.107671 mercedes_benz 0.096480 audi 0.086617 ford 0.069937 Name: brand, dtype: float64
new_series = pd.Series(MOT_autolist)
print(new_series)
opel 2975 ford 3710 volkswagen 5402 bmw 8201 mercedes_benz 8529 audi 9295 dtype: int64
new_df = pd.DataFrame(new_series, columns = ['mean_price'])
new_df
mean_price | |
---|---|
opel | 2975 |
ford | 3710 |
volkswagen | 5402 |
bmw | 8201 |
mercedes_benz | 8529 |
audi | 9295 |
Mean_MofTA = {} #mean mileage of top Automobiles dictionary
for brands in top_brands:
selected_rows_miles = autos[autos["brand"] == brands]
mean_miles = selected_rows_miles["odometer_km"].mean()
Mean_MofTA[brands] = int(mean_miles)
Mean_MofTA_list = sorted(Mean_MofTA.items(), key = lambda x:x[1])
sort_Mean_MofTA_list = dict(Mean_MofTA_list)
Mean_MofTA_list = dict(Mean_MofTA_list)
Mean_MofTA_list
{'ford': 124288, 'volkswagen': 128707, 'audi': 129188, 'opel': 129310, 'mercedes_benz': 130851, 'bmw': 132633}
MMofTA_series = pd.Series(Mean_MofTA_list)
print(MMofTA_series)
ford 124288 volkswagen 128707 audi 129188 opel 129310 mercedes_benz 130851 bmw 132633 dtype: int64
MMofTA_df = pd.DataFrame(MMofTA_series, columns = ['mean_mileage'])
MMofTA_df
mean_mileage | |
---|---|
ford | 124288 |
volkswagen | 128707 |
audi | 129188 |
opel | 129310 |
mercedes_benz | 130851 |
bmw | 132633 |
d = {'mean_price': pd.Series(MOT_autolist), 'mean_mileage': pd.Series(MMofTA_series)}
pd.DataFrame(data = d, index = top_brands)
mean_price | mean_mileage | |
---|---|---|
volkswagen | 5402 | 128707 |
bmw | 8201 | 132633 |
opel | 2975 | 129310 |
mercedes_benz | 8529 | 130851 |
audi | 9295 | 129188 |
ford | 3710 | 124288 |
autos.head()
date_crawled | name | seller | offer_type | price_$ | ab_test | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | 5000 | control | bus | 2004 | manuell | 158 | andere | 150000 | 3 | lpg | peugeot | nein | 2016-03-26 | 0 | 79588 | 2016-04-06 |
1 | 2016-04-04 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | 8500 | control | limousine | 1997 | automatik | 286 | 7er | 150000 | 6 | benzin | bmw | nein | 2016-04-04 | 0 | 71034 | 2016-04-06 |
2 | 2016-03-26 | Volkswagen_Golf_1.6_United | privat | Angebot | 8990 | test | limousine | 2009 | manuell | 102 | golf | 70000 | 7 | benzin | volkswagen | nein | 2016-03-26 | 0 | 35394 | 2016-04-06 |
3 | 2016-03-12 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | 4350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70000 | 6 | benzin | smart | nein | 2016-03-12 | 0 | 33729 | 2016-03-15 |
4 | 2016-04-01 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | 1350 | test | kombi | 2003 | manuell | 0 | focus | 150000 | 7 | benzin | ford | nein | 2016-04-01 | 0 | 39218 | 2016-04-01 |
autos["vehicle_type"].unique()
array(['bus', 'limousine', 'kleinwagen', 'kombi', nan, 'coupe', 'suv', 'cabrio', 'andere'], dtype=object)
autos["vehicle_type"].isnull().value_counts()
False 43939 True 2703 Name: vehicle_type, dtype: int64
autos.head(20)
date_crawled | name | seller | offer_type | price_$ | ab_test | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | 5000 | control | bus | 2004 | manuell | 158 | andere | 150000 | 3 | lpg | peugeot | nein | 2016-03-26 | 0 | 79588 | 2016-04-06 |
1 | 2016-04-04 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | 8500 | control | limousine | 1997 | automatik | 286 | 7er | 150000 | 6 | benzin | bmw | nein | 2016-04-04 | 0 | 71034 | 2016-04-06 |
2 | 2016-03-26 | Volkswagen_Golf_1.6_United | privat | Angebot | 8990 | test | limousine | 2009 | manuell | 102 | golf | 70000 | 7 | benzin | volkswagen | nein | 2016-03-26 | 0 | 35394 | 2016-04-06 |
3 | 2016-03-12 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | 4350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70000 | 6 | benzin | smart | nein | 2016-03-12 | 0 | 33729 | 2016-03-15 |
4 | 2016-04-01 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | 1350 | test | kombi | 2003 | manuell | 0 | focus | 150000 | 7 | benzin | ford | nein | 2016-04-01 | 0 | 39218 | 2016-04-01 |
5 | 2016-03-21 | Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto... | privat | Angebot | 7900 | test | bus | 2006 | automatik | 150 | voyager | 150000 | 4 | diesel | chrysler | NaN | 2016-03-21 | 0 | 22962 | 2016-04-06 |
6 | 2016-03-20 | VW_Golf_III_GT_Special_Electronic_Green_Metall... | privat | Angebot | 300 | test | limousine | 1995 | manuell | 90 | golf | 150000 | 8 | benzin | volkswagen | NaN | 2016-03-20 | 0 | 31535 | 2016-03-23 |
7 | 2016-03-16 | Golf_IV_1.9_TDI_90PS | privat | Angebot | 1990 | control | limousine | 1998 | manuell | 90 | golf | 150000 | 12 | diesel | volkswagen | nein | 2016-03-16 | 0 | 53474 | 2016-04-07 |
8 | 2016-03-22 | Seat_Arosa | privat | Angebot | 250 | test | NaN | 2000 | manuell | 0 | arosa | 150000 | 10 | NaN | seat | nein | 2016-03-22 | 0 | 7426 | 2016-03-26 |
9 | 2016-03-16 | Renault_Megane_Scenic_1.6e_RT_Klimaanlage | privat | Angebot | 590 | control | bus | 1997 | manuell | 90 | megane | 150000 | 7 | benzin | renault | nein | 2016-03-16 | 0 | 15749 | 2016-04-06 |
11 | 2016-03-16 | Mercedes_A140_Motorschaden | privat | Angebot | 350 | control | NaN | 2000 | NaN | 0 | NaN | 150000 | 0 | benzin | mercedes_benz | NaN | 2016-03-16 | 0 | 17498 | 2016-03-16 |
12 | 2016-03-31 | Smart_smart_fortwo_coupe_softouch_pure_MHD_Pan... | privat | Angebot | 5299 | control | kleinwagen | 2010 | automatik | 71 | fortwo | 50000 | 9 | benzin | smart | nein | 2016-03-31 | 0 | 34590 | 2016-04-06 |
13 | 2016-03-23 | Audi_A3_1.6_tuning | privat | Angebot | 1350 | control | limousine | 1999 | manuell | 101 | a3 | 150000 | 11 | benzin | audi | nein | 2016-03-23 | 0 | 12043 | 2016-04-01 |
14 | 2016-03-23 | Renault_Clio_3__Dynamique_1.2__16_V;_viele_Ver... | privat | Angebot | 3999 | test | kleinwagen | 2007 | manuell | 75 | clio | 150000 | 9 | benzin | renault | NaN | 2016-03-23 | 0 | 81737 | 2016-04-01 |
15 | 2016-04-01 | Corvette_C3_Coupe_T_Top_Crossfire_Injection | privat | Angebot | 18900 | test | coupe | 1982 | automatik | 203 | NaN | 80000 | 6 | benzin | sonstige_autos | nein | 2016-04-01 | 0 | 61276 | 2016-04-02 |
16 | 2016-03-16 | Opel_Vectra_B_Kombi | privat | Angebot | 350 | test | kombi | 1999 | manuell | 101 | vectra | 150000 | 5 | benzin | opel | nein | 2016-03-16 | 0 | 57299 | 2016-03-18 |
17 | 2016-03-29 | Volkswagen_Scirocco_2_G60 | privat | Angebot | 5500 | test | coupe | 1990 | manuell | 205 | scirocco | 150000 | 6 | benzin | volkswagen | nein | 2016-03-29 | 0 | 74821 | 2016-04-05 |
18 | 2016-03-26 | Verkaufen_mein_bmw_e36_320_i_touring | privat | Angebot | 300 | control | bus | 1995 | manuell | 150 | 3er | 150000 | 0 | benzin | bmw | NaN | 2016-03-26 | 0 | 54329 | 2016-04-02 |
19 | 2016-03-17 | mazda_tribute_2.0_mit_gas_und_tuev_neu_2018 | privat | Angebot | 4150 | control | suv | 2004 | manuell | 124 | andere | 150000 | 2 | lpg | mazda | nein | 2016-03-17 | 0 | 40878 | 2016-03-17 |
20 | 2016-03-05 | Audi_A4_Avant_1.9_TDI_*6_Gang*AHK*Klimatronik*... | privat | Angebot | 3500 | test | kombi | 2003 | manuell | 131 | a4 | 150000 | 5 | diesel | audi | NaN | 2016-03-05 | 0 | 53913 | 2016-03-07 |
vehicle_type_dict = {"kleinwagen" : "small car",
"kombi" : "station wagon",
"cabrio" : "convertible",
"andere" : "other",
"bus" : "bus",
"limousine" : "limousine",
"coupe" : "coupe",
"suv" : "suv"
}
autos['vehicle_type'] = autos['vehicle_type'].map(vehicle_type_dict)
autos["vehicle_type"].fillna("unknown_type", inplace = True)
autos.head(20)
date_crawled | name | seller | offer_type | price_$ | ab_test | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | 5000 | control | bus | 2004 | manuell | 158 | andere | 150000 | 3 | lpg | peugeot | nein | 2016-03-26 | 0 | 79588 | 2016-04-06 |
1 | 2016-04-04 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | 8500 | control | limousine | 1997 | automatik | 286 | 7er | 150000 | 6 | benzin | bmw | nein | 2016-04-04 | 0 | 71034 | 2016-04-06 |
2 | 2016-03-26 | Volkswagen_Golf_1.6_United | privat | Angebot | 8990 | test | limousine | 2009 | manuell | 102 | golf | 70000 | 7 | benzin | volkswagen | nein | 2016-03-26 | 0 | 35394 | 2016-04-06 |
3 | 2016-03-12 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | 4350 | control | small car | 2007 | automatik | 71 | fortwo | 70000 | 6 | benzin | smart | nein | 2016-03-12 | 0 | 33729 | 2016-03-15 |
4 | 2016-04-01 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | 1350 | test | station wagon | 2003 | manuell | 0 | focus | 150000 | 7 | benzin | ford | nein | 2016-04-01 | 0 | 39218 | 2016-04-01 |
5 | 2016-03-21 | Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto... | privat | Angebot | 7900 | test | bus | 2006 | automatik | 150 | voyager | 150000 | 4 | diesel | chrysler | NaN | 2016-03-21 | 0 | 22962 | 2016-04-06 |
6 | 2016-03-20 | VW_Golf_III_GT_Special_Electronic_Green_Metall... | privat | Angebot | 300 | test | limousine | 1995 | manuell | 90 | golf | 150000 | 8 | benzin | volkswagen | NaN | 2016-03-20 | 0 | 31535 | 2016-03-23 |
7 | 2016-03-16 | Golf_IV_1.9_TDI_90PS | privat | Angebot | 1990 | control | limousine | 1998 | manuell | 90 | golf | 150000 | 12 | diesel | volkswagen | nein | 2016-03-16 | 0 | 53474 | 2016-04-07 |
8 | 2016-03-22 | Seat_Arosa | privat | Angebot | 250 | test | unknown_type | 2000 | manuell | 0 | arosa | 150000 | 10 | NaN | seat | nein | 2016-03-22 | 0 | 7426 | 2016-03-26 |
9 | 2016-03-16 | Renault_Megane_Scenic_1.6e_RT_Klimaanlage | privat | Angebot | 590 | control | bus | 1997 | manuell | 90 | megane | 150000 | 7 | benzin | renault | nein | 2016-03-16 | 0 | 15749 | 2016-04-06 |
11 | 2016-03-16 | Mercedes_A140_Motorschaden | privat | Angebot | 350 | control | unknown_type | 2000 | NaN | 0 | NaN | 150000 | 0 | benzin | mercedes_benz | NaN | 2016-03-16 | 0 | 17498 | 2016-03-16 |
12 | 2016-03-31 | Smart_smart_fortwo_coupe_softouch_pure_MHD_Pan... | privat | Angebot | 5299 | control | small car | 2010 | automatik | 71 | fortwo | 50000 | 9 | benzin | smart | nein | 2016-03-31 | 0 | 34590 | 2016-04-06 |
13 | 2016-03-23 | Audi_A3_1.6_tuning | privat | Angebot | 1350 | control | limousine | 1999 | manuell | 101 | a3 | 150000 | 11 | benzin | audi | nein | 2016-03-23 | 0 | 12043 | 2016-04-01 |
14 | 2016-03-23 | Renault_Clio_3__Dynamique_1.2__16_V;_viele_Ver... | privat | Angebot | 3999 | test | small car | 2007 | manuell | 75 | clio | 150000 | 9 | benzin | renault | NaN | 2016-03-23 | 0 | 81737 | 2016-04-01 |
15 | 2016-04-01 | Corvette_C3_Coupe_T_Top_Crossfire_Injection | privat | Angebot | 18900 | test | coupe | 1982 | automatik | 203 | NaN | 80000 | 6 | benzin | sonstige_autos | nein | 2016-04-01 | 0 | 61276 | 2016-04-02 |
16 | 2016-03-16 | Opel_Vectra_B_Kombi | privat | Angebot | 350 | test | station wagon | 1999 | manuell | 101 | vectra | 150000 | 5 | benzin | opel | nein | 2016-03-16 | 0 | 57299 | 2016-03-18 |
17 | 2016-03-29 | Volkswagen_Scirocco_2_G60 | privat | Angebot | 5500 | test | coupe | 1990 | manuell | 205 | scirocco | 150000 | 6 | benzin | volkswagen | nein | 2016-03-29 | 0 | 74821 | 2016-04-05 |
18 | 2016-03-26 | Verkaufen_mein_bmw_e36_320_i_touring | privat | Angebot | 300 | control | bus | 1995 | manuell | 150 | 3er | 150000 | 0 | benzin | bmw | NaN | 2016-03-26 | 0 | 54329 | 2016-04-02 |
19 | 2016-03-17 | mazda_tribute_2.0_mit_gas_und_tuev_neu_2018 | privat | Angebot | 4150 | control | suv | 2004 | manuell | 124 | andere | 150000 | 2 | lpg | mazda | nein | 2016-03-17 | 0 | 40878 | 2016-03-17 |
20 | 2016-03-05 | Audi_A4_Avant_1.9_TDI_*6_Gang*AHK*Klimatronik*... | privat | Angebot | 3500 | test | station wagon | 2003 | manuell | 131 | a4 | 150000 | 5 | diesel | audi | NaN | 2016-03-05 | 0 | 53913 | 2016-03-07 |
autos["date_crawled"] = autos["date_crawled"].str.replace("-","")
autos["ad_created"] = autos["ad_created"].str.replace("-","")
autos["last_seen"] = autos["last_seen"].str.replace("-","")
autos.head(20)
date_crawled | name | seller | offer_type | price_$ | ab_test | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 20160326 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | 5000 | control | bus | 2004 | manuell | 158 | andere | 150000 | 3 | lpg | peugeot | nein | 20160326 | 0 | 79588 | 20160406 |
1 | 20160404 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | 8500 | control | limousine | 1997 | automatik | 286 | 7er | 150000 | 6 | benzin | bmw | nein | 20160404 | 0 | 71034 | 20160406 |
2 | 20160326 | Volkswagen_Golf_1.6_United | privat | Angebot | 8990 | test | limousine | 2009 | manuell | 102 | golf | 70000 | 7 | benzin | volkswagen | nein | 20160326 | 0 | 35394 | 20160406 |
3 | 20160312 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | 4350 | control | small car | 2007 | automatik | 71 | fortwo | 70000 | 6 | benzin | smart | nein | 20160312 | 0 | 33729 | 20160315 |
4 | 20160401 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | 1350 | test | station wagon | 2003 | manuell | 0 | focus | 150000 | 7 | benzin | ford | nein | 20160401 | 0 | 39218 | 20160401 |
5 | 20160321 | Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto... | privat | Angebot | 7900 | test | bus | 2006 | automatik | 150 | voyager | 150000 | 4 | diesel | chrysler | NaN | 20160321 | 0 | 22962 | 20160406 |
6 | 20160320 | VW_Golf_III_GT_Special_Electronic_Green_Metall... | privat | Angebot | 300 | test | limousine | 1995 | manuell | 90 | golf | 150000 | 8 | benzin | volkswagen | NaN | 20160320 | 0 | 31535 | 20160323 |
7 | 20160316 | Golf_IV_1.9_TDI_90PS | privat | Angebot | 1990 | control | limousine | 1998 | manuell | 90 | golf | 150000 | 12 | diesel | volkswagen | nein | 20160316 | 0 | 53474 | 20160407 |
8 | 20160322 | Seat_Arosa | privat | Angebot | 250 | test | unknown_type | 2000 | manuell | 0 | arosa | 150000 | 10 | NaN | seat | nein | 20160322 | 0 | 7426 | 20160326 |
9 | 20160316 | Renault_Megane_Scenic_1.6e_RT_Klimaanlage | privat | Angebot | 590 | control | bus | 1997 | manuell | 90 | megane | 150000 | 7 | benzin | renault | nein | 20160316 | 0 | 15749 | 20160406 |
11 | 20160316 | Mercedes_A140_Motorschaden | privat | Angebot | 350 | control | unknown_type | 2000 | NaN | 0 | NaN | 150000 | 0 | benzin | mercedes_benz | NaN | 20160316 | 0 | 17498 | 20160316 |
12 | 20160331 | Smart_smart_fortwo_coupe_softouch_pure_MHD_Pan... | privat | Angebot | 5299 | control | small car | 2010 | automatik | 71 | fortwo | 50000 | 9 | benzin | smart | nein | 20160331 | 0 | 34590 | 20160406 |
13 | 20160323 | Audi_A3_1.6_tuning | privat | Angebot | 1350 | control | limousine | 1999 | manuell | 101 | a3 | 150000 | 11 | benzin | audi | nein | 20160323 | 0 | 12043 | 20160401 |
14 | 20160323 | Renault_Clio_3__Dynamique_1.2__16_V;_viele_Ver... | privat | Angebot | 3999 | test | small car | 2007 | manuell | 75 | clio | 150000 | 9 | benzin | renault | NaN | 20160323 | 0 | 81737 | 20160401 |
15 | 20160401 | Corvette_C3_Coupe_T_Top_Crossfire_Injection | privat | Angebot | 18900 | test | coupe | 1982 | automatik | 203 | NaN | 80000 | 6 | benzin | sonstige_autos | nein | 20160401 | 0 | 61276 | 20160402 |
16 | 20160316 | Opel_Vectra_B_Kombi | privat | Angebot | 350 | test | station wagon | 1999 | manuell | 101 | vectra | 150000 | 5 | benzin | opel | nein | 20160316 | 0 | 57299 | 20160318 |
17 | 20160329 | Volkswagen_Scirocco_2_G60 | privat | Angebot | 5500 | test | coupe | 1990 | manuell | 205 | scirocco | 150000 | 6 | benzin | volkswagen | nein | 20160329 | 0 | 74821 | 20160405 |
18 | 20160326 | Verkaufen_mein_bmw_e36_320_i_touring | privat | Angebot | 300 | control | bus | 1995 | manuell | 150 | 3er | 150000 | 0 | benzin | bmw | NaN | 20160326 | 0 | 54329 | 20160402 |
19 | 20160317 | mazda_tribute_2.0_mit_gas_und_tuev_neu_2018 | privat | Angebot | 4150 | control | suv | 2004 | manuell | 124 | andere | 150000 | 2 | lpg | mazda | nein | 20160317 | 0 | 40878 | 20160317 |
20 | 20160305 | Audi_A4_Avant_1.9_TDI_*6_Gang*AHK*Klimatronik*... | privat | Angebot | 3500 | test | station wagon | 2003 | manuell | 131 | a4 | 150000 | 5 | diesel | audi | NaN | 20160305 | 0 | 53913 | 20160307 |
autos["name"] = autos["name"].str.replace("_"," ")
autos.head(20)
date_crawled | name | seller | offer_type | price_$ | ab_test | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 20160326 | Peugeot 807 160 NAVTECH ON BOARD | privat | Angebot | 5000 | control | bus | 2004 | manuell | 158 | andere | 150000 | 3 | lpg | peugeot | nein | 20160326 | 0 | 79588 | 20160406 |
1 | 20160404 | BMW 740i 4 4 Liter HAMANN UMBAU Mega Optik | privat | Angebot | 8500 | control | limousine | 1997 | automatik | 286 | 7er | 150000 | 6 | benzin | bmw | nein | 20160404 | 0 | 71034 | 20160406 |
2 | 20160326 | Volkswagen Golf 1.6 United | privat | Angebot | 8990 | test | limousine | 2009 | manuell | 102 | golf | 70000 | 7 | benzin | volkswagen | nein | 20160326 | 0 | 35394 | 20160406 |
3 | 20160312 | Smart smart fortwo coupe softouch/F1/Klima/Pan... | privat | Angebot | 4350 | control | small car | 2007 | automatik | 71 | fortwo | 70000 | 6 | benzin | smart | nein | 20160312 | 0 | 33729 | 20160315 |
4 | 20160401 | Ford Focus 1 6 Benzin TÜV neu ist sehr gepfleg... | privat | Angebot | 1350 | test | station wagon | 2003 | manuell | 0 | focus | 150000 | 7 | benzin | ford | nein | 20160401 | 0 | 39218 | 20160401 |
5 | 20160321 | Chrysler Grand Voyager 2.8 CRD Aut.Limited Sto... | privat | Angebot | 7900 | test | bus | 2006 | automatik | 150 | voyager | 150000 | 4 | diesel | chrysler | NaN | 20160321 | 0 | 22962 | 20160406 |
6 | 20160320 | VW Golf III GT Special Electronic Green Metall... | privat | Angebot | 300 | test | limousine | 1995 | manuell | 90 | golf | 150000 | 8 | benzin | volkswagen | NaN | 20160320 | 0 | 31535 | 20160323 |
7 | 20160316 | Golf IV 1.9 TDI 90PS | privat | Angebot | 1990 | control | limousine | 1998 | manuell | 90 | golf | 150000 | 12 | diesel | volkswagen | nein | 20160316 | 0 | 53474 | 20160407 |
8 | 20160322 | Seat Arosa | privat | Angebot | 250 | test | unknown_type | 2000 | manuell | 0 | arosa | 150000 | 10 | NaN | seat | nein | 20160322 | 0 | 7426 | 20160326 |
9 | 20160316 | Renault Megane Scenic 1.6e RT Klimaanlage | privat | Angebot | 590 | control | bus | 1997 | manuell | 90 | megane | 150000 | 7 | benzin | renault | nein | 20160316 | 0 | 15749 | 20160406 |
11 | 20160316 | Mercedes A140 Motorschaden | privat | Angebot | 350 | control | unknown_type | 2000 | NaN | 0 | NaN | 150000 | 0 | benzin | mercedes_benz | NaN | 20160316 | 0 | 17498 | 20160316 |
12 | 20160331 | Smart smart fortwo coupe softouch pure MHD Pan... | privat | Angebot | 5299 | control | small car | 2010 | automatik | 71 | fortwo | 50000 | 9 | benzin | smart | nein | 20160331 | 0 | 34590 | 20160406 |
13 | 20160323 | Audi A3 1.6 tuning | privat | Angebot | 1350 | control | limousine | 1999 | manuell | 101 | a3 | 150000 | 11 | benzin | audi | nein | 20160323 | 0 | 12043 | 20160401 |
14 | 20160323 | Renault Clio 3 Dynamique 1.2 16 V; viele Ver... | privat | Angebot | 3999 | test | small car | 2007 | manuell | 75 | clio | 150000 | 9 | benzin | renault | NaN | 20160323 | 0 | 81737 | 20160401 |
15 | 20160401 | Corvette C3 Coupe T Top Crossfire Injection | privat | Angebot | 18900 | test | coupe | 1982 | automatik | 203 | NaN | 80000 | 6 | benzin | sonstige_autos | nein | 20160401 | 0 | 61276 | 20160402 |
16 | 20160316 | Opel Vectra B Kombi | privat | Angebot | 350 | test | station wagon | 1999 | manuell | 101 | vectra | 150000 | 5 | benzin | opel | nein | 20160316 | 0 | 57299 | 20160318 |
17 | 20160329 | Volkswagen Scirocco 2 G60 | privat | Angebot | 5500 | test | coupe | 1990 | manuell | 205 | scirocco | 150000 | 6 | benzin | volkswagen | nein | 20160329 | 0 | 74821 | 20160405 |
18 | 20160326 | Verkaufen mein bmw e36 320 i touring | privat | Angebot | 300 | control | bus | 1995 | manuell | 150 | 3er | 150000 | 0 | benzin | bmw | NaN | 20160326 | 0 | 54329 | 20160402 |
19 | 20160317 | mazda tribute 2.0 mit gas und tuev neu 2018 | privat | Angebot | 4150 | control | suv | 2004 | manuell | 124 | andere | 150000 | 2 | lpg | mazda | nein | 20160317 | 0 | 40878 | 20160317 |
20 | 20160305 | Audi A4 Avant 1.9 TDI *6 Gang*AHK*Klimatronik*... | privat | Angebot | 3500 | test | station wagon | 2003 | manuell | 131 | a4 | 150000 | 5 | diesel | audi | NaN | 20160305 | 0 | 53913 | 20160307 |
autos["name"].str.split().str[0:2]
0 [Peugeot, 807] 1 [BMW, 740i] 2 [Volkswagen, Golf] 3 [Smart, smart] 4 [Ford, Focus] ... 49995 [Audi, Q5] 49996 [Opel, Astra] 49997 [Fiat, 500] 49998 [Audi, A3] 49999 [Opel, Vectra] Name: name, Length: 46642, dtype: object
models = autos["model"].unique()
print(models)
['andere' '7er' 'golf' 'fortwo' 'focus' 'voyager' 'arosa' 'megane' nan 'a3' 'clio' 'vectra' 'scirocco' '3er' 'a4' '911' 'cooper' '5er' 'polo' 'e_klasse' '2_reihe' 'c_klasse' 'corsa' 'mondeo' 'altea' 'a1' 'twingo' 'a_klasse' 'cl' '3_reihe' 's_klasse' 'sandero' 'passat' 'primera' 'wrangler' 'a6' 'transporter' 'astra' 'v40' 'ibiza' 'micra' '1er' 'yaris' 'colt' '6_reihe' '5_reihe' 'corolla' 'ka' 'tigra' 'punto' 'vito' 'cordoba' 'galaxy' '100' 'octavia' 'm_klasse' 'lupo' 'fiesta' 'superb' 'meriva' 'c_max' 'laguna' 'touran' '1_reihe' 'm_reihe' 'touareg' 'seicento' 'avensis' 'vivaro' 'x_reihe' 'ducato' 'carnival' 'boxster' 'signum' 'sharan' 'zafira' 'rav' 'a5' 'beetle' 'c_reihe' 'phaeton' 'i_reihe' 'sl' 'insignia' 'up' 'civic' '80' 'mx_reihe' 'omega' 'sorento' 'z_reihe' 'berlingo' 'clk' 's_max' 'kalos' 'cx_reihe' 'grand' 'swift' 'tiguan' 'sprinter' 'mii' 'viano' 'kaefer' 'almera' 'picanto' 'espace' 'scenic' 'one' 'bora' 'fox' 'leon' 'transit' 'tucson' 'tt' 'qashqai' 'mustang' 'jazz' 'nubira' 'v70' 'duster' 'stilo' 'justy' 'kangoo' 'fabia' 'matiz' 'freelander' 'lancer' 'forester' 's_type' 'bravo' '500' '156' 'cr_reihe' 'toledo' 'panda' 'a8' 'ceed' 'caddy' 'aygo' 'citigo' 'exeo' 'c1' 'doblo' 'impreza' '147' 'agila' '145' 'c3' 'roomster' 'auris' '601' 'eos' 'aveo' 'slk' 'terios' 'fusion' 'c4' 'glk' 'materia' 'v50' 'combo' 'yeti' '900' '850' 'outlander' 'rio' 'escort' 'c2' 'jimny' 'forfour' 'a2' 'spider' 'kadett' 'b_klasse' 'carisma' 'cayenne' 'r19' 'cc' 'note' 'logan' 'xc_reihe' 'verso' 'v_klasse' 'jetta' 'c5' 'q5' 'getz' 'navara' 'x_type' 'clubman' 's60' '4_reihe' 'lybra' 'cherokee' 'q3' 'q7' 'roadster' 'captiva' 'cuore' '90' 'ptcruiser' 'kuga' 'x_trail' 'ypsilon' 'pajero' 'delta' 'rx_reihe' 'modus' 'sportage' 'calibra' 'sirion' '6er' 'gl' 'amarok' 'santa' '300c' 'antara' 'accord' 'crossfire' '159' 'niva' 'galant' 'legacy' 'defender' 'range_rover' 'juke' 'spark' 'lanos' 'v60' 'alhambra' 'discovery' 'range_rover_sport' 'range_rover_evoque' 'move' 'croma' 'g_klasse' 'lodgy' 'rangerover' 'samara' 'kappa' '9000' 'charade' 'i3' '200' 'b_max']
autos["model"].value_counts().head(20)
golf 3707 andere 3370 3er 2615 polo 1609 corsa 1592 passat 1349 astra 1348 a4 1231 c_klasse 1136 5er 1132 e_klasse 958 a3 825 a6 797 focus 762 fiesta 722 transporter 674 twingo 615 2_reihe 600 fortwo 550 vectra 544 Name: model, dtype: int64
autos["model"].value_counts().head(12).sum() / autos["model"].shape[0]
0.447493675228335
top_12_models = autos["model"].value_counts().head(12)
print(top_12_models)
golf 3707 andere 3370 3er 2615 polo 1609 corsa 1592 passat 1349 astra 1348 a4 1231 c_klasse 1136 5er 1132 e_klasse 958 a3 825 Name: model, dtype: int64
Top_12_models = top_12_models[top_12_models > 837].index
print(Top_12_models)
Index(['golf', 'andere', '3er', 'polo', 'corsa', 'passat', 'astra', 'a4', 'c_klasse', '5er', 'e_klasse'], dtype='object')
autos["model_sum"] = autos.groupby('model').cumcount() + 1
autos
date_crawled | name | seller | offer_type | price_$ | ab_test | vehicle_type | registration_year | gearbox | power_ps | ... | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | model_sum | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 20160326 | Peugeot 807 160 NAVTECH ON BOARD | privat | Angebot | 5000 | control | bus | 2004 | manuell | 158 | ... | 150000 | 3 | lpg | peugeot | nein | 20160326 | 0 | 79588 | 20160406 | 1 |
1 | 20160404 | BMW 740i 4 4 Liter HAMANN UMBAU Mega Optik | privat | Angebot | 8500 | control | limousine | 1997 | automatik | 286 | ... | 150000 | 6 | benzin | bmw | nein | 20160404 | 0 | 71034 | 20160406 | 1 |
2 | 20160326 | Volkswagen Golf 1.6 United | privat | Angebot | 8990 | test | limousine | 2009 | manuell | 102 | ... | 70000 | 7 | benzin | volkswagen | nein | 20160326 | 0 | 35394 | 20160406 | 1 |
3 | 20160312 | Smart smart fortwo coupe softouch/F1/Klima/Pan... | privat | Angebot | 4350 | control | small car | 2007 | automatik | 71 | ... | 70000 | 6 | benzin | smart | nein | 20160312 | 0 | 33729 | 20160315 | 1 |
4 | 20160401 | Ford Focus 1 6 Benzin TÜV neu ist sehr gepfleg... | privat | Angebot | 1350 | test | station wagon | 2003 | manuell | 0 | ... | 150000 | 7 | benzin | ford | nein | 20160401 | 0 | 39218 | 20160401 | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
49995 | 20160327 | Audi Q5 3.0 TDI qu. S tr. Navi Panorama Xenon | privat | Angebot | 24900 | control | limousine | 2011 | automatik | 239 | ... | 100000 | 1 | diesel | audi | nein | 20160327 | 0 | 82131 | 20160401 | 62 |
49996 | 20160328 | Opel Astra F Cabrio Bertone Edition TÜV neu+... | privat | Angebot | 1980 | control | convertible | 1996 | manuell | 75 | ... | 150000 | 5 | benzin | opel | nein | 20160328 | 0 | 44807 | 20160402 | 1348 |
49997 | 20160402 | Fiat 500 C 1.2 Dualogic Lounge | privat | Angebot | 13200 | test | convertible | 2014 | automatik | 69 | ... | 5000 | 11 | benzin | fiat | nein | 20160402 | 0 | 73430 | 20160404 | 120 |
49998 | 20160308 | Audi A3 2.0 TDI Sportback Ambition | privat | Angebot | 22900 | control | station wagon | 2013 | manuell | 150 | ... | 40000 | 11 | diesel | audi | nein | 20160308 | 0 | 35683 | 20160405 | 825 |
49999 | 20160314 | Opel Vectra 1.6 16V | privat | Angebot | 1250 | control | limousine | 1996 | manuell | 101 | ... | 150000 | 1 | benzin | opel | nein | 20160313 | 0 | 45897 | 20160406 | 544 |
46642 rows × 21 columns
brand_model = {}
autos["model"].fillna("unknown",inplace = True)
Top_12_models = top_12_models[top_12_models > 837].index
for m in Top_12_models:
selected_rows = autos[autos["model"] == m ]
sorted_rows = selected_rows.sort_values("model_sum",ascending = False)
top_sum = sorted_rows.iloc[0]
top_brands = top_sum["brand"]
brand_model[m] = top_brands
brand_model
{'golf': 'volkswagen', 'andere': 'fiat', '3er': 'bmw', 'polo': 'volkswagen', 'corsa': 'opel', 'passat': 'volkswagen', 'astra': 'opel', 'a4': 'audi', 'c_klasse': 'mercedes_benz', '5er': 'bmw', 'e_klasse': 'mercedes_benz'}
brand_model_series = pd.Series(brand_model)
autos["date_crawled"] = autos["date_crawled"].astype(int)
autos["ad_created"] = autos["ad_created"].astype(int)
autos["last_seen"] = autos["last_seen"].astype(int)
brand_model_df = pd.DataFrame(brand_model_series, columns = ['model'])
print(brand_model_df)
model golf volkswagen andere fiat 3er bmw polo volkswagen corsa opel passat volkswagen astra opel a4 audi c_klasse mercedes_benz 5er bmw e_klasse mercedes_benz
model_count_series = pd.Series(top_12_models)
print(model_count_series)
golf 3707 andere 3370 3er 2615 polo 1609 corsa 1592 passat 1349 astra 1348 a4 1231 c_klasse 1136 5er 1132 e_klasse 958 a3 825 Name: model, dtype: int64
d = {'brand': pd.Series(brand_model), 'count': pd.Series(top_12_models)}
pd.DataFrame(data = d, index = Top_12_models)
brand | count | |
---|---|---|
golf | volkswagen | 3707 |
andere | fiat | 3370 |
3er | bmw | 2615 |
polo | volkswagen | 1609 |
corsa | opel | 1592 |
passat | volkswagen | 1349 |
astra | opel | 1348 |
a4 | audi | 1231 |
c_klasse | mercedes_benz | 1136 |
5er | bmw | 1132 |
e_klasse | mercedes_benz | 958 |
autos['odometer_km'].unique()
array([150000, 70000, 50000, 80000, 10000, 30000, 125000, 90000, 20000, 60000, 5000, 40000, 100000])
AvgPrice = {}
miles_strata = autos["odometer_km"].unique()
for m in miles_strata:
selected_rows = autos[autos["odometer_km"] == m]
mean_price = selected_rows["price_$"].mean()
AvgPrice[m] = int(mean_price)
AvgPrice_list = sorted(AvgPrice.items(), key = lambda x:x[1])
sort_AvgPrice_list = dict(AvgPrice_list)
AvgPrice_list
[(150000, 3754), (125000, 6165), (5000, 6455), (100000, 7932), (90000, 8465), (80000, 9652), (70000, 10927), (60000, 12109), (50000, 13527), (40000, 15305), (30000, 15672), (20000, 16635), (10000, 18946)]
odometer_km.value_counts()
150000 32424 125000 5170 100000 2169 90000 1757 80000 1436 70000 1230 60000 1164 50000 1027 5000 967 40000 819 30000 789 20000 784 10000 264 Name: odometer_km, dtype: int64
As a trend, (with an exception for 5,000 miles) prices tend to go up as odometer values decrease. What could be driving the cost for 5,000 km values is that these listings may be more recent and so bids are initially lower but one might expect those values to ultimately increase the longer the listing stays on the site.
odometer_500 = autos[autos["odometer_km"] == 500]
odomoter_500["ad_created"].unique()
--------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-106-d87f6c8df839> in <module> ----> 1 odomoter_500["ad_created"].unique() NameError: name 'odomoter_500' is not defined
odometer_500_dict = {}
odometer_500["ad_created_500_sum"] = 0
odometer_500["ad_created_500_sum"] = odometer_500.groupby('ad_created').cumcount() + 1
for rows in odometer_500:
selected_rows = odometer_500[odometer_500["odometer_km"] == rows]
sorted_rows = selected_rows.sort_values("ad_created_500_sum", ascending = True)
top_row = sorted_rows.iloc[0]
ad_list_dates = top_row["ad_created"]
autos[rows] = ad_list_dates
odometer_500_dict
--------------------------------------------------------------------------- IndexError Traceback (most recent call last) <ipython-input-96-23b29866c556> in <module> 7 sorted_rows = selected_rows.sort_values("ad_created_500_sum", ascending = True) 8 ----> 9 top_row = sorted_rows.iloc[0] 10 ad_list_dates = top_row["ad_created"] 11 autos[rows] = ad_list_dates /dataquest/system/env/python3/lib/python3.8/site-packages/pandas/core/indexing.py in __getitem__(self, key) 1766 1767 maybe_callable = com.apply_if_callable(key, self.obj) -> 1768 return self._getitem_axis(maybe_callable, axis=axis) 1769 1770 def _is_scalar_access(self, key: Tuple): /dataquest/system/env/python3/lib/python3.8/site-packages/pandas/core/indexing.py in _getitem_axis(self, key, axis) 2136 2137 # validate the location -> 2138 self._validate_integer(key, axis) 2139 2140 return self._get_loc(key, axis=axis) /dataquest/system/env/python3/lib/python3.8/site-packages/pandas/core/indexing.py in _validate_integer(self, key, axis) 2061 len_axis = len(self.obj._get_axis(axis)) 2062 if key >= len_axis or key < -len_axis: -> 2063 raise IndexError("single positional indexer is out-of-bounds") 2064 2065 def _getitem_tuple(self, tup: Tuple): IndexError: single positional indexer is out-of-bounds
autos
autos['unrepaired_damage'].unique()
damage_cost = {}
Unrepaired = autos["unrepaired_damage"].value_counts(dropna = True).index
for u in Unrepaired:
selected_rows = autos[autos["unrepaired_damage"] == u]
mean = selected_rows["price_$"].mean()
damage_cost[u] = int(mean)
damage_cost
means there is no unrepaired damage. This makes sense (with my limited knowledge of German I do know 'nein' means no). The average price for cars with unrepaired damage is higher than the average cost for cars WITH unrepaired damage.