import pandas as pd
import numpy as np
autos = pd.read_csv("autos.csv", encoding='Windows-1252')
autos.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): dateCrawled 50000 non-null object name 50000 non-null object seller 50000 non-null object offerType 50000 non-null object price 50000 non-null object abtest 50000 non-null object vehicleType 44905 non-null object yearOfRegistration 50000 non-null int64 gearbox 47320 non-null object powerPS 50000 non-null int64 model 47242 non-null object odometer 50000 non-null object monthOfRegistration 50000 non-null int64 fuelType 45518 non-null object brand 50000 non-null object notRepairedDamage 40171 non-null object dateCreated 50000 non-null object nrOfPictures 50000 non-null int64 postalCode 50000 non-null int64 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
Here we see that vehicle type, gearbox, model, fuel type, and unrepaired damage have missing values. These will have to be removed later on to have a cleaner, more complete dataset down the road.
Need to change spelling of manual and automatic
Need to remove extra characters like dollar signs
Need to remove km from odometer column
Names of columns are in camcelcase not snakecase.
#autos.head()
col_names = autos.columns
print(col_names)
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest', 'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model', 'odometer', 'monthOfRegistration', 'fuelType', 'brand', 'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode', 'lastSeen'], dtype='object')
def clean_str(str):
str = str.replace("yearOfRegistration","registration_year")
str = str.replace("monthOfRegistration","registration_month")
str = str.replace("notRepairedDamage","unrepaired_damage")
str = str.replace("dateCreated","ad_created")
str = str.replace("offerType","offer_type")
str = str.replace("vehicleType","vehicle_type")
str = str.replace("powerPS","power_ps")
str = str.replace("fuelType","fuel_type")
str = str.replace("nrOfPictures","nr_of_pictures")
str = str.replace("postalCode","postal_code")
str = str.replace("lastSeen","last_seen")
str = str.replace("dateCrawled","date_crawled")
return str
new_col_names = []
for c in autos.columns:
clean_c = clean_str(c)
new_col_names.append(clean_c)
autos.columns = new_col_names
print(autos.columns)
Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest', 'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model', 'odometer', 'registration_month', 'fuel_type', 'brand', 'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code', 'last_seen'], dtype='object')
#autos.head()
There were no spaces in the code to turn into underscores, so unfortunately, the text for most columns had to be replaced rather than use str.strip()
. The latter columns were all lowercase or I would have used str.lower()
. After creating this function, using a for loop, I was able to incorporate the changes into the original dataset with cleaned column names.
#autos.describe(include="all")
autos["nr_of_pictures"].unique()
array([0])
Some of these columns seem to have limited information such as nr_of_pictures, seller and offer_type as they are all primarily the same for all car sales. ab_test doesn't seem necessary when analying this dataframe.
autos = autos.drop(["nr_of_pictures","seller","offer_type","abtest"], axis=1)
#autos.head()
autos["price"]=autos["price"].str.replace("$","").str.replace(",","").astype(int)
autos.rename({"price":"price_usd"}, axis=1,inplace=True)
autos["odometer"]=autos["odometer"].str.replace(",","").str.replace("km","").astype(int)
autos.rename({"odometer":"odometer_km"},axis=1,inplace=True)
#autos.describe(include="all")
In the price_usd column, we see the min is at 0. That doesn't seem realistic. Neither does the max for a used car being priced in the millions. In the power_ps column, we see there's a min of 0 for horsepower so that doesn't seem realistic either. Need to understand what registration_month = 0 means. Is it January? Or no data?
autos["price_usd"].describe(percentiles=[0.05,0.10,0.25,0.5,0.75,0.9,0.95])
count 5.000000e+04 mean 9.840044e+03 std 4.811044e+05 min 0.000000e+00 5% 2.000000e+02 10% 5.000000e+02 25% 1.100000e+03 50% 2.950000e+03 75% 7.200000e+03 90% 1.400000e+04 95% 1.990000e+04 max 1.000000e+08 Name: price_usd, dtype: float64
autos["price_usd"].value_counts().sort_index()
0 1421 1 156 2 3 3 1 5 2 8 1 9 1 10 7 11 2 12 3 13 2 14 1 15 2 17 3 18 1 20 4 25 5 29 1 30 7 35 1 40 6 45 4 47 1 49 4 50 49 55 2 59 1 60 9 65 5 66 1 ... 151990 1 155000 1 163500 1 163991 1 169000 1 169999 1 175000 1 180000 1 190000 1 194000 1 197000 1 198000 1 220000 1 250000 1 259000 1 265000 1 295000 1 299000 1 345000 1 350000 1 999990 1 999999 2 1234566 1 1300000 1 3890000 1 10000000 1 11111111 2 12345678 3 27322222 1 99999999 1 Name: price_usd, Length: 2357, dtype: int64
I'm going to remove datapoints that in the highest and lowest 5% of the data. The range is very large and for used cars, it seems that there are a handful of datapoints that are exhorbitant that seems to be skewing the mean.
autos = autos[(autos["price_usd"].between(autos["price_usd"].quantile(0.05),autos["price_usd"].quantile(0.95)))]
autos["price_usd"].describe()
count 45207.000000 mean 4578.824076 std 4443.771737 min 200.000000 25% 1249.000000 50% 2900.000000 75% 6500.000000 max 19900.000000 Name: price_usd, dtype: float64
autos["odometer_km"].describe(percentiles=[0.05,0.10,0.25,0.5,0.75,0.9,0.95])
count 45207.000000 mean 128827.062181 std 36883.551984 min 5000.000000 5% 40000.000000 10% 70000.000000 25% 125000.000000 50% 150000.000000 75% 150000.000000 90% 150000.000000 95% 150000.000000 max 150000.000000 Name: odometer_km, dtype: float64
Seems like there's a huge skew in cars reading 150,000km. One realistic way of interpreting this data is 150K+ on the odometer.
autos["registration_year"].describe(percentiles=[0.05,0.10,0.25,0.5,0.75,0.9,0.95])
#autos["registration_year"].value_counts().sort_index()
count 45207.000000 mean 2004.322848 std 82.234293 min 1000.000000 5% 1993.000000 10% 1996.000000 25% 1999.000000 50% 2003.000000 75% 2008.000000 90% 2012.000000 95% 2016.000000 max 9999.000000 Name: registration_year, dtype: float64
We should remove years past 2019. The steep drop-off in car registratins in 2018 is probably due to when the data was scraped, so perhaps it's wise to eliminate 2018 as well since it wouldn't show a complete picture of the data from that year. Also, will remove cars registered before 1997. Just realistically, speaking when getting a used car, it's probably best to not look for cars older than 20 years (1997-2017)
autos = autos[autos["registration_year"].between(1997,2017)]
autos["registration_year"].describe()
count 39206.000000 mean 2004.771515 std 5.241971 min 1997.000000 25% 2000.000000 50% 2004.000000 75% 2008.000000 max 2017.000000 Name: registration_year, dtype: float64
autos["registration_month"].describe()
count 39206.000000 mean 5.897031 std 3.654661 min 0.000000 25% 3.000000 50% 6.000000 75% 9.000000 max 12.000000 Name: registration_month, dtype: float64
0 doesn't make sense for month so we'll remove those datapoints.
autos = autos[autos["registration_month"].between(1,12)]
autos["power_ps"].describe(percentiles=[0.05,0.10,0.25,0.5,0.75,0.9,0.95])
count 36140.000000 mean 118.685529 std 206.662289 min 0.000000 5% 0.000000 10% 54.000000 25% 75.000000 50% 109.000000 75% 147.000000 90% 184.000000 95% 224.000000 max 17700.000000 Name: power_ps, dtype: float64
Googled some facts on horsepower. It seems like the "normal" range to expect (based on car size and type) is somewhere between 70 to 200/300. I think I'm going to eliminate all datapoints in the lower quartile and eliminate anything in the upper 5th percentile.
autos = autos[autos["power_ps"].between(autos["power_ps"].quantile(0.25),autos["power_ps"].quantile(0.95))]
We need to convert the string values that currently represent the dates to a numerical representation. Also, it would be good to ensure the formatting of the date is consistent all across the board.
#autos["ad_created"].value_counts(normalize=True,dropna=False).sort_index()
It seems like ad_created doesn't have a time block in it. Maybe, it would be useful to remove the times so they don't say "00:00:00". As for the other two, they seem to be in the same format.
top_brands = autos["brand"].value_counts(ascending=False).head(20)
top_brands.index
Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford', 'renault', 'peugeot', 'skoda', 'seat', 'mazda', 'fiat', 'toyota', 'nissan', 'mini', 'citroen', 'hyundai', 'volvo', 'mitsubishi', 'alfa_romeo'], dtype='object')
I've chosen the top 20 most popular brands in the dataset.
top_brands_price = {}
for b in top_brands.index:
brand = autos[autos["brand"] == b]
mean_price = brand["price_usd"].mean()
top_brands_price[b] = mean_price
sorted(top_brands_price.items(),key=lambda x:x[1])
[('renault', 3225.7273489932886), ('alfa_romeo', 3330.1487603305786), ('mitsubishi', 3425.072463768116), ('fiat', 3496.433106575964), ('opel', 3824.672874209417), ('peugeot', 3835.035591274397), ('ford', 4072.5909580193756), ('mazda', 4303.145228215768), ('citroen', 4445.994285714286), ('volvo', 5227.584192439863), ('seat', 5372.139959432048), ('toyota', 5821.07397260274), ('nissan', 5823.0329670329675), ('volkswagen', 6128.094465476835), ('mercedes_benz', 6228.172503725782), ('hyundai', 6251.53144654088), ('skoda', 6543.923076923077), ('bmw', 6808.131986106726), ('audi', 7024.9432565789475), ('mini', 9659.374647887324)]
Here we can see the mean price for the top 20 brands on the eBay car website. I've sorted them in ascending order. Alfa Romeo is the most inexpensive overall of the top 20 brands while Audi, Mini and Skoda are the top 3 expensive brands.
top6 = autos["brand"].value_counts(ascending=False).head(6)
top6_mean_price = {}
top6_mean_mile = {}
for b in top6.index:
brand = autos[autos["brand"] == b]
mean_price = brand["price_usd"].mean()
mean_mile = brand["odometer_km"].mean()
top6_mean_price[b] = mean_price
top6_mean_mile[b] = mean_mile
top6mp_series = pd.Series(top6_mean_price)
top6mm_series = pd.Series(top6_mean_mile)
#top6mm_series
#top6mp_series
top6_df = pd.DataFrame(top6mp_series, columns=["mean_price"])
#top6_df
top6_df.insert(1,"mean_mileage",top6mm_series)
top6_df
mean_price | mean_mileage | |
---|---|---|
audi | 7024.943257 | 135326.891447 |
bmw | 6808.131986 | 136071.992422 |
ford | 4072.590958 | 127516.146394 |
mercedes_benz | 6228.172504 | 136132.637854 |
opel | 3824.672874 | 128411.806044 |
volkswagen | 6128.094465 | 129640.346133 |
top6 = autos["brand"].value_counts(ascending=False).head(6)
top6_dict = {brand : {'price': [], 'mile': []} for brand in top6.index }
for b in top6.index:
brand = autos[autos["brand"] == b]
mean_price = brand["price_usd"].mean()
mean_mile = brand["odometer_km"].mean()
top6_dict[b]['price'] = mean_price
top6_dict[b]['mile'] = mean_mile
#top6_dict
x = pd.DataFrame.from_dict(top6_dict).T
x
mile | price | |
---|---|---|
audi | 135326.891447 | 7024.943257 |
bmw | 136071.992422 | 6808.131986 |
ford | 127516.146394 | 4072.590958 |
mercedes_benz | 136132.637854 | 6228.172504 |
opel | 128411.806044 | 3824.672874 |
volkswagen | 129640.346133 | 6128.094465 |
The hypothesis going into this would be that the smaller the mileage on the car, the greater the price. However, we see that the top 3 brands in mean price are also the top 3 brands with mean mileage. One may say the confound here is that these brands are all luxury car brands. However, with the latter 3 (Volkswagen, Opel and Ford), we see that VW has the highest mileage, but also the highest price. Between Ford and opel, we see the correlation between increased miles meaning decreased price that we initially expected. Granted, these variables can be greatly influenced by other factors such as the age of the car, the state of repair it is in, and horsepower.