#!/usr/bin/env python # coding: utf-8 # # Exploring eBay Car Sales Data. Understanding average price of cars by each popular brand # # In this project I will work with data set containing information on eBay used car sales in Germany. I will explore dataset, try to find most popular car brands and some similar patterns in relation to average price of car by each popular brand. Patterns that I am interested in are covered distance, presence of unrepaired damage and average age of car for each popular brand. In other words, I am going to analyse if there are similar patterns between average price of car for brand and other characteristics of brand in average. This will also help to understand if some natural expectations of buyers and/or sellers for price of used cars can be trusted. # # You can download dataset [here](https://www.kaggle.com/orgesleka/used-cars-database/data). # # ## Reading & Exploring data # # ### Reading data # Pandas are main module I used for analysis. Except that I imported NumPy and created print_2n function. Latter prints 2 empty lines before each printed object. Numbers in tables will be printed with 3 digits after decimal point. # In[31]: import pandas as pd import numpy as np pd.set_option('display.float_format', lambda x: '%.3f' % x) def print_2n(*args): for arg in args: print("\n"*2, arg) autos = pd.read_csv('autos.csv', encoding="Latin-1") autos # As we can see from result of above code there are 50,000 rows and 20 columns in dataset. # # ### Exploring data # # In order to see all columns, types and # of non-null values in each column I use info() method. Except that printing first 5 rows helps to comprehend with what type of data we deal. # In[32]: autos.info() autos.head() # Only 5 of 20 columns are saved as integer types of data. Most of the columns are saved as string types.
# Majority of columns do not have null values. Share of null values in minority of columns where they exist is less than 20% by each column. # ## Cleaning dataset # ### Changing column labels # Below I printed column names. From result you can see that column labels are in camelcase and some labels are unnecessarily long. I renamed them to snakecase and shortened some labels. But besides that I added 'km' to 'odemeter' column and 'usd' to 'price' column. Added information to labels ('km' & 'usd') are presented in all values of both corresponding columns if you look closely to data printed as result of above cells. But I will delete them later as I convert those columns to integer type. Therefore information that prices are in US dollars and distance covered by car is in kilometers will be saved in column labels. # In[33]: print(autos.columns) col_dict = {'dateCrawled': 'date_crawled', 'name': 'name', 'seller': 'seller', 'offerType': 'offer_type', 'price': 'price_usd', 'abtest': 'ab_test', 'vehicleType': 'vehicle_type', 'yearOfRegistration': 'registration_year', 'gearbox': 'gearbox', 'powerPS': 'power_ps', 'model': 'model', 'odometer': 'odometer_km', 'monthOfRegistration': 'registration_month', 'fuelType': 'fuel_type', 'brand': 'brand','notRepairedDamage': 'unrepaired_damage', 'dateCreated': 'ad_created', 'nrOfPictures': 'nr_of_pictures', 'postalCode': 'postal_code','lastSeen': 'last_seen'} autos.rename(columns=col_dict, inplace=True) autos.columns # ### Cleaning 'odemeter_km', 'price_usd' & 'unrepaired_damage' columns # As mentioned above I will delete special characters from 'odometer_km' and 'price_usd' columns. Then I convert them to numeric types of data for future purposes. The main future purpose is to use converted to float kilometers and prices for calculation of average values for most frequently presented brands in dataset. # In[34]: autos['odometer_km'] = autos['odometer_km'].str.replace('km', '').str.replace(',', '').astype(float) autos['price_usd'] = autos['price_usd'].str.replace('$', '').str.replace(',', '').astype(float) autos.head(10) # There are 2 columns with German words as values which I need to change:
# * unrepaired_damage - If the car has a damage which is not yet repaired. # * gearbox - The transmission type.
# # I will change German words for 'yes' and 'no' in 'unrepaired_damage' column to 1's and 0's. And German words for transmission type to their English counterparts. # # In[35]: # leaving nan in its place damage_dict = {'nein': 0, 'ja': 1} autos['unrepaired_damage']= autos['unrepaired_damage'].map(damage_dict) gear_dict = {'automatik' : 'automatic', 'manuell': 'manual'} autos['gearbox'] = autos['gearbox'].map(gear_dict) print_2n(autos['unrepaired_damage'].value_counts(dropna=False),\ autos['gearbox'].value_counts(dropna=False)) # I already cleaned column showing prices from special characters, added information stored in special character to column label, converted column to numeric type. And there is no null-values as you can see in [exploring data](#exploring_data) section. But before we can use this column in any project we should identify if there are outliers or unrealistic data. Considering that we analyse used car market very expensive cars seems to be unrelated to the topic. Too high prices can be related to rare old models, mistakes in dataset, almost new cars and so on. In any case this prices cannot be corrected on objective basis at this step. Also there are too cheap cars in data set. I will check descriptive statistics, top 10 and bottom 10 values to have better understanding of extreme prices. # In[36]: desc_stat_price = autos['price_usd'].describe() print(desc_stat_price) top_10_price = autos['price_usd'].value_counts().sort_index(ascending=False).head(10) bottom_10_price = autos['price_usd'].value_counts().sort_index(ascending=False).tail(10) print_2n(top_10_price, bottom_10_price) # If you pay attention in output of above cell descriptive statistics on price column shows that minimum and maximum values are unexpectedly high (almost 10 million USD) and low (0 USD), respectively. Bottom 10 prices are below than 15 dollars and top ten are between 100k and 10 million dollars. Therefore I should set up boundaries and I found range between 100 USD and 100,000 USD as sensible. Down I will check number of cars that fell out of mentioned range. # In[37]: unreal_prices = autos[(autos['price_usd'] >= 100000) |\ (autos['price_usd'] <= 100)].shape[0] print('\n'*2, '# of rows with unrealistic prices (below 100 and more tha 100k):'\ , unreal_prices) # In cell below I will delete those 1949 rows in order to eliminate effect of outliers in terms of prices and check on descriptive statistics. # In[38]: autos = autos[autos['price_usd'].between(100, 100000)] desc_stat_price = autos['price_usd'].describe() print_2n(desc_stat_price) # I use same methods which was implemented above to understand if there are outliers in 'odometer_km' column which presents distance covered by each car. There also no null-values in this column (see [exploring data](#exploring_data)). # In[39]: # checking descriptive statistics and unrealisticly high and low # covered distance by car desc_stat_odo = autos['odometer_km'].describe() print_2n(desc_stat_odo) top_10_odo = autos['odometer_km'].value_counts().sort_index(ascending=False).head(10) bottom_10_odo = autos['odometer_km'].value_counts().sort_index(ascending=False).tail(10) print_2n(top_10_odo, bottom_10_odo) # Minimum (5,000 km) and maximum (150,000 km) values make sense in 'odometer_km' column. Top 10 and bottom 10 values also seems to be realistic and useful for later analysis where I will use it for aggregation of average values. # ### Understanding and cleaning date columns # There are 5 columns containing date and/or time data: # * registration_month - The month in which the car was first registered. # * registration_year - The year in which the car was first registered. # * date_crawled - When this ad was first crawled. # * ad_created - The date on which the eBay listing was created. # * last_seen - When the crawler saw this ad last online. # # To have better understanding we should pay attention on below cell output to see data types and descriptive statistics on all 5 columns. # In[40]: date_cols = ['registration_month', 'registration_year', 'date_crawled',\ 'ad_created' , 'last_seen'] print(autos[date_cols].dtypes) autos[date_cols].describe(include='all') # Form result of above cell it is visible that 'registration_month' and 'registration_year' columns are saved as numeric types while rest of 3 columns as string data type. To see in which format values in all 5 date columns represented I will separate column labels based on column data types in 2 different lists. Then I will print top 5 rows from both to see example of how date related information is stored. # In[41]: str_date_cols = ['date_crawled','ad_created' , 'last_seen'] int_date_cols = ['registration_year', 'registration_month'] print_2n(autos[str_date_cols].head(), autos[int_date_cols].head()) # As we can see 3 columns with values saved as strings contain dates and time. Later those 3 columns can be used to extract date or time from them and use for analysis. Other 2 date columns represent year and month of registration in numbers and therefore can be used for further analysis without any change in data type. # Now we saw what data is saved in date columns we can pay closer attention to 'last_seen', 'registration_year' & 'registration_month' columns. There are 2 reasons to look closer at mentioned columns:
# * Presence of too old cars that can be considered as outliers drawing wrong patterns for analysis # * Presence of "impossibly new cars". There are rows where registration year is higher than corresponding year in 'last_seen' column which is impossible.
# To clean mentioned above 2 categories first I will check distribution in 'last_seen' column. This will help to understand approximate upper limit for registration year to be realistic. I will also check distribution for other 2 columns saved in string data type ('date_crawled','ad_created'). # In[42]: for col in str_date_cols: date = autos[col].str.split().str[0] month_year = date.str.split('.').str[1] + "-" +\ date.str.split('.').str[2] autos[f'{col}_month_year'] = month_year month_year_dist = month_year.value_counts(normalize=True, dropna=False)\ .sort_values(ascending=False)*100 print_2n(f"Distribution for '{col}_month_year' column:", month_year_dist) # The output presents format of date as 'month number-2 digit year number' (e.g. 4-16 is April of 2016). Distribution of ‘last_seen’ and ‘date_crawled’ columns shows that there are only 2 values: 3-16 (March 2016) and 4-16 (April 2016). Distribution for 'ad_created' column also shows that there is no date later than 4-16 (April 2016). These factors tells that dataset presents situation in used car market on eBay in Germany as of March - April 2016. # Above observations make it clear that registration year above 2016 is error and cannot be realistic. Moreover, in rows where registration year is 2016 and registration month is later than month in 'last_seen' column false date is presented and should be deleted. # In[43]: autos['last_seen_month'] = autos['last_seen'].str.split().str[0]\ .str.split(".").str[1].astype(int) autos = autos[~((autos['registration_year'] == 2016) &\ (autos['registration_month'] >= autos['last_seen_month']))] autos.to_csv('autos_cleaned.csv', index=False) autos # For lower limit let's first consider 1950. Now I need to see how many cars were registered before that year and decide if number is insignificant enough to delete corresponding rows. # In[44]: too_old_cars = autos[autos['registration_year'] < 1950].shape[0] print(f'# of too old cars = {too_old_cars}') # As number of cars registered before 1950 is just 22 and will not have significant data loss effect they can be removed from dataset. # In[45]: # removing cars registered before 1950 and after 2016 autos = autos[autos['registration_year'].between(1949, 2017, inclusive=False)] # As a result of all above operations I got cleaned data and saved it to new csv file (''). I will use cleaned data for analysis of used car market in German eBay based on most common brands, their average mileage, percentage of unrepaired damages, gearbox type and average age. As mentioned before I will try to find similar patterns between average price and other factors of popular brands. # ## Analysis # ### Creating new dataset with popular brands only # First of all in order to analyse most frequent brands in our dataset we need to see detailed information on brands. And then I will create new dataset where we will have only popular brands and aggregated values for these brands. # In below cell I check the unique brands, descriptive statistics of 'brands' column, and the distribution of brands in percentage in descending order. # In[46]: unique_brands = autos['brand'].unique() stat_brands = autos['brand'].describe() brands_dist = autos['brand'].value_counts(normalize=True, dropna=False)*100 print_2n(unique_brands, stat_brands, brands_dist) # There are 40 unique brands in dataset with Volkswagen being top frequent brand. We can see these brands in list of unique brands. But I plan to analyse some patterns in popular brands. Therefore, brands with very low number of cars in dataset will not be included in new dataset I will create. The threshold to be included to analysis for popular cars set by me is 1 percent. In other words if in above cell percentage in distribution of brands is less than 1 % then corresponding brand will be left out. We can see from distribution than last brand qualified for analysis is 'hyundai' and first one not qualifies is 'sonstige_autos'.
# Next step is to create new dataset with popular brands only. Each brand will be represented by one line where brands will be index values. Besides I will create also first column containing average price for each brand. # In[47]: # selecting brands which cover 1 or more than 1 percent of total share cmmn_brands = brands_dist[brands_dist >= 1] # aggregating avg price for each common brand # adding brand as key & avg price as value to dict # creating series from dict: key turns to index, # values to column and descending order # creating df from series brands_price_dict = {} for brand in cmmn_brands.index: avg_price = autos.loc[autos['brand'] == brand, 'price_usd'].mean() brands_price_dict[brand] = avg_price cmmn_brands_avg_price = pd.Series(brands_price_dict).sort_values(ascending=False) cmmn_brands_df = pd.DataFrame(cmmn_brands_avg_price, columns=['avg_price']) cmmn_brands_df.index.name = 'brand' cmmn_brands_df # Now we have new dataset called 'cmmn_brands_df'. In next 6 cells I will add to new dataset other columns. # # ### Adding new columns to 'cmmn_brands_df' to dataset # Adding 'avg_km_covered' column:
# Values in this column represent average distance covered by cars representing particular brand in kilometers. # In[48]: # aggreating avg covered distance by each brand in # common brands df and adding it as series to same df brands_avg_km = {} for brand in cmmn_brands_df.index: avg_mlg = autos.loc[autos['brand'] == brand, 'odometer_km'].mean() brands_avg_km[brand] = avg_mlg cmmn_brands_df['avg_km_covered'] = pd.Series(brands_avg_km) cmmn_brands_df # Adding 'damage_info_%' column:
# Values in this column represent percentage of cars for each brand where there is information weather car has unrepaired damage or not.
# In next cells I will also add percentage of cars with unrepaired damage and percentage of healthy cars. But as we learned [here](#exploring_data) there are null values in corresponding column. Therefore it is crucial to learn if majority of cars have information about presence of unrepaired damage. # In[49]: damage_info_dict = {} for brand in cmmn_brands_df.index: count_brands = autos[autos['brand'] == brand].shape[0] count_na = autos.loc[(autos['brand'] == brand) &\ (autos['unrepaired_damage'].isnull())].shape[0] share_dmg = 100 - (count_na / count_brands ) * 100 damage_info_dict[brand] = share_dmg cmmn_brands_df['damage_info_%'] = pd.Series(damage_info_dict) cmmn_brands_df # Output of above cell shows that more than 75 % of cars by each brand has information about presence of unrepaired damage. As significant majority of cars has this information we can rely on analysis based on percentage of healthy cars or cars with unrepaired damage.
# # Adding 'unrep_damage_%' column:
# Values in this column represent percentage of cars for each brand where there is unrepaired damage
# # Adding 'healthy_car_%' column:
# Values in this column represent percentage of cars with no unrepaired damage by each brand. # In[50]: unrep_damage_dict = {} for brand in cmmn_brands_df.index: damaged_share = autos.loc[(autos['brand'] == brand) &\ (autos['unrepaired_damage'].notnull()), 'unrepaired_damage'].mean() * 100 unrep_damage_dict[brand] = damaged_share cmmn_brands_df['unrep_damage_%'] = pd.Series(unrep_damage_dict) cmmn_brands_df['healthy_car_%'] = cmmn_brands_df['damage_info_%']\ - cmmn_brands_df['unrep_damage_%'] cmmn_brands_df # Adding 'automatic_%' column:
# Values in this column represent percentage of cars with automatic gearbox by each brand
# # Adding 'manual_%' column:
# Values in this column represent percentage of cars with manual gearbox by each brand # In[51]: automatic_dict = {} manual_dict = {} for brand in cmmn_brands_df.index: automatic_share = autos.loc[autos['brand'] == brand, 'gearbox']\ .value_counts(normalize=True, dropna=False)['automatic'] * 100 manual_share = autos.loc[autos['brand'] == brand, 'gearbox']\ .value_counts(normalize=True, dropna=False)['manual'] * 100 automatic_dict[brand] = automatic_share manual_dict[brand] = manual_share cmmn_brands_df['automatic_%'] = pd.Series(automatic_dict) cmmn_brands_df['manual_%'] = pd.Series(manual_dict) cmmn_brands_df # Adding 'vehicle_count' column:
# Values in this column represent number of cars by each brand # In[52]: count_dict = {} for brand in cmmn_brands_df.index: vehicle_count = autos[autos['brand'] == brand].shape[0] count_dict[brand] = vehicle_count cmmn_brands_df.insert(0, 'vehicle_count', pd.Series(count_dict)) cmmn_brands_df # Adding 'avg_age' column:
# Values in this column represent average age of cars by each brand
# In[53]: # as there are '0' values in registration month column I replaced them with 1 reg_month_0to1 = autos['registration_month'].replace(0, 1).astype(str) # combining values in registration month and registration year columns m_y_reg = reg_month_0to1 + '-' + autos['registration_year'].astype(str) # converting above combination to date values date_m_y_reg = pd.to_datetime(m_y_reg, format='%m-%Y') # finding latest date from last seen column latest_date = pd.to_datetime(autos['last_seen'].str.split().str[0], format='%d.%m.%y').max() # finding age of each car in days diff = latest_date - date_m_y_reg # converting age from days to years and adding to autos dataset autos['vehicle_age'] = diff.dt.components['days'].astype(int) / 365 # finding average age of cars by brand and adding it to cmmn_brands_df dataset age_dict = {} for brand in cmmn_brands_df.index: avg = autos.loc[autos['brand'] == brand, 'vehicle_age'].mean() age_dict[brand] = avg cmmn_brands_df.insert(1, 'avg_age', pd.Series(age_dict)) #cmmn_brands_df.to_csv('autos_brands.csv') cmmn_brands_df # Now we got dataset ‘cmmn_brands_df’ ready for analysis. I will use it to understand if there are similar patterns and trends between average price and other data in other columns. # # ### Trends & Patterns # At this step of my training I will not use statistical methods as they haven't been covered in this course yet, although I have some knowledge from university. Techniques used below are simple but some explanation of logic behind analysis is required. First, I will create function named 'compare_around_mean' to see if there is pattern between 2 columns. Then I will use it several times to understand if there is pattern between 'avg_price' column and 4 columns. # In[54]: def compare_around_mean(a_df, x_col, y_col): print(f"'{x_col}' column values in descending order:\n \ {a_df[x_col].sort_values(ascending=False)}") x_mean = a_df[x_col].mean() buttomx_y_avg = a_df.loc[a_df[x_col] < x_mean, y_col].mean() topx_y_avg = a_df.loc[a_df[x_col] >= x_mean, y_col].mean() print_2n(f"Mean of '{y_col}' for below average values in '{x_col}': {buttomx_y_avg}") print_2n(f"Mean of '{y_col}' for above average values in '{x_col}': {topx_y_avg}") # I will explain what function does and logic behind it on example of 'avg_price' and 'manual_%' columns below. 'manual_%' column represent share of cars with manual gear by each brand :
# # * 1. So function takes 3 arguments: ‘a_df’ (dataset), ‘x_col’ (column which we think has impact on 'y_col' column), ‘y_col’ (column which we think has been affected by 'x_col' column) # * 2. Once called function prints values in 'x_col' ('manual_%' column in this example) in descending order so that we can see if there is any anomality in data # * 3. Then it finds average value of 'manual_%' column # * 4. Now we can imagine values in 'manual_%' to be divided in 2 groups: 'values below average' and 'values above average'. # * 5. Next, function calculates average of 'y_col' ('avg_price' column in this example) for 'values below average' and 'values above average' in 'manual_%' column # * 6. Finally function prints the result # * 7. And if there is significant difference between average price for 'values below average' and 'values above average' in 'manual_%' we can say that there is pattern between 2 columns although it does not imply correlation. # In below cell we can see that average price for brands with share of manual cars below average (7478.61 USD) is much higher than for above average (4245.74 USD). This pattern confirms generally known fact that usually car with manual gear is cheaper. There is only one concern in our case with this conclusin. Share (and number) of cars with automatic gearbox is really low relatively to cars with manual gearbox. Therefore difference between average prices can change significantly once more cars with automatic gears added to the data. # In[55]: compare_around_mean(cmmn_brands_df, 'manual_%', 'avg_price') # Before using 'cmmn_brands_df' function below I added new column to dataset: 'damage/age'. This column represent ratio of share of cars with unrepaired damage to average age of cars by each brand. Therefore we can see which share of cars by each brand gets unrepaired damage every year. Ideally, the higher the ratio the worse condition of car should be. So price of car should be in that case also lower compared to brands with lower ratio. And as we can see from output of below cell average price for brands with lower ratio is higher approximately by 60 percent. # In[56]: cmmn_brands_df['damage/age'] = cmmn_brands_df['unrep_damage_%'] / cmmn_brands_df['avg_age'] compare_around_mean(cmmn_brands_df, 'damage/age', 'avg_price') # Another column we have to check on relation between price and damage of car is 'healthy_car_%' column. Below cell confirms finding of above cell. The higher the share of healthy (no unrepaired damage) cars the higher the average price. # In[57]: compare_around_mean(cmmn_brands_df, 'healthy_car_%', 'avg_price') # And last column I will check in pair with 'avg_price' column is 'avg_km_covered'. Natural expectation is higher the distance covered lower the price. From output of cell it is visible that average price for brands which covered above average distance is lower by ~790 USD. This number confirms natural expectation but we saw that this difference was higher in previous cells. There fore difference between average prices here indicates weaker similarity between covered distance and price. # In[58]: compare_around_mean(cmmn_brands_df, 'avg_km_covered', 'avg_price') # ## Conclusion # # Even before looking at any data anybody with plans to set a price of used car has some expectation of what and how affects price of car. Or if you want to buy one you probably have some price in your mind depending in what condition is car. In this project I analysed average prices of popular brands. I found out supporting indications for 4 of our natural expectations about prices of cars that belong to some popular brand in [trends & patterns](#trend_pattern) section.
# # Basically average price for car that belongs to certain brand is higher if cars of that brand : # * have less manual gear cars than automatic gear cars # * get less unrepaired damage every year # * has high share of cars with no unrepaired damage at point of sale # * covered less distance on average relative to other cars # # #### Notes # * The method I used here is simple and does not involve sufficient statistical methods. The reason for it is quite simple as at this stage of my training I tried to use only concepts already covered in the course. Therefore method I used cannot be enough on its own to draw deterministic conclusions but can serve as additional evidence.

# # * There are some parts of code that can be deleted in this project. For example, adding some new columns to original dataset which later I did not use. I decide to leave and not delete those parts. The reason is that I am on my training phase. As a result, I want to type as more code as possible to memorize the concepts and syntax better. And leave it there cause in case I need come back to update my knowledge it will be easy to find what I learned.