#!/usr/bin/env python
# coding: utf-8

# # Investigating Ebay Used Car Sales
# 
# EBay is an American e-commerce websites where users can place their wares on sale. EBay car sales include sales from individual sellers and car dealers. Thus, investigating the car sales that take place on eBay will help shed light on consumer behaviour and demand.
# 
# The original dataset was uploaded on Kaggle but can now be found on [data.world](https://data.world/data-society/used-cars-data). The data comprises of 50,000 points and the following columns:
# 
# |Column Name|Details|
# | :-------: | :---: |
# |dateCrawled|When this ad was first crawled. All field-values are taken from this date|
# |name|Name of the car|
# |seller|Whether the seller is private or a dealer|
# |offerType|The type of listing|
# |price|The price on the ad to sell the car|
# |abtest|Whether the listing is included in an A/B test|
# |vehicleType|The vehicle Type|
# |yearOfRegistration|The year in which the car was first registered|
# |gearbox|The transmission type|
# |powerPS|The power of the car in PS|
# |model|The car model name|
# |kilometer|How many kilometers the car has driven|
# |monthOfRegistration|The month in which the car was first registered|
# |fuelType|What type of fuel the car uses|
# |brand|The brand of the car|
# |notRepairedDamage|If the car has a damage which is not yet repaired|
# |dateCreated|The date on which the eBay listing was created|
# |nrOfPictures|The number of pictures in the ad|
# |postalCode|The postal code for the location of the vehicle|
# |lastSeenOnline|When the crawler saw this ad last online|

# In[1]:


#import numpy and pandas
import numpy as np
import pandas as pd

#read csv file
autos = pd.read_csv('autos.csv', encoding = 'Latin-1')


# In[2]:


#initial data exploration
autos.info()


# In[3]:


autos.head()


# In[4]:


autos.describe(include = 'all')


# In[5]:


autos.dtypes


# ## Preliminary Observations
# 
# At a first glance, it can be observed that the column names are in camelcase instead of snakecase. This makes analysis and readability a bit more difficult. Thus, all column names will be transformed to snake case. Some of the names, (like *yearOfRegistration*) are too long and will be shortened (to *registration_year*) for easy readability.
# 
# Additionally, there are some columns like *price* and *odometer* which have extra characters that render the dtype of the column as object. However, after those extra characters are removed, the column dtypes can be changed to numeric ones. TThe date columns *ad_created*, *last_seen* and *date_crawled* are identified as strings and will need to be converted to numeric representation for easy analysis.
# 
# Finally, there are null values in the *vehicleType*, *gearbox*, *model*, *fuelType*, *notRepairedDamage* columns which will need to be appropriately dealt with.
# 
# 

# In[6]:


old_column_names = autos.columns

# create a list for new column names
new_column_names = []
changed_names = {
    'yearOfRegistration' : 'registration_year',
    'monthOfRegistration' : 'registration_month',
    'notRepairedDamage' : 'unrepaired_damage',
    'dateCreated' : 'ad_created',
    'powerPS' : 'power_ps',
    'nrOfPictures': 'pictures_number'
}

# transform old column names add them to new list
for column in old_column_names:
    if column in changed_names:
        column = changed_names[column]
    column_l = list(column)
    column_l2 = column_l[:]
    for i in range(len(column_l2)):
        if column_l2[i].isupper():
            column_l.insert (i, '_')
    column = ''.join(column_l)
    column = column.lower()
    new_column_names.append(column)

# assign new names to columns
autos.columns = new_column_names
    
autos.head()


# ## Inital Exploration and Cleaning
# 
# As mentioned previously, the *odometer* and *price* columns need to be converted to numeric data types. However, to ensure that the right operations are being conducted the unique values of each will be analysed.

# In[7]:


print(autos['price'].unique())
print(autos['odometer'].unique())


# It can be observed that the non-numeric characters for the *price* column are '$' and ',' and that of the *odometer column are ',' and 'km'. The odometer column can be converted to integer values. However, the price column will be converted into a float, since price is sometimes quoted with decimals.

# In[8]:


#remove all non-numeric characters for odometer and price columns

autos['price'] = autos['price'].str.replace('$', '')
autos['price'] = autos['price'].str.replace(',', '')
autos['price'] = autos['price'].astype(float)

autos['odometer'] = autos['odometer'].str.replace('km', '')
autos['odometer'] = autos['odometer'].str.replace(',', '')
autos['odometer'] = autos['odometer'].astype(int)
autos.rename({'odometer':'odometer_km'}, axis = 1, inplace = True)

autos.head()


# ### Further Exploration of the *odometer_km* and *price* Columns
# 
# These columns will be further analysed for any unrealistic minimum/maxiumum values and those outliers will be removed.

# In[9]:


# explore odometer_km column values
print(autos['odometer_km'].describe())
print('\n')
print(autos['odometer_km'].value_counts())


# In[10]:


# explore price column values
print("Summary of values in price column")
print(autos['price'].describe())
print('\n')
print('Most common values')
print(autos['price'].value_counts().head())
print('\n')
print('Least common values')
print(autos['price'].value_counts().tail())
print('\n')
print('Cars more expensive than $4 million')
print(autos.loc[(autos['price'] > 4000000), 'price'])
print('\n')


# Looking at the column summary for the price column, both the minimum and maximum values are a bit unrealistic. While there are some cars that are sold for extremely low prices, it is highly unlikely that a car was sold for nothing. On the other hand, the [most expensive vehicle](https://www.firstvehicleleasing.co.uk/blog/1974/top-10-most-expensive-cars-ever-sold-on-ebay/) sold in ebay up until 2012 was \$3.3million and [in 2016](https://www.ebay.com/motors/blog/most-expensive-cars-ebay-2016/)  no car was sold on ebay over \$1 million. Thus, although there is little to no information for the years 2013-2015, it can perhaps be estimated that no car was sold above \$4million dollars.
# 
# Thus, cars that are less than \$50, and those more than \$4 million will be removed.

# In[11]:


# remove outliers in price column
outlier_bool = autos[~(autos['price'].between(50,4000000))]
outlier_indices = outlier_bool.index
autos = autos.drop(outlier_indices)
print(autos['price'].describe())


# After removing the outliers the mean reduced by 37.9%. Additionally the max price reduced from \$100 million to \$3.6 million which is more in line with the data we have concerning expensive car sales on eBay.

# ## Exploring the Date Columns
# 
# There are 5 columns that have date values: *date crawled*, *last seen*, *ad_created*, *registration_month* and *registratin_year*. Of the 5, *date_crawled* and *last_seen* and *ad_created* are stored as strings. However, in order to analyse them, they need to be converted to numeric representation.
# 

# ### *date_crawled*  Column Exploration

# In[12]:


# create percentage distribution for dates in date_crawled
date_crawled_percents = autos['date_crawled'].str[:10].value_counts(normalize = True, dropna = False)
date_crawled_percents.sort_index(inplace = True)
print(date_crawled_percents)


# From the data above it can be seen that the web scraper was active for one month from March to April 2016. With the maxiumum amount of scraping done on 3rd April 2016.

# ### *last_seen* Column Exploration

# In[13]:


# create percentage distribution for dates in last_seen
last_seen_percents = autos['last_seen'].str[:10].value_counts(normalize = True, dropna = False)
last_seen_percents.sort_index(inplace = True)
print(last_seen_percents)


# The last seen dates mirror those of the web scraper activity which makes sense, since the web scraper could only judge the last seen dates of ads when it was active. 

# ### *ad_created* Column Exploration

# In[14]:


# create percentage distribution for dates in ad_created
ad_created_percents = autos['ad_created'].str[:10].value_counts(normalize = True, dropna = False)
ad_created_percents.sort_index(inplace = True)
print(ad_created_percents)


# The earliest car ad that was picked up by the scraper was in 2015. This is in line with the scraper's active times, since most cars posted before June 2015 would probably have been sold by the March-April 2016.

# ### *registration_year* Column Exploration

# In[15]:


autos['registration_year'].describe()
autos['registration_year'].unique()


# The minimum and maximum years are wildly unrealistic. They year 9999 is far into the future and by 1000 cars had not even been invented. The first car to be registered in the United States was in [1901](https://www.thoughtco.com/1903-the-first-license-plates-us-1779187), thus this will be taken as the minimum year, and since the year the web scraper was active was 2016, any years after that will be discarded.

# In[16]:


# remove outliers in registration_year column
autos = autos[(autos['registration_year'].between(1901, 2016))]
autos['registration_year'].unique()


# In[17]:


#calculate percentage distribution of values in registration_year
autos['registration_year'].value_counts(normalize = True)


# The most popular registration year, for car sold on eBay was 2000. Cars registered in 2000 comprised about 6.7% while those in 2005 closely followed with 6.2%. Clearly 10-16 year old cars sell the best on eBay. This is reinforced by the fact that the top 5 years ranged from 1999 to 2005. On the other hand, vintage cars (50+ years) sold the least, with the last ten spots being occupied by cars registered before 1953.

# ## Analysing Price by Brand
# 
# It would be interesting to figure out which brands sell the most expensive cars and if that aligns with the traditional views held on the prestige of those brands. 

# In[26]:


import operator

#calculate 6 most common brands
common_brands = list(autos['brand'].value_counts()[:6].index)

#function to calculate mean for each brand
def mean_calculator(brand_list, column):
    '''This function takes in a list of brands and a relevant common name(as a string)
    and calculates by brand, the mean of the values pertaining to it in the specified
    column. It returns a dictionary which maps each brand to the relevant mean'''
    
    brand_mean = {}

    for br in brand_list:
        brand_mean[br] = autos.loc[autos['brand'] == br, column].mean()
        
    return brand_mean

#calculate mean price for each brand
brand_mean_price = mean_calculator(common_brands, 'price')
print(brand_mean_price)


# As expected, Audi, Mercedes Benz & BMW have the most expesive cars. This suggests that e-commerce sales mirror real-life prestige enjoyed by those brands. However, in the top 6 most common brands there's a lost of variation in mean price. The mean price of an opel is 68% lower than that of an audi (most expensive). However, even compared to ford which is the the 2nd least expensive common car, the price of an opel is 38% lower. In order to investigate the variance in price, the average mileage of each of these brands will be investigated in order to see if this is the main factor affecting the price variability. 

# In[33]:


#calculate mean mileage for each brand
brand_mean_mileage = mean_calculator(common_brands, 'odometer_km')

#convert both dictionaries to series objects
bmp_series = pd.Series(brand_mean_price)
bmm_series = pd.Series(brand_mean_mileage)

price_mileage = pd.DataFrame(bmp_series, columns=['mean_price'])
price_mileage['mean_mileage'] = bmm_series

print(price_mileage)


# There seems to be no clear pattern between mean price and mean mileage this indicates that the most expensive cars sell for more money due to some other factors, such as perceived quality or durability.