#!/usr/bin/env python
# coding: utf-8

# # eBay Kleinanzeigen
# This is a guided project about a dataset of used cars from eBay Kleinanzeigen, a classified section of German eBay website.
# The dataset was originally scrapped and uploaded to [Kaggle](https://www.kaggle.com/piumiu/exploring-ebay-car-sales-data/comments)
# 
# The data dictionary provided from the dataset is as follow:
# * dateCrawled - When this ad was first crawled. All field-values are taken from this date.
# * name - Name of the car.
# * seller - Whether the seller is private or a dealer.
# * offerType - The type of listing
# * price - The price on the ad to sell the car.
# * abtest - Whether the listing is included in an A/B test.
# * vehicleType - The vehicle Type.
# * yearOfRegistration - The year in which the car was first registered.
# * gearbox - The transmission type.
# * powerPS - The power of the car in PS.
# * model - The car model name.
# * kilometer - How many kilometers the car has driven.
# * monthOfRegistration - The month in which the car was first registered.
# * fuelType - What type of fuel the car uses.
# * brand - The brand of the car.
# * notRepairedDamage - If the car has a damage which is not yet repaired.
# * dateCreated - The date on which the eBay listing was created.
# * nrOfPictures - The number of pictures in the ad.
# * postalCode - The postal code for the location of the vehicle.
# * lastSeenOnline - When the crawler saw this ad last online.
# 
# The aim of this project is to clean the data and analyze the used car listings. So let us begin by importing and reading the data.
#     

# In[81]:


import pandas as pd
import numpy as np
pd.set_option('float_format', '{:f}'. format) # this is to have float with 4 decimal points
autos = pd.read_csv('autos.csv', encoding = 'Latin-1')


# In[82]:


autos.info()
autos.head()


# In[83]:


print(autos.columns)
autos_test = autos.copy()


# Looking at the columns titles of the autos dataframe, they are using camelcase. We are going to change these to snakecase so as to make it more descriptive.
# To do that, we copied the autos dataframe using Dataframe.copy() method and change the columns titles into snakecase and then assigne them back to autos.columns. 

# In[84]:


new_columns = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'ab_test',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'date_created', 'nr_of_pictures', 'postal_code',
       'last_seen']

autos.columns = new_columns

autos.head()


# We need to see the value for each columns. We will take notes on columns with string and almost all values are the same. We might drop these columns because not much to be analyzed (i.e. **'seller'**, **'offer_type'**, **'unrepaired_damage'**). We also take notes on numeric data stored as text such as **'price', 'odometer'**. We would need to convert these into numeric values.

# In[85]:


autos.describe(include = 'all')


# The first row of the dataframe (df) show the number or count for each column of the df. There are several columns which the total count were less than 50,000; **'vehicle-type','gearbox', 'model', 'fuel_type' and 'unrepaired_damage'**. We will investigate these column later. 
# Several columns are also got the unique instances of only 2. 
# We will check several columns contain numeric data but stored as text:
# **'price', 'registration_year', 'odometer', 'power_ps'and 'registration_month'**. 

# In[86]:


autos['price']. head()


# The 'price' column is text object. We will change this to numeric later.

# In[87]:


autos['odometer'].head()


# The 'odometer'column is also text object.

# In[88]:


autos['power_ps']. head()


# The 'power_ps'columns is integer.

# In[89]:


autos['registration_year'].head()


# # Exploring the Odometer and Price Columns

# Now, we are going to change the 'price' and 'odometer' columns as numeric by firstly removing '$' sign and 'km'.

# In[90]:


autos['price'] = autos['price'].str.replace('$', '').str.replace(',','').astype(int)


# In[91]:


autos['odometer']= autos['odometer'].str.replace('km', '').str.replace(',', '').astype(int)


# In[92]:


# We rename the 'odometer' column to 'odometer_km'.

autos.rename({'odometer' : 'odometer_km'}, axis = 1, inplace = True)


# In[93]:


autos.head()


# In[94]:


# Further exploration of 'price' column:

autos['price'].unique().shape
autos['price'].describe()

# We can see the minimum is $0 which is unlikely. Also max 100,000,000 which is unlikely also.


# In[95]:


# We will get rid the outliers
autos['price'].value_counts().sort_index(ascending=False).head(20)


# In[96]:


# We can see that the minimum price $0 which is unlikely, maximum price is $99,999,999 which is way too much for an auction site
# Therefore, we are going to take out all price below $1000 and any price above $350000
autos=autos[autos['price'].between(999, 350000)]
autos['price'].describe()


# In[97]:


# Let's do the same for 'odometer_km' column.
autos['odometer_km']. unique().shape
autos['odometer_km'].describe()


# In[98]:


# The minimum value is 5000 km and maximum is 150,000 km. 
# However, there seem to be the same 150,000 km for max, 75% and 50%. We need to check this.
autos['odometer_km'].value_counts().sort_index(ascending=False)


# In[99]:


# Since the count for 150,000 km is 23,678, it is more than half. It's make sense.
# We don't do further analysis for the 'odometer_km'.


# # Exploring the date column

# We are going to explore columns that having to do with date:
#     
# * date_crawled ; added by the crawler
# * last_seen : added by the crawler
# * date_created : from the website
# * registration_month : from the website
# * registration_year : from the website
#     
# Now, we want to analyze the distribution of values in the 'date_crawled', 'date_created', and 
# 'last_seen' columns (all string columns) as percentage. 
# The objective is to see the movements of car selling within this Ebay Car Sales site.

# In[100]:


autos[['date_crawled','date_created','last_seen']] [0:5]


# The date format is comprised of date and hour. We will extract the first 10 characters to isolate the date and ignore the hour.

# In[101]:


# We use value_counts method (normalize = True) to get relative frequencies of unique values
print(autos['date_crawled'].str[:10].unique().shape)
autos['date_crawled']. str[:10].value_counts(normalize=True, dropna=False).sort_index()


# There was 34 days of crawling. The distribution percentage of crawlings almost equal.

# In[102]:


autos['last_seen'].str[:10].value_counts(normalize=True, dropna=False). sort_index()


# The 'last_seen' column shows relative equal distribution of percentage except for 5 April 2016 to 7 April 2016, the frequency is 6-10 times more. We could not tell what make this big difference. 

# In[103]:


print(autos['date_created'].str[:10].unique().shape)
autos['date_created'].str[:10]. value_counts(normalize=False, dropna=False).sort_index()


# Adverts were created mainly in the month of March and beginning of April 2016
# But there are some adverts created much earlier in June 2015.

# In[104]:


autos['registration_year'].describe()


# The registration year is the year when the car was firstly registered before use. There are 2 stange values which are year 1000 and 9999. These must be incorrect. 

# # Dealing with Incorrect Registration Year

# Car was invented in early 1900. Hence, registration year prior to this year is invalid.
# Secondly, cars must have been registered when they were advertised. Therefore, any registration
# date after 2016 must be invalid.
# We will see the proportion of values outside the range of year 1900 and 2016 the range
# that sounds realistic.

# In[105]:


(~autos['registration_year']. between ( 1900,2016)). sum()/autos.shape[0]


# As the number is less than 4%, we will remove these rows.

# In[106]:


autos=autos[autos['registration_year'].between(1900,2016)]
autos['registration_year']. value_counts(normalize=True). head(10)


# It appears that the registration year mostly happened within the past 16 year from 2016.

# In[107]:


# Exploring Price by Brand
autos['brand']. value_counts(normalize=True)


# Top of the list are mostly German cars such as Volkswagen (21.3%), BMW (12.3%), Mercedes Benz (11.0%) and Audi (9.6%). They are more than 50% of the brand. 
# Let's analyse 'price by brand' to know more about how brand affect the sale.
# We are going to take into account any brands at least 2% of all adverts.                                                                                               

# In[108]:


major_brands = autos['brand'] . value_counts(normalize=True)
price_by_brand={}
for brand in major_brands.index:
    if major_brands[brand]>= 0.02:
        price_by_brand[brand] = autos.loc[autos['brand'] ==brand, 'price'].mean().astype(int) 
print(price_by_brand)        


# As commonly perceived, we can see the mean price for brands like Audi (10,134), Mercedes Benz (9,167), BMW (9,021) are almost triple than the price from brands like Renault, Peugeot (cheapest brands) and then the next cheaper brands are Ford and Opel while Volkswagen is mid-price brand.
# We will assign price_by_brand Series to  bmp_series then
# we would convert the dictionary to a series and then to a dataframe. 

# In[109]:


bmp_series= pd.Series(price_by_brand).sort_values(ascending=False) 
pd.DataFrame(bmp_series, columns = ['mean_price'])


# In[110]:


# Now, let's have a look whether mileage has any bearing on price

mileage_by_brand={}

for brand in major_brands.index:
    if major_brands[brand]>=0.02:
        mileage_by_brand[brand] = autos.loc[autos['brand']==brand]['odometer_km'].mean(). astype(int)
                                 
print(mileage_by_brand)


# In[111]:


mileage_by_brand_series = pd.Series(mileage_by_brand)
df=pd.DataFrame(bmp_series, columns=['mean_price'])
df['mean_mileage']=mileage_by_brand_series
df


# The mileage shown by brands on average do not have any significant difference on particular brand hence price.
# 

# # Further data cleaning
# 

# # Translating German words to English
# 
# We are going to do some data cleanings with regards to translating German words to English.
# 

# In[112]:


autos.head(10)


# In[113]:


# German words appear in these columns: 'vehicle_type', 'gearbox','model', 'fuel_type' and 'unrepaired_damage'.
# Let's make the changes.
# We will also drop the column 'seller', 'offer_type'and 'nr_of_pictures' because not much to be analyzed.

columns= ['vehicle_type', 'gearbox', 'fuel_type', 'model', 'unrepaired_damage', 'brand' ]
change_words={'kleinwagen':'super mini', 'kombi': 'station wagon', 'cabrio': 'convertible', 'limousine': 'sedan', 'andere': 'other','manuell': 'manual', 'automatik': 'automatic', 'benzin': 'gas', 'elektro': 'electric', 'sonstige_autos': 'other','nein':'no', 'ja': 'yes'}
autos[columns]= autos[columns].replace(change_words)
autos= autos.drop(['seller', 'offer_type', 'nr_of_pictures'],axis=1)

autos.head(20)


# # Further Analysis
# ## Finding the age of the vehicle
# Let's determine whether the ages of a vehicle got any effect to price.
# Since the data was crawled in 2016 and because we know the registration year of
# a vehicle then we will know how old the vehicle when it's put on the sale.
# 

# In[114]:


# Find out the age of vehicle
# Append the result into a new column called 'age'in the DataFrame.

autos['age']= 2016-autos['registration_year'] 
autos.head(5)


# # Time lapse to sell
# Having the age column we can now find out how many days lapse to sell the car.
# We make use the 'date_created' and 'last_seen' date. The difference would be the time lapse to sell the car.
# To work with date and time, we need to convert the strings into datetime datatype.
# 
# 

# In[115]:


# We pick up first 10 string characters representing yyyy-mm-dd and convert to datetime object

crawling= pd.to_datetime(autos['date_crawled'].str[:10])
posting = pd.to_datetime(autos['date_created'].str[:10])
sold = pd.to_datetime(autos['last_seen'].str[:10])

time_lapse = sold - posting  # how many days it took to sell
autos['time_lapse'] = time_lapse  # we append new column in the autos DataFrame

autos['time_lapse'].value_counts(bins=10). sort_index()


# Clearly most cars ( more than 98%) sold within the first 30 days after posting. We are going to use time_lapse even more with other values such as brand, price, odometer and age.
# To make it easier for comparison we will split it and put the integer in separate column.

# In[116]:


# Convert the  time_lapse column into string and split it taking the integer into new column

autos['time_lapse_days']= autos['time_lapse']. astype(str).str.split().str[0].astype(int)

autos.head(5)


# There are now 3 new columns: 'age', 'time_lapse' and the same values without string 'time_lapse-days'.
# We will use this new column more intensively in the following analysis.

# # Focusing on mean values to analyse sales performance
# We are going to dig deeper how the mean values of brand price, brand odometer, brand age 
# affecting the time-lapse to sell.
# We want to see whether the turn around between posting a car listing until it got sold
# have anything to do with pricing, odometer and age of car.
# 

# In[117]:


# Create 4 dictionaries, brand_price_dict, brand_odometer_dict, brand_age_dict and brand_sold_time_dict:

brand_price_dict={}
brand_odometer_dict={}
brand_age_dict={}
brand_sold_time_dict={}

# Select unique values of brand names

brand_names=autos['brand'].unique()
for brand_name in brand_names:
    mean_price = autos.loc[autos['brand']==brand_name, 'price'].mean()
    mean_odometer= autos.loc[autos['brand']==brand_name, 'odometer_km'].mean()
    mean_age= autos.loc[autos['brand']==brand_name, 'age'].mean()
    mean_sold_time= autos.loc[autos['brand']==brand_name, 'time_lapse_days'].mean()
    brand_price_dict[brand_name]=int(mean_price)
    brand_odometer_dict[brand_name]=int(mean_odometer)
    brand_age_dict[brand_name]=int(mean_age)
    brand_sold_time_dict[brand_name]= int(mean_sold_time)
    
# Convert dictionaries into pandas series for easier analysis

brand_mean_price_series= pd.Series(brand_price_dict)
brand_mean_odometer_series= pd.Series(brand_odometer_dict)
brand_mean_age_series=pd.Series(brand_age_dict)
brand_mean_sold_time_series= pd.Series(brand_sold_time_dict)

# Create a new DataFrame containing all new mean values

mean_values = pd.DataFrame(brand_mean_price_series, columns=['mean_price'])


# Append the other mean values as well

mean_values['mean_odometer']= brand_mean_odometer_series
mean_values['mean_age']= brand_mean_age_series
mean_values['mean_sold_time']= brand_mean_sold_time_series
mean_values['total_count']= autos['brand'].value_counts(normalize=True)*100

mean_values.sort_values('total_count', ascending=False)


# Notice that, taking out the mean_price, mean_odometer, mean_age and mean_sold_time are
# more or less about the same, yet the most dominant count is Volkswagen (21%). 
# It's quite consistent if we look into published car market share in general in Germany
# such as from this [site](https://www.best-selling-cars.com/germany/2019-full-year-germany-best-selling-car-brands/).

# # Prolong selling time
# When we look back at 'time-lapse' to sell we could see there were some cars took very long to sell. Some of them were taking 90 days and some up to 150 days. Let's take a deeper look to find out why that's the case.

# In[118]:


autos[autos['time_lapse_days']>60]. sort_values('time_lapse_days', ascending=False)[['brand','age', 'price','odometer_km','time_lapse']]


# Narrowing our attention to Mercedes Benz with 2 years old and listed price of 47,900 (more than 5 times mean price) taking 300 days time lapse to get sold (price extreme) and then Opel with 19 years old (almost double mean age) and listed price of 1099 (almost a quarter of mean price), taking 68 days time lapse to get sold (age extreme), we might conclude that these two parameters (list price and age relative to mean price and mean age) affect the time lapse to get sold.
# 

# # Fastest Selling cars
# Now, let us take a look at fastest selling cars.

# In[119]:


autos.sort_values('time_lapse_days')[['brand', 'age','price', 'odometer_km', 'time_lapse', 'date_created','last_seen', 'date_crawled']].head(20)


# The DataFrame shows listings that gone within 0 days. Reasons could be: 
# * the listing has been deleted for whatever reason
# * the price was really good and sold immediately
# 
# We won't know if the reason is due to deletion. Therefore, let's examine if the rice was really the reason.
# To do that we will see the difference between listed price and mean price.

# In[120]:


for i in range(20):
    name_of_brand=autos.sort_values('time_lapse_days').iloc[i]['brand']
    list_price = autos.sort_values('time_lapse_days').iloc[i]['price']
    mean_price = mean_values.loc[name_of_brand]['mean_price']
    difference = mean_price - list_price
    print(difference, ':', name_of_brand)


# We can see that 70% of the above data shows the difference between list price and mean price
# is more that 1700. It's a strong indication that this is the main cause for quick
# time_lapse to sell.
# 
# # Conclusions
# 
# German Ebay used cars has been analyzed using dataset crawled in 2016. After cleaning from several
# anomalies and inconsistencies, several conclusions can be drawn:
# * Top 5 selling brands are Volkswagen, BMW, OPEL, Mercedes and Audi.
# * Volkswagen is top of the list representing its market share for new car in Germany
# * Mean price can be used as benchmark in determining list price to enable faster/slower time_lapse to sell.
# * The difference of list price against mean price in order to achieve faster selling time is around $1700
# * Other than price, age is also major determinant affecting time_lapse to sell. Each brand
# got mean age and if anyone want to achieve quicker time_lapse to sell, the age of the
# car should not be too much older than its mean age.