#!/usr/bin/env python
# coding: utf-8

# # Exploring eBay Car Sale Datasets
# 
# In this project, we will work on a Car sale Dataset scraped by Kaggle user [orgesleka](https://www.kaggle.com/orgesleka). The aim of this project is to clean the dataset and analyze the car sale listing to extract useful information for car buyers.

# ## Data Introduction and Imports
# 
# This dataset is 50,000 data points sampled from the original datasets. The structure (data fields) of the dataset should be as followed:
# 
# - dateCrawled - When this ad was first crawled. All field-values are taken from this date.
# - name - Name of the car.
# - seller - Whether the seller is private or a dealer.
# - offerType - The type of listing
# - price - The price on the ad to sell the car.
# - abtest - Whether the listing is included in an A/B test.
# - vehicleType - The vehicle Type.
# - yearOfRegistration - The year in which the car was first registered.
# - gearbox - The transmission type.
# - powerPS - The power of the car in PS.
# - model - The car model name.
# - kilometer - How many kilometers the car has driven.
# - monthOfRegistration - The month in which the car was first registered.
# - fuelType - What type of fuel the car uses.
# - brand - The brand of the car.
# - notRepairedDamage - If the car has a damage which is not yet repaired.
# - dateCreated - The date on which the eBay listing was created.
# - nrOfPictures - The number of pictures in the ad.
# - postalCode - The postal code for the location of the vehicle.
# - lastSeenOnline - When the crawler saw this ad last online.
# 
# First, we import the dataset to the notebook, looking at the info and the first few data points from the datasets

# In[1]:


import pandas as pd
import numpy as np

df = pd.read_csv('autos.csv', encoding='Latin-1')
df.head(10)


# In[2]:


df.info()


# There is a few observations from the information of the first few dataset. These can be helpful guidance for further exploration and cleaning of the datasets
# 
# - The column name is written in snake-case, not head-case as python convention
# - Most of the datasets is string object, but can be converted into other data type to be easier to work with. For example, (dateCrawled, dateCreated, lastSeen) to datetime
# - The datasets contains 371528 data entries, but there are some columns (e.g model, powerPS, ...) contains missing values.
# - The name column can be split and explored further to yield meaningful data towards our analysis.

# ## Data Cleaning
# 
# ### Examine dataset
# 
# First, we need to address the issue with the column names. We need to change the column names to snake-case for easier data exploration with Python

# In[3]:


df = df.rename({'dateCrawled': 'date_crawled', 
                        'offerType': 'offer_type',
                        'vehicleType': 'vehicle_type',
                        'yearOfRegistration': 'registration_year',
                        'monthOfRegistration': 'registration_month', 
                        'powerPS': 'power_ps',
                        'fuelType': 'fuel_type', 
                        'notRepairedDamage': 'unrepaired_damage',
                        'dateCreated': 'ad_created',
                        'nrOfPictures': 'num_pics', 
                        'postalCode': 'postal_code', 
                        'lastSeen': 'last_seen_date'}, axis=1)
df.columns


# Next, we will look into the description of each columns to see any new information that we can extract

# In[4]:


df.describe()


# There are some observation that we can see from this:
# 
# - The `num_pics` in every post are all 0, so this information is basically useless for our analysis and should be dropped
# - **registration_year** columns contains year that reach 9999, which also needs further examination
# - There are some pretty unusual data value such as 20000 horsepower and price over 1 billion dollar, which may requires eliminate the outlier data
# 
# Next, we look at all the data, including string and object data

# In[5]:


df.describe(include=['O'])


# There are some notable things to consider in this description:
# 
# - `odometer` and `price` should be consider numeric type, not object type. We need to transform these two into different type
# - `seller` and `offer_type` consists of only one value, so they should be dropped since they do not contribute much to the data analysis

# ### Drop Unnecessary Columns
# 
# The data in `num_pics`, `seller` and `offer_type` contain only one value, so we can safely remove this column without eliminating any useful information

# In[6]:


df = df.drop(['num_pics', 'seller', 'offer_type'], axis=1)
df.columns


# ### Convert Columns data into desirable type
# 
# The type of `price` and `odometer` data is string/object, which is not desirable. What we should do is transform it into integer type, and add additional information into the 

# In[7]:


print(df['price'].str[0].value_counts())
print(df['odometer'].str[-2:].value_counts())


# Here we can see that all the data in these two are expressed in US Dollar and Kilometer, so we can effectively remove the units in the data

# In[8]:


df['price']    = df['price'].str[1:].str.replace(',', '').astype(int)
df['odometer'] = df['odometer'].str[:-2].str.replace(',', '').astype(int)

df = df.rename({'price': 'price_dollar', 'odometer': 'odometer_km'}, axis=1)
df.describe()


# ### Remove Outliers
# 
# Since there are a lot of outliers for the project, we needs to eliminate them so that we do not lead to incorrect conclusion and analysis. 
# 
# We start with the column `price_dollar` first, as we can see from the data description, the maximum price of a car can be up to more than 1 billion dollars, which is unrealistic

# In[9]:


print(df['price_dollar'].value_counts().sort_index().tail(20))


# We can see that data is pretty consistently from 200,000 dollar to 350,000 dollar, then we see a sudden jumped from 999,990 to over 999,999,999. On the other side, 0 is the most common 'exact' price, and there are many people giving their car for free (I supposed), so we do not have to worry about this end. 
# 
# In this column, we can treat any data over 350,000 as outliers and replace it with `NaN`

# In[10]:


df.loc[~(df['price_dollar'].between(0, 350000)), 'price_dollar'] = np.nan
df['price_dollar'].describe()


# The next column we needs to examine its outliers is `registration_year`. This columns contains year from 1000 (when the car not even invented yet) to 9999 (way into the future)! We should examine all the data lying outside 1900 (when car first registered) to 2020 (present) to see what we should do with it

# In[11]:


df.loc[~(df['registration_year'].between(1900, 2020)), :]


# These entries, when we look at other attribute, is mostly inaccurate data. Most of them have registration month of 0, some of them have car power that is way over 1800ps (some of the most powerful car these day cannot reach 1500ps). Therefore, I think we can safely remove these rows from the dataset. After remove the registration year, we look back at the dataset

# In[12]:


df = df.loc[df['registration_year'].between(1900, 2020), :]
df.describe()


# ## Data Analysis
# 
# Since we finish cleaning our dataset, now we can use our cleaned data to extract some useful infomation for eBay's car buyers
# 
# ### Car Analysis by Brand
# 
# First we retrieve the top 20 most posted brands to calculate the mean price for each brands. We then calculate their mean price, mileage, and power(for those who wants a little bit more push!) data of each brand to get some insights on which car brands that is suitable for the customer

# In[30]:


# Get the top 20 brands on eBay
brands = df['brand'].value_counts().sort_values(ascending=False)[:20].index

# Calculate and display average price and mileage for each car
price_by_brand = []
mileage_by_brand = []
power_by_brand = []
for brand in brands:
    price_by_brand.append(df.loc[df['brand'] == brand, 'price_dollar'].mean())
    mileage_by_brand.append(df.loc[df['brand'] == brand, 'odometer_km'].mean())
    power_by_brand.append(df.loc[df['brand'] == brand, 'power_ps'].mean())
    
# Convert into dataframe for display
brand_info = pd.DataFrame(index=brands)
brand_info['mean_price_dollar'] = price_by_brand
brand_info['mean_odometer_km'] = mileage_by_brand
brand_info['mean_power_ps'] = power_by_brand
brand_info


# As we can see in the table, one of the most expensive brands will be Mercedes Benz, Audi, BMW. The Mid-tier cars is populated by brands like Volkswagen, Ford, Toyota, Huyndai. Cheap options include Opel, Ford, Peugeot, Renault.
# 
# Most of the popular brands will have the average mileage around 130,000km, and this number drops gradually with their popularity. Also, car prices also corresponds towards average mileage, as high-end brands like BMW and Mercedes have on average more mileage compared to mid-tier and low-tier ones.
# 
# For those who desire a little bit more power, expensive car is the way to go. However, there are some mid-tier car brands that provide a little bit more horsepower for their money, such as Volvo and Skoda. Otherwise, most of the car power does depends on their price tag.

# ### Popular Brand/Model Combination
# 
# Car buyers usually interested in the model of the car, not just the brand itself. We should investigate on some of the popular car models out there for the customer, so the customer can see what is their most popular options.

# In[27]:


(df['brand'] + ' ' + df['model']).value_counts().sort_values(ascending=False).head(20)


# The most popular option on eBay, by far, is the Volkswagen Golf, with over 4022 listings. The Volkswagen family also have the Polo and the Passat model in the top 10. BMW 3er and BMW 5er is also very high on the list as a popular high-end model. For low-end brand, Opel Corsa and Opel Astra are the most popular choice that customers can find.

# ### Price Difference in 

# In[41]:


df['power_ps'].value_counts().sort_index().tail(50)


# In[ ]: