#!/usr/bin/env python # coding: utf-8 # # `GUIDED PROJECT: ANALYZING eBay KLEINENZEIGEN CAR SALE DATA` # --- # --- # # In this guided project, we'll work with a dataset of used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website. # # The dataset was originally scraped and uploaded to Kaggle, but in this project we're going to use a sample dataset with 50,000 data points that was prepared by dataquest. # # The data dictionary provided with data is as follows: # # # |**Column**|**Description** # |-|-| # |dateCrawled| When this ad was first crawled. All field-values are taken from this date. # |name | Name of the car. # |seller | Whether the seller is private or a dealer. # |offerType | The type of listing # |price | The price on the ad to sell the car. # |abtest | Whether the listing is included in an A/B test. # |vehicleType | The vehicle Type. # |yearOfRegistration | The year in which the car was first registered. # |gearbox | The transmission type. # |powerPS | The power of the car in PS. # |model | The car model name. # |kilometer | How many kilometers the car has driven. # |monthOfRegistration | The month in which the car was first registered. # |fuelType | What type of fuel the car uses. # |brand | The brand of the car. # |notRepairedDamage | If the car has a damage which is not yet repaired. # |dateCreated | The date on which the eBay listing was created. # |nrOfPictures | The number of pictures in the ad. # |postalCode | The postal code for the location of the vehicle. # |lastSeenOnline | When the crawler saw this ad last online. # # **The aim of this project is to clean the data and analyze the included used car listings. Also, we want to explore some of the unique benefits jupyter notebook provides for pandas. And of course, we want to have fun doing it!** # # So let's get started. As usual, we'll start by importing libraries, reading the dataset, and doing some exploration # # `Importing Libraries and Reading Data` # --- # In[1]: #Import pandas, NumPy, matplotlib.pyplot, and seaborn libraries get_ipython().run_line_magic('matplotlib', 'inline') import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns sns.set(font_scale=1.3) sns.set_palette("pastel") sns.set_style("white") # In[2]: #Read autos.csv into pandas autos = pd.read_csv('autos.csv', encoding = 'Latin-1') #The default encoding = 'utf-8' doesn't work, so we had to try another encoding. In this case we use 'Latin-1' # # `Data Exploration: Overview` # --- # In[3]: # A neat feature of jupyter notebook is its ability #to render the first few and last few values of any pandas object. autos # In[4]: autos.info() # print information about the autos dataframe autos.head() #print first few rows # ## Hmmm interesting dataset. We notice several things here: # # - Column names use camelCase instead of Python's preferred snake_case. We're going to rename the column names so all of
them are on snake_case # - A lot of German words (well, it's from German eBay website). We're going to translate the words for better understanding. # - There are 20 columns, most of which are strings. We're going to convert some of the strings column to numeric when
necessary. # - Some columns have null values (based on **autos.info()**), but all 20 columns have at least 80% non-null values. This is fine, we don't need to remove the null rows from the column. #
* *note: we know this by looking at **autos.info()**, RangeIndex is 50,000 entries, and column with the least amount of non-null count is notRepairedDamage, #
with 40,171.* # # **So what are waiting for? Let's clean the dataset!** # # `Data Cleaning` # --- # ## Clean column names: camelCase --> snake_case # # Let's convert the column names from camelcase to snakecase and reword some of the column names based on the data dictionary to be more descriptive. # In[5]: #Honestly, this is totally unnecessary. #However, I just learned about wordcloud and it looks cool! #So let's have a little fun and use it ;D #Displaying the column names using wordcloud #as you can see below, they are still on camelCase # Libraries from wordcloud import WordCloud # Create a list of word text= str(autos.columns) # Create the wordcloud object wordcloud = WordCloud(width=4128, height=4128, margin=0, background_color="white").generate(text) # Display the generated image: plt.imshow(wordcloud, interpolation='bilinear') plt.axis("off") plt.margins(x=0, y=0) plt.show() # In[6]: #print an array of the existing column names. autos.columns # In[7]: #rename the columns autos.columns = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'ab_test', 'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model', 'odometer', 'registration_month', 'fuel_type', 'brand', 'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code', 'last_seen'] # In[8]: #verify the change autos.columns # In[9]: #Displaying column names on snake_case # Libraries from wordcloud import WordCloud # Create a list of word text= str(autos.columns) # Create the wordcloud object wordcloud = WordCloud(width=4128, height=4128, margin=0, background_color="white").generate(text) # Display the generated image: plt.imshow(wordcloud, interpolation='bilinear') plt.axis("off") plt.margins(x=0, y=0) plt.show() # **As we can see, now the column names are on snake_case format. Woohoo! Next!** # ## Cleaning columns with German words # # It is a shame that I don't speak German. I need to take some class to understand some of the words in this dataset. :( #
But luckily, there is google translate! So by using google translate, I am going to translate all the German words in this dataset to English # In[10]: #print the unique values of all column #to figure out what needs to be changed columns= ['seller', 'offer_type', 'vehicle_type', 'gearbox', 'unrepaired_damage', 'fuel_type' ] for column in columns: print(column) print(autos[column].unique()) print('\n') # In[11]: #TRANSLATE: replace the values (German : English) with mapping dict columns= ['seller', 'offer_type', 'vehicle_type', 'gearbox', 'unrepaired_damage', 'fuel_type' ] translation={'privat':'private', 'gewerblich': 'commercial', #seller 'Angebot': 'offer', 'Gesuch': 'request', #offer_type 'kleinwagen': 'small car', 'kombi': 'wagon', 'cabrio': 'convertible', 'andere': 'other', #vehicle_type 'manuell': 'manual', 'automatik': 'automatic', 'cabrio': 'convertible', #gearbox 'nein': 'no', 'ja': 'yes', #unrepaired_damage 'benzin': 'gas', 'elektro': 'electric'} #fuel_type autos[columns]= autos[columns].replace(translation) # In[12]: #verify the change #by printing the unique values again columns= ['seller', 'offer_type', 'vehicle_type', 'gearbox', 'unrepaired_damage', 'fuel_type' ] for column in columns: print(column) print(autos[column].unique()) print('\n') # **Alright! So we have translated all the German words!** #
Now we have a better understanding of the dataset, which will help us when we explore the dataset even further, and do our analysis # # Onto the next one! # ## Cleaning odometer and price columns (part 1): Removing non-numeric characters and converting string column to int # **By looking at the table above, we can see that odometer and price columns are numeric values stored as text (we can see that on the row 'top' above, price has '$ and'odometer has 'km').** # # We don't want odometer and price columns to be string, because we won't be able to sort it. So you know what? let's clean those columns! Remove the non-numeric characters and convert the data types of those two columns to int. # In[13]: print(autos.dtypes) #verify data types # In[14]: #print the unique values of all column #to figure out what needs to be changed columns= ['price', 'odometer'] for column in columns: print('-' + column + '-') print(autos[column].unique()) print('\n') # In[15]: #Replace the non-numeric characters, then print unique values to verify the change columns= ['price', 'odometer'] for column in columns: autos[column] = (autos[column].str.replace('$','') .str.replace('km','') .str.replace(',','') .astype(int) ) print('-' + column + '-') print(autos[column].unique()) print('\n') # In[16]: #rename columns autos.rename({"odometer": "odometer_km"}, axis = 1, inplace = True) autos.rename({"price": "price_usd"}, axis = 1, inplace = True) # In[17]: autos.describe(include = 'all') #verify the changes # ## Buckle up, take a deep breath! Before we move on, let's do a little reflection of what we have done. # # **Up until this point, we've done the following:** # # - Cleaned column names: camelcase --> snakecase # - Cleaned columns with German words # - Cleaned odometer and price columns # - Removed non-numeric characters # - Converted columns to int # - Renamed columns # # Let's continue exploring and cleaning the dataset, specifically looking for data that doesn't look right. For now, we'll continue working on the odometer_km and price columns. # # **We're going to analyze the columns using minimum and maximum values and look for outliers that we might want to remove.** # ## Cleaning odometer_km and price_usd columns (part 2): Removing outliers # **`autos["price_usd"]`** # In[18]: print('-nunique-') print(autos["price_usd"].nunique()) #to see how many unique values print('\n') print('-describe-') print(autos["price_usd"].describe()) #to view min/max/median/mean etc # In[19]: # count unique values of the 50 highest prices autos["price_usd"].value_counts().sort_index(ascending= False).head(50) #note: Series.value_counts() returns a series # In[20]: # count unique values of the 50 lowest prices autos["price_usd"].value_counts().sort_index(ascending= False).tail(50) # **If we look at the two series above, it appears that the outliers are listings with price \\$0, and everything above \\$350,000.** # # We want to remove \\$0, because giving a car to strangers for free doesn't seem realistic at this day and age, right? :P
Also, we want to remove rows with prices > \\$350,000, because after \\$350,000 the prices that follows jumped significantly higher, which also doesn't seeem natural. # # *note: cars with really low price (e.g. below \\$10) makes sense, because eBay is an auction site. So we're going to keep it. Most likely the really low prices are just the starting bid. # In[21]: #removing outliers autos = autos[autos["price_usd"].between(1, 350000)] # In[22]: #verify change print('-head-') print(autos["price_usd"].value_counts().sort_index(ascending= False).head()) print('\n') print('-tail-') print(autos["price_usd"].value_counts().sort_index(ascending= False).tail()) print('\n') print('-nunique-') print(autos["price_usd"].nunique()) print('\n') print('-describe-') print(autos["price_usd"].describe()) # In[23]: #understanding the distribution of price_usd values using kdeplot fig, ax =plt.subplots(1,1,figsize=(15,6)) plt.title('price_usd') sns.kdeplot(autos["price_usd"], shade= True) sns.despine(left=True, bottom=True) # **Outliers on price_usd column has been removed. Yay! Now the column only contains car with price between \\$1 and \\$350,000** #
Let's move on! # # **`autos["odometer_km"]`** # In[24]: print('-nunique-') print(autos["odometer_km"].nunique()) print('\n') print('-desribe-') print(autos["odometer_km"].describe()) # In[25]: print(autos["odometer_km"].value_counts().sort_index(ascending= False)) # In[26]: #display odometer_km values in descending order of #values #by using barplot odometer_km = autos["odometer_km"].value_counts().sort_index(ascending= False) fig, ax =plt.subplots(1,1,figsize=(15,6)) plt.title('value_counts vs odometer_km ') plt.xlabel('odometer_km') plt.ylabel('value_counts') sns.barplot(x=odometer_km.index.sort_values(), y=odometer_km.unique(), data=autos, ci=None).set_xticklabels(labels=odometer_km.index,rotation=75) sns.despine(left=True, bottom=True) # **There doesn't seem to be any outlier in odometer_km column. Everything is good. So we'll just leave it at that. Next!** # ## Cleaning & exploring date columns # # Let's now move on to the date columns and understand the date range the data covers. # # There are 5 columns that should represent date values. Some of these columns were created by the crawler, some came from the website itself. We can differentiate by referring to the data dictionary: # # - `date_crawled`: added by the crawler # - `last_seen`: added by the crawler # - `ad_created`: from the website # - `registration_month`: from the website # - `registration_year`: from the website # # Right now, the **date_crawled, last_seen, and ad_created columns are all identified as string values** by pandas. Because these three columns are represented as strings, we need to **convert the data into a numerical representation so we can understand it quantitatively**. The other two columns are represented as numeric values, so we can use methods like Series.describe() to understand the distribution without any extra data processing. # # **Let's first understand how the values in the three string columns are formatted. These columns all represent full timestamp values:** # In[27]: autos[['date_crawled','ad_created','last_seen']][0:5] # **We notice that the first 10 characters represent the date (e.g. 2016-03-12). To understand the date range, we can extract just the date values, use Series.value_counts() to generate a distribution, and then sort by the index.** # # **Let's extract the date values!** # To select the first 10 characters in each column, we can use Series.str[:10]: # **`dateof_date_crawled`** # In[28]: from pandas import to_datetime #importing to_datetime so we don't have to use pd.to_datetime to call it dateof_date_crawled = to_datetime(autos['date_crawled'].str[:10]) print('-dateof_date_crawled-') print(dateof_date_crawled) #we're using to_datetime so that we can access any datetime property # In[29]: #understanding the distribution of dateof_date_crawled using kdeplot fig, ax =plt.subplots(1,1,figsize=(15,6)) #sns.set(font_scale=1.3) plt.title('dateof_date_crawled') sns.kdeplot(dateof_date_crawled, shade= True) sns.despine(left=True, bottom=True) # **`dateof_ad_created`** # In[30]: dateof_ad_created = to_datetime(autos['ad_created'].str[:10]) print('-dateof_ad_created-') print(dateof_ad_created) # In[31]: #understanding the distribution of dateof_date_crawled using kdeplot fig, ax =plt.subplots(1,1,figsize=(15,6)) plt.title('dateof_date_crawled') sns.kdeplot(dateof_ad_created, shade= True) sns.despine(left=True, bottom=True) # **`dateof_last_seen`** # In[32]: dateof_last_seen = to_datetime(autos['last_seen'].str[:10]) print('-dateof_last_seen-') print(dateof_last_seen) # In[33]: #understanding the distribution of dateof_date_crawled using kdeplot fig, ax =plt.subplots(1,1,figsize=(15,6)) #sns.set(font_scale=1.3) plt.title('dateof_last_seen') sns.kdeplot(dateof_last_seen, shade= True) sns.despine(left=True, bottom=True) # **Almost done with data cleaning, almost! But for now, let's explore the dates and registration_year column even further to understand the data distribution** # ## Exploring date_crawled, ad_created, and last_seen columns: calculating the distribution of values # **`dateof_date_crawled`** # In[34]: #using sort_index we rank the date from most common to least common (based on percentage) dateof_date_crawled.value_counts(normalize=True, dropna=False).sort_index #Series.value_counts(normalize=True, dropna=False) is used #to include missing values in the distribution #and to use percentages instead of counts # **2016-04-03** makes up most of the data in the date_crawled column, with 3.8% # **`dateof_ad_created`** # In[35]: dateof_ad_created.value_counts(normalize=True, dropna=False).sort_index # **2016-04-03** makes up most of the data in the ad_created column, with 3.8% # # One interesting thing: the most common dates for `dateof_date_crawled` and `dateof_ad_created` are **2016-04-03**, which seem to suggest that the dates on dateof_ad_created is NOT the actual date ads were created. It is more than likely that most ads were created before the dates listed on dateof_ad_created. # **`dateof_last_seen`** # In[36]: dateof_last_seen.value_counts(normalize=True, dropna=False).sort_index # **2016-04-06** makes up most of the data in the last_seen column, with 22.2% # ## Creating a new column: # **`turnaround_time` --> `autos['turnaround_time_days']`** # In[37]: #using the assumption that all values of dateof_ad_created consist of the actual date the ads are created #and that all values of dateof_last_seen reflects that date the car was sold, #we calculate the turnaround time by taking the difference of the two turnaround_time = dateof_last_seen - dateof_ad_created turnaround_time.value_counts(normalize=True, dropna=False).sort_index #note that this is why we need to_datetime, otherwise we won't be able to use substraction operator(-) # Just a quick note here, can see that the top three most common turnaround time are 0 days, 2 days, and 4 days. This is interesting, but we want more insights! so let's to create a new column that will be useful for later analysis. # # **`turnaround_time_days`** # In[79]: #create a new column in the DataFrame autos['turnaround_time_days']= turnaround_time.copy().astype(str).str.split().str[0].astype(int) #understanding the distribution of dateof_date_crawled using kdeplot fig, ax =plt.subplots(1,1,figsize=(15,6)) #sns.set(font_scale=1.3) plt.title('turnaround_time_days') sns.kdeplot(autos['turnaround_time_days'], shade= True) sns.despine(left=True, bottom=True) # In[39]: autos.head(5) #verify the change #we converted turnaround_time to numeric, then add it to a new column called turnaround_time_days # **All is good on the date columns. Next!** # ## Cleaning & exploring registration_year column: understanding the data distribution # In[40]: autos['registration_year'].describe() # In[41]: #understanding the distribution of registration_year fig, ax =plt.subplots(1,1,figsize=(15,6)) #sns.set(font_scale=1.3) plt.title('registration_year') sns.kdeplot(autos['registration_year'], shade= True) sns.despine(left=True, bottom=True) # ## Huh?! what ...? if you're confused looking at the distribution of registration year, you're not alone... # # **The minimum value is 1000 ?!** Which means the car was produced and registered in the year 1000?! Really ? Car didn't even exist yet. #
*Correct me if I'm wrong, but I don't even think there is any road that was built for cars at that time. So this definitely doesn't make sense. # # **The maximum value is 9999 -___-'** which makes even less sense. #
*I mean this dataset is from 2016. How could it be that there is a used car that comes from the future? But you know what, let's not dwell on this even further. # # **It's time to clean the data!** # **So since there are odd values in the dataset, we will remove a few rows:** # * Since this dataset is from 2016, then we will only include cars up until 2016. # * I did a little googling for "the first car ever", and turns out the birth certificate of automobile is at year 1886. The car is called *Benz Patent Motor Car, model no. 1*, which was invented by Carl Benz. This car was not sold to general public at the time of the invention, but maybe there is someone that owns this particular car and wants to sell it on Ebay. Who knows? So without overthinking, let's set the earliest year to 1886 # # # * **TL;DR: we'll only incude cars with registration year between 1886 and 2016** # # Before we remove the rows, let's check the relative frequency of cars with a registration year that fall outside of the 1886 - 2016 interval and see if it's safe to remove those rows entirely, or if we need more custom logic.. # In[42]: #check relative frequency using boolean filtering (~autos["registration_year"].between(1886, 2016)).value_counts(normalize=True, dropna=False) # **True = 0.039, which means that cars with a registration year that fall outside of the 1886 - 2016 interval only make up 3.9% of the total rows. This is only a small percentage, so we can remove them.** # In[43]: print('-Before-') print(autos["registration_year"].describe()) #before change print('\n') autos = autos[autos["registration_year"].between(1886, 2016)] #removing odd rows print('-After-') print(autos["registration_year"].describe()) #verify change # In[44]: #calculate the distribution of the remaining values print('-head-') print(autos["registration_year"].value_counts(normalize=True).head(10).sort_index) print('\n') print('-tail-') print(autos["registration_year"].value_counts(normalize=True).tail(10).sort_index) # In[45]: #new distribution of registration_year fig, ax =plt.subplots(1,1,figsize=(15,6)) #sns.set(font_scale=1.3) plt.title('registration_year') sns.kdeplot(autos['registration_year'], shade= True) sns.despine(left=True, bottom=True) # So based on the data distribution we can see that in this dataset: # * The **most common** cars come from the year **around 1998 - 2007** # * The **least common** cars come from the year **1927 - 1953** # ## As we all know, data scientists spend 80% of the time cleaning data, and at this point we're at 79.99% # # **We have cleaned the dataset, and we can start doing more analysis in this dataset. However, before we move on let's take a small break and do another reflection on what we have done in this project. So... have done the following:** # # - Cleaned column names: camelcase --> snakecase # - Cleaned columns with German words # - Cleaned odometer and price columns # - Removed non-numeric characters # - Converted columns to int # - Renamed columns # - Removed outliers # - Cleaned date columns (date_crawled, ad_created, and last_seen) # - Explored date columns & understood the distribution of values # - Cleaned & explored registration_year column # # **We did most of our chore (data cleaning), now the super fun part begins! We'll start analyzing data and creating insights! Oh I'm excited :D** # # `Analyzing Data & Creating Insights` # --- # # One of the analysis techniques we learned in this course is **aggregation**. When working with data on cars, it's natural to explore variations across different car brands. **We can use aggregation to understand the brand column.** # # **How do we do aggregation? Here are the steps:** # # - Identify the unique values we want to aggregate by # - Create an empty dictionary to store our aggregate data # - Loop over the unique values, and for each: # - Subset the dataframe by the unique values # - Calculate the mean of whichever column we're interested in # - Assign the val/mean to the dict as k/v # ## Analyzing brand column: Finding the correlation between price and odometer_km (mileage) between top brands. # **`brands_freq`** # In[46]: #Identify the unique values we want to aggregate by brands_freq = autos["brand"].value_counts(normalize=True, dropna=False) brands_freq #We explore the unique values in the brand column, and calculate the relative frequency # **Display brands_freq in descending order of relative frequency** # In[47]: #understanding the distribution of brand column using barplot sorted_brands = brands_freq.sort_values(ascending= False) fig, ax =plt.subplots(1,1,figsize=(15,6)) plt.title('sorted_brands') plt.xlabel('brands') plt.ylabel('relative frequency') sns.barplot(x=sorted_brands.sort_values(). index, y=sorted_brands.unique(), data=autos, ci=None).set_xticklabels(labels=sorted_brands .index,rotation=75) sns.despine(left=True, bottom=True) # **`top_brands`** #
**Brands that have relative frequency > 5%** # In[48]: top_brands = brands_freq[brands_freq > .05] print(top_brands) # **Display top_brands in descending order of relative frequency** # In[49]: #understanding the distribution of top_brands using barplot sorted_top_brands = top_brands.sort_values(ascending= False) fig, ax =plt.subplots(1,1,figsize=(15,6)) plt.title('top_brands') plt.xlabel('brands') plt.ylabel('relative frequency') sns.barplot(x=sorted_top_brands.sort_values(ascending= False). index, y=sorted_top_brands.unique(), data=autos, ci=None) sns.despine(left=True, bottom=True) # **`top_brands_price`** # In[50]: top_brands_price = {} #Create an empty dictionary to store our aggregate data for b in top_brands.index: # Loop over the unique values, and for each: selected_rows = autos[autos["brand"] == b ] #Subset the dataframe by the unique values using boolean comparison mean_price = selected_rows["price_usd"].mean() #Calculate the mean of whichever column we're interested in top_brands_price[b] = int(mean_price) #Assign the val/mean to the dict as k/v #print in descending order sorted_top_brands_price = sorted(top_brands_price.items() , reverse=True, key=lambda x: x[1]) for element in sorted_top_brands_price : print(element[0] , " :" , element[1] ) # **Display top_brands_price in descending order of price_usd** # In[51]: #display top_brands_price in descending order using barplot fig, ax =plt.subplots(1,1,figsize=(15,6)) plt.title('top_brands_price') plt.xlabel('brands') plt.ylabel('price_usd') keys = list(sorted(top_brands_price, key=top_brands_price.get, reverse=True)) #key sorted by values vals = [float(top_brands_price[k]) for k in keys] #sorted in descending order sns.barplot(x=keys, y=vals) sns.despine(left=True, bottom=True) # **As we can see, among top brands the most expensive is Audi and the cheapest is Opel** # **`top_brands_mileage`** # In[52]: #using the same aggregation methods as above top_brands_mileage = {} for b in top_brands.index: selected_rows = autos[autos["brand"] == b ] mean_mileage = selected_rows["odometer_km"].mean() top_brands_mileage[b] = int(mean_mileage) sorted_top_brands_mileage = sorted(top_brands_mileage.items() , reverse=True, key=lambda x: x[1]) for element in sorted_top_brands_mileage : print(element[0] , " :" , element[1] ) # **display top_brands_mileage in descending of odometer_km** # In[53]: #display top_brands_mileage in descending order using barplot fig, ax =plt.subplots(1,1,figsize=(15,6)) plt.title('top_brands_mileage') plt.xlabel('brands') plt.ylabel('mileage') keys = list(sorted(top_brands_mileage, key=top_brands_mileage.get, reverse=True)) #key sorted by values vals = [float(top_brands_mileage[k]) for k in keys] #sorted in descending order plt.ylim (120000,132572) #we limit the y axis so we can clearly see the difference of mean odometer_km b/w brands sns.barplot(x=keys, y=vals) sns.despine(left=True, bottom=True) # **Ah ok, so among top brands the most mileage on average is BMW and the least mileage on average is Ford** # **Alright! So we have figured out the average price and mileage for the top brands using aggregation, next we want to analyze the aggregated data by converting dictionary --> series objects using pandas series constructor** # **`tbp_series`** (top brands price) # In[54]: tbp_series = pd.Series(top_brands_price).sort_values(ascending=False) print(tbp_series) # **`tbm_series`** (top brands mileage) # In[55]: tbm_series = pd.Series(top_brands_mileage).sort_values(ascending=False) print(tbm_series) # **We have convered both dictionaries to series objects using series constructor. The next step is to create a dataframe from both series objects using dataframe constructor so we can easily compare the top brand's mean price & mean mileage** # In[56]: brand_info = pd.DataFrame(tbp_series, columns = ['mean_price']) #create a dataframe from tbp_series brand_info['mean_mileage'] = tbm_series #assign tbm_series to a new column brand_info #pretty print # If we want to buy a car just by considering the price and mileage, then I'd say **Ford has the best value, because it is the second least expensive car, and has the least amount of mileage on average.** # # When looking at the table above, it seems like there is no direct correlation between price and mileage for top brands. For example, Audi is the most expensive car, and yet it is ranked fourth in mileage; Merceedes Benz is the second most expensive car, and it has second most amount of mileage. # # **But really? There is no correlation at all? Maybe since the top brands only make up of 5% of the dataset, then analyzing only the top brands doesn't do the whole dataset justice . I have an idea, let's analyze the whole dataset. We can split the odometer_km (mileage) into groups, and use aggregation to see if average prices follows any patterns based on the mileage.** # # Let's do this! I am curious # ## Analyzing odometer_km (mileage) column (part 1): Find the relationship beween odometer_km (mileage) and price # Call me stubborn, but I really don't believe that there is no correlation at all between mileage and price. #
**Let's use groupby() to analyze the mileage and price columns for the whole dataset.** See if we get a different result. I wonder... # In[57]: mileage_price = autos.groupby(["odometer_km"])["price_usd"].mean().sort_values(ascending= False) print(mileage_price) fig, ax =plt.subplots(1,1,figsize=(15,6)) #sns.set(font_scale=1.3) plt.title('price_usd vs odometer_km') plt.xlabel('odometer_km') sns.barplot(x=sorted(mileage_price.index), y=mileage_price, data=autos, ci=None).set_xticklabels(labels=mileage_price.index,rotation=75) sns.despine(left=True, bottom=True) # **Tadaa! See?! It appears that the higher the price, the lower the mileage, and vice versa *(odometer: 5000 is an outlier)*. Let's try another method just to make our analysis credible. This time, we are going to split the odometer_km into groups, and use aggregation to see if average prices follows any patterns based on the mileage.** # # ## Analyzing odometer_km (mileage) column (part 2): Split the odometer_km into groups, and use aggregation to verify the correlation between price and mileage # In[58]: #splitting odomoter_km into 3 groups mileage_groups = autos["odometer_km"].value_counts(bins = 3).sort_index(ascending= False) mileage_groups # In[59]: #understanding the distribution of odomoter_km using barplot fig, ax =plt.subplots(1,1,figsize=(15,6)) plt.title('value_counts vs mileage group') plt.xlabel('mileage group') plt.ylabel('value_counts') sns.barplot(x=mileage_groups.sort_values(). index, y=mileage_groups.unique(), data=autos, ci=None) sns.despine(left=True, bottom=True) # In[60]: #display using barplot bins = [4854.999, 53333.333, 101666.667, 150000.0] mileage_price = autos.groupby(pd.cut(autos["odometer_km"], bins))["price_usd"].mean().sort_values(ascending= False) print(mileage_price) fig, ax =plt.subplots(1,1,figsize=(15,6)) #sns.set(font_scale=1.3) plt.title('price_usd vs mileage group') plt.xlabel('mileage group') plt.ylabel('price_usd') sns.barplot(x=sorted(mileage_price.index), y=mileage_price, data=autos, ci=None) sns.despine(left=True, bottom=True) # In[61]: #calculate the mean price of mileage_groups #using boolean comparison print('-(101666.667, 150000.0]-') print('Price = $' + str(round(autos[autos["odometer_km"] >= 101666.667]["price_usd"].mean() )) ) #group 1 print('\n') print('-(53333.333, 101666.667]-') print('Price = $' + str(round(autos[autos["odometer_km"].between(53333.333 , 101666.667)]["price_usd"].mean() ))) #group 2 print('\n') print('-(4854.999, 53333.333]-') print('Price = $' + str(round(autos[autos["odometer_km"] <= 53333.333]["price_usd"].mean() ))) #group 3 # Based on the result above, we get the following: # # |mileage_groups| price | # |-|-| # |(101666.667, 150000.0] |\\$4,108 | # |||| # |(53333.333, 101666.667]| \\$9,595| # |||| # |(4854.999, 53333.333] | \\$14,891| # # Wooh! Good that we did this. Turns out if we analyze the whole dataset and grouped the odometer_km column into groups, we find **there is a correlation between price and mileage. As we all can see from the table, the group of cars with higher mileage has lower price, and vice versa.** # # Alright alright, so what else can we analyze? let's check the dataset again # In[62]: autos.head(5) #print the first 5 columns to check what to do next # **So in our analysis we have done the following:** # # * Analyzed top brands, and find the correlation between price and odometer_km (mileage) between top brands. # * Analyzed the correlation between odometer_km (mileage) column, and price column in this dataset. # # **Let's do a few more analysis to get some more insights!** # ## Analyzing brand & model columns (part 1): Find the most common brand/model combinations in this dataset # **Let's find out the most common brand/ model combinations by using groupby( ) on brands and the model column** # In[63]: #using groupby: #A groupby operation involves some combination of splitting the object, #applying a function, and combining the results. #This can be used to group large amounts of data and compute operations on these groups. common_brands_model = autos.groupby(["brand","model"]).size().sort_values(ascending=False).head() common_brands_model # **Oh wow! The most common brand & model combination is volkswagen golf, with 3707 entries. Interesting.** # ## Analyzing brand & model columns (part 2): Find the most common model among top brands # Earlier when analyzing the brand column we managed to get top_brands, which consists of car brands that have a relative frequency > 5% in this dataset. **Let's find the most common model of each brands! In this case we're going to use aggregation (since top_brands is not a column in the DataFrame).** # In[64]: top_brands.index #just to remind us which brands are included in top_brands # In[65]: #aggregation top_brands_model = {} for b in top_brands.index: selected_rows = autos[autos["brand"] == b ] #boolean comparison model = selected_rows["model"] top_brands_model[b] = model print(str(b) + ': ' + str(top_brands_model[b].value_counts()[[0]])) print('\n') # **Ford focus is the most popular model of Ford brand, but compared to the popular model of other top brands it has the lowest number of entries.** This is not surprising, since Ford is the only American top brand while the other brands in top brands are from German. Note that this dataset is obtained from German ebay, so the result makes sense. # ## Analyzing brand & model columns (part 3): Top 3 most expensive used cars # # **Let's find out the most expensive cars in this dataset using groupby( ) on the model column** # In[66]: model_price = autos.groupby(["model"])["price_usd"].max().sort_values(ascending= False).head() print(model_price) fig, ax =plt.subplots(1,1,figsize=(15,6)) #sns.set(font_scale=1.3) plt.title('price_usd vs odometer_km') plt.xlabel('odometer_km') sns.barplot(x=model_price.index, y=model_price, data=autos, ci=None) sns.despine(left=True, bottom=True) # **Ok, so the most expensive cars in this dataset is porsche 911, with $350,000. One day I'll buy it! ;D** # # Sooo we are almost done with our analysis. We have obtained a lot of insights by analyzing the brand column, and then the Odometer_km (mileage) column. We're almost don... just a few more analysis... # # **But before we continue, let's do a little recap again. Here is what we have done on the data analysis section:** # # - Brand column # - Correlation between price and odometer_km (mileage) between top brands. # - Odometer_km (mileage) column # - Relationship beween odometer_km (mileage) and price # - Split the odometer_km into groups, and use aggregation to verify the correlation between price and mileage # - Brand & model columns # - Most common brand/model combinations in this dataset # - Most common model among top brands # - Most expensive used cars # **Alright so next let's check out the average prices of cars per brand in this dataset.** # # ## Analyzing brand & price_usd columns: Find the priciest & cheapest car brands on average # **`brand_model_price`** # In[67]: brand_model_price = autos.groupby(["brand"])["price_usd"].mean().sort_values(ascending=False) fig, ax =plt.subplots(1,1,figsize=(15,6)) #sns.set(font_scale=1.3) plt.title('price_usd vs brand_model_price_low') sns.barplot(x=brand_model_price.index, y=brand_model_price, data=autos, ci=None).set_xticklabels(labels=brand_model_price.index,rotation=75) sns.despine(left=True, bottom=True) # **Yup, this makes sense. The priciest is porsche, and the cheapest is daewoo** # # **Next! let's check out brand & turnaround_time_days** # ## Analyzing brand & turnaround_time_days columns: Top 3 easiest & hardest car to sell (fastest &slowest turnaround time on average) # **`brand_model_turnaround_short`** # In[68]: #Top 3 cars with the fastest turnaround brand_model_turnaround_short = autos.groupby(["brand"])["turnaround_time_days"].mean().sort_values().head(3) brand_model_turnaround_short # This is really interesting, because I live in the US and as far as I am concerned lada, daewoo, and daihatsu are not considered popular. #
* *CMIIW I am pretty oblivious about cars honestly :P* # # **If I have to guess.. I think this has to be related to their price.** # **`brand_model_turnaround_long`** # In[69]: #Top 3 cars with the slowest turnaround brand_model_turnaround_long = autos.groupby(["brand"])["turnaround_time_days"].mean().sort_values(ascending=False).head(3) brand_model_turnaround_long # **Mini is the car that takes the longest time to sell on average? I didn't expect this,** since mini is a British brand that has been owned by a German brand, BMW, since 2000. So since this data is obtained from a German eBay, I thought the slowest selling car wouldn't be the one that was owned by a German brand. # # As for Chevrolet and Subaru, I'm not really surprised because Chevrolet comes from the US, and Subaru comes from Japan. #
* *I'm not saying that German only buy German cars btw. This is only my assumption (since I am ADD and have limited knowledge about cars :P)* # **`brand_model_turnaround`** #
**Display brand_model_turnaround in descending order** # In[70]: brand_model_turnaround = autos.groupby(["brand"])["turnaround_time_days"].mean().sort_values(ascending=False) fig, ax =plt.subplots(1,1,figsize=(15,6)) #sns.set(font_scale=1.3) plt.title('price_usd vs brand_model_price_high') sns.barplot(x=brand_model_turnaround.index, y=brand_model_turnaround, data=autos, ci=None).set_xticklabels(labels=brand_model_turnaround.index,rotation=75) sns.despine(left=True, bottom=True) # **Alright so next let's check out the relationship between turnaround_time_days and price_usd .** # ## Analyzing turnaround_time_days & price_usd columns: Find the correlation # In[71]: fig, ax =plt.subplots(1,1,figsize=(15,6)) sns.scatterplot(data = autos, x='price_usd', y='turnaround_time_days', hue ='turnaround_time_days') sns.despine(left=True, bottom=True) plt.show() # **Based on the plot, we can see that turnaround_time_days decreases as price_usd increases. This indicates that there is a negative correlation between turnaround_time_days, and price_usd.** # # Woohoo! Our assumption is correct! Price definitely has an effect on the turnaround time. # # Almost done... HAHA I know I've been saying this, but it's hard to stop since we keep getting cool insights. We still have a few unanswered questions, so let's get to it. I swear this is the last time I say 'almost done' # ## Analyzing unrepaired_damage: How much cheaper are cars with damage than their non-damaged counterparts? # **Let's find out the the price difference between cars with damage vs non-damaged** # In[72]: autos.groupby(["unrepaired_damage"])["price_usd"].mean() # **Makes total sense. The price of cars without unrepaired damage is higher on average, with $7,164 # Next, let's calculate the difference** # In[73]: price_damage = autos.groupby(["unrepaired_damage"])["price_usd"].mean() print(price_damage) #sns.set_palette("pastel") #sns.set_style("white") fig, ax =plt.subplots(1,1,figsize=(15,6)) # sns.set(font_scale=1.3) plt.title('price_usd vs unrepaired_damage') sns.barplot(x='unrepaired_damage', y='price_usd', data=autos, ci=None) sns.despine(left=True, bottom=True) # In[74]: #Calculating the difference: yes - no autos.groupby(["unrepaired_damage"])["price_usd"].mean()["yes"] - autos.groupby(["unrepaired_damage"])["price_usd"].mean()["no"] # **Alright so without using calculator we know that the difference is $4,922... neat.** # ## Analyzing vehicle_type: What are the top 5 the most expensive vehicle types on average? # **Let's use groupby to find the the most expensive vehicle types on average** # # In[75]: price_vtype = autos.groupby(["vehicle_type"])["price_usd"].mean().sort_values(ascending=False) print(price_vtype) #sns.set_palette("pastel") #sns.set_style("white") fig, ax =plt.subplots(1,1,figsize=(15,6)) # sns.set(font_scale=1.3) plt.title('price_usd vs vehicle_type') sns.barplot(x= price_vtype.index, y=price_vtype, data=autos, ci=None).set_xticklabels(labels=price_vtype.index,rotation=75) sns.despine(left=True, bottom=True) # **SUV? This is actually unexpected, but ok!** # ## Analyzing gearbox: Which one is more expensive? Automatic or manual? # **Of course using common sense we all guess that automatic should be more expensive, but let's check...** # In[76]: #Calculate the average prices print(autos.groupby(["gearbox"])["price_usd"].mean()) fig, ax =plt.subplots(1,1,figsize=(15,6)) plt.title('price_usd vs gearbox') sns.barplot(x='gearbox', y='price_usd', data=autos, ci=None) sns.despine(left=True, bottom=True) # In[77]: #Calculate the difference: automatic - manual autos.groupby(["gearbox"])["price_usd"].mean()["automatic"] - autos.groupby(["gearbox"])["price_usd"].mean()["manual"] # **Our guess is right, automatic is $6,256 more expensive!** # ## Analyzing fuel_type: What are the cars with the most expensive fuel types on average? # This is going to be our last analysis in this project! Let's find out the average prices of each fuel types and sort it in descending order # In[78]: price_ftype = autos.groupby(["fuel_type"])["price_usd"].mean().sort_values(ascending=False) print(price_ftype) fig, ax =plt.subplots(1,1,figsize=(15,6)) plt.title('price_usd vs fuel_type') sns.barplot(x=price_ftype.index, y=price_ftype, data=autos, ci=None).set_xticklabels(labels=price_ftype.index,rotation=75) sns.despine(left=True, bottom=True) # **Electric cars are the most expensive, which is not surprising at all** # **DONE!!! Alright this is it. Before we jump into conclusion, let's recap on what we have done for our data analysis** # # **Data Analysis:** # - Brand column # - Correlation between price and odometer_km (mileage) between top brands. # - Odometer_km (mileage) column # - Relationship beween odometer_km (mileage) and price # - Split the odometer_km into groups, and use aggregation to verify the correlation between price and mileage # - Brand & model columns # - Most common brand/model combinations in this dataset # - Most common model among top brands # - Most expensive used cars # - Brand & price_usd columns # - The priciest & cheapest car brands on average # - Brand & turnaround_time_days columns # - Top 3 easiest & hardest car to sell per brand (fastest &slowest turnaround time on average) # - Turnaround_time_days & price_usd # - The correlation # - unrepaired_damage column # - Price difference between cars with damage vs non-damaged # - vehice_type column # - Average prices of each vehicle type # - gearbox column # - Average prices of each gearbox category # - fuel_type column # - Average prices of each vehicle type # # `Conclusion & Insights` # --- # # Well, that was fun wasn't it? We have explored, cleaned and analyzed the dataset, and now it's time present our insights! **;D** # # - On average, cars are usually sold within 0 days, 2 days, or 4 days (in order from most common to least common). # - If we want to buy a car just by considering the price and mileage, Ford has the best value, because it is the second least expensive car, and has the least amount of mileage on average. # - There is a correlation between price and mileage if we analyze the whole dataset. Cars with higher mileage have lower price, and vice versa. # - The most common brand & model combination is volkswagen golf, with 3707 entries # - The top brands in this dataset are: Volkswagen, Bmw, Opel, Mercedes Benz, Audi, and Ford # - Among top brands the most expensive is Audi and the cheapest is Opel # - Among top brands the most mileage on average is BMW and the least mileage on average is Ford # - The most common model among top brands are: # - Volkswagen Golf # - Bmw 3er (3 series) # - Opel Corsa # - Mercedes-Benz C-klasse (C-Class) # - Audi A4 # - Ford Focus # - Top 3 most expensive cars # - Porsche 911 # - BMW Z Reihe (Z Series) # - BMW X Reihe (X Series) # - Non-damaged cars are \\$4,922 more expensive compared the damaged cars # - The most expensive vehicle types on average is SUV # - Cars with automatic gearbox are $6,256 more expensive compared to cars with manual gearbox # - On average, Electric cars are the most expensive compared to cars with other fuel types # # #