#!/usr/bin/env python # coding: utf-8 # # Exploring eBay Used Car Sales Data in German website # # Introduction # # This is a Guided Project of one of the DataQuest.io courses. It is a Data Cleaning project, in which we will take a web scrapped data set of the German eBay website, more specifically the eBay Kleinanzeigen section. # # The data from DataQuest was altered to make the job a little bit more difficult. The original one was cleaner to make analysis easier, it was uploaded on Kaggle by the user orgesleka but it is not available anymore. But it can be found [here](https://data.world/data-society/used-cars-data). # # We are going to clean the data and answer some questions that could come in the way. # # ## Column Summary # The content of the data is in german, so one has to translate it first if one can not speak german. Those fields are included: # # * dateCrawled : when this ad was first crawled, all field-values are taken from this date # * name : "name" of the car # * seller : private or dealer # * offerType : The type of listing # * price : the price on the ad to sell the car # * abtest: Whether the listing is included in an A/B test # * vehicleType: The type of car # * yearOfRegistration : at which year the car was first registered # * gearbox : Transmission type # * powerPS : power of the car in PS # * model : The car model name # * kilometer : how many kilometers the car has driven # * monthOfRegistration : at which month the car was first registered # * fuelType : type of fuel car used # * brand : brand of car # * notRepairedDamage : if the car has a damage which is not repaired yet # * dateCreated : the date for which the ad at ebay was created # * nrOfPictures : number of pictures in the ad # * postalCode: postal code for the location of the vehicle # * lastSeenOnline : when the crawler saw this ad last online # # The fields lastSeen and dateCrawled could be used to estimate how long a car will be at least online before it is sold. # # ## Objective : # The aim of this project is to clean the data and analyze the included used car listings. # ## 1 Import libraries and reading data # In[1]: # import necessary libraries as we will use here import numpy as np import pandas as pd # In[2]: # Read file with csv reader and encode file by encoding autos = pd.read_csv("autos.csv", encoding = "Latin-1") # In[3]: # view first 5 rows of dataset autos.head() # In[4]: # view last 5 rows in dataset autos.tail() # ## 2 Exploring data for insights # Since everything worked fine until here, we are going to take a closer look at the entries, looking for possible mistakes that could lead to wrong analysis. # In[5]: # show the information details about dataframe autos.info() # The dataset contain 50000 rows, 20 columns, few columns have null values, 5 integer datatype, 15 object datatype and columns contain cammel case. # # We need to convert column names from cammel case to snakecase and rewrite some columns name to more easy readble. # # ## 3 Rename columns # # With the help of the dataset dictionary, we are going to change them to more descriptive snakecase names. # In[6]: # Printing the autos dataframe columns autos.columns # In[7]: # Changing few columns name with dictionary as instruction said. autos.rename(columns = {"yearOfRegistration" : "registration_year", "monthOfRegistration" : "registration_month", "notRepairedDamage" : "unrepaired_damage", "dateCreated" : "ad_created"},inplace = True) # In[8]: # Checking the changed columns autos.columns # In[9]: # Changing the name to more meaningful snakecase ones autos.columns = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'ab_test', 'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model', 'odometer', 'registration_month', 'fuel_type', 'brand', 'unrepaired_damage', 'ad_created','n_of_pictures', 'postal_code', 'last_seen'] # In[10]: # Checking it autos.columns # In column section we changed column name camel case to snake case with underscores, lower characters, changing column in meaningful name. # ## 4 Checking for duplicates # # Now that we fixed the first problems we saw about column names, so we are going to check the data one more time to see if there are any duplicates (or other things that could take our attention). # In[11]: # Print descriptive statistics for all columns autos.describe(include = "all") # From the statistical description of datasets, its clearly described that column price and odometer is text data but it should be numeric, so we have to convert text to numeric. # # Then no_of_pictures column has all values are 0 that has no importance for further data analysis. # # Also we have to look into columns registration_year, registration_month and postal_code. # # Another column name power_ps showed big difference between minimum and maximum value. # # On top of that seller and offer_type columns have almost same values, only 2 unique values. # In[12]: # Lets see how many unique values in seller column autos["seller"].unique().shape # In[13]: # show the unique values autos["seller"].unique() # In[14]: # Lets see number of unique values in offer_type column autos["offer_type"].unique().shape # In[15]: # show the unique values autos["offer_type"].unique() # Looking at this words we can translate it and see that "gewerblich" seller is commercial and "gesuch" offer type is wanted or requested. This type of seller is not relevant (since it is just one entry), and the offer type too, just one entry and it is not we are looking for. # # Now let's convert the columns price and odometer to numbers, as well as give to the odometer column a better name. # ## 5 Convert column from text to numeric # In[16]: # Convert price column autos["price"] = autos["price"].str.replace("$","").str.replace(",","").astype(float) autos["price"].dtype # In[17]: # lets see how many unique values in price column autos["price"].unique().shape # In[18]: # lets see the statistical view autos["price"].describe() # In[19]: # show the variations in values for price column autos["price"].value_counts() # though value_counts show all values so we can use head to show few values autos["price"].value_counts().head() # though value_counts return series, #we can use series.sort_index with ascending=True/False to view the highest # or lowest values with their counts, can use head() as well. autos["price"].value_counts().sort_index(ascending = False).head() # In[20]: # Show the highest and lowest value from price column highest_value = autos["price"].value_counts().sort_index(ascending = False).head() lowest_value = autos["price"].value_counts().sort_index(ascending = True).head() print("higher price frequency:\n{} \n\nlower price frquency:\n{}" .format(highest_value, lowest_value)) # In[21]: # Convert odometer column and rename column as odometer_km autos["odometer"] = autos["odometer"].str.replace("km","").str.replace(",","").astype(float) autos.rename({"odometer": "odometer_km"}, axis = 1, inplace = True) # In[22]: # show the variations in values for odometer column autos["odometer_km"].value_counts() # though value_counts show all values so we can use head to show few values autos["odometer_km"].value_counts().head() # though value_counts return series, #we can use series.sort_index with ascending=True/False to view the highest # or lowest values with their counts, can use head() as well. print("Higher values ", autos["odometer_km"].value_counts().sort_index(ascending = False).head()) print("\n") print("Lower values ", autos["odometer_km"].value_counts().sort_index(ascending = True).head()) # In[23]: # Show the statistical details of odometer_km column autos["odometer_km"].describe() # # 6 Missing values & Outlier detection and removal # In[24]: autos.isnull().sum() # In[25]: missing_value_columns = ["vehicle_type", "gearbox", "model", "fuel_type", "unrepaired_damage"] # In[26]: for col in missing_value_columns: pct_missing = np.mean(autos[col].isnull()) print('{} - {}%'.format(col, round(pct_missing*100))) # The missing value percengtage is very low, less than 20%, we can neglect that # but only unrepaired_damage column can impute with mode. # In[27]: # fill the categorical column null values with mode autos['unrepaired_damage'] = autos['unrepaired_damage'].fillna(autos['unrepaired_damage'].mode()[0]) # Fill rest of the columns null values with zero autos.fillna(0, inplace=True) # Show the datasets null view amount autos.isnull().sum() # In[28]: # Outlier detection Q1=autos["price"].quantile(0.25) Q3=autos["price"].quantile(0.75) IQR=Q3-Q1 low_bound=Q1 - 1.5 * IQR upper_bound=Q3 + 1.5 * IQR print(low_bound,upper_bound) # From outlier detection we determined lower bound and upper bound of dataset, above or below that range data is called outlier. We will remove that oulier and keep the data in between lowerbound and upperbound. Lets see interquartile price without outlier. # In[29]: IQR_price=autos.price[(autos.price > low_bound) & (autos.price < upper_bound)] IQR_price.shape # In[30]: IQR_price.describe() # In[31]: # removing the price value zeros and count from 1 to 16333 autos = autos[autos['price'].between(1,16333)] autos['price'].describe() # # 7 Date columns cleaning # # There are 5 columns that should represent date values. Some of these columns were created by the crawler, some came from the website itself. We can differentiate by referring to the data dictionary: # # - `date_crawled`: added by the crawler # - `last_seen`: added by the crawler # - `ad_created`: from the website # - `registration_month`: from the website # - `registration_year`: from the website # # Right now, the date_crawled, last_seen, and ad_created columns are all identified as string values by pandas. Because these three columns are represented as strings, we need to convert the data into a numerical representation so we can understand it quantitatively. The other two columns are represented as numeric values, so we can use methods like Series.describe() to understand the distribution without any extra data processing. # # Now we are going to see how the date columns are formatted, how the data behaves and see if we need to drop any rows or correct any possible wrong entries. # Let's start printing our date columns: # In[32]: # Printing all date columns autos.loc[:5, ['date_crawled', 'last_seen', 'ad_created', 'registration_month', 'registration_year']] # As we cans ee above the 'registration_month' and 'registration_year' columns are already numbers so let's analyze the content already. # In[33]: # Checking if there are more than 12 months autos['registration_month'].unique() # Everything seems normal, so lets see year column # In[34]: # Checking the unique values from the year column autos['registration_year'].unique() # We can see clearly that there are years that make no sense at all, like 9000, 4800 or any year greater than 2018, since 2016 is the year the data was crawled (maybe we can use some of this data, let's see bellow what percentage it is). Also makes no sense to take values from less than 1900, they will be removed too. With that in mind lets drop this rows. # In[35]: autos = autos.loc[autos['registration_year'].between(1900,2018)] autos.head() # Next we will configure the columns with data unformatted and to include the missing values in the distribution and to use percentage instead of counts, chain the series.value_counts(normalize = True, dropna=false) method. # In[36]: # Th following 3 columns is unstructured, we will convert them as structured autos[['date_crawled', 'ad_created', 'last_seen']][0:5] # The first 10 characters represent the day (e.g. 2016-03-12). To understand the date range, we can extract just the date values, use Series.value_counts() to generate a distribution, and then sort by the index. # # To select the first 10 characters in each column, we can use Series.str[:10]: # In[37]: print(autos['date_crawled'].str[:10]) # In[38]: # Convert string columns to datetime autos['date_crawled'] = pd.to_datetime(autos['date_crawled']) #the timestamp is lost during the convertion autos['ad_created'] = pd.to_datetime(autos['ad_created']) autos['last_seen'] = pd.to_datetime(autos['last_seen']) # ### Exploring the 'date_crawled' column: # In[39]: print("Counts of unique values for 'date_crawled' column, in percentages:", '\n', autos['date_crawled'].value_counts(normalize=True, dropna=False).sort_index().head(10) * 100) # No anomalities with the 'date_crawled' column (something one would expect from the column created by the crawler). # ### Exploring the 'ad_created' column: # In[40]: print("Counts of unique values for 'ad_created' column, in percentages:", '\n', autos['ad_created'].value_counts(normalize=True, dropna=False).sort_index().head(10) * 100) # The mistery is solved. There´re only 5 ads created in 2015, which definitely can´t be legitimate. It should be some specifics of crawled data. To confirm that, the documentation to the original data set should be revised. # ### Exploring the 'last_seen' column # In[41]: print("Counts of unique values for 'last_seen' column, in percentages:", '\n', autos['last_seen'].dt.date.value_counts(normalize=True, dropna=False).sort_index() * 100) # The last three days contain a disproportionate amount of 'last seen' values. Given that these are 6-10x the values from the previous days, it's unlikely that there was a massive spike in sales, and more likely that these values are to do with the crawling period ending and don't indicate car sales. # ### What does registration year tells us? # In[42]: autos["registration_year"].describe() # As we sorted registration_year data earlier range from 2016 to 2019, so we have no doubtful data in this column. # ### Drop irrelevant data # In[43]: autos.head(2) # In[44]: autos.loc[:,["seller", "offer_type", "n_of_pictures"]].head() # As these columns have 0 and duplicate values, so we can remove it from main dataframe, it will not important for further data analysis. # In[45]: autos.drop(["seller", "offer_type", "n_of_pictures"], axis = 1, inplace = True) autos.head() # In[46]: autos["registration_year"].value_counts(normalize = True) # After sorting data in registration year we have found the values as on top from 1900-2018. # ## 8 Aggregation # Now that we have made most of the cleaning, we are going to aggregate the items by car brands. This will bring more clarity to the dataset. # In[47]: # Identify the unique values as we want to aggregate by car brand autos["brand"].unique() # In[48]: # Checking the frequencies of car brands brands = autos["brand"].value_counts() brands.head() # In[49]: # Create the frequency dataframe as percentage brands_percentage = ((brands)/(autos["brand"].shape[0]))*100 # Printing the top 10 brands_percentage.head(10) # Looking at the date above we selected the top 10 ones, this means the brands we are going to aggregate are: volkswagen (21.62%), opel (11.61%),bmw (10.22%), mercedez_benz (8.76%), audi (7.71%), ford (7.27%), renaut (5.14%), peugeot (3.16%), fiat (2.80%) and seat (1.97%) as the last one. # # Now we'll make a dictionary to store the mean price information of each brand. # In[50]: # choosing our top 10 brands in a list auto_brands = ['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford', 'renault', 'peugeot', 'fiat', 'seat'] # In[51]: # Creating empty dictionary to hold aggregate data # by mean price and mean mileage brand_mean_price = {} brand_mean_mileage = {} # In[52]: # Populating our dictionary for brand in auto_brands: mean_price = autos.loc[autos['brand'] == brand, 'price'].mean() mean_mileage = autos.loc[autos['brand'] == brand, 'odometer_km'].mean() brand_mean_price[brand] = mean_price brand_mean_mileage[brand] = mean_mileage print("brand_mean_price:\n\n",brand_mean_price) print("\nbrand_mean_mileage:\n\n", brand_mean_mileage) # As we can see above the most expensive car brands are audi, bmw and mercedes benz with mean prices over 5,000 dollars. Followed by volkswagen and seat with prices varying between 3,500 dollars and 4,000 dollars. The less expensive are ford, fiat, peugeot and renault cars. # # Below we are going to combine both price and mileage values inside a dataframe so we can compare both information. Also brand_mean_prices will be reduced to bmp and brand_mean_mileage will be reduced to bmm. # In[53]: # Turning the dictionary into a dataframe bmp_series = pd.Series(brand_mean_price) bmm_series = pd.Series(brand_mean_mileage) # In[54]: # Turning both series into a dataframe brand_df = pd.DataFrame(bmp_series, columns = ['mean_price']) brand_df['mean_mileage'] = bmm_series brand_df # As we can see above, there is a small difference between how much mileage a car has (at least in the mean values) and the price, the most expensive ones are the ones with the biggest mean mileage, except opel that has a big mean but not so big price. # Also renault has big mileage compare to mean price. # This data can lead to much more insights with more hours of work. # ## 9 More Data cleaning # ### Identify categorical data that uses german words, translate them and map the values to their english counterparts # we have four columns have german words vehicle_type, gearbox, fuel_type, unrepaired_damage. Lets map all those columns into english. # In[55]: # To print all the column levels autos.columns # In[56]: # Finding German words in vehicle_type column autos["vehicle_type"].unique() # In[57]: # Translating German Values to English by Mapping Technique vehicle_type = autos["vehicle_type"] map_vehicle = { "limousine":"limousine", "kleinwagen":"small car", "kombi":"combi", "bus": "bus", "cabrio":"convertible", "coupe":"coupe", "suv":"suv", "andere":"other" } autos["vehicle_type"] = pd.Series(vehicle_type).map(map_vehicle) # In[58]: # For gearbox column autos["gearbox"].unique() # In[59]: # Translating German Values to English by Mapping Technique gear_box = autos["gearbox"] map_gearbox = { "manuell":"manual", "automatik":"automatic", } autos["gearbox"] = pd.Series(gear_box).map(map_gearbox) # In[60]: # for fuel_type column autos["fuel_type"].unique() # In[61]: # Translating German Values to English by Mapping Technique fuel_type = autos["fuel_type"] map_fuel_type = { "benzin" :"petrol", "diesel":"diesel", "lpg" :"lpg", "cpg":"cpg", "hybrid" :"hybrid", "andere":"other", "elektro":"electric", } autos["fuel_type"] = pd.Series(fuel_type).map(map_fuel_type) # In[62]: # for unrepaired_damage autos["unrepaired_damage"].unique() # In[63]: # Translating German Values to English by Mapping Technique damage_type = autos["unrepaired_damage"] map_damage_type = { "nein" :"no", "ja":"yes", } autos["unrepaired_damage"] = pd.Series(damage_type).map(map_damage_type) # ### Convert the dates to be uniform numeric data, so "2016-03-21" becomes the integer 20160321 # In[64]: #to convert in numeric type of date crawled column autos["date_crawled"] = autos["date_crawled"].astype(str).str[:10].str.replace("-","").astype(int) print(autos["date_crawled"].head()) # In[65]: #to convert in numeric type of ad_created column autos["ad_created"] = autos["ad_created"].astype(str).str[:10].str.replace("-","").astype(int) print(autos["ad_created"].head()) # In[66]: #to convert in numeric type of last_seen column autos["last_seen"] = autos["last_seen"].astype(str).str[:10].str.replace("-","").astype(int) print(autos["last_seen"].head()) # ### See if there are particular keywords in the name column that you can extract as new columns # In[67]: autos["name"].unique() # In[68]: autos["name"].value_counts().head() # In[69]: autos["name"] = autos["name"].str.replace("_", ",") # In[70]: print(autos["name"].head().str.split()) # ##### Thus, name column can be split into Model_name and Engine or other Specification columns using extract and split commands # ### Find the most common brand/model combinations. # In[71]: # grouping the dataset by model and brand brand_mod_group = autos.groupby(["brand","model"]) # find the count and sort descending brand_model = brand_mod_group["date_crawled"].count().sort_values(ascending=False) print(brand_model[:10]) # ### Split the odometer_km into groups, and use aggregation to see if average prices follows any patterns based on the milage. # In[72]: # splitting mileage into groups df_by_km = autos.groupby(['odometer_km']) df_by_km # In[75]: # sort values by mean price df_by_km['price'].mean().sort_values(ascending=False) # We can clearly observe a pattern: the higher the mileage - the less is the mean price, except one outlier mileage of 5000. The cars with mileage between 10000-40000 have the best value, whereas cars with mileage higher than 125000 has the lowest value. # ### How much cheaper are cars with damage than their non-damaged counterparts: # In[76]: autos['unrepaired_damage'].unique() # In[77]: df_by_damage = autos.groupby(['unrepaired_damage']) df_by_damage # In[78]: df_by_damage['price'].mean().sort_values(ascending=False) # In the code above, we calculated the mean prices for cars with damage and without damage. As we can see, cars with no damage sell 2 times better than damaged cars. # # Conclusion # # # The top brands by listings: # * Volkswagen (21.21%) # * BMW (11.00%) # * Opel (10.82%) # * Mercedes Benz (9.54%) # * Audi (8.64%) # * Ford (6.98%) # # The Germans clearly love local automakers, with German brands taking the top 5 spots and making up over 61.21% of all listings. # # We can see that Mercedes Benz vehicles are by far the most expensive amongst the top brands, on average costing three times more than the second most expensive brand Audi. However, it still makes up nearly 10% of all vehicles in the dataset which indicates many people are tying to sell their Mercedes Benz vehicles. # # More research would need to be done in order to state whether or not the demand for Mercedes Benz vehicles is greater or less than the supply. # # We also see that the three most expensive top brands (Mercedes Benz, Audi, BMW) make up nearly 30% of all listings, which could indicate there is not a strong correlation between price and popularity. However, more research would be required to determine whetether or not this is true. # # There is no apparent relationship between the mean price and mean mileage amongst vehicles of the top brands. # In[ ]: