#!/usr/bin/env python # coding: utf-8 # # Cleaning and Exploring Ebay Car Sales Data # In this project, we'll work with a dataset of used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website. The dataset can he found here: [data](https://data.world/data-society/used-cars-data). # # Let's start by importing the libraries we need and reading the dataset into pandas. # In[1]: import numpy as np import pandas as pd # In[2]: autos = pd.read_csv('autos_dq.csv', encoding='Windows-1252') # In[3]: autos.info() # In[4]: autos.head() # ### Initial Observation # The initial observation are the following: # # - The dataset contains 20 columns, most of which are strings. # - Some columns have null values, but none have more than ~20% null values. # - The column names use camelcase instead of Python's preferred snakecase, which means we can't just replace spaces with underscores. # # We will now convert column names from camelcase to snakecase and reword some of the column names to be more descriptive. # ### Column Renaming # In[5]: autos.columns # In[6]: autos.rename({ 'dateCrawled':'date_crawled', 'offerType':'offer_type', 'vehicleType':'vehicle_type', 'yearOfRegistration':'registration_year', 'powerPS':'power_ps', 'monthOfRegistration':'registration_month', 'fuelType':'fuel_type', 'notRepairedDamage':'unrepaired_damage', 'dateCreated':'ad_created', 'nrOfPictures':'nr_of_pictures', 'postalCode':'postal_code', 'lastSeen':'last_seen' }, axis=1, inplace=True) # In[7]: autos.columns # In[8]: autos.head() # In[9]: autos.describe(include='all') # After the initial review of the data, it seems that there are few columns that have mostly one value and probably might be dropped from the further analysis; it includes: seller and offer_type. # ### Deleting Irrelevant Column # In[10]: autos = autos.drop(columns = ['seller', 'offer_type']) # We also investigate few more columns. # In[11]: autos.model.value_counts() # In[12]: autos.brand.value_counts() # The data stored in brand and model columns seems to be clean. Now let's explore the price column. # In[13]: autos.price.describe() # In[14]: autos.price.value_counts() # ### Conversion of Text Data into Numeric # The data stored in the price column is stored as a text. Now let's remove any non-numeric characters ($) and then convert it into numeric values, followed by the column name change to properly reflect the currency. # In[15]: autos["price"] = autos["price"].str.replace('$', '') autos["price"] = autos["price"].str.replace(',', '').astype(int) # In[16]: autos.rename({"price":"price_usd"}, axis=1, inplace=True) # The same applies to odometer column. Let's quickly do the operation for this column and also change the column name. # In[17]: autos.odometer.value_counts() # In[18]: autos["odometer"] = autos["odometer"].str.replace('km','') autos["odometer"] = autos["odometer"].str.replace(',','').astype(int) # In[19]: autos.rename({"odometer":"odometer_km"}, axis=1, inplace=True) # ### Removing the outliers # Let's continue exploring the data, specifically looking for data that doesn't look right. We'll analyze the odometer_km and price_usd columns and look for the outliers. # In[20]: autos.price_usd.value_counts().sort_index(ascending=False).head(20) # In[21]: autos.price_usd.value_counts().sort_index(ascending=True).head(30) # In[22]: autos.loc[autos['price_usd'] == 350000] # It seems that price column includes many outliers that should be removed. It's fair to say that every entries having price that above 350 000 can be removed (we checked the entries for 350 000 and 345 000, which are Porsche and Rolls-Royce, so their prices still seem rational). # # The same applies to any cars sold for the price that equals or is lower than 50. Let's remove them! # In[23]: autos = autos[(autos["price_usd"] > 50) & (autos["price_usd"] < 350000)] # Now let's do the same exercise with the odemeter_km column. # In[24]: autos.odometer_km.value_counts().sort_index(ascending=False).head(20) # In[25]: autos.odometer_km.value_counts().sort_index(ascending=True).head(30) # It seems that odometer_km column does not need further cleaning, as the data seems to be clean and does not include any outlier. # ### Working with Date & Time # In[26]: autos.info() # There are 5 columns that should represent date values. Some of these columns were created by the crawler, some came from the website itself. We can differentiate by referring to the data dictionary: # # - `date_crawled`: added by the crawler # - `last_seen`: added by the crawler # - `ad_created`: from the website # - `registration_month`: from the website # - `registration_year`: from the website # # Right now, the date_crawled, last_seen, and ad_created columns are all identified as string values by pandas. Because these three columns are represented as strings, we need to convert the data into a numerical representation so we can understand it quantitatively. The other two columns are represented as numeric values, so we can use methods like Series.describe() to understand the distribution without any extra data processing. # In[27]: autos[['date_crawled','ad_created','last_seen']][0:5] # In[28]: autos['date_crawled'].str[:10].value_counts(normalize=True, dropna=False).sort_index() # The most popular day for the ad crawl was 2016-04-03, with 3.86% share. # In[29]: autos['ad_created'].str[:10].value_counts(normalize=True, dropna=False).sort_index() # The most popular day for the ad creation was 2016-04-03, with 3.88% share. # In[30]: autos['last_seen'].str[:10].value_counts(normalize=True, dropna=False).sort_index() # The most popular day for the last seen column was 2016-04-06, with 2.21% share. # After exploring ad crawl, ad creation and last seen columns, let's explore registration_year column. # In[31]: autos.registration_year.describe() # The registration_year column includes some odd values, with minimum value at 1000 (before cars were invented) and maximum value 9999 (many years into the future). # # Because a car can't be first registered after the listing was seen, any vehicle with a registration year above 2016 is definitely inaccurate. Determining the earliest valid year is more difficult. Realistically, it could be somewhere in the first few decades of the 1900s. # In[32]: autos = autos[autos['registration_year'].between(1900,2016)] # In[33]: autos.registration_year.describe() # In[34]: autos.registration_year.value_counts(normalize=True) # After further cleaning and removing any cars claimed to be registered before 1900 or after 2016, we have 46423 remaining cars. # # Most cars (6.7%) were registered in year 2000. # ### Car Brands Analysis with Aggregation # Now let's explore variations across different car brands using aggregation. # In[35]: autos['brand'].describe() # In[36]: autos['brand'].value_counts() # We will select top 10 most common brands for further analysis. # In[37]: top_brands = autos['brand'].value_counts().index[:10] top_brands # Now we will do the data agreggation using loops and dictionary. The process looks as follows: # # - Create an empty dictionary to store our aggregate data # - Loop over the unique values, and for each: # - Subset the dataframe by the unique values # - Calculate the mean of whichever column we're interested in # - Assign the val/mean to the dict as k/v. # # We will use the process described above to calculate the mean price for the top 10 brands in the dataset. # In[38]: mean_price_brand = {} for b in top_brands: brand_row = autos[autos['brand']==b] mean = brand_row['price_usd'].mean() mean_price_brand[b]=mean # In[39]: mean_price_brand # In[40]: dict(sorted(mean_price_brand.items(), key=lambda item: item[1])) #sorting the dictionary by value # We aggregated across brands to understand mena price of the cars. From among the top 10 brands (most frequent ones), audi has the highest average price (9376 USD), followed by mercedes benz (8664 USD) and bmw (8378 USD). # # Ford and Seat were less expensive, with Volkswagen in between. # ### Data Aggregation - Average Mileage of the Car and the Mean Price # For the top 6 brands, let's use aggregation to understand the average mileage for those cars and if there's any visible link with mean price. To do so, we can combine the data from both series objects into a single dataframe (with a shared index) and display the dataframe directly using pandas series constructor. # In[41]: autos.info() # In[42]: mean_mileage_brand = {} for b in top_brands: brand_row = autos[autos['brand']==b] mean = brand_row['odometer_km'].mean() mean_mileage_brand[b]=mean # In[43]: mean_mileage_brand # We calculated the mean mileage for the top 10 brands and stored the results in a dictionary. Now we will convert both dictionaries into series objects, using the series constructor. # In[44]: mpb_series = pd.Series(mean_price_brand) print(mpb_series) # In[45]: mmb_series = pd.Series(mean_mileage_brand) print(mmb_series) # Now we will create a dataframe from the first series object using the dataframe constructor. Then we will assign the other series as a new column in this dataframe # In[46]: price_miles_df = pd.DataFrame(mpb_series, columns=['mean_price']) # In[47]: price_miles_df['mean_mileage'] = mmb_series # In[48]: price_miles_df # In[49]: price_miles_df.describe() # In[50]: price_miles_df['mean_price'].corr(price_miles_df['mean_mileage']) # There are no huge differences in the average mileage within the top 10 brands in the dataset, with the standard deviation value being below 5000 km. There is also no correlation between those two values. # ### Conclusions # Within the project, we have mainly performed several data cleaning steps with the ebay cars dataset. As for the data analysis itself, from among the top 10 brands, there were three brands with the highest average price (audi, mercedes_benz and bmw). Also, the mean milage did not influence the average price of the car. # In[ ]: