We will be working on a dataset of used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website.
The dataset was originally scraped and uploaded to Kaggle. The version of the dataset we are working with is a sample of 50,000 data points that was prepared by Dataquest including simulating a less-cleaned version of the data.
The data dictionary provided with data is as follows:
The aim of this project is to clean the data and analyze the included used car listings. So let us begin by importing and reading the data.
import numpy as np #Importing Numpy
import pandas as pd #Importing Pandas
# Reading the CSV file using Pandas
auto=pd.read_csv(open(
r'\Users\Surface GO\Downloads\Guided Project_ Exploring Ebay Car Sales Data\autos.csv'))
auto.head(2) # Checking the top 2 rows of the data set
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
From the initial observation, we can see that the format of column names can be improved. So we will update, standardize and make the column names more readable.
First thing we have to do is
We will now read the column names of the dataframe.
# Reading the column names of the dataframe 'auto'
auto.columns
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest', 'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model', 'odometer', 'monthOfRegistration', 'fuelType', 'brand', 'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode', 'lastSeen'], dtype='object')
We use dataframe.rename() method to rename to make the column names more readable
The format to rename is
dataframe.rename({'oldname':'new_name'},axis=1, inplace=True)
auto.rename({'dateCrawled':'date_crawled',
'offerType':'offer_type',
'abtest':'ab_test',
'vehicleType':'vehicle_type',
'yearOfRegistration':'reg_year',
'powerPS':'power_ps',
'monthOfRegistration':'reg_month',
'fuelType':'fuel_type',
'notRepairedDamage':'not_repaired_damage',
'dateCreated':'date_created',
'nrOfPictures':'no_of_pictures',
'postalCode':'postal_code',
'lastSeen':'last_seen'}, axis=1, inplace=True)
Now that we have changed the column names, we can check the infomation of the dataframe. To print a concise summary of a DataFrame we use dataframe.info()
auto.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date_crawled 50000 non-null object 1 name 50000 non-null object 2 seller 50000 non-null object 3 offer_type 50000 non-null object 4 price 50000 non-null object 5 ab_test 50000 non-null object 6 vehicle_type 44905 non-null object 7 reg_year 50000 non-null int64 8 gearbox 47320 non-null object 9 power_ps 50000 non-null int64 10 model 47242 non-null object 11 odometer 50000 non-null object 12 reg_month 50000 non-null int64 13 fuel_type 45518 non-null object 14 brand 50000 non-null object 15 not_repaired_damage 40171 non-null object 16 date_created 50000 non-null object 17 no_of_pictures 50000 non-null int64 18 postal_code 50000 non-null int64 19 last_seen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
We can see that the column names have been updated. There are a total of 50,000 entries and 20 columns. Out of this 5 entries are of int data type while others are objects.
Now that we have updated the column names, we can look into the data stats using describe function and use include='all'
parameter to see all non-numeric data as well. Otherwise it will show only those columns with datatype int
.
The format of describe function is this.
dataframe.describe(include=all')
auto.describe(include='all')
date_crawled | name | seller | offer_type | price | ab_test | vehicle_type | reg_year | gearbox | power_ps | model | odometer | reg_month | fuel_type | brand | not_repaired_damage | date_created | no_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000 | 50000 | 50000 | 50000 | 50000 | 50000 | 44905 | 50000.000000 | 47320 | 50000.000000 | 47242 | 50000 | 50000.000000 | 45518 | 50000 | 40171 | 50000 | 50000.0 | 50000.000000 | 50000 |
unique | 48213 | 38754 | 2 | 2 | 2357 | 2 | 8 | NaN | 2 | NaN | 245 | 13 | NaN | 7 | 40 | 2 | 76 | NaN | NaN | 39481 |
top | 2016-03-19 17:36:18 | Ford_Fiesta | privat | Angebot | $0 | test | limousine | NaN | manuell | NaN | golf | 150,000km | NaN | benzin | volkswagen | nein | 2016-04-03 00:00:00 | NaN | NaN | 2016-04-07 06:17:27 |
freq | 3 | 78 | 49999 | 49999 | 1421 | 25756 | 12859 | NaN | 36993 | NaN | 4024 | 32424 | NaN | 30107 | 10687 | 35232 | 1946 | NaN | NaN | 8 |
mean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2005.073280 | NaN | 116.355920 | NaN | NaN | 5.723360 | NaN | NaN | NaN | NaN | 0.0 | 50813.627300 | NaN |
std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 105.712813 | NaN | 209.216627 | NaN | NaN | 3.711984 | NaN | NaN | NaN | NaN | 0.0 | 25779.747957 | NaN |
min | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1000.000000 | NaN | 0.000000 | NaN | NaN | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 1067.000000 | NaN |
25% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1999.000000 | NaN | 70.000000 | NaN | NaN | 3.000000 | NaN | NaN | NaN | NaN | 0.0 | 30451.000000 | NaN |
50% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2003.000000 | NaN | 105.000000 | NaN | NaN | 6.000000 | NaN | NaN | NaN | NaN | 0.0 | 49577.000000 | NaN |
75% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | NaN | 9.000000 | NaN | NaN | NaN | NaN | 0.0 | 71540.000000 | NaN |
max | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 9999.000000 | NaN | 17700.000000 | NaN | NaN | 12.000000 | NaN | NaN | NaN | NaN | 0.0 | 99998.000000 | NaN |
From the stats given by describe() function we can make a few observations regarding missing datas, some of the columns that we can drop, some of the mistakes in the columns and so on. Let us look into such observations now.
The total number of entries in a column should be 50,000 as we have that many rows in this dataframe. But there are some missing values.
Seller column has 2 unique values and the top one is Private with a total number of 49999 which means there is only one other item in that column, and for most practical puroposes, we can drop this column from our analysis. But before doing that we can have a look at what the other item is and its data.
Similarly offer_type column has 2 unique items with Angebot being the top with a count of 49999. We do the same inspection for this column as well.
The stat values of Number of pictures column shows 0. Looks like the number of pictures column has all the values set to zero. We will check this and if it is zero we will drop that column.
In the Reg_year column there are some interesting values. The stat values shows that reg_year minimum is 1000 and maximum is 9999!
It means that there were cars in 1000AD and they are for sale just like the cars from the future- 9999AD to be exact- are also for sale.
This means that there are obviously some errors in those data.
An internet search will show that one of the most powerful cars, Bugatti Veyron Super Sport has a power of 1,200 PS Most powerful production cars Lotus Evija has a power of 1,973 hp which is 2,000 PS. The maximum value of Power_ps says 17700, so there is an outlier in that column as well which we need to check
The price and odometer column has units along with the values. Price has $ sign while Odometer values have km with it. We need to remove the units and rename the columns for the ease of using these data.
So let us begin cleaning the data.
First we will remove the units from price and odometer columns
To do this we will use dataframe.str.replace()
method
Once we remove the non-numeric characters from the entries, we can convert it into float type using astype(float)
# Replacing '$'' and ',' with whitespace and converting the string to float datatype
auto['price']=auto['price'].str.replace('$','').str.replace(',','').astype(float)
# Checking the first 2 entries in 'Price' column
auto['price'].head(2)
0 5000.0 1 8500.0 Name: price, dtype: float64
We can see that the values in the price column has become float type and all the non numerical values are removed. Now we can check the odometer column
Now we do the same operations in Odometer column to remove the non-numeric values which are 'km' and ','.
# Replacing 'km'' and ',' with whitespace and converting the string to float datatype
auto['odometer']=auto['odometer'].str.replace('km','').str.replace(',','').astype(float)
# Checking the first 2 entries in 'odometer' column
auto['odometer'].head(2)
0 150000.0 1 150000.0 Name: odometer, dtype: float64
So we have successfully converted odometer values also into float values without the units.
Now we need to specify the units on column name. We use dataframe.rename()
method to do that.
# Renaming the price and odometer columns
auto.rename({'price':'price_$', 'odometer':'odometer_km'},axis=1,inplace=True)
auto.columns #Checking the updated column names
Index(['date_crawled', 'name', 'seller', 'offer_type', 'price_$', 'ab_test', 'vehicle_type', 'reg_year', 'gearbox', 'power_ps', 'model', 'odometer_km', 'reg_month', 'fuel_type', 'brand', 'not_repaired_damage', 'date_created', 'no_of_pictures', 'postal_code', 'last_seen'], dtype='object')
Now that we have a cleaned data, we can look at the columns that don't give much value to the analysis.
During our initial observation we have seen that the seller and offer_type columns has only 2 values and more than 99% of them are the same value. So let us check those columns.
#Checking for number of unique values in 'seller' column
auto['seller'].value_counts()
privat 49999 gewerblich 1 Name: seller, dtype: int64
# Finding the data associated with the only different value in seller column.
auto.loc[auto['seller']=='gewerblich']
date_crawled | name | seller | offer_type | price_$ | ab_test | vehicle_type | reg_year | gearbox | power_ps | model | odometer_km | reg_month | fuel_type | brand | not_repaired_damage | date_created | no_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
7738 | 2016-03-15 18:06:22 | Verkaufe_mehrere_Fahrzeuge_zum_Verschrotten | gewerblich | Angebot | 100.0 | control | kombi | 2000 | manuell | 0 | megane | 150000.0 | 8 | benzin | renault | NaN | 2016-03-15 00:00:00 | 0 | 65232 | 2016-04-06 17:15:37 |
So we have found out that there are two types of sellers privat and gewerblich which translates to Private and Commercial.
Since all the values associated with the commercial seller is similar to other values, even if we drop this specific row or the whole column of 'seller', it wouldn't make any difference to our analysis. So we can drop the 'seller' column along with other columns if there are any.
Now let us look at Offer_types and its values.
#Checking for number of unique values in 'offer_type' column
auto['offer_type'].value_counts()
Angebot 49999 Gesuch 1 Name: offer_type, dtype: int64
The English translation of these German words means
So there is only one Gesuch aka request. So lets check that request.
# Finding the data associated with the only different value in 'offer_type' 'Gesuch'
auto.loc[auto['offer_type']=='Gesuch']
date_crawled | name | seller | offer_type | price_$ | ab_test | vehicle_type | reg_year | gearbox | power_ps | model | odometer_km | reg_month | fuel_type | brand | not_repaired_damage | date_created | no_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
17541 | 2016-04-03 15:48:33 | Suche_VW_T5_Multivan | privat | Gesuch | 0.0 | test | bus | 2005 | NaN | 0 | transporter | 150000.0 | 0 | NaN | volkswagen | NaN | 2016-04-03 00:00:00 | 0 | 29690 | 2016-04-05 15:16:06 |
Looks like the request is for Volkswagen bus.Let us quickly check if such a combination is available in the dataframe.
# Finding the number of Volkswagen buses in the given data set.
# We will count the number of buses that are Volkswagen.
auto.loc[(auto['vehicle_type']=='bus') & (auto['brand']=='volkswagen'),'vehicle_type'].value_counts()
bus 1400 Name: vehicle_type, dtype: int64
Since the value associated with 'request' offer_type is no different from other values, we can still keep the row. But at the same time, the column values don't influence the analysis. So we can drop the column without affecting the analysis.
Now we can look into the mysteries zeroes in the number of pictures column.
# Finding different unique values in no_of_pictures column
auto['no_of_pictures'].value_counts()
0 50000 Name: no_of_pictures, dtype: int64
So the result says all the values are 0 in the column and hence we can safely drop this column from analysis.
Now we have bigger mysteries to solve.
From our initial observation we have seen some interesting years in which the vehicle was registered. The minimum was 1000 and the maximum was 9999. Now lets dig deep into this mystery.
Let us first find out more about the different years present in the data.
# Finding the distribution of unique year values and segregating them using 'bins'
print(auto['reg_year'].value_counts(bins=5).sort_index())
# Findind sorted array of unique values
auto['reg_year'].sort_values().unique()
(991.0, 2799.8] 49982 (2799.8, 4599.6] 3 (4599.6, 6399.4] 7 (6399.4, 8199.2] 0 (8199.2, 9999.0] 8 Name: reg_year, dtype: int64
array([1000, 1001, 1111, 1500, 1800, 1910, 1927, 1929, 1931, 1934, 1937, 1938, 1939, 1941, 1943, 1948, 1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959, 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2800, 4100, 4500, 4800, 5000, 5911, 6200, 8888, 9000, 9996, 9999], dtype=int64)
Here we can clearly see that most of the years are below 2799. So let's find more specific results based on a few facts.
So lets go ahead and check
print("Number of Cars registered as between 1000 and 1885:", auto[auto['reg_year'].between(1000,1885)].shape[0])
print("Number of Cars registered as between 1885 and 1950: ", auto[auto['reg_year'].between(1885,1950)].shape[0])
print("Number of Cars registered as between 1950 and 1970: ", auto[auto['reg_year'].between(1950,1970)].shape[0])
print("Number of Cars registered as between 1970 and 1990: ", auto[auto['reg_year'].between(1970,1990)].shape[0])
print("Number of Cars registered as between 1990 and 2000: ", auto[auto['reg_year'].between(1990,2000)].shape[0])
print("Number of Cars registered as between 2000 and 2016: ", auto[auto['reg_year'].between(2000,2016)].shape[0])
print("Number of Cars registered as between 2016 and 2020: ", auto[auto['reg_year'].between(2016,2020)].shape[0])
print("Number of Cars registered as between 2020 and 9999: ", auto[auto['reg_year'].between(2020,9999)].shape[0])
# Percentage = (no. of rows of dataframe containing values between 1910 and 2016) / (no. of rows of whole dataframe)
print("Values within the range", auto[auto['reg_year'].between(1910,2016)].shape[0]/auto.shape[0]*100, "%")
Number of Cars registered as between 1000 and 1885: 6 Number of Cars registered as between 1885 and 1950: 27 Number of Cars registered as between 1950 and 1970: 250 Number of Cars registered as between 1970 and 1990: 1558 Number of Cars registered as between 1990 and 2000: 15839 Number of Cars registered as between 2000 and 2016: 34151 Number of Cars registered as between 2016 and 2020: 3264 Number of Cars registered as between 2020 and 9999: 18 Values within the range 96.056 %
Since 96% of data lies within the reg_year range of 1910 and 2016, we can drop all other rows outside that. Since reg_month is also given in the data, we can have a quick look at it.
Lets see if there are any errors in reg_month column. In the initial observation we have found that the minimum value is 0 and maximum is 12. According to conventions, 12 represents December, but then what happens with 0 value. We will have to check that now.
# Finding the number of unique values in reg_month column
auto['reg_month'].value_counts().sort_index()
0 5075 1 3282 2 3008 3 5071 4 4102 5 4107 6 4368 7 3949 8 3191 9 3389 10 3651 11 3360 12 3447 Name: reg_month, dtype: int64
Interestingly we have a 0 month has the most number of entries here. Let's have a look at it.
# Finding the first five entries with reg_month=0
auto.loc[auto['reg_month']==0].head()
date_crawled | name | seller | offer_type | price_$ | ab_test | vehicle_type | reg_year | gearbox | power_ps | model | odometer_km | reg_month | fuel_type | brand | not_repaired_damage | date_created | no_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
11 | 2016-03-16 18:45:34 | Mercedes_A140_Motorschaden | privat | Angebot | 350.0 | control | NaN | 2000 | NaN | 0 | NaN | 150000.0 | 0 | benzin | mercedes_benz | NaN | 2016-03-16 00:00:00 | 0 | 17498 | 2016-03-16 18:45:34 |
18 | 2016-03-26 19:57:44 | Verkaufen_mein_bmw_e36_320_i_touring | privat | Angebot | 300.0 | control | bus | 1995 | manuell | 150 | 3er | 150000.0 | 0 | benzin | bmw | NaN | 2016-03-26 00:00:00 | 0 | 54329 | 2016-04-02 12:16:41 |
27 | 2016-03-27 18:45:01 | Hat_einer_Ahnung_mit_Ford_Galaxy_HILFE | privat | Angebot | 0.0 | control | NaN | 2005 | NaN | 0 | NaN | 150000.0 | 0 | NaN | ford | NaN | 2016-03-27 00:00:00 | 0 | 66701 | 2016-03-27 18:45:01 |
46 | 2016-03-31 10:53:28 | BMW_mit__Lpg | privat | Angebot | 9000.0 | control | NaN | 2005 | automatik | 0 | NaN | 150000.0 | 0 | NaN | bmw | NaN | 2016-03-31 00:00:00 | 0 | 12351 | 2016-04-06 03:44:41 |
52 | 2016-03-25 18:50:03 | Senator_A_3.0E_Karosserie_restauriert_m._viele... | privat | Angebot | 3500.0 | test | limousine | 1985 | NaN | 0 | andere | 5000.0 | 0 | benzin | opel | nein | 2016-03-25 00:00:00 | 0 | 63500 | 2016-04-07 00:46:00 |
After going through the data containing reg_month=0, we can see that most other entries look fine.
Since zero is the top entry in this column, we can't drop it. Maybe zero is the default option for selecting the month and most of the users left it at default value. This could be a good reason to explain this.
During our initial observation, we have seen some extreme values for power such as 17700PS. These are very high values. If they are correct values, probably the prices should also go more than the average price for that brand. If we can find any correlation between these extreme power and price, there is a chance that these values are correct. Else, they are some wrong data.
Now let us look at the power_ps column and inspect it further. First lets check the unique values and group them.
# Finding unique values of power_ps column and grouping it with bins parameter
auto['power_ps'].value_counts(bins=10).sort_index()
(-17.701, 1770.0] 49969 (1770.0, 3540.0] 16 (3540.0, 5310.0] 2 (5310.0, 7080.0] 4 (7080.0, 8850.0] 2 (8850.0, 10620.0] 1 (10620.0, 12390.0] 0 (12390.0, 14160.0] 1 (14160.0, 15930.0] 2 (15930.0, 17700.0] 3 Name: power_ps, dtype: int64
We can see that most of the values are less than 1770. The highest value 17700 could be because of a plain typing error and ended up having an extra zero. Let's check a bit more about the data corresponding to power more than 2500ps.
print(auto[auto['power_ps']>2500].shape[0]) # Find the number of entries with power more than 2500
auto[auto['power_ps']>2500].head() # Find the first 5 rows with power more than 2500
17
date_crawled | name | seller | offer_type | price_$ | ab_test | vehicle_type | reg_year | gearbox | power_ps | model | odometer_km | reg_month | fuel_type | brand | not_repaired_damage | date_created | no_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1699 | 2016-04-04 19:49:19 | Opel_Corsa_1.0_Motor_ecotek | privat | Angebot | 1200.0 | test | limousine | 2001 | manuell | 6512 | corsa | 150000.0 | 12 | benzin | opel | NaN | 2016-04-04 00:00:00 | 0 | 47198 | 2016-04-06 22:16:46 |
3753 | 2016-04-03 18:47:14 | VW_Polo_9n | privat | Angebot | 4700.0 | control | kleinwagen | 2009 | manuell | 6045 | polo | 125000.0 | 12 | benzin | volkswagen | nein | 2016-04-03 00:00:00 | 0 | 48565 | 2016-04-05 19:17:39 |
10659 | 2016-03-18 11:47:40 | VW_POLO_BASTLERFAHRZEUG | privat | Angebot | 100.0 | control | kleinwagen | 1996 | manuell | 3750 | polo | 125000.0 | 4 | benzin | volkswagen | ja | 2016-03-18 00:00:00 | 0 | 12489 | 2016-04-03 21:17:20 |
11009 | 2016-03-15 21:55:11 | Smart_fortwo_Silver_**Brabus_Style** | privat | Angebot | 2550.0 | control | kleinwagen | 2004 | automatik | 6226 | fortwo | 125000.0 | 5 | benzin | smart | nein | 2016-03-15 00:00:00 | 0 | 12359 | 2016-04-07 07:45:21 |
11311 | 2016-04-01 02:36:43 | Vw__zu__verkaufen__Caravelle_Confort_kurz_ | privat | Angebot | 8600.0 | control | andere | 2007 | automatik | 3500 | transporter | 150000.0 | 12 | NaN | volkswagen | NaN | 2016-04-01 00:00:00 | 0 | 30855 | 2016-04-07 05:17:11 |
We can see that there are only 17 entries that has power value more than 2500.
Since these numbers are negligible and other values associated with these rows looks fine, for the time being we can keep them. When there is a specific analysis based on power of the vehicles, we will come back to this extremely high values.
We have figured our that we can drop the columns of seller, offer_type and no_of_pictures without affecting the analysis. So now lets go ahead and drop the columns for user, offer_type and no_of_pictures.
Once it is done, we can drop the rows with wrong registration year values.
# Dropping 'seller','offer_type','no_of_pictures' columns
auto.drop(columns=['seller','offer_type','no_of_pictures'], inplace=True)
# Dropping the rows by including only the entries with reg_year between and including 1910 and 2016.
auto = auto.loc[(auto['reg_year']>=1910)&(auto['reg_year']<=2016)]
# Checking the updated auto dataframe
auto.describe(include='all')
date_crawled | name | price_$ | ab_test | vehicle_type | reg_year | gearbox | power_ps | model | odometer_km | reg_month | fuel_type | brand | not_repaired_damage | date_created | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 48028 | 48028 | 4.802800e+04 | 48028 | 44903 | 48028.00000 | 45604 | 48028.000000 | 45560 | 48028.000000 | 48028.000000 | 44301 | 48028 | 39040 | 48028 | 48028.000000 | 48028 |
unique | 46375 | 37019 | NaN | 2 | 8 | NaN | 2 | NaN | 244 | NaN | NaN | 7 | 40 | 2 | 74 | NaN | 38102 |
top | 2016-03-22 09:51:06 | Ford_Fiesta | NaN | test | limousine | NaN | manuell | NaN | golf | NaN | NaN | benzin | volkswagen | nein | 2016-04-03 00:00:00 | NaN | 2016-04-07 06:17:27 |
freq | 3 | 76 | NaN | 24756 | 12859 | NaN | 35560 | NaN | 3815 | NaN | NaN | 29248 | 10188 | 34255 | 1878 | NaN | 8 |
mean | NaN | NaN | 9.585252e+03 | NaN | NaN | 2002.80351 | NaN | 117.070417 | NaN | 125544.161739 | 5.767760 | NaN | NaN | NaN | NaN | 50935.867327 | NaN |
std | NaN | NaN | 4.843817e+05 | NaN | NaN | 7.31085 | NaN | 195.151278 | NaN | 40106.751417 | 3.696802 | NaN | NaN | NaN | NaN | 25792.079828 | NaN |
min | NaN | NaN | 0.000000e+00 | NaN | NaN | 1910.00000 | NaN | 0.000000 | NaN | 5000.000000 | 0.000000 | NaN | NaN | NaN | NaN | 1067.000000 | NaN |
25% | NaN | NaN | 1.150000e+03 | NaN | NaN | 1999.00000 | NaN | 71.000000 | NaN | 100000.000000 | 3.000000 | NaN | NaN | NaN | NaN | 30459.000000 | NaN |
50% | NaN | NaN | 2.990000e+03 | NaN | NaN | 2003.00000 | NaN | 107.000000 | NaN | 150000.000000 | 6.000000 | NaN | NaN | NaN | NaN | 49696.000000 | NaN |
75% | NaN | NaN | 7.400000e+03 | NaN | NaN | 2008.00000 | NaN | 150.000000 | NaN | 150000.000000 | 9.000000 | NaN | NaN | NaN | NaN | 71665.000000 | NaN |
max | NaN | NaN | 1.000000e+08 | NaN | NaN | 2016.00000 | NaN | 17700.000000 | NaN | 150000.000000 | 12.000000 | NaN | NaN | NaN | NaN | 99998.000000 | NaN |
Now that we have a fairly cleaner data, it is time to look deep into the price column and learn more about it.
From the price column stats we know that the Minimum price is given as 0 and maximum price is given as 100,000,000. That obviously raises the question that if someone is selling a car for free or someone has a super expensive car to sell? We need to figure that out. So lets get started.
#Grouping the price column with unique value of prices.
auto['price_$'].value_counts(bins=5000).sort_index()
(-100000.0, 20000.0] 45671 (20000.0, 40000.0] 2003 (40000.0, 59999.999] 219 (59999.999, 79999.999] 63 (79999.999, 99999.999] 21 ... (99899999.001, 99919999.001] 0 (99919999.001, 99939999.001] 0 (99939999.001, 99959999.0] 0 (99959999.0, 99979999.0] 0 (99979999.0, 99999999.0] 1 Name: price_$, Length: 5000, dtype: int64
We can clearly see that most of the values are within 0 and 2000 and the second largest values are within 2000 and 4000 range. Then there are a few extreme values that are giving a wrong picture on stats values.
Now let us examine the top most expensive listings.
# Getting the listing based on the value of price in descending order
auto.sort_values('price_$',ascending=False).head()
date_crawled | name | price_$ | ab_test | vehicle_type | reg_year | gearbox | power_ps | model | odometer_km | reg_month | fuel_type | brand | not_repaired_damage | date_created | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
39705 | 2016-03-22 14:58:27 | Tausch_gegen_gleichwertiges | 99999999.0 | control | limousine | 1999 | automatik | 224 | s_klasse | 150000.0 | 9 | benzin | mercedes_benz | NaN | 2016-03-22 00:00:00 | 73525 | 2016-04-06 05:15:30 |
42221 | 2016-03-08 20:39:05 | Leasinguebernahme | 27322222.0 | control | limousine | 2014 | manuell | 163 | c4 | 40000.0 | 2 | diesel | citroen | NaN | 2016-03-08 00:00:00 | 76532 | 2016-03-08 20:39:05 |
47598 | 2016-03-31 18:56:54 | Opel_Vectra_B_1_6i_16V_Facelift_Tuning_Showcar... | 12345678.0 | control | limousine | 2001 | manuell | 101 | vectra | 150000.0 | 3 | benzin | opel | nein | 2016-03-31 00:00:00 | 4356 | 2016-03-31 18:56:54 |
24384 | 2016-03-21 13:57:51 | Schlachte_Golf_3_gt_tdi | 11111111.0 | test | NaN | 1995 | NaN | 0 | NaN | 150000.0 | 0 | NaN | volkswagen | NaN | 2016-03-21 00:00:00 | 18519 | 2016-03-21 14:40:18 |
2897 | 2016-03-12 21:50:57 | Escort_MK_1_Hundeknochen_zum_umbauen_auf_RS_2000 | 11111111.0 | test | limousine | 1973 | manuell | 48 | escort | 50000.0 | 3 | benzin | ford | nein | 2016-03-12 00:00:00 | 94469 | 2016-03-12 22:45:27 |
So looks like mostly the expensive car types are of Limousine. A quick internet search will show that one of the most expensive Limo would cost about 5000,000$. That too for the brand new one. The last one in the list above, the 1973 Limousine could be more expensive, due to its antiquity and brand name.
As these high value entries are only a few exceptions, we can discard them from the analysis.
From the grouping we did 40,0000 is an acceptable sort of expensive value. Also after 2000-4000 range, there are lesser number of entries. So let have a look at the entries and confirm.
auto[auto['price_$'].between(250000,400000)].sort_values('price_$', ascending=False)
date_crawled | name | price_$ | ab_test | vehicle_type | reg_year | gearbox | power_ps | model | odometer_km | reg_month | fuel_type | brand | not_repaired_damage | date_created | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
36818 | 2016-03-27 18:37:37 | Porsche_991 | 350000.0 | control | coupe | 2016 | manuell | 500 | 911 | 5000.0 | 3 | benzin | porsche | nein | 2016-03-27 00:00:00 | 70499 | 2016-03-27 18:37:37 |
14715 | 2016-03-30 08:37:24 | Rolls_Royce_Phantom_Drophead_Coupe | 345000.0 | control | cabrio | 2012 | automatik | 460 | NaN | 20000.0 | 8 | benzin | sonstige_autos | nein | 2016-03-30 00:00:00 | 73525 | 2016-04-07 00:16:26 |
34723 | 2016-03-23 16:37:29 | Porsche_Porsche_911/930_Turbo_3.0__deutsche_Au... | 299000.0 | test | coupe | 1977 | manuell | 260 | 911 | 100000.0 | 7 | benzin | porsche | nein | 2016-03-23 00:00:00 | 61462 | 2016-04-06 16:44:50 |
35923 | 2016-04-03 07:56:23 | Porsche_911_Targa_Exclusive_Edition__1_von_15_... | 295000.0 | test | cabrio | 2015 | automatik | 400 | 911 | 5000.0 | 6 | benzin | porsche | nein | 2016-04-03 00:00:00 | 74078 | 2016-04-03 08:56:20 |
12682 | 2016-03-28 22:48:01 | Porsche_GT3_RS__PCCB__Lift___grosser_Exklusiv_... | 265000.0 | control | coupe | 2016 | automatik | 500 | 911 | 5000.0 | 3 | benzin | porsche | nein | 2016-03-28 00:00:00 | 70193 | 2016-04-05 03:44:51 |
47337 | 2016-04-05 10:25:38 | BMW_Z8_roadster | 259000.0 | test | cabrio | 2001 | manuell | 400 | z_reihe | 20000.0 | 6 | benzin | bmw | nein | 2016-04-05 00:00:00 | 61462 | 2016-04-05 12:07:32 |
38299 | 2016-03-28 22:25:25 | Glas_BMW_mit_Wasser | 250000.0 | test | NaN | 2015 | NaN | 0 | x_reihe | 5000.0 | 0 | NaN | bmw | NaN | 2016-03-28 00:00:00 | 60489 | 2016-03-28 22:25:25 |
The price increase is gradual. Also they are rather new porsche, Rolls Royce and BMW. So these entries being super expensive is justified.
So we can conclude that we can discard the expensive cars above 350,000$ without affecting the data analysis.
Now let us look at the lower values including the one that says 0.
# Sorting the unique price values in descending order and displaying the top 20
auto['price_$'].value_counts().sort_index().head(20)
0.0 1335 1.0 150 2.0 2 3.0 1 5.0 2 8.0 1 9.0 1 10.0 6 11.0 2 12.0 3 13.0 2 14.0 1 15.0 2 17.0 3 18.0 1 20.0 4 25.0 5 29.0 1 30.0 6 35.0 1 Name: price_$, dtype: int64
# Displaying the entries in ascending order of price, starting from 0
auto.sort_values('price_$').head()
date_crawled | name | price_$ | ab_test | vehicle_type | reg_year | gearbox | power_ps | model | odometer_km | reg_month | fuel_type | brand | not_repaired_damage | date_created | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
46729 | 2016-03-15 21:47:24 | Opel_Astra_Caravan_1_6_Njoy_zum_ausschlachten | 0.0 | control | kombi | 2003 | manuell | 84 | astra | 150000.0 | 7 | benzin | opel | ja | 2016-03-15 00:00:00 | 47226 | 2016-03-16 20:20:50 |
43824 | 2016-03-17 16:45:04 | Golf_4_____V5_1.8_Motor_150_ps | 0.0 | control | kleinwagen | 1999 | automatik | 150 | golf | 150000.0 | 0 | benzin | volkswagen | NaN | 2016-03-17 00:00:00 | 60528 | 2016-04-06 05:16:55 |
29947 | 2016-03-11 10:48:57 | Familie_sucht_einen_Kombi | 0.0 | test | kombi | 1999 | NaN | 0 | NaN | 150000.0 | 0 | NaN | sonstige_autos | NaN | 2016-03-11 00:00:00 | 37619 | 2016-03-16 17:47:08 |
29949 | 2016-03-31 08:43:49 | VW_Golf_IV_1.4_16v_75Ps_zum_herrichten_oder_sc... | 0.0 | control | NaN | 2016 | manuell | 75 | golf | 150000.0 | 0 | benzin | volkswagen | ja | 2016-03-30 00:00:00 | 31029 | 2016-03-31 08:43:49 |
16729 | 2016-04-04 15:52:41 | 1946_Chevrolet_Fleetmaster_/_Tausch_moeglich | 0.0 | test | NaN | 1960 | NaN | 0 | NaN | 100000.0 | 0 | NaN | chevrolet | NaN | 2016-04-04 00:00:00 | 16727 | 2016-04-06 17:44:46 |
There are 1335 rows with the price as 0. It doens't make much sense to do an analysis of used cars with out prices. Also the 1335/50000 is less than 3%. So we can omit them from our analysis. '
Since ebay is an auction website, having 1$ as the starting price can make sense. So we can keep those low values.
After going through the "Free" listings and "super expensive" listings, we have come to a conclusion to discard the zero values and any values above 350,000.
So we are going to discard others and select the rows with prices between 1 and 350000.
# Selecting and reassigning the auto dataframe with entries having price between 1 and 350,000
auto=auto[auto['price_$'].between(1,350000)]
auto.describe()
price_$ | reg_year | power_ps | odometer_km | reg_month | postal_code | |
---|---|---|---|---|---|---|
count | 46681.000000 | 46681.000000 | 46681.000000 | 46681.000000 | 46681.000000 | 46681.000000 |
mean | 5977.716801 | 2002.910756 | 117.892933 | 125586.855466 | 5.827125 | 51097.434181 |
std | 9177.909479 | 7.185103 | 184.922911 | 39852.528628 | 3.670300 | 25755.387192 |
min | 1.000000 | 1910.000000 | 0.000000 | 5000.000000 | 0.000000 | 1067.000000 |
25% | 1250.000000 | 1999.000000 | 75.000000 | 100000.000000 | 3.000000 | 30827.000000 |
50% | 3100.000000 | 2003.000000 | 109.000000 | 150000.000000 | 6.000000 | 49828.000000 |
75% | 7500.000000 | 2008.000000 | 150.000000 | 150000.000000 | 9.000000 | 71732.000000 |
max | 350000.000000 | 2016.000000 | 17700.000000 | 150000.000000 | 12.000000 | 99998.000000 |
We can now see that the minimum price is 1 and maximum price is 350000.So that looks fine now.
Lets now check the odometer_km column and find out if it has anything to do with car price or sales. Let us start with by getting its stat values.
# Finding the stats value and unique values in odometer column
print(auto['odometer_km'].describe())
auto['odometer_km'].value_counts()
count 46681.000000 mean 125586.855466 std 39852.528628 min 5000.000000 25% 100000.000000 50% 150000.000000 75% 150000.000000 max 150000.000000 Name: odometer_km, dtype: float64
150000.0 30085 125000.0 4857 100000.0 2058 90000.0 1673 80000.0 1375 70000.0 1187 60000.0 1128 50000.0 993 40000.0 797 5000.0 785 30000.0 760 20000.0 742 10000.0 241 Name: odometer_km, dtype: int64
The values given here looks fine. Although the values given are rounded off, probably because of selecting pre-set options. If on a later analysis, a brand new vehicle having very high value of odometer reading, then we will have to remove those rows. Otherwise,for now, these data looks good.
Now the columns left to check are the date time columns.
# To list down all date time column given, together
auto[['date_crawled','date_created','last_seen']].head()
date_crawled | date_created | last_seen | |
---|---|---|---|
0 | 2016-03-26 17:47:46 | 2016-03-26 00:00:00 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | 2016-04-04 00:00:00 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | 2016-03-26 00:00:00 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | 2016-03-12 00:00:00 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | 2016-04-01 00:00:00 | 2016-04-01 14:38:50 |
# To find the percentage values of number of entries under a particular date crawled in chronological order
auto['date_crawled'].str[:10].value_counts(normalize=True,dropna=False).sort_index()
2016-03-05 0.025192 2016-03-06 0.014160 2016-03-07 0.036246 2016-03-08 0.033547 2016-03-09 0.033247 2016-03-10 0.032240 2016-03-11 0.032454 2016-03-12 0.036824 2016-03-13 0.015874 2016-03-14 0.036332 2016-03-15 0.034361 2016-03-16 0.029498 2016-03-17 0.031790 2016-03-18 0.012810 2016-03-19 0.034661 2016-03-20 0.038024 2016-03-21 0.037317 2016-03-22 0.032840 2016-03-23 0.032197 2016-03-24 0.029477 2016-03-25 0.031512 2016-03-26 0.032069 2016-03-27 0.030783 2016-03-28 0.034597 2016-03-29 0.034104 2016-03-30 0.033804 2016-03-31 0.031790 2016-04-01 0.033804 2016-04-02 0.035561 2016-04-03 0.038774 2016-04-04 0.036610 2016-04-05 0.013003 2016-04-06 0.003085 2016-04-07 0.001414 Name: date_crawled, dtype: float64
# To find the percentage values of number of entries on which the post is created in chronological order
auto['date_created'].str[:10].value_counts(normalize=True, dropna=False).sort_index()
2015-06-11 0.000021 2015-08-10 0.000021 2015-09-09 0.000021 2015-11-10 0.000021 2015-12-05 0.000021 ... 2016-04-03 0.039009 2016-04-04 0.036953 2016-04-05 0.011782 2016-04-06 0.003170 2016-04-07 0.001264 Name: date_created, Length: 74, dtype: float64
# To find the percentage values of number of entries under last seen in chronological order
auto['last_seen'].str[:10].value_counts(normalize=True,dropna=False).sort_index()
2016-03-05 0.001071 2016-03-06 0.004113 2016-03-07 0.005377 2016-03-08 0.007476 2016-03-09 0.009768 2016-03-10 0.010690 2016-03-11 0.012382 2016-03-12 0.023757 2016-03-13 0.008654 2016-03-14 0.012660 2016-03-15 0.016002 2016-03-16 0.016281 2016-03-17 0.028084 2016-03-18 0.007219 2016-03-19 0.015617 2016-03-20 0.020629 2016-03-21 0.020587 2016-03-22 0.020844 2016-03-23 0.018359 2016-03-24 0.019687 2016-03-25 0.018937 2016-03-26 0.016795 2016-03-27 0.015638 2016-03-28 0.020694 2016-03-29 0.022086 2016-03-30 0.024614 2016-03-31 0.023628 2016-04-01 0.022943 2016-04-02 0.024657 2016-04-03 0.025149 2016-04-04 0.024121 2016-04-05 0.125404 2016-04-06 0.223324 2016-04-07 0.132752 Name: last_seen, dtype: float64
From the date column distribution nothing comes out of the ordinary. The crawl was done for just over a month. In the last couple of days, number of items crawled has reduces at the same time the 'last seen' has increased which probably corelates to the changes in crawling. Date created varies over a range from June 2015 to April 2016, close to 10 months.
With these information, let us go ahead and analyse the data.
Now that we have a data set with less ambiguity, let us see if we can create something useful to help with our analysis.
We want to find out if there are any correlation between brands, price, distance travelled, how old the vehicle is etc. So let us find out how old these vehicles are
We know that the data crawled was in 2016. We also have the registration year of the vehicle. That gives us the age of the vehicle as on 2016
# Finding the age of vehicles by subtracting it from 2016
# And adding it into a new column called 'age' in 'auto' dataframe.
auto['age']=2016-auto['reg_year']
auto.head(2)
date_crawled | name | price_$ | ab_test | vehicle_type | reg_year | gearbox | power_ps | model | odometer_km | reg_month | fuel_type | brand | not_repaired_damage | date_created | postal_code | last_seen | age | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | 5000.0 | control | bus | 2004 | manuell | 158 | andere | 150000.0 | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 79588 | 2016-04-06 06:45:54 | 12 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | 8500.0 | control | limousine | 1997 | automatik | 286 | 7er | 150000.0 | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 71034 | 2016-04-06 14:45:08 | 19 |
Now that we have the age of the vehicle, we can go ahead and find how many days it took for a vehicle to get sold. To find this out, we can make use the data available regarding post created date and last seen date. The number of days in between these will be the time taken to sell.
In order to work with date and time, first we need to convert the strings into datetime datatype. Lets get to it.
# Selecting first 10 string elements which denotes the yyyy-mm-dd and converting it into datetime
crawled = pd.to_datetime(auto['date_crawled'].str[:10])
posted = pd.to_datetime(auto['date_created'].str[:10])
sold = pd.to_datetime(auto['last_seen'].str[:10])
time_to_sell = sold - posted # Finding how many days it took to sell
auto['time_to_sell'] = time_to_sell # Creating a new column in auto dataframe
auto['time_to_sell'].value_counts(bins=10).sort_index()
(-1 days +16:47:59.999999, 30 days 00:00:00] 45972 (30 days 00:00:00, 60 days 00:00:00] 697 (60 days 00:00:00, 90 days 00:00:00] 5 (90 days 00:00:00, 120 days 00:00:00] 3 (120 days 00:00:00, 150 days 00:00:00] 1 (150 days 00:00:00, 180 days 00:00:00] 0 (180 days 00:00:00, 210 days 00:00:00] 1 (210 days 00:00:00, 240 days 00:00:00] 1 (240 days 00:00:00, 270 days 00:00:00] 0 (270 days 00:00:00, 300 days 00:00:00] 1 Name: time_to_sell, dtype: int64
We can see that most of the listing had found success within first 30 days. Though there are a few items that took way too many days. Maybe the vehicles are too old or too expensive. We will have a quick look into those reasons.
But in order to easily select and compare the date values, let us convert them into corresponding int values and create a new column. So 32 days will be represented as 32, easier to do our analysis.
In order to do that, we will convert it back to string, split it and convert the first part to int.
# Creating a new column to add int value of days
# First converting it into str type, then splits it and takes the first part of the string
# Converting this string into int data type
auto['time_to_sell_int']=auto['time_to_sell'].astype(str).str.split().str[0].astype(int)
auto.head(2)
date_crawled | name | price_$ | ab_test | vehicle_type | reg_year | gearbox | power_ps | model | odometer_km | reg_month | fuel_type | brand | not_repaired_damage | date_created | postal_code | last_seen | age | time_to_sell | time_to_sell_int | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | 5000.0 | control | bus | 2004 | manuell | 158 | andere | 150000.0 | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 79588 | 2016-04-06 06:45:54 | 12 | 11 days | 11 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | 8500.0 | control | limousine | 1997 | automatik | 286 | 7er | 150000.0 | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 71034 | 2016-04-06 14:45:08 | 19 | 2 days | 2 |
We can see that the new auto dataframe has 3 new extra columns with age, time to sell in days and corresponding int value.
Now lets move onto finding the top car brands among the list and run a few analysis on them to see how they are performing or to find any correlation between the given values and its sales.
# Listing down all unique brands along with their percentage in the list
auto['brand'].value_counts(normalize=True)
volkswagen 0.211264 bmw 0.110045 opel 0.107581 mercedes_benz 0.096463 audi 0.086566 ford 0.069900 renault 0.047150 peugeot 0.029841 fiat 0.025642 seat 0.018273 skoda 0.016409 nissan 0.015274 mazda 0.015188 smart 0.014160 citroen 0.014010 toyota 0.012703 hyundai 0.010025 sonstige_autos 0.009811 volvo 0.009147 mini 0.008762 mitsubishi 0.008226 honda 0.007840 kia 0.007069 alfa_romeo 0.006641 porsche 0.006127 suzuki 0.005934 chevrolet 0.005698 chrysler 0.003513 dacia 0.002635 daihatsu 0.002506 jeep 0.002271 subaru 0.002142 land_rover 0.002099 saab 0.001649 jaguar 0.001564 daewoo 0.001500 trabant 0.001392 rover 0.001328 lancia 0.001071 lada 0.000578 Name: brand, dtype: float64
We can see that volkswagen, bmw, opel, mercedes_benz and audi are the top 5 brands from the list. So now we will create mean values from various columns and check against these brands to find any correlations.
We should check the brands against its mean values of price, odometer distance, age, number of days it took to sell etc.
In order to find the mean price we need to run through a loop having the brand names of vehicles. We can get it by using unque()
function on the brand column.
In the for loop, we iterates through the whole data set looking for the brand name and then find the mean of
# Creating empty dictionaries to store brand based mean values
brand_price_dict={}
brand_distance_dict={}
brand_age_dict={}
brand_sell_time_dict={}
#Selecting all unique values of brand names
brand_names= auto['brand'].unique()
# for loop to calculate the mean value of selected columns.
for brand_name in brand_names:
#finding mean value
mean_price= auto.loc[auto['brand']==brand_name,'price_$'].mean() # mean value of price column
mean_distance=auto.loc[auto['brand']==brand_name,'odometer_km'].mean() # mean value of odometer column
mean_age=auto.loc[auto['brand']==brand_name,'age'].mean() # mean value of age column
mean_sell_time =auto.loc[auto['brand']==brand_name,'time_to_sell_int'].mean() # mean of time to sell int column
#adding to the dictionary
brand_price_dict[brand_name] = int(mean_price)
brand_distance_dict[brand_name] = int(mean_distance)
brand_age_dict[brand_name] = int(mean_age)
brand_sell_time_dict[brand_name] = int(mean_sell_time)
# Converting dictionary into pandas series for easier analysis
brand_mean_price_series = pd.Series(brand_price_dict)
brand_mean_distance_series = pd.Series(brand_distance_dict)
brand_mean_age_series = pd.Series(brand_age_dict)
brand_mean_sell_time_series=pd.Series(brand_sell_time_dict)
# Creating a new dataframe with brand and its corresponding mean values
brand_data = pd.DataFrame(brand_mean_price_series,columns=['mean_price'])
# adding more columns to the newly created dataframe
brand_data['mean_distance'] = brand_mean_distance_series
brand_data['mean_age'] = brand_mean_age_series
brand_data['mean_sell_time']=brand_mean_sell_time_series
brand_data['total_percentage']=auto['brand'].value_counts(normalize=True)*100
brand_data.head(10).sort_values('total_percentage', ascending=False)
mean_price | mean_distance | mean_age | mean_sell_time | total_percentage | |
---|---|---|---|---|---|
volkswagen | 5402 | 128707 | 13 | 8 | 21.126368 |
bmw | 8332 | 132572 | 12 | 9 | 11.004477 |
mercedes_benz | 8628 | 130788 | 13 | 9 | 9.646323 |
audi | 9336 | 129157 | 11 | 9 | 8.656627 |
ford | 3749 | 124266 | 13 | 8 | 6.989996 |
renault | 2474 | 128071 | 13 | 8 | 4.714980 |
peugeot | 3094 | 127153 | 12 | 8 | 2.984083 |
seat | 4397 | 121131 | 11 | 8 | 1.827296 |
smart | 3580 | 99326 | 10 | 9 | 1.415994 |
chrysler | 3465 | 132378 | 15 | 9 | 0.351321 |
Now we have a data set with mean values, sorted in the order the percentage of brands.
Mean values are more or less the same. But Volkswagen has almost double the number of presence in the listing. When we look at the mean price, it is actually less than the other 3 in the first 5 and only higher than the 5th brand, Ford. So the moderate pricing of Volkswagen along with similar performance could be the main reason why it is the most listed brand.
When we were finding the number of days it took to sell a vehicle, we have seen some numbers like 300 days. Let's have a look at those difficult to sell cars and find out the possible reasons behind it.
print(brand_data['mean_price'][['mercedes_benz','peugeot','opel','subaru','smart']])
auto[auto['time_to_sell_int']>60].sort_values(
'time_to_sell_int',ascending=False)[['brand','age','price_$','odometer_km','time_to_sell']]
mercedes_benz 8628 peugeot 3094 opel 2975 subaru 3953 smart 3580 Name: mean_price, dtype: int64
brand | age | price_$ | odometer_km | time_to_sell | |
---|---|---|---|---|---|
22781 | mercedes_benz | 2 | 47900.0 | 20000.0 | 300 days |
20649 | peugeot | 10 | 5200.0 | 80000.0 | 239 days |
34883 | opel | 3 | 9500.0 | 50000.0 | 209 days |
2243 | peugeot | 7 | 5499.0 | 70000.0 | 149 days |
27986 | subaru | 19 | 900.0 | 150000.0 | 109 days |
36993 | smart | 12 | 2400.0 | 90000.0 | 98 days |
2232 | volkswagen | 10 | 14499.0 | 90000.0 | 95 days |
24851 | mercedes_benz | 8 | 15490.0 | 150000.0 | 88 days |
39270 | opel | 19 | 1099.0 | 150000.0 | 68 days |
15081 | audi | 11 | 5800.0 | 150000.0 | 63 days |
49743 | bmw | 9 | 7399.0 | 100000.0 | 62 days |
36429 | mercedes_benz | 11 | 9900.0 | 150000.0 | 61 days |
A 2 year old Mercedes Benz is at the top with 300 days to get sold! It is definitely not the age or brand of the vehicle that stood in its way of sales. It must be the high price. The mean price of mercedes_benz is 8628 while here the listed price is 47900.0, way higher, more than 5 times!
If we compare the list prices with mean prices, we can see that most of those list prices are quite higher than the mean prices. Since it is expesive to buy, probably these cars took too long to get sold.
At the same time, if we look at subaru with a list price of 900 while the mean price is 3953, we can see that we have an opposite case. But if you look at the age of the car, it says 19 years. Hence the cheaper price is justified along with the delay in getting sold.
So if the cars are not getting sold, it is either too expesive compared to the mean price or too old to buy. So this leads us to conclude that, people have cheaper options as well as less older models available. If the price is too high or the model is too old, it will be difficult to make a sale. In such cases it takes at least 2 months before one can make a sale according to our data.
In the similar way, let us look at fastest selling cars.
# Sorting the dataframe based on time_to_sell_int in ascending order
# And listing down a few selected columns for analysis.
print(auto['time_to_sell'].value_counts(bins=10).sort_index())
# Printing the mean value of selected brands
print(brand_data['mean_price'][['jeep','ford','volkswagen','audi','bmw','smart']])
# sorting the dataframe based on time_to_sell and listing selected columns
auto.sort_values('time_to_sell_int')[
['brand','age','price_$','odometer_km','time_to_sell','date_created','last_seen','date_crawled']].head(20)
(-1 days +16:47:59.999999, 30 days 00:00:00] 45972 (30 days 00:00:00, 60 days 00:00:00] 697 (60 days 00:00:00, 90 days 00:00:00] 5 (90 days 00:00:00, 120 days 00:00:00] 3 (120 days 00:00:00, 150 days 00:00:00] 1 (150 days 00:00:00, 180 days 00:00:00] 0 (180 days 00:00:00, 210 days 00:00:00] 1 (210 days 00:00:00, 240 days 00:00:00] 1 (240 days 00:00:00, 270 days 00:00:00] 0 (270 days 00:00:00, 300 days 00:00:00] 1 Name: time_to_sell, dtype: int64 jeep 11650 ford 3749 volkswagen 5402 audi 9336 bmw 8332 smart 3580 Name: mean_price, dtype: int64
brand | age | price_$ | odometer_km | time_to_sell | date_created | last_seen | date_crawled | |
---|---|---|---|---|---|---|---|---|
47828 | jeep | 10 | 7450.0 | 150000.0 | 0 days | 2016-03-11 00:00:00 | 2016-03-11 10:51:00 | 2016-03-11 10:51:00 |
47283 | ford | 10 | 10299.0 | 125000.0 | 0 days | 2016-03-12 00:00:00 | 2016-03-12 19:52:59 | 2016-03-12 19:52:59 |
17561 | volkswagen | 26 | 2999.0 | 150000.0 | 0 days | 2016-03-17 00:00:00 | 2016-03-17 19:42:57 | 2016-03-17 19:42:57 |
17562 | audi | 22 | 1400.0 | 150000.0 | 0 days | 2016-03-17 00:00:00 | 2016-03-17 11:38:40 | 2016-03-17 11:38:40 |
17569 | bmw | 19 | 1500.0 | 150000.0 | 0 days | 2016-03-29 00:00:00 | 2016-03-29 10:56:52 | 2016-03-29 10:56:52 |
17570 | smart | 0 | 1799.0 | 125000.0 | 0 days | 2016-03-12 00:00:00 | 2016-03-12 17:47:59 | 2016-03-12 17:47:59 |
17572 | volkswagen | 16 | 1450.0 | 150000.0 | 0 days | 2016-03-31 00:00:00 | 2016-03-31 11:40:14 | 2016-03-31 10:37:53 |
41300 | volkswagen | 11 | 2900.0 | 150000.0 | 0 days | 2016-04-01 00:00:00 | 2016-04-01 18:40:41 | 2016-04-01 17:59:18 |
47286 | citroen | 12 | 2700.0 | 90000.0 | 0 days | 2016-03-20 00:00:00 | 2016-03-20 22:39:51 | 2016-03-20 22:39:51 |
41299 | bmw | 21 | 1590.0 | 150000.0 | 0 days | 2016-04-06 00:00:00 | 2016-04-06 15:36:32 | 2016-04-06 15:36:32 |
17581 | nissan | 13 | 4200.0 | 150000.0 | 0 days | 2016-03-29 00:00:00 | 2016-03-29 17:53:11 | 2016-03-29 17:53:11 |
41293 | volkswagen | 11 | 2990.0 | 150000.0 | 0 days | 2016-04-02 00:00:00 | 2016-04-02 14:44:05 | 2016-04-02 13:56:42 |
5610 | mercedes_benz | 17 | 5900.0 | 150000.0 | 0 days | 2016-04-01 00:00:00 | 2016-04-01 09:48:37 | 2016-04-01 08:53:31 |
41288 | mazda | 10 | 9750.0 | 40000.0 | 0 days | 2016-04-05 00:00:00 | 2016-04-05 11:40:52 | 2016-04-05 10:53:05 |
17602 | volkswagen | 36 | 5900.0 | 90000.0 | 0 days | 2016-03-30 00:00:00 | 2016-03-30 19:38:33 | 2016-03-30 19:38:33 |
5593 | opel | 19 | 450.0 | 150000.0 | 0 days | 2016-03-30 00:00:00 | 2016-03-30 20:46:03 | 2016-03-30 20:46:03 |
41282 | hyundai | 4 | 8000.0 | 60000.0 | 0 days | 2016-03-25 00:00:00 | 2016-03-25 15:45:39 | 2016-03-25 15:45:39 |
5590 | peugeot | 8 | 7900.0 | 60000.0 | 0 days | 2016-03-19 00:00:00 | 2016-03-19 09:49:41 | 2016-03-19 08:56:56 |
47295 | volkswagen | 9 | 3200.0 | 150000.0 | 0 days | 2016-03-28 00:00:00 | 2016-03-28 19:53:23 | 2016-03-28 19:53:23 |
17611 | skoda | 15 | 1550.0 | 150000.0 | 0 days | 2016-03-24 00:00:00 | 2016-03-24 14:39:15 | 2016-03-24 13:58:58 |
Here we can see that most of the listings have disappeared whithin a very short span of time. There can be two possible reasons.
From the given set of data, we have no way of knowing if the listing has been deleted or not. So let us find out if the offer was too good or not.
In order to find that we can find the difference between Listed price and the mean price for the same brand.
# Printing the top 10 best time to sell
print(auto['time_to_sell_int'].value_counts().head(10).sort_index(), '\n')
# Creating a for loop to check the first top 20 entries with fastest time_to_sell
# Vehicles sold on DAY 0
for i in range(20):
name_of_brand=auto.sort_values('time_to_sell_int').iloc[i]['brand'] # Finding the name of the brand in each row
list_price = auto.sort_values('time_to_sell_int').iloc[i]['price_$'] # Finding the list price in each row
mean_price = brand_data.loc[name_of_brand]['mean_price'] # Finding the mean price of any given brand
difference = mean_price-list_price # Finding the diff between mean and list price
print(difference, ":", name_of_brand)
0 6473 1 2671 2 4846 3 2341 4 3481 5 1579 6 2597 7 1589 8 2227 9 1717 Name: time_to_sell_int, dtype: int64 4200.0 : jeep -6550.0 : ford 2403.0 : volkswagen 7936.0 : audi 6832.0 : bmw 1781.0 : smart 3952.0 : volkswagen 2502.0 : volkswagen 1079.0 : citroen 6742.0 : bmw 543.0 : nissan 2412.0 : volkswagen 2728.0 : mercedes_benz -5638.0 : mazda -498.0 : volkswagen 2525.0 : opel -2635.0 : hyundai -4806.0 : peugeot 2202.0 : volkswagen 4818.0 : skoda
# Vehicles sold on DAY 1
auto1=auto[auto['time_to_sell_int']==1].sort_values('time_to_sell')
# Creating a for loop to check the first top 20 entries with fastest time_to_sell
# Vehicles sold on DAY 1
for i in range(20):
name_of_brand=auto1.sort_values('time_to_sell_int').iloc[i]['brand'] # Finding the name of the brand in each row
list_price = auto1.sort_values('time_to_sell_int').iloc[i]['price_$'] # Finding the list price in each row
mean_price = brand_data.loc[name_of_brand]['mean_price'] # Finding the mean price of any given brand
difference = mean_price-list_price # Finding the diff between mean and list price
print(difference, ":", name_of_brand)
auto1.sort_values('time_to_sell')[
['brand','age','price_$','odometer_km','time_to_sell','date_created','last_seen','date_crawled']].head(20)
-6562.0 : sonstige_autos 5212.0 : volkswagen 2513.0 : fiat 4952.0 : volkswagen 1324.0 : renault 2267.0 : toyota 1465.0 : chrysler 1975.0 : opel 1928.0 : mercedes_benz -1975.0 : opel 3428.0 : mercedes_benz 2385.0 : opel 3299.0 : ford 2695.0 : mitsubishi -4848.0 : volkswagen 3112.0 : volkswagen 2485.0 : opel 7378.0 : mercedes_benz 2529.0 : citroen 2035.0 : opel
brand | age | price_$ | odometer_km | time_to_sell | date_created | last_seen | date_crawled | |
---|---|---|---|---|---|---|---|---|
15 | sonstige_autos | 34 | 18900.0 | 80000.0 | 1 days | 2016-04-01 00:00:00 | 2016-04-02 21:10:48 | 2016-04-01 12:06:20 |
127 | volkswagen | 21 | 190.0 | 30000.0 | 1 days | 2016-03-21 00:00:00 | 2016-03-22 20:18:58 | 2016-03-21 15:49:37 |
174 | fiat | 16 | 300.0 | 150000.0 | 1 days | 2016-03-19 00:00:00 | 2016-03-20 15:47:12 | 2016-03-19 13:53:19 |
176 | volkswagen | 19 | 450.0 | 150000.0 | 1 days | 2016-03-13 00:00:00 | 2016-03-14 13:28:22 | 2016-03-13 16:22:15 |
199 | renault | 17 | 1150.0 | 150000.0 | 1 days | 2016-03-12 00:00:00 | 2016-03-13 00:45:34 | 2016-03-12 08:54:25 |
225 | toyota | 19 | 2900.0 | 150000.0 | 1 days | 2016-03-15 00:00:00 | 2016-03-16 05:18:46 | 2016-03-15 09:57:37 |
275 | chrysler | 15 | 2000.0 | 100000.0 | 1 days | 2016-03-07 00:00:00 | 2016-03-08 10:44:31 | 2016-03-07 18:41:56 |
295 | opel | 0 | 1000.0 | 150000.0 | 1 days | 2016-03-28 00:00:00 | 2016-03-29 23:18:11 | 2016-03-28 03:36:22 |
123 | mercedes_benz | 13 | 6700.0 | 150000.0 | 1 days | 2016-03-21 00:00:00 | 2016-03-22 20:16:50 | 2016-03-21 15:46:24 |
333 | opel | 11 | 4950.0 | 125000.0 | 1 days | 2016-03-23 00:00:00 | 2016-03-24 07:40:54 | 2016-03-23 23:37:23 |
363 | mercedes_benz | 10 | 5200.0 | 125000.0 | 1 days | 2016-04-04 00:00:00 | 2016-04-05 12:22:58 | 2016-04-04 18:52:13 |
385 | opel | 18 | 590.0 | 150000.0 | 1 days | 2016-03-22 00:00:00 | 2016-03-23 22:45:32 | 2016-03-22 14:56:32 |
402 | ford | 11 | 450.0 | 150000.0 | 1 days | 2016-03-05 00:00:00 | 2016-03-06 03:45:35 | 2016-03-05 21:49:41 |
408 | mitsubishi | 16 | 699.0 | 150000.0 | 1 days | 2016-03-15 00:00:00 | 2016-03-16 20:21:44 | 2016-03-15 21:49:23 |
411 | volkswagen | 7 | 10250.0 | 100000.0 | 1 days | 2016-03-17 00:00:00 | 2016-03-18 00:41:13 | 2016-03-17 22:38:14 |
413 | volkswagen | 11 | 2290.0 | 150000.0 | 1 days | 2016-03-21 00:00:00 | 2016-03-22 07:42:14 | 2016-03-22 00:51:35 |
498 | opel | 19 | 490.0 | 150000.0 | 1 days | 2016-03-16 00:00:00 | 2016-03-17 20:30:29 | 2016-03-16 08:51:04 |
334 | mercedes_benz | 16 | 1250.0 | 150000.0 | 1 days | 2016-03-14 00:00:00 | 2016-03-15 04:45:49 | 2016-03-14 09:51:24 |
507 | citroen | 0 | 1250.0 | 150000.0 | 1 days | 2016-03-08 00:00:00 | 2016-03-09 12:15:17 | 2016-03-08 13:45:53 |
562 | opel | 19 | 940.0 | 125000.0 | 1 days | 2016-03-05 00:00:00 | 2016-03-06 04:45:46 | 2016-03-05 20:36:42 |
# Vehicles sold on DAY 9
auto9=auto[auto['time_to_sell_int']==9].sort_values('time_to_sell')
# Creating a for loop to check the first top 20 entries with fastest time_to_sell
# Vehicles sold on DAY 9
for i in range(20):
name_of_brand=auto2.sort_values('time_to_sell_int').iloc[i]['brand'] # Finding the name of the brand in each row
list_price = auto2.sort_values('time_to_sell_int').iloc[i]['price_$'] # Finding the list price in each row
mean_price = brand_data.loc[name_of_brand]['mean_price'] # Finding the mean price of any given brand
difference = mean_price-list_price # Finding the diff between mean and list price
print(difference, ":", name_of_brand)
auto9.sort_values('time_to_sell')[
['brand','age','price_$','odometer_km','time_to_sell','date_created','last_seen','date_crawled']].head(20)
--------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-46-9f08162ed9b1> in <module> 6 7 for i in range(20): ----> 8 name_of_brand=auto2.sort_values('time_to_sell_int').iloc[i]['brand'] # Finding the name of the brand in each row 9 list_price = auto2.sort_values('time_to_sell_int').iloc[i]['price_$'] # Finding the list price in each row 10 mean_price = brand_data.loc[name_of_brand]['mean_price'] # Finding the mean price of any given brand NameError: name 'auto2' is not defined
Comparing with Day 0 and 1, Day 9 has more negative values. Out of these negative values, the a couple of them were spend on buying a rather new vehicles. So the higher price is justified.
If the list price is less than the mean value, at least by 2000$, the chances of getting the vehicle sold quikcly is higher.
Age of the vehicle also plays a crucial role in this. If vehicle is registered recently, the list price will be higher and takes longer time to get sold.
If the age of the vehicle is old and the price isn't comparitively less than the mean price, it is difficult to sell. But if the price is lower than at least 2000$ there is high chance of quick sales.
If the age is more than 40 years or so, due to the antiquity of the vehicle combined with its brand value, prices can go higher and sales time can be lower. In this case, the data can behave differently.
Now let us have a look at age alone and its influences
# Segregating the data based on age of the car
auto['age'].value_counts(bins=10).sort_index()
We can see majority of the vehicles are between 10 and 20 years, second place goes to vehicles below 10 years. But there are some interesting values. They are the vehicles above 85 years. Let us have a look at them.
# Finding the cars that are more than 84 years old, sorting them in the ascending order of price
# And listing various columns in auto dataframe for better analysis.
auto[auto['age']>84].sort_values('price_$')[['brand','age','price_$','odometer_km','time_to_sell','date_created','last_seen']]
From the data we have, we can't really say if the registed year of these vehicles are true or not.
It can be true since these companies have started a 100 years ago. But if these vehicles are 100 years old and in sellable conditions, the price listed doesn't make any sense. Vintage cars can cost a lot of money. So it looks like those 106 year claims are most probably wrong. So we can remove them from the analysis.
Now let us look at the new vehicles in town!
First we will try to find the ones that are least expensive.
# Finding the cars with age less than 5 years, sorting in ascending order of the price
# And listing a few colums in auto dataframe
auto[auto['age']<5].sort_values('price_$').head(10)[['brand','age','price_$','odometer_km','time_to_sell']]
The cars registered in 2016, ran less than a year are listed for just $1. It also seems like they have done at least 410km everyday for 365 days to reach the 150,000km mark. Since most of the data collected don't match, there could be wrong information provided, in these cases.
Now let us look at the new and expensive collection.
# Finding the cars with age less than 5 years, sorting in Descending order of the price, higher ones first
# And listing a few colums in auto dataframe
auto[auto['age']<5].sort_values('price_$', ascending=False).head(10)[['brand','age','price_$','odometer_km','time_to_sell']]
This collection makes more sense. A brand new porsche that did 5000km priced at 350000$ got sold in the same day its been listed. Now let us check if the high price can be justified or not.
# Gathering the mean values of some of the brands listed above from brand_data array
brand_data[['mean_price', 'mean_age','mean_distance']].loc[['porsche','bmw','mercedes_benz']]
The list price is way higher than the mean price. But when we look at the mean age and distance it travelled, the price of the new cars can be justified.
So if the car is new, has a brand value, and has done reasonable kilometers, even if the price is higher than the mean value, it still has a lot of demand and it will get sold quickly.
Now let us look at vehicles that are between 5 and 10 years. First we will go through the cheap auctions.
# Finding the cars with age between 5 and 10 years, sorting in ascending order of the price, cheapest first
# And listing a few colums in auto dataframe
auto[auto['age'].between(5,10)].sort_values('price_$').head(10)[['brand','age','price_$','odometer_km','time_to_sell']]
In these data, things do match except for the price. Since it is an auction site, let us assume that some lucky guy got an 8 year old Opel that did 150000 for $1 as no one else was interested in it for about 5 days. We need more data to clarify such things though.
With that conclusion, lets now check the 5-10 year old cars with more expensive price tags.
# Finding the cars with age between 5 and 10 years, sorting in Descending order of the price, higher ones first
# And listing a few colums in auto dataframe
auto[auto['age'].between(5,10)].sort_values('price_$', ascending=False).head(10)[
['brand','age','price_$','odometer_km','time_to_sell']]
We can clearly see that those vehicles with some years on it can't expect to get sold quickly if the prices are way higher than its mean price. In this list 80% of the cars took more than 10 days to get sold. So as long as people have cheaper options with similar parameters, they would choose that.
The dataset scrapped from German ebay for used cars has been cleaned and analysed. There are some anomalies present in the data and in certain cases the data we have is insufficient to come to certain conclusion. Excluding those exceptions, we have come to a few conclusions from the given data.
They are