In this guided project, we will be cleaning and exploring car sales data from eBay Kleinanzeigen, a classifieds section of the German eBay website. The goal of this project is to demonstrate our ability to use and understand the functionality of Pandas by working with Series and DataFrame objects, as well as demonstrate our ability to recognize issues and anomolies in our dataset.
The original data can be found here. The original data has been trimmed to a smaller size (50,000 data points), and dirtied, to better suit the purpose of this project.
The data dictionary provided with data is as follows:
dateCrawled
- When this ad was first crawled. All field-values are taken from this date.name
- Name of the car.seller
- Whether the seller is private or a dealer.offerType
- The type of listingprice
- The price on the ad to sell the car.abtest
- Whether the listing is included in an A/B test.vehicleType
- The vehicle Type.yearOfRegistration
- The year in which the car was first registered.gearbox
- The transmission type.powerPS
- The power of the car in PS.model
- The car model name.kilometer
- How many kilometers the car has driven.monthOfRegistration
- The month in which the car was first registered.fuelType
- What type of fuel the car uses.brand
- The brand of the car.notRepairedDamage
- If the car has a damage which is not yet repaired.dateCreated
- The date on which the eBay listing was created.nrOfPictures
- The number of pictures in the ad.postalCode
- The postal code for the location of the vehicle.lastSeenOnline
- When the crawler saw this ad last online.#importing the pandas and numpy libraries
import pandas as pd
import numpy as np
#reading the .csv file as a dataframe
autos = pd.read_csv('autos.csv', encoding = 'Latin-1')
autos.info()
autos.head()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): dateCrawled 50000 non-null object name 50000 non-null object seller 50000 non-null object offerType 50000 non-null object price 50000 non-null object abtest 50000 non-null object vehicleType 44905 non-null object yearOfRegistration 50000 non-null int64 gearbox 47320 non-null object powerPS 50000 non-null int64 model 47242 non-null object odometer 50000 non-null object monthOfRegistration 50000 non-null int64 fuelType 45518 non-null object brand 50000 non-null object notRepairedDamage 40171 non-null object dateCreated 50000 non-null object nrOfPictures 50000 non-null int64 postalCode 50000 non-null int64 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
The number of non-null objects listed under each column can tell us which part of the data set that we'll likely need to focus on cleaning. As we know from before, this data has 50,000 entries, so any column with less than 50,000 non-null objects has some number of values that are null, which can prove difficult when we get to analysis.
The columns where we see this pattern are:
vehicleType
gearbox
model
fuelType
notRepairedDamage
Additionally, there are a few columns that may benefit from being a different data type. For example, the price
column currently is an object type, but it would likely be more useful as an int64 data type. That way we can do things like sort the rows by price.
autos.columns
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest', 'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model', 'odometer', 'monthOfRegistration', 'fuelType', 'brand', 'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode', 'lastSeen'], dtype='object')
Another problem with our data is that our column names are in camelcase (likeThis
) instead of snakecase (like_this
). Below, we'll convert the column names to snakecase.
autos.columns= ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'ab_test',
'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
'odometer', 'registration_month', 'fuel_type', 'brand',
'unrepaired_damage', 'ad_created', 'nr_pictures', 'postal_code',
'last_seen']
autos.columns
Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'ab_test', 'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model', 'odometer', 'registration_month', 'fuel_type', 'brand', 'unrepaired_damage', 'ad_created', 'nr_pictures', 'postal_code', 'last_seen'], dtype='object')
Although this is a merely a stylistic change, a little bit of readability can go a long way!
Let's use the dataframe.describe()
method to learn more about the columns and values stored in our data set. This will give us insight into which parts of our data we'll need to focus on cleaning. Note that we use the include='all'
parameter, so we can get summary statistics for both numeric and object data types.
autos.describe(include='all')
date_crawled | name | seller | offer_type | price | ab_test | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000 | 50000 | 50000 | 50000 | 50000 | 50000 | 44905 | 50000.000000 | 47320 | 50000.000000 | 47242 | 50000 | 50000.000000 | 45518 | 50000 | 40171 | 50000 | 50000.0 | 50000.000000 | 50000 |
unique | 48213 | 38754 | 2 | 2 | 2357 | 2 | 8 | NaN | 2 | NaN | 245 | 13 | NaN | 7 | 40 | 2 | 76 | NaN | NaN | 39481 |
top | 2016-03-19 17:36:18 | Ford_Fiesta | privat | Angebot | $0 | test | limousine | NaN | manuell | NaN | golf | 150,000km | NaN | benzin | volkswagen | nein | 2016-04-03 00:00:00 | NaN | NaN | 2016-04-07 06:17:27 |
freq | 3 | 78 | 49999 | 49999 | 1421 | 25756 | 12859 | NaN | 36993 | NaN | 4024 | 32424 | NaN | 30107 | 10687 | 35232 | 1946 | NaN | NaN | 8 |
mean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2005.073280 | NaN | 116.355920 | NaN | NaN | 5.723360 | NaN | NaN | NaN | NaN | 0.0 | 50813.627300 | NaN |
std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 105.712813 | NaN | 209.216627 | NaN | NaN | 3.711984 | NaN | NaN | NaN | NaN | 0.0 | 25779.747957 | NaN |
min | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1000.000000 | NaN | 0.000000 | NaN | NaN | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 1067.000000 | NaN |
25% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1999.000000 | NaN | 70.000000 | NaN | NaN | 3.000000 | NaN | NaN | NaN | NaN | 0.0 | 30451.000000 | NaN |
50% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2003.000000 | NaN | 105.000000 | NaN | NaN | 6.000000 | NaN | NaN | NaN | NaN | 0.0 | 49577.000000 | NaN |
75% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | NaN | 9.000000 | NaN | NaN | NaN | NaN | 0.0 | 71540.000000 | NaN |
max | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 9999.000000 | NaN | 17700.000000 | NaN | NaN | 12.000000 | NaN | NaN | NaN | NaN | 0.0 | 99998.000000 | NaN |
Looking at the summary statistics for the data, we can see there are a few columns that don't seem to be useful for our analysis. For instance, the seller
, offer_type
, and nr_pictures
columns are comprised mostly of the same value.
autos['seller'].value_counts()
privat 49999 gewerblich 1 Name: seller, dtype: int64
autos['offer_type'].value_counts()
Angebot 49999 Gesuch 1 Name: offer_type, dtype: int64
autos['nr_pictures'].value_counts()
0 50000 Name: nr_pictures, dtype: int64
As we can see by using the series.value_counts()
method on the seller
column, the nearly all of our data is from private sellers, there is only one listing under 'gewerblich' (in english, commercial). Similarly, nearly all of the listings have an offer type of 'angebot' (in english, offer, or someone looking sell their car), and only one listing as 'gesuch' (in english, request, or someone posting a listing looking to buy a car). All of the listings contain a '0' value in nr_pictures
.
Since these columns contain mostly one value, it would make sense to drop them from the data set.
As mentioned before, there are a few columns where numeric data is stored as text rather than an integer. Both the price
and odometer
columns would be better suited as integers. Let's begin by looking at the price
column
autos['price'].head()
0 $5,000 1 $8,500 2 $8,990 3 $4,350 4 $1,350 Name: price, dtype: object
It looks like the values in the price
column contain the non-numeric characters $ and ,
We need to remove them before converting the strings into integers
autos['price']=autos['price'].str.replace('$','')
autos['price']=autos['price'].str.replace(',','')
autos['price'].head()
0 5000 1 8500 2 8990 3 4350 4 1350 Name: price, dtype: object
Note that the dtype
is still object
. So, we'll use the series.astype()
method to change the dtype from object to integer.
autos['price']=autos['price'].astype(int)
autos['price'].head()
0 5000 1 8500 2 8990 3 4350 4 1350 Name: price, dtype: int64
Now that the price
column holds numeric values, let's use series.describe()
to look at the summary statistics
autos['price'].describe()
count 5.000000e+04 mean 9.840044e+03 std 4.811044e+05 min 0.000000e+00 25% 1.100000e+03 50% 2.950000e+03 75% 7.200000e+03 max 1.000000e+08 Name: price, dtype: float64
Let's use this same process for the odometer
column
autos['odometer'].head()
0 150,000km 1 150,000km 2 70,000km 3 70,000km 4 150,000km Name: odometer, dtype: object
autos['odometer']=autos['odometer'].str.replace(',','')
autos['odometer']=autos['odometer'].str.replace('km','')
autos['odometer']=autos['odometer'].astype(int)
autos['odometer'].head()
0 150000 1 150000 2 70000 3 70000 4 150000 Name: odometer, dtype: int64
autos['odometer'].describe()
count 50000.000000 mean 125732.700000 std 40042.211706 min 5000.000000 25% 125000.000000 50% 150000.000000 75% 150000.000000 max 150000.000000 Name: odometer, dtype: float64
Before moving on, let's rename the odometer
column to odometer_km
so that we can remember that the values are in kilometers.
autos.rename({'odometer':'odometer_km'}, axis=1, inplace=True)
autos.describe(include='all')
date_crawled | name | seller | offer_type | price | ab_test | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000 | 50000 | 50000 | 50000 | 5.000000e+04 | 50000 | 44905 | 50000.000000 | 47320 | 50000.000000 | 47242 | 50000.000000 | 50000.000000 | 45518 | 50000 | 40171 | 50000 | 50000.0 | 50000.000000 | 50000 |
unique | 48213 | 38754 | 2 | 2 | NaN | 2 | 8 | NaN | 2 | NaN | 245 | NaN | NaN | 7 | 40 | 2 | 76 | NaN | NaN | 39481 |
top | 2016-03-19 17:36:18 | Ford_Fiesta | privat | Angebot | NaN | test | limousine | NaN | manuell | NaN | golf | NaN | NaN | benzin | volkswagen | nein | 2016-04-03 00:00:00 | NaN | NaN | 2016-04-07 06:17:27 |
freq | 3 | 78 | 49999 | 49999 | NaN | 25756 | 12859 | NaN | 36993 | NaN | 4024 | NaN | NaN | 30107 | 10687 | 35232 | 1946 | NaN | NaN | 8 |
mean | NaN | NaN | NaN | NaN | 9.840044e+03 | NaN | NaN | 2005.073280 | NaN | 116.355920 | NaN | 125732.700000 | 5.723360 | NaN | NaN | NaN | NaN | 0.0 | 50813.627300 | NaN |
std | NaN | NaN | NaN | NaN | 4.811044e+05 | NaN | NaN | 105.712813 | NaN | 209.216627 | NaN | 40042.211706 | 3.711984 | NaN | NaN | NaN | NaN | 0.0 | 25779.747957 | NaN |
min | NaN | NaN | NaN | NaN | 0.000000e+00 | NaN | NaN | 1000.000000 | NaN | 0.000000 | NaN | 5000.000000 | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 1067.000000 | NaN |
25% | NaN | NaN | NaN | NaN | 1.100000e+03 | NaN | NaN | 1999.000000 | NaN | 70.000000 | NaN | 125000.000000 | 3.000000 | NaN | NaN | NaN | NaN | 0.0 | 30451.000000 | NaN |
50% | NaN | NaN | NaN | NaN | 2.950000e+03 | NaN | NaN | 2003.000000 | NaN | 105.000000 | NaN | 150000.000000 | 6.000000 | NaN | NaN | NaN | NaN | 0.0 | 49577.000000 | NaN |
75% | NaN | NaN | NaN | NaN | 7.200000e+03 | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | 150000.000000 | 9.000000 | NaN | NaN | NaN | NaN | 0.0 | 71540.000000 | NaN |
max | NaN | NaN | NaN | NaN | 1.000000e+08 | NaN | NaN | 9999.000000 | NaN | 17700.000000 | NaN | 150000.000000 | 12.000000 | NaN | NaN | NaN | NaN | 0.0 | 99998.000000 | NaN |
Let's take another look at the price column
autos['price'].describe()
count 5.000000e+04 mean 9.840044e+03 std 4.811044e+05 min 0.000000e+00 25% 1.100000e+03 50% 2.950000e+03 75% 7.200000e+03 max 1.000000e+08 Name: price, dtype: float64
It appears that we may have some outliers in our data that may be skewing the mean. Note how the minimum price is 0 dollar and the maximum price is 100 million dollars. Below, we use the value_counts()
and sort_index()
methods to look at some of the specific, most extreme prices.
autos['price'].value_counts().sort_index(ascending=False)
99999999 1 27322222 1 12345678 3 11111111 2 10000000 1 3890000 1 1300000 1 1234566 1 999999 2 999990 1 350000 1 345000 1 299000 1 295000 1 265000 1 259000 1 250000 1 220000 1 198000 1 197000 1 194000 1 190000 1 180000 1 175000 1 169999 1 169000 1 163991 1 163500 1 155000 1 151990 1 ... 66 1 65 5 60 9 59 1 55 2 50 49 49 4 47 1 45 4 40 6 35 1 30 7 29 1 25 5 20 4 18 1 17 3 15 2 14 1 13 2 12 3 11 2 10 7 9 1 8 1 5 2 3 1 2 3 1 156 0 1421 Name: price, Length: 2357, dtype: int64
It looks like there are a handful of listings listed at either above 350,000 dollars or below 200 dollars. Let's drop some of these outliers in a different dataframe clean_autos
.
clean_autos=autos[autos['price'].between(200,350000)]
clean_autos['price'].describe()
count 47645.000000 mean 6000.707273 std 9110.783444 min 200.000000 25% 1300.000000 50% 3190.000000 75% 7500.000000 max 350000.000000 Name: price, dtype: float64
Next, let's examine the odometer_km
column in our new clean_autos
data.
clean_autos['odometer_km'].describe()
count 47645.000000 mean 125887.501312 std 39482.911790 min 5000.000000 25% 125000.000000 50% 150000.000000 75% 150000.000000 max 150000.000000 Name: odometer_km, dtype: float64
clean_autos['odometer_km'].value_counts()
150000 30781 125000 4998 100000 2083 90000 1720 80000 1407 70000 1209 60000 1145 50000 1007 40000 813 30000 773 20000 747 5000 720 10000 242 Name: odometer_km, dtype: int64
It looks like the car listings are distributed in roughly 10,000 km intervals, which indicates that the odometer readings aren't exact. The website likely offers a km range for sellers to list their cars. Additionally, the majority of the listings are at the highest interval, 150,000 km, indicating that most of the cars have 150,000 or more km on them.
Another column we should examine is registration_year
.
clean_autos['registration_year'].describe()
count 47645.000000 mean 2004.800084 std 88.423872 min 1000.000000 25% 1999.000000 50% 2004.000000 75% 2008.000000 max 9999.000000 Name: registration_year, dtype: float64
Some of the listings have registration years from 1000, or 9999, which is not accurate. Let's remove these outliers from our dataset.
clean_autos['registration_year'].value_counts().sort_index(ascending=False)
9999 3 9000 1 8888 1 6200 1 5911 1 5000 3 4800 1 4500 1 4100 1 2800 1 2019 1 2018 468 2017 1371 2016 1161 2015 376 2014 656 2013 797 2012 1307 2011 1617 2010 1587 2009 2079 2008 2206 2007 2272 2006 2667 2005 2906 2004 2694 2003 2692 2002 2470 2001 2611 2000 3019 ... 1964 12 1963 8 1962 4 1961 6 1960 21 1959 6 1958 4 1957 2 1956 4 1955 2 1954 2 1953 1 1952 1 1951 2 1950 1 1948 1 1943 1 1941 2 1939 1 1938 1 1937 4 1934 2 1931 1 1929 1 1927 1 1910 2 1800 2 1111 1 1001 1 1000 1 Name: registration_year, Length: 95, dtype: int64
We'll keep the listings with registration year in the range 1910-2018, and drop the others.
clean_autos=clean_autos[clean_autos['registration_year'].between(1900,2016)]
clean_autos['registration_year'].describe()
count 45786.000000 mean 2002.993098 std 7.113188 min 1910.000000 25% 1999.000000 50% 2003.000000 75% 2008.000000 max 2016.000000 Name: registration_year, dtype: float64
clean_autos['registration_year'].value_counts(normalize=True)
2000 0.065937 2005 0.063469 1999 0.062137 2004 0.058839 2003 0.058795 2006 0.058249 2001 0.057026 2002 0.053947 1998 0.049906 2007 0.049622 2008 0.048181 2009 0.045407 1997 0.040449 2011 0.035316 2010 0.034661 2012 0.028546 1996 0.028437 2016 0.025357 1995 0.024440 2013 0.017407 2014 0.014328 1994 0.012952 1993 0.008867 2015 0.008212 1992 0.007753 1991 0.007207 1990 0.006989 1989 0.003691 1988 0.002905 1985 0.001922 ... 1976 0.000459 1960 0.000459 1969 0.000415 1975 0.000393 1965 0.000371 1964 0.000262 1963 0.000175 1961 0.000131 1959 0.000131 1962 0.000087 1956 0.000087 1937 0.000087 1958 0.000087 1955 0.000044 1954 0.000044 1957 0.000044 1951 0.000044 1910 0.000044 1934 0.000044 1941 0.000044 1927 0.000022 1929 0.000022 1950 0.000022 1931 0.000022 1948 0.000022 1938 0.000022 1939 0.000022 1953 0.000022 1943 0.000022 1952 0.000022 Name: registration_year, Length: 78, dtype: float64
Most of the listings are for cars registered in the previous 20 years at the time this data was collected. This is to be expected as any car older than that generally loses utility and exits the market. However I'd suspect the other listings are for collectors or buyers specifically looking for vintage vehicles.
clean_autos.describe(include='all')
date_crawled | name | seller | offer_type | price | ab_test | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 45786 | 45786 | 45786 | 45786 | 45786.000000 | 45786 | 43394 | 45786.000000 | 43911 | 45786.000000 | 43717 | 45786.000000 | 45786.000000 | 42756 | 45786 | 37966 | 45786 | 45786.0 | 45786.000000 | 45786 |
unique | 44289 | 35043 | 1 | 1 | NaN | 2 | 8 | NaN | 2 | NaN | 244 | NaN | NaN | 7 | 40 | 2 | 74 | NaN | NaN | 36495 |
top | 2016-03-23 18:39:34 | Volkswagen_Golf_1.4 | privat | Angebot | NaN | test | limousine | NaN | manuell | NaN | golf | NaN | NaN | benzin | volkswagen | nein | 2016-04-03 00:00:00 | NaN | NaN | 2016-04-07 06:17:27 |
freq | 3 | 75 | 45786 | 45786 | NaN | 23582 | 12489 | NaN | 34105 | NaN | 3639 | NaN | NaN | 27978 | 9672 | 33670 | 1793 | NaN | NaN | 8 |
mean | NaN | NaN | NaN | NaN | 6092.594112 | NaN | NaN | 2002.993098 | NaN | 119.168654 | NaN | 125699.340410 | 5.868759 | NaN | NaN | NaN | NaN | 0.0 | 51160.666776 | NaN |
std | NaN | NaN | NaN | NaN | 9229.962879 | NaN | NaN | 7.113188 | NaN | 185.462987 | NaN | 39546.768047 | 3.650627 | NaN | NaN | NaN | NaN | 0.0 | 25733.241750 | NaN |
min | NaN | NaN | NaN | NaN | 200.000000 | NaN | NaN | 1910.000000 | NaN | 0.000000 | NaN | 5000.000000 | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 1067.000000 | NaN |
25% | NaN | NaN | NaN | NaN | 1300.000000 | NaN | NaN | 1999.000000 | NaN | 75.000000 | NaN | 100000.000000 | 3.000000 | NaN | NaN | NaN | NaN | 0.0 | 30880.000000 | NaN |
50% | NaN | NaN | NaN | NaN | 3200.000000 | NaN | NaN | 2003.000000 | NaN | 109.000000 | NaN | 150000.000000 | 6.000000 | NaN | NaN | NaN | NaN | 0.0 | 50127.000000 | NaN |
75% | NaN | NaN | NaN | NaN | 7700.000000 | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | 150000.000000 | 9.000000 | NaN | NaN | NaN | NaN | 0.0 | 72100.000000 | NaN |
max | NaN | NaN | NaN | NaN | 350000.000000 | NaN | NaN | 2016.000000 | NaN | 17700.000000 | NaN | 150000.000000 | 12.000000 | NaN | NaN | NaN | NaN | 0.0 | 99998.000000 | NaN |
There are 5 columns that represent date values, some from the website itself and some gathered by the crawler:
date_crawled
: added by the crawlerlast_seen
: added by the crawlerad_created
: from the websiteregistration_month
: from the websiteregistration_year
: from the websiteThe values in date_crawled
,ad_created
, and last_seen
are formatted as strings. Since we're only interested in the month, day, and year, we'll look at the first 10 characters of each value and use Series.value_counts()
method to see the distribution of the dates. Additionally, we'll use the parameter normalize=True
to see the percentage rather than the count.
clean_autos['date_crawled'].str[:10].value_counts(normalize=True, dropna=False).sort_index()
2016-03-05 0.025226 2016-03-06 0.014175 2016-03-07 0.036256 2016-03-08 0.033351 2016-03-09 0.033242 2016-03-10 0.032390 2016-03-11 0.032565 2016-03-12 0.036758 2016-03-13 0.015900 2016-03-14 0.036321 2016-03-15 0.034290 2016-03-16 0.029310 2016-03-17 0.031691 2016-03-18 0.012711 2016-03-19 0.034487 2016-03-20 0.037937 2016-03-21 0.037326 2016-03-22 0.032543 2016-03-23 0.032346 2016-03-24 0.029441 2016-03-25 0.031341 2016-03-26 0.032106 2016-03-27 0.030905 2016-03-28 0.034749 2016-03-29 0.034028 2016-03-30 0.034006 2016-03-31 0.031800 2016-04-01 0.033941 2016-04-02 0.035775 2016-04-03 0.038920 2016-04-04 0.036671 2016-04-05 0.013039 2016-04-06 0.003080 2016-04-07 0.001376 Name: date_crawled, dtype: float64
It looks like the data was collected by the crawler over a month period from March 5, 2016 to April 7, 2016
clean_autos['ad_created'].str[:10].value_counts(normalize=True, dropna=False)
2016-04-03 0.039160 2016-03-20 0.038003 2016-03-21 0.037544 2016-04-04 0.036998 2016-03-12 0.036561 2016-04-02 0.035426 2016-03-07 0.034989 2016-03-14 0.034967 2016-03-28 0.034836 2016-03-15 0.034093 2016-03-29 0.034050 2016-04-01 0.033941 2016-03-30 0.033809 2016-03-19 0.033416 2016-03-08 0.033351 2016-03-09 0.033329 2016-03-11 0.032848 2016-03-22 0.032302 2016-03-23 0.032237 2016-03-26 0.032106 2016-03-10 0.032106 2016-03-31 0.031887 2016-03-25 0.031494 2016-03-17 0.031385 2016-03-27 0.030774 2016-03-16 0.029725 2016-03-24 0.029398 2016-03-05 0.022845 2016-03-13 0.017276 2016-03-06 0.015398 ... 2016-02-25 0.000066 2016-02-02 0.000044 2016-01-27 0.000044 2016-02-24 0.000044 2016-02-12 0.000044 2016-02-18 0.000044 2016-02-20 0.000044 2016-02-14 0.000044 2016-02-26 0.000044 2016-02-05 0.000044 2016-01-10 0.000044 2016-02-07 0.000022 2015-11-10 0.000022 2016-01-13 0.000022 2015-06-11 0.000022 2016-01-16 0.000022 2016-02-17 0.000022 2015-09-09 0.000022 2016-02-09 0.000022 2016-01-14 0.000022 2016-02-11 0.000022 2016-01-07 0.000022 2015-12-30 0.000022 2016-02-22 0.000022 2015-08-10 0.000022 2016-02-01 0.000022 2016-02-16 0.000022 2015-12-05 0.000022 2016-02-08 0.000022 2016-01-03 0.000022 Name: ad_created, Length: 74, dtype: float64
Most of the listings were created around the same month period the data was crawled.
clean_autos['ad_created'].str[:10].value_counts(normalize=True, dropna=False).sort_index()
2015-06-11 0.000022 2015-08-10 0.000022 2015-09-09 0.000022 2015-11-10 0.000022 2015-12-05 0.000022 2015-12-30 0.000022 2016-01-03 0.000022 2016-01-07 0.000022 2016-01-10 0.000044 2016-01-13 0.000022 2016-01-14 0.000022 2016-01-16 0.000022 2016-01-27 0.000044 2016-02-01 0.000022 2016-02-02 0.000044 2016-02-05 0.000044 2016-02-07 0.000022 2016-02-08 0.000022 2016-02-09 0.000022 2016-02-11 0.000022 2016-02-12 0.000044 2016-02-14 0.000044 2016-02-16 0.000022 2016-02-17 0.000022 2016-02-18 0.000044 2016-02-19 0.000066 2016-02-20 0.000044 2016-02-21 0.000066 2016-02-22 0.000022 2016-02-23 0.000087 ... 2016-03-09 0.033329 2016-03-10 0.032106 2016-03-11 0.032848 2016-03-12 0.036561 2016-03-13 0.017276 2016-03-14 0.034967 2016-03-15 0.034093 2016-03-16 0.029725 2016-03-17 0.031385 2016-03-18 0.013367 2016-03-19 0.033416 2016-03-20 0.038003 2016-03-21 0.037544 2016-03-22 0.032302 2016-03-23 0.032237 2016-03-24 0.029398 2016-03-25 0.031494 2016-03-26 0.032106 2016-03-27 0.030774 2016-03-28 0.034836 2016-03-29 0.034050 2016-03-30 0.033809 2016-03-31 0.031887 2016-04-01 0.033941 2016-04-02 0.035426 2016-04-03 0.039160 2016-04-04 0.036998 2016-04-05 0.011816 2016-04-06 0.003167 2016-04-07 0.001223 Name: ad_created, Length: 74, dtype: float64
The listings in our data set were created between June 11th 2015 to April 7th 2016
clean_autos['last_seen'].str[:10].value_counts(normalize=True, dropna=False).sort_index()
2016-03-05 0.001092 2016-03-06 0.004106 2016-03-07 0.005351 2016-03-08 0.007273 2016-03-09 0.009785 2016-03-10 0.010484 2016-03-11 0.012318 2016-03-12 0.023872 2016-03-13 0.008649 2016-03-14 0.012580 2016-03-15 0.015835 2016-03-16 0.016097 2016-03-17 0.028087 2016-03-18 0.007164 2016-03-19 0.015420 2016-03-20 0.020618 2016-03-21 0.020443 2016-03-22 0.020858 2016-03-23 0.018324 2016-03-24 0.019526 2016-03-25 0.018805 2016-03-26 0.016686 2016-03-27 0.015485 2016-03-28 0.020596 2016-03-29 0.021950 2016-03-30 0.024396 2016-03-31 0.023719 2016-04-01 0.023042 2016-04-02 0.024527 2016-04-03 0.025073 2016-04-04 0.024243 2016-04-05 0.126130 2016-04-06 0.224020 2016-04-07 0.133447 Name: last_seen, dtype: float64
It's notable that nearly 50% of the listings were last seen on April 5, 6, and 7. This may be linked to the fact that the crawler hasn't gathered any data after April 7th. Many of these listings were likely still open on these dates.
For this next section, let's look at the brand
column and ask: which car brand is more expensive on average?
clean_autos['brand'].value_counts(normalize=True)
volkswagen 0.211244 bmw 0.111257 opel 0.105906 mercedes_benz 0.097628 audi 0.087603 ford 0.068733 renault 0.046543 peugeot 0.029944 fiat 0.025095 seat 0.018171 skoda 0.016512 nissan 0.015289 mazda 0.015136 smart 0.014371 citroen 0.014131 toyota 0.012930 hyundai 0.010112 sonstige_autos 0.009544 volvo 0.009195 mini 0.008889 mitsubishi 0.008168 honda 0.007928 kia 0.007142 alfa_romeo 0.006683 porsche 0.006094 suzuki 0.005810 chevrolet 0.005722 chrysler 0.003560 dacia 0.002686 daihatsu 0.002490 jeep 0.002315 land_rover 0.002140 subaru 0.002097 saab 0.001660 jaguar 0.001529 daewoo 0.001485 rover 0.001332 trabant 0.001289 lancia 0.001048 lada 0.000590 Name: brand, dtype: float64
Let's look at the top 20 most commonly listed car brands and find the mean price for each brand.
#selecting the 20 most commonly listed brands
brands_list=clean_autos['brand'].value_counts().index[:20]
print(brands_list)
Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford', 'renault', 'peugeot', 'fiat', 'seat', 'skoda', 'nissan', 'mazda', 'smart', 'citroen', 'toyota', 'hyundai', 'sonstige_autos', 'volvo', 'mini'], dtype='object')
In the cell below, we create an empty dictionary, loop through the brands_list
referenced above. For each brand, we calculate the mean price from the clean_autos
dataframe, then we add the brand and it's mean price the brand_mean
dictionary.
brand_price={}
for brand in brands_list:
mean=int(clean_autos.loc[clean_autos['brand'] == brand, 'price'].mean())
brand_price[brand]=mean
brand_price
For reference, the mean price across all of the listings in clean_autos
is 6092. The most commonly listed brand, Volkswagen, is cheaper on average than other cars in our data. However, the second most commonly listed brand, BMW, is more expensive on average.
Next, let's use the same process to calculate the average milage (from the odometer_km
column) accross the selected brands.
brand_km={}
for brand in brands_list:
mean=int(clean_autos.loc[clean_autos['brand'] == brand, 'odometer_km'].mean())
brand_km[brand]=mean
brand_km
In the cell below, we'll convert the two dictionaries brand_price
and brand_km
to series objects, and then combine them into a single dataframe object that share a common index (the brand).
bmp_series=pd.Series(brand_price)
bkm_series=pd.Series(brand_km)
avg_df=pd.DataFrame(bmp_series, columns=['mean_price'])
avg_df['mean_km']=bkm_series
avg_df
The assumption is that the higher the milage on a car, the cheaper the car becomes, however when looking at the dataframe, there is no clear relationship between the milage and price accross brands. This may indicate that the car brand is more of a factor affecting price than the milage of the car. We should note, however, as we saw before that the data in the odometer_km
is not exact, and that the listed milage on the eBay ads is likely limited to a preselected range.
The goal of this guided project was to demonstrate our ability to clean data by recognizing anomolies, organize DataFrame and Series objects, and alter data types using the functionality of Pandas and NumPy. It is important to not declare the presence a particular trend without proving it true through statistical analysis.