This dataset is 50,000 data points sampled from the original datasets. The structure (data fields) of the dataset should be as followed:
First, we import the dataset to the notebook, looking at the info and the first few data points from the datasets
import pandas as pd
import numpy as np
df = pd.read_csv('autos.csv', encoding='Latin-1')
df.head(10)
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
5 | 2016-03-21 13:47:45 | Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto... | privat | Angebot | $7,900 | test | bus | 2006 | automatik | 150 | voyager | 150,000km | 4 | diesel | chrysler | NaN | 2016-03-21 00:00:00 | 0 | 22962 | 2016-04-06 09:45:21 |
6 | 2016-03-20 17:55:21 | VW_Golf_III_GT_Special_Electronic_Green_Metall... | privat | Angebot | $300 | test | limousine | 1995 | manuell | 90 | golf | 150,000km | 8 | benzin | volkswagen | NaN | 2016-03-20 00:00:00 | 0 | 31535 | 2016-03-23 02:48:59 |
7 | 2016-03-16 18:55:19 | Golf_IV_1.9_TDI_90PS | privat | Angebot | $1,990 | control | limousine | 1998 | manuell | 90 | golf | 150,000km | 12 | diesel | volkswagen | nein | 2016-03-16 00:00:00 | 0 | 53474 | 2016-04-07 03:17:32 |
8 | 2016-03-22 16:51:34 | Seat_Arosa | privat | Angebot | $250 | test | NaN | 2000 | manuell | 0 | arosa | 150,000km | 10 | NaN | seat | nein | 2016-03-22 00:00:00 | 0 | 7426 | 2016-03-26 18:18:10 |
9 | 2016-03-16 13:47:02 | Renault_Megane_Scenic_1.6e_RT_Klimaanlage | privat | Angebot | $590 | control | bus | 1997 | manuell | 90 | megane | 150,000km | 7 | benzin | renault | nein | 2016-03-16 00:00:00 | 0 | 15749 | 2016-04-06 10:46:35 |
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 dateCrawled 50000 non-null object 1 name 50000 non-null object 2 seller 50000 non-null object 3 offerType 50000 non-null object 4 price 50000 non-null object 5 abtest 50000 non-null object 6 vehicleType 44905 non-null object 7 yearOfRegistration 50000 non-null int64 8 gearbox 47320 non-null object 9 powerPS 50000 non-null int64 10 model 47242 non-null object 11 odometer 50000 non-null object 12 monthOfRegistration 50000 non-null int64 13 fuelType 45518 non-null object 14 brand 50000 non-null object 15 notRepairedDamage 40171 non-null object 16 dateCreated 50000 non-null object 17 nrOfPictures 50000 non-null int64 18 postalCode 50000 non-null int64 19 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
There is a few observations from the information of the first few dataset. These can be helpful guidance for further exploration and cleaning of the datasets
df = df.rename({'dateCrawled': 'date_crawled',
'offerType': 'offer_type',
'vehicleType': 'vehicle_type',
'yearOfRegistration': 'registration_year',
'monthOfRegistration': 'registration_month',
'powerPS': 'power_ps',
'fuelType': 'fuel_type',
'notRepairedDamage': 'unrepaired_damage',
'dateCreated': 'ad_created',
'nrOfPictures': 'num_pics',
'postalCode': 'postal_code',
'lastSeen': 'last_seen_date'}, axis=1)
df.columns
Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest', 'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model', 'odometer', 'registration_month', 'fuel_type', 'brand', 'unrepaired_damage', 'ad_created', 'num_pics', 'postal_code', 'last_seen_date'], dtype='object')
Next, we will look into the description of each columns to see any new information that we can extract
df.describe()
registration_year | power_ps | registration_month | num_pics | postal_code | |
---|---|---|---|---|---|
count | 50000.000000 | 50000.000000 | 50000.000000 | 50000.0 | 50000.000000 |
mean | 2005.073280 | 116.355920 | 5.723360 | 0.0 | 50813.627300 |
std | 105.712813 | 209.216627 | 3.711984 | 0.0 | 25779.747957 |
min | 1000.000000 | 0.000000 | 0.000000 | 0.0 | 1067.000000 |
25% | 1999.000000 | 70.000000 | 3.000000 | 0.0 | 30451.000000 |
50% | 2003.000000 | 105.000000 | 6.000000 | 0.0 | 49577.000000 |
75% | 2008.000000 | 150.000000 | 9.000000 | 0.0 | 71540.000000 |
max | 9999.000000 | 17700.000000 | 12.000000 | 0.0 | 99998.000000 |
There are some observation that we can see from this:
num_pics
in every post are all 0, so this information is basically useless for our analysis and should be droppedNext, we look at all the data, including string and object data
df.describe(include=['O'])
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | gearbox | model | odometer | fuel_type | brand | unrepaired_damage | ad_created | last_seen_date | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000 | 50000 | 50000 | 50000 | 50000 | 50000 | 44905 | 47320 | 47242 | 50000 | 45518 | 50000 | 40171 | 50000 | 50000 |
unique | 48213 | 38754 | 2 | 2 | 2357 | 2 | 8 | 2 | 245 | 13 | 7 | 40 | 2 | 76 | 39481 |
top | 2016-04-02 11:37:04 | Ford_Fiesta | privat | Angebot | $0 | test | limousine | manuell | golf | 150,000km | benzin | volkswagen | nein | 2016-04-03 00:00:00 | 2016-04-07 06:17:27 |
freq | 3 | 78 | 49999 | 49999 | 1421 | 25756 | 12859 | 36993 | 4024 | 32424 | 30107 | 10687 | 35232 | 1946 | 8 |
There are some notable things to consider in this description:
odometer
and price
should be consider numeric type, not object type. We need to transform these two into different typeseller
and offer_type
consists of only one value, so they should be dropped since they do not contribute much to the data analysisThe data in num_pics
, seller
and offer_type
contain only one value, so we can safely remove this column without eliminating any useful information
df = df.drop(['num_pics', 'seller', 'offer_type'], axis=1)
df.columns
Index(['date_crawled', 'name', 'price', 'abtest', 'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model', 'odometer', 'registration_month', 'fuel_type', 'brand', 'unrepaired_damage', 'ad_created', 'postal_code', 'last_seen_date'], dtype='object')
The type of price
and odometer
data is string/object, which is not desirable. What we should do is transform it into integer type, and add additional information into the
print(df['price'].str[0].value_counts())
print(df['odometer'].str[-2:].value_counts())
$ 50000 Name: price, dtype: int64 km 50000 Name: odometer, dtype: int64
Here we can see that all the data in these two are expressed in US Dollar and Kilometer, so we can effectively remove the units in the data
df['price'] = df['price'].str[1:].str.replace(',', '').astype(int)
df['odometer'] = df['odometer'].str[:-2].str.replace(',', '').astype(int)
df = df.rename({'price': 'price_dollar', 'odometer': 'odometer_km'}, axis=1)
df.describe()
price_dollar | registration_year | power_ps | odometer_km | registration_month | postal_code | |
---|---|---|---|---|---|---|
count | 5.000000e+04 | 50000.000000 | 50000.000000 | 50000.000000 | 50000.000000 | 50000.000000 |
mean | 9.840044e+03 | 2005.073280 | 116.355920 | 125732.700000 | 5.723360 | 50813.627300 |
std | 4.811044e+05 | 105.712813 | 209.216627 | 40042.211706 | 3.711984 | 25779.747957 |
min | 0.000000e+00 | 1000.000000 | 0.000000 | 5000.000000 | 0.000000 | 1067.000000 |
25% | 1.100000e+03 | 1999.000000 | 70.000000 | 125000.000000 | 3.000000 | 30451.000000 |
50% | 2.950000e+03 | 2003.000000 | 105.000000 | 150000.000000 | 6.000000 | 49577.000000 |
75% | 7.200000e+03 | 2008.000000 | 150.000000 | 150000.000000 | 9.000000 | 71540.000000 |
max | 1.000000e+08 | 9999.000000 | 17700.000000 | 150000.000000 | 12.000000 | 99998.000000 |
Since there are a lot of outliers for the project, we needs to eliminate them so that we do not lead to incorrect conclusion and analysis.
We start with the column price_dollar
first, as we can see from the data description, the maximum price of a car can be up to more than 1 billion dollars, which is unrealistic
print(df['price_dollar'].value_counts().sort_index().tail(20))
197000 1 198000 1 220000 1 250000 1 259000 1 265000 1 295000 1 299000 1 345000 1 350000 1 999990 1 999999 2 1234566 1 1300000 1 3890000 1 10000000 1 11111111 2 12345678 3 27322222 1 99999999 1 Name: price_dollar, dtype: int64
We can see that data is pretty consistently from 200,000 dollar to 350,000 dollar, then we see a sudden jumped from 999,990 to over 999,999,999. On the other side, 0 is the most common 'exact' price, and there are many people giving their car for free (I supposed), so we do not have to worry about this end.
In this column, we can treat any data over 350,000 as outliers and replace it with NaN
df.loc[~(df['price_dollar'].between(0, 350000)), 'price_dollar'] = np.nan
df['price_dollar'].describe()
count 49986.000000 mean 5721.525167 std 8983.617820 min 0.000000 25% 1100.000000 50% 2950.000000 75% 7200.000000 max 350000.000000 Name: price_dollar, dtype: float64
The next column we needs to examine its outliers is registration_year
. This columns contains year from 1000 (when the car not even invented yet) to 9999 (way into the future)! We should examine all the data lying outside 1900 (when car first registered) to 2020 (present) to see what we should do with it
df.loc[~(df['registration_year'].between(1900, 2020)), :]
date_crawled | name | price_dollar | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | postal_code | last_seen_date | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
453 | 2016-03-28 13:51:12 | Armee_Jeep | 9800.0 | test | NaN | 4500 | manuell | 0 | andere | 5000 | 0 | NaN | jeep | NaN | 2016-03-28 00:00:00 | 7545 | 2016-04-06 17:45:49 |
4164 | 2016-03-29 18:39:40 | Verkaufe_DESIGN_Streifen_/_Aufkleber_VW__Opel_... | 49.0 | control | NaN | 5000 | NaN | 0 | golf | 5000 | 12 | NaN | volkswagen | NaN | 2016-03-29 00:00:00 | 74523 | 2016-04-06 04:16:14 |
4549 | 2016-04-01 21:57:05 | Kompressor | 1600.0 | test | NaN | 4100 | NaN | 0 | NaN | 5000 | 0 | NaN | sonstige_autos | NaN | 2016-04-01 00:00:00 | 67686 | 2016-04-05 20:19:27 |
6308 | 2016-03-12 17:38:17 | Kaufe_Autos_jeglicher | 0.0 | test | NaN | 9996 | NaN | 0 | NaN | 10000 | 0 | NaN | sonstige_autos | NaN | 2016-03-12 00:00:00 | 21244 | 2016-03-12 17:38:17 |
8012 | 2016-03-23 16:43:29 | Opel_GT_Karosserie_mit_Brief! | 700.0 | test | NaN | 9999 | NaN | 0 | andere | 10000 | 0 | NaN | opel | NaN | 2016-03-23 00:00:00 | 21769 | 2016-04-05 20:16:15 |
8360 | 2016-03-11 22:56:30 | Vito_touret_119_Blue_Tec | 42800.0 | control | NaN | 6200 | automatik | 0 | vito | 10000 | 7 | diesel | mercedes_benz | nein | 2016-03-11 00:00:00 | 63739 | 2016-03-19 20:16:56 |
10556 | 2016-04-01 06:02:10 | UNFAL_Auto | 450.0 | control | NaN | 1800 | NaN | 1800 | NaN | 5000 | 2 | NaN | mitsubishi | nein | 2016-04-01 00:00:00 | 63322 | 2016-04-01 09:42:30 |
13559 | 2016-03-19 15:57:44 | Saab_9000_CSE_Automatik_2_3_ltr._mit_EGSD | 0.0 | control | NaN | 9000 | automatik | 170 | 9000 | 150000 | 2 | benzin | saab | nein | 2016-03-19 00:00:00 | 32457 | 2016-03-21 21:18:11 |
14341 | 2016-03-23 01:36:20 | Hole_kostenlos_ab | 0.0 | test | NaN | 9999 | NaN | 0 | NaN | 10000 | 0 | NaN | bmw | NaN | 2016-03-23 00:00:00 | 32689 | 2016-03-23 08:47:00 |
22316 | 2016-03-29 16:56:41 | VW_Kaefer.__Zwei_zum_Preis_von_einem. | 1500.0 | control | NaN | 1000 | manuell | 0 | kaefer | 5000 | 0 | benzin | volkswagen | NaN | 2016-03-29 00:00:00 | 48324 | 2016-03-31 10:15:28 |
22799 | 2016-03-20 18:56:44 | Subaru_Impreza_GT | 9000.0 | test | NaN | 5000 | manuell | 420 | impreza | 5000 | 6 | benzin | subaru | nein | 2016-03-20 00:00:00 | 34253 | 2016-04-07 02:45:30 |
24511 | 2016-03-17 19:45:11 | Trabant__wartburg__Ostalgie | 490.0 | control | NaN | 1111 | NaN | 0 | NaN | 5000 | 0 | NaN | trabant | NaN | 2016-03-17 00:00:00 | 16818 | 2016-04-07 07:17:29 |
24519 | 2016-03-05 17:53:37 | 4x_Winterreifen_auf_Alufelge_der_naechste_Wint... | 250.0 | test | NaN | 5000 | NaN | 0 | andere | 5000 | 0 | NaN | seat | NaN | 2016-03-05 00:00:00 | 49124 | 2016-04-05 13:46:51 |
25003 | 2016-03-06 16:06:20 | Reo_Vorkriegs_Oldtimer_Rennwagen_1928 | 6500.0 | control | NaN | 8888 | NaN | 0 | NaN | 10000 | 0 | NaN | sonstige_autos | NaN | 2016-03-06 00:00:00 | 55262 | 2016-03-30 20:46:55 |
27578 | 2016-03-31 22:49:04 | VW_GOLF_2_SYNCRO | 1700.0 | test | NaN | 2800 | manuell | 98 | NaN | 5000 | 7 | benzin | volkswagen | NaN | 2016-03-31 00:00:00 | 83022 | 2016-03-31 23:44:01 |
27618 | 2016-03-29 22:39:32 | Golf_1_75ps_5911km_Original_Automatik_einer_de... | 7999.0 | test | NaN | 5911 | automatik | 75 | golf | 10000 | 0 | NaN | volkswagen | NaN | 2016-03-29 00:00:00 | 41462 | 2016-04-06 09:17:09 |
32585 | 2016-04-02 16:56:39 | UNFAL_Auto | 450.0 | control | NaN | 1800 | NaN | 1800 | NaN | 5000 | 2 | NaN | mitsubishi | nein | 2016-04-02 00:00:00 | 63322 | 2016-04-04 14:46:21 |
33950 | 2016-03-23 21:52:25 | 58er_karmann_ghia_lowlight_Kaefer__zum_restaur... | 7999.0 | test | NaN | 9999 | NaN | 0 | kaefer | 10000 | 0 | NaN | volkswagen | NaN | 2016-03-23 00:00:00 | 47638 | 2016-04-06 03:46:40 |
35238 | 2016-03-26 13:45:20 | Suche_Skoda_Fabia____Skoda_Fabia_Combi_mit_Klima | 0.0 | control | NaN | 1500 | NaN | 0 | NaN | 5000 | 0 | benzin | skoda | NaN | 2016-03-26 00:00:00 | 15517 | 2016-04-04 00:16:54 |
38076 | 2016-04-04 22:54:47 | Mercedes_Benz_A180 | 18000.0 | test | NaN | 9999 | NaN | 0 | a_klasse | 10000 | 0 | benzin | mercedes_benz | NaN | 2016-04-04 00:00:00 | 51379 | 2016-04-07 02:44:52 |
42079 | 2016-03-17 19:55:29 | APE_50___deutsche_Papiere_!!! | 500.0 | control | NaN | 4800 | manuell | 3 | NaN | 5000 | 11 | NaN | sonstige_autos | NaN | 2016-03-17 00:00:00 | 48653 | 2016-03-17 19:55:29 |
49153 | 2016-03-12 01:36:59 | Corsa_c20xe | 2500.0 | test | NaN | 5000 | NaN | 0 | corsa | 5000 | 0 | NaN | opel | NaN | 2016-03-12 00:00:00 | 88214 | 2016-03-12 22:15:17 |
49283 | 2016-03-15 18:38:53 | Citroen_HY | 7750.0 | control | NaN | 1001 | NaN | 0 | andere | 5000 | 0 | NaN | citroen | NaN | 2016-03-15 00:00:00 | 66706 | 2016-04-06 18:47:20 |
49910 | 2016-04-03 21:39:15 | Schoener_fast_neuer_Opel_Mokka_in_Zell_Mosel_m... | 22200.0 | test | NaN | 9000 | automatik | 140 | andere | 10000 | 3 | benzin | opel | NaN | 2016-04-03 00:00:00 | 56856 | 2016-04-05 22:18:26 |
These entries, when we look at other attribute, is mostly inaccurate data. Most of them have registration month of 0, some of them have car power that is way over 1800ps (some of the most powerful car these day cannot reach 1500ps). Therefore, I think we can safely remove these rows from the dataset. After remove the registration year, we look back at the dataset
df = df.loc[df['registration_year'].between(1900, 2020), :]
df.describe()
price_dollar | registration_year | power_ps | odometer_km | registration_month | postal_code | |
---|---|---|---|---|---|---|
count | 49962.000000 | 49976.000000 | 49976.000000 | 49976.000000 | 49976.000000 | 49976.000000 |
mean | 5721.426684 | 2003.366836 | 116.321634 | 125786.877701 | 5.725068 | 50814.988014 |
std | 8983.291959 | 7.693943 | 208.979969 | 39970.469598 | 3.711228 | 25781.624480 |
min | 0.000000 | 1910.000000 | 0.000000 | 5000.000000 | 0.000000 | 1067.000000 |
25% | 1100.000000 | 1999.000000 | 70.000000 | 125000.000000 | 3.000000 | 30451.000000 |
50% | 2950.000000 | 2003.000000 | 105.000000 | 150000.000000 | 6.000000 | 49577.000000 |
75% | 7200.000000 | 2008.000000 | 150.000000 | 150000.000000 | 9.000000 | 71543.000000 |
max | 350000.000000 | 2019.000000 | 17700.000000 | 150000.000000 | 12.000000 | 99998.000000 |
Since we finish cleaning our dataset, now we can use our cleaned data to extract some useful infomation for eBay's car buyers
First we retrieve the top 20 most posted brands to calculate the mean price for each brands. We then calculate their mean price, mileage, and power(for those who wants a little bit more push!) data of each brand to get some insights on which car brands that is suitable for the customer
# Get the top 20 brands on eBay
brands = df['brand'].value_counts().sort_values(ascending=False)[:20].index
# Calculate and display average price and mileage for each car
price_by_brand = []
mileage_by_brand = []
power_by_brand = []
for brand in brands:
price_by_brand.append(df.loc[df['brand'] == brand, 'price_dollar'].mean())
mileage_by_brand.append(df.loc[df['brand'] == brand, 'odometer_km'].mean())
power_by_brand.append(df.loc[df['brand'] == brand, 'power_ps'].mean())
# Convert into dataframe for display
brand_info = pd.DataFrame(index=brands)
brand_info['mean_price_dollar'] = price_by_brand
brand_info['mean_odometer_km'] = mileage_by_brand
brand_info['mean_power_ps'] = power_by_brand
brand_info
mean_price_dollar | mean_odometer_km | mean_power_ps | |
---|---|---|---|
volkswagen | 5159.401629 | 129012.357236 | 98.318105 |
opel | 2842.824629 | 129365.152070 | 93.930194 |
bmw | 8028.474479 | 132544.215181 | 165.454679 |
mercedes_benz | 8380.637920 | 130937.235841 | 149.788462 |
audi | 8965.560355 | 129643.941163 | 158.651413 |
ford | 3626.542997 | 124131.934464 | 99.318482 |
renault | 2351.301997 | 128223.793677 | 78.779534 |
peugeot | 3010.868819 | 127352.335165 | 90.689560 |
fiat | 2697.677123 | 117037.461774 | 68.596330 |
seat | 4223.654255 | 122186.170213 | 98.055319 |
skoda | 6313.076433 | 111082.802548 | 120.951592 |
mazda | 3962.542933 | 125132.100396 | 107.642008 |
nissan | 4588.879310 | 118978.779841 | 94.278515 |
smart | 3482.971469 | 100756.062767 | 64.298146 |
citroen | 3680.844063 | 119928.571429 | 87.135714 |
toyota | 5097.941653 | 115988.654781 | 101.980551 |
sonstige_autos | 10600.480519 | 87776.752768 | 116.387454 |
hyundai | 5316.754098 | 106782.786885 | 100.651639 |
volvo | 4685.548246 | 138632.385120 | 139.820569 |
mini | 10392.393868 | 89375.000000 | 127.280660 |
As we can see in the table, one of the most expensive brands will be Mercedes Benz, Audi, BMW. The Mid-tier cars is populated by brands like Volkswagen, Ford, Toyota, Huyndai. Cheap options include Opel, Ford, Peugeot, Renault.
Most of the popular brands will have the average mileage around 130,000km, and this number drops gradually with their popularity. Also, car prices also corresponds towards average mileage, as high-end brands like BMW and Mercedes have on average more mileage compared to mid-tier and low-tier ones.
For those who desire a little bit more power, expensive car is the way to go. However, there are some mid-tier car brands that provide a little bit more horsepower for their money, such as Volvo and Skoda. Otherwise, most of the car power does depends on their price tag.
Car buyers usually interested in the model of the car, not just the brand itself. We should investigate on some of the popular car models out there for the customer, so the customer can see what is their most popular options.
(df['brand'] + ' ' + df['model']).value_counts().sort_values(ascending=False).head(20)
volkswagen golf 4022 bmw 3er 2761 volkswagen polo 1757 opel corsa 1734 opel astra 1454 volkswagen passat 1425 audi a4 1291 bmw 5er 1183 mercedes_benz c_klasse 1172 mercedes_benz e_klasse 1001 audi a3 882 audi a6 835 ford focus 811 ford fiesta 769 volkswagen transporter 704 renault twingo 677 peugeot 2_reihe 621 mercedes_benz a_klasse 587 smart fortwo 574 opel vectra 574 dtype: int64
The most popular option on eBay, by far, is the Volkswagen Golf, with over 4022 listings. The Volkswagen family also have the Polo and the Passat model in the top 10. BMW 3er and BMW 5er is also very high on the list as a popular high-end model. For low-end brand, Opel Corsa and Opel Astra are the most popular choice that customers can find.
df['power_ps'].value_counts().sort_index().tail(50)
950 1 952 1 999 1 1000 1 1001 3 1003 1 1011 1 1016 1 1055 1 1056 1 1082 1 1090 1 1103 1 1202 1 1300 1 1367 1 1398 1 1400 3 1401 1 1405 1 1704 1 1753 1 1771 1 1779 1 1780 1 1781 1 1793 1 1796 1 1800 1 1986 1 1988 1 1998 2 2018 1 2729 1 3500 1 3750 1 4400 1 5867 1 6045 1 6226 1 6512 1 7511 1 8404 1 9011 1 14009 1 15001 1 15016 1 16011 1 16312 1 17700 1 Name: power_ps, dtype: int64