In [1]:

```
import vaex
from vaex.ui.colormaps import cm_plusmin
import numpy as np
import pylab as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
```

`matplotlib`

parameters¶In [2]:

```
SMALL_SIZE = 12
MEDIUM_SIZE = 14
BIGGER_SIZE = 16
plt.rc('font', size=SMALL_SIZE) # controls default text sizes
plt.rc('axes', titlesize=SMALL_SIZE) # fontsize of the axes title
plt.rc('axes', labelsize=MEDIUM_SIZE) # fontsize of the x and y labels
plt.rc('xtick', labelsize=SMALL_SIZE) # fontsize of the tick labels
plt.rc('ytick', labelsize=SMALL_SIZE) # fontsize of the tick labels
plt.rc('legend', fontsize=SMALL_SIZE) # legend fontsize
plt.rc('figure', titlesize=BIGGER_SIZE) # fontsize of the figure title
```

The original data is courtesy of the New York City Taxi and Limousine Commision, and can be downloaded from this website.

The data was then converted to the memory-mappable HDF5 file format. For an example on how to do this, you may want to look at this notebook.

- Can "read" the memmory mapped file that we have on disk in no time.
`Vaex`

can also read data stored on S3. The data is streamed on need-to-have basis and is locally cached.

In [3]:

```
# Check the file size on disk
!ls -l -h ./data/yellow_taxi_2009_2015_f32.hdf5
```

In [4]:

```
# Read in the data from disk
df = vaex.open('./data/yellow_taxi_2009_2015_f32.hdf5')
```

In [5]:

```
# # Read in the data from S3
# df = vaex.open('s3://vaex/taxi/yellow_taxi_2009_2015_f32.hdf5?anon=true')
```

In [6]:

```
# A view into the data
df
```

Out[6]:

This is done with a single pass over the data

In [7]:

```
# Get a high level overview of the DataFrame
df.describe()
```

Out[7]:

In this section we will use the output of describe to get rid of outliers, and other erroneous data. Let's start with the City of New York itself.

Let's visualise the pickup locations.

In [8]:

```
# Interactively visualise the pickup locations of all taxi trips in our dataset.
df.plot_widget(df.pickup_longitude,
df.pickup_latitude,
shape=512,
limits='minmax',
f='log1p',
colormap='plasma')
```

With Vaex we can interactively explore such heamaps as the one above, even when the data contains over 1 billion samples. This way we can choose the spatial extent over which the taxi company operates in New York City. In fact, it is mostly Manhattan.

In [9]:

```
# Define the boundaries by interactively choosing the area of interest!
long_min = -74.05
long_max = -73.75
lat_min = 40.58
lat_max = 40.90
# Make a selection based on the boundaries
df_filtered = df[(df.pickup_longitude > long_min) & (df.pickup_longitude < long_max) & \
(df.pickup_latitude > lat_min) & (df.pickup_latitude < lat_max) & \
(df.dropoff_longitude > long_min) & (df.dropoff_longitude < long_max) & \
(df.dropoff_latitude > lat_min) & (df.dropoff_latitude < lat_max)]
```

From the output of the `describe`

method we see that the maximum number of passengers is 255!
Let's make a bar plot showing the common number of passengers in a taxi trip.

In [10]:

```
# Get number of unique trips with certain number of passengers
num_passengers = df_filtered.passenger_count.value_counts(progress=True)
# Plot the result
plt.figure(figsize=(16, 4))
sns.barplot(x=num_passengers.index, y=np.log10(num_passengers.values))
plt.xlabel('Number of passengers')
plt.ylabel('Number of trips [dex]')
plt.xticks(rotation='45')
plt.show()
```

First impressions: Typical number of passengers in a ride between 1-6. Very large number of taxi trips with 0 passengeres. Are these deliveries maybe, or were passengers not recorded. There are few hundreds of taxi trips with 7-9 passengers, and beyond that the numbers look erroneous.

In this analysis we will focus only on the trips with typical number of passengers, that is between 1 and 6. So let's add that do the filter.

In [11]:

```
# Filterd based on the number of passengers
df_filtered = df_filtered[(df_filtered.passenger_count>0) & (df_filtered.passenger_count<7)]
```

Next up, we turn to the distance column. Here we see that the minimum value is negative, i.e. for sure something has gone wrong, and the maximum values is.. well very large! In fact, to put this in perspective:

In [12]:

```
# What is the largest distance?
max_trip_distance = df_filtered.trip_distance.max().astype('int')
print(f'The largest distance in the data is {max_trip_distance} miles!')
print(f'This is {max_trip_distance/238_900:.1f} times larger than the distance between the Earth and the Moon!')
```

Let's plot the distribution of distances, but in a more sensible range, relative to the scale of the part of New York City we selected above.

In [13]:

```
# Plot the distribution of distances.
plt.figure(figsize=(8, 4))
df_filtered.plot1d('trip_distance', limits=[0, 250], f='log10', shape=128, lw=3, progress=True)
plt.xlabel('Trip distance [miles]')
plt.ylabel('Number of trips [dex]')
plt.show()
```

So we observe that at ~100 miles, the number of taxi trips drops considerably, and becomes more sporadic. Thus we decide to only consider taxi trip that in total are up to 100 miles.

In [14]:

```
# Select taxi trips have travelled maximum 100 miles (but also with non-zero distance).
df_filtered = df_filtered[(df_filtered.trip_distance > 0) & (df_filtered.trip_distance < 100)]
```

In the next step of our data cleaning process, let's look at the distributions of trip times and speeds, and make sure they are sensible. These quantities are not readily available in the dataset, but are trivial to compute. We will do this in a rather standard way, but here is the kick: these additional columns do not cost any memory what so ever. This is what we call *virtual columns*.

In [15]:

```
# Time in transit (minutes)
df_filtered['trip_duration_min'] = (df_filtered.dropoff_datetime - df_filtered.pickup_datetime) / \
np.timedelta64(1, 'm')
# Speed (miles per hour)
df_filtered['trip_speed_mph'] = df_filtered.trip_distance / \
((df_filtered.dropoff_datetime - df_filtered.pickup_datetime) / \
np.timedelta64(1, 'h'))
```

In [16]:

```
# Plot the distribution of trip durtaions
plt.figure(figsize=(8, 4))
df_filtered.plot1d('trip_duration_min', limits=[0, 600], f='log10', shape=64, lw=3, progress=True)
plt.xlabel('Trip duration [minutes]')
plt.ylabel('Number of trips [dex]')
plt.show()
```

We see that the majority of taxi trips, 95% to be exact last less than 30 minutes. From the above plot, we see the distribution falls of, and almost becomes flat after 200 minutes. Can you imagine, spending over 3 hours in a taxi in New York City! Perhaps it happens..

So let's be..open minded for now, and consider all trips that last less than 3 hours in total.

In [17]:

```
# Filter taxi trips that have unreasonably long dirations
df_filtered = df_filtered[(df_filtered.trip_duration_min > 0) & (df_filtered.trip_duration_min < 180)]
```

Now let's look at the mean speed of a trip. Let us first look a the extremes:

In [18]:

```
# Minimum and maximum average speed of a taxi trip
print('Minimal mean speed: %.3f miles/hour.' % (df_filtered.trip_speed_mph.min()))
print('maximal mean speed: %.3f miles/hour.' % (df_filtered.trip_speed_mph.max()))
```

From the extremes of this column we notice that we have some serious outliers. On the lower end of the spectrum, the slowest speeds are considerably slower than walking sleeds. On the high end of the spectrum, those cars are flying so fast, they can be used as spaceships!

Let us plot the distribution of mean speeds, for a more sensible, or at least physically viable range.

In [19]:

```
# Plot the distribution of trip durtaions
plt.figure(figsize=(8, 4))
df_filtered.plot1d('trip_speed_mph', limits=[0, 120], f='log10', shape=64, lw=3, progress=True)
plt.xlabel('Mean speed during a trip [miles/hour]')
plt.ylabel('Number of trips [dex]')
plt.show()
```

Based on this plot we can make a sensible choce of a typical trip speed: somewhere in the range of 1-60 miles per hour.

In [20]:

```
# Filter out errouneous average trip speeds.
df_filtered = df_filtered[(df_filtered.trip_speed_mph > 1) & (df_filtered.trip_speed_mph < 60)]
```

Finally, let's look at the cost of the taxi trips. From the output of the `describe()`

function, we can see that there are some crazy outliers in the *fare_amount*, *total_amount*, and *tip_amount*. For starters, no value in these columns should be negative. Also their upper limits are ridiculously high. Let's look at their distributions, but in a more sensible range.

In [21]:

```
plt.figure(figsize=(18, 5))
plt.subplot(131)
df_filtered.plot1d('fare_amount', shape=64, lw=3, limits=[0, 1000], f='log10', progress=True)
plt.xlabel('Fare amount [$]')
plt.ylabel('Number of trips [dex]')
plt.tick_params(labelsize=14)
plt.subplot(132)
df_filtered.plot1d('total_amount', shape=64, lw=3, limits=[0, 1000], f='log10', progress=True)
plt.xlabel('Total amount [$]')
plt.ylabel('')
plt.tick_params(labelsize=14)
plt.subplot(133)
df_filtered.plot1d('tip_amount', shape=64, lw=3, limits=[0, 1000], f='log10', progress=True)
plt.xlabel('Tip amount [$]')
plt.ylabel('')
plt.tick_params(labelsize=14)
plt.tight_layout()
plt.show()
```

We see that in all three cases, these distribution have some very long tail. Perhaps few of these large fares are legit, most are probably or hopefully errouneous data, or maybe some funny business is going on from time to time. In any case, we would like to focus on the regular "vanilla" rides, so we will select all trips that have total and fare amount less than \$200 (the elbow of the distributions). Same for the tips. Note that the tips are not included in the total amount column. We also require that the fare and total amount be larger than \\$0. This condition is not imposed on the tips, although none of these can be negative.

In [22]:

```
df_filtered = df_filtered[((df_filtered.total_amount > 0) & (df_filtered.total_amount < 200) &
(df_filtered.fare_amount > 0) & (df_filtered.fare_amount < 200) &
(df_filtered.tip_amount >= 0) & (df_filtered.tip_amount < 200))]
```

Finally, after this initial cleaning of the data is done, let's see how many taxi trips we have left

In [23]:

```
N_samples = len(df_filtered)
print(f'Number of trips in the filtered dataset: {N_samples}')
```

We have over 1.1 billion taxi trips for our upcoming analysis. Let's get to it!

Let's assume we are a prospective taxi driver, or even a manager of a taxi company, and are interested in finding out where are, on average, the best hotspots to pick up passengers from, which will lead to large taxi fees.

Naively, we can just plot a map of the pickup locations color-coded by the average fare amount for that big, i.e. part of the town.

However, as a taxi driver, we have our own expences as well. We need to pay for fuel, or taking a passenger somewhere remote might mean that we will spend a lot of time and fuel just getting back to the city centre, and perhaps it will not be so easy to find a passenger for our trip back. Having that into consideration, we decide to instead color code the map of NYC by the mean of the trip fare divided by the trip distance. This is simple way we can introduce normalization, i.e. taking some of our costs into account.

These two cases are plotted below

In [24]:

```
# Define new columns that might prove useful:
df_filtered['tip_percentage'] = df_filtered.tip_amount / df_filtered.total_amount * 100.
df_filtered['fare_over_distance'] = df_filtered.fare_amount / df_filtered.trip_distance
```

In [25]:

```
plt.figure(figsize=(15, 5))
plt.subplot(121)
df_filtered.plot('pickup_longitude', 'pickup_latitude', what='mean(fare_amount)',
colormap='plasma', f='log1p', shape=512, colorbar=True,
colorbar_label='mean fare amount [$]', vmin=1, vmax=4.5)
plt.xlabel('pickup longitude')
plt.ylabel('pickup latitude')
plt.subplot(122)
df_filtered.plot('pickup_longitude', 'pickup_latitude', what='mean(fare_over_distance)',
colormap='plasma', f='log1p', shape=512, colorbar=True,
colorbar_label='mean fare/distance [$/mile]', vmin=0.75, vmax=2.5)
plt.xlabel('pickup longitude')
plt.ylabel('')
plt.gca().axes.get_yaxis().set_visible(False)
plt.tight_layout()
plt.show()
```

We see that in the 1st case, if we just care about getting the maximum fare for the service provided, it is best to pick up passengers around the NYC airprots, and long the main streets, such as the Van Wyck Expressway, and Long Island Expressway avenues.

However, when we divide by the distance travelled, we get a slightly different picture. The Van Wyck Expressway, and Long Island Expressway avenues are still relevant, but much less prominant, the airports are not as popular. Some other hotspots appear on the other side of the Hudson river, which seem quite profitable locations to pick up passengers from.

Next, we wanna figure out when do most of the taxi usage happens so we can schedule our working hours appropriately.

In [26]:

```
# Extract some date/time features
df_filtered['pickup_hour'] =
```