This is an excerpt from Learning IPython for Interactive Computing and Data Visualization, second edition.Previously in this Chapter 3, we analyzed trip and fare data of nearly one million NYC taxi journeys. Here, we make a density plot of evening trips in Manhattan. In order to run this notebook locally, you'll need to download the

`nyc_taxi`

dataset from the data repo, and extract it to`../chapter2/data/nyc_data.csv`

(or just change the path in the second code cell below).

Let's first import NumPy, pandas, matplotlib, and seaborn:

In [1]:

```
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import matplotlib; matplotlib.rcParams['savefig.dpi'] = 144
```

We load the NYC taxi dataset with pandas:

In [2]:

```
data = pd.read_csv('../chapter2/data/nyc_data.csv',
parse_dates=['pickup_datetime', 'dropoff_datetime'])
```

`.values`

attribute of pandas DataFrames:

In [3]:

```
pickup = data[['pickup_longitude', 'pickup_latitude']].values
dropoff = data[['dropoff_longitude', 'dropoff_latitude']].values
pickup
```

Out[3]:

`pickup[i, j]`

, where `i`

is the 0-indexed row number, and `j`

is the 0-indexed column number:

In [4]:

```
print(pickup[3, 1])
```

In [5]:

```
pickup[1:7:2, 1:]
```

Out[5]:

`[1, 1]`

, `[3, 1]`

, and `[5, 1]`

. The slicing syntax in Python is `start:end:step`

where `start`

is included and `end`

is excluded. If `start`

or `end`

are omitted, they default to `0`

or the length of the dimension, respectively, whereas `step`

defaults to 1. For example, `1:`

is equivalent to `1:n:1`

where `n`

is the size of the axis.

Let's select the longitudes of all pickup locations, in other words, the first column:

In [6]:

```
lon = pickup[:, 0]
lon
```

Out[6]:

The result is a 1D ndarray.

We also get the second column of `pickup`

:

In [7]:

```
lat = pickup[:, 1]
lat
```

Out[7]:

In [8]:

```
lon_min, lon_max = (-73.98330, -73.98025)
lat_min, lat_max = ( 40.76724, 40.76871)
```

`lon_min`

and `lon_max`

:

In [9]:

```
in_lon = (lon_min <= lon) & (lon <= lon_max)
in_lon
```

Out[9]:

The symbol `&`

represents the AND boolean operator, while `|`

represents the OR.

Here, the result is a Boolean vector containing as many elements as there are in the `lon`

vector.

`True`

elements are there in this array? NumPy arrays provide a `sum()`

method that returns the sum of all elements in the array. When the array contains boolean values, `False`

elements are converted to 0 and `True`

elements are converted to 1. Therefore, the sum corresponds to the number of `True`

elements:

In [10]:

```
in_lon.sum()
```

Out[10]:

We can process the latitudes similarly:

In [11]:

```
in_lat = (lat_min <= lat) & (lat <= lat_max)
```

Then, we get all trips where both the longitude and latitude belong to our rectangle:

In [12]:

```
in_lonlat = in_lon & in_lat
in_lonlat.sum()
```

Out[12]:

`np.nonzero()`

function returns the indices corresponding to `True`

in a boolean array, as shown here:

In [13]:

```
np.nonzero(in_lonlat)[0]
```

Out[13]:

Finally, we'll need the dropoff coordinates in the following:

In [14]:

```
lon1, lat1 = dropoff.T
```

`lon1 = dropoff[:, 0]; lat1 = dropoff[:, 1]`

. The `T`

attribute corresponds to the transpose of a matrix, which simply means that a matrix's columns become the corresponding rows of a new matrix, just as the new columns are the original matrix's rows. Here, `dropoff.T`

is a `(2, N)`

array where the first row contains the longitude and the second row contains the latitude. In NumPy, an ndarray is iterable along the first dimension, in other words, along the rows of the matrix. Therefore, the syntax unpacking feature of Python allows us to concisely assign `lon1`

to the first row and `lat1`

to the second row.

The following function implements this formula.

In [15]:

```
EARTH_R = 6372.8
def geo_distance(lon0, lat0, lon1, lat1):
"""Return the distance (in km) between two points in
geographical coordinates."""
# from: http://en.wikipedia.org/wiki/Great-circle_distance
# and: http://stackoverflow.com/a/8859667/1595060
lat0 = np.radians(lat0)
lon0 = np.radians(lon0)
lat1 = np.radians(lat1)
lon1 = np.radians(lon1)
dlon = lon0 - lon1
y = np.sqrt(
(np.cos(lat1) * np.sin(dlon)) ** 2
+ (np.cos(lat0) * np.sin(lat1)
- np.sin(lat0) * np.cos(lat1) * np.cos(dlon)) ** 2)
x = np.sin(lat0) * np.sin(lat1) + \
np.cos(lat0) * np.cos(lat1) * np.cos(dlon)
c = np.arctan2(y, x)
return EARTH_R * c
```

`np.radians()`

(converting numbers from degrees into radians), `np.cos()`

, `np.sin()`

, `np.arctan2(x, y)`

(returning the arctangent of `x/y`

), and so on. These mathematical functions are defined on real numbers, but NumPy provides *vectorized* versions of them. These vectorized functions not only work on numbers but also on arbitrary numerical ndarrays. As we have explained earlier, these functions are orders of magnitude faster than Python loops. You will find the list of mathematical functions in NumPy at http://docs.scipy.org/doc/numpy/reference/routines.math.html.

Now, let's compute the straight line distances of all taxi trips:

In [16]:

```
distances = geo_distance(lon, lat, lon1, lat1)
```

`in_lonlat`

boolean array obtained earlier in this section.

In [18]:

```
plt.hist(distances[in_lonlat], np.linspace(0., 10., 50))
plt.xlabel('Trip distance (km)')
plt.ylabel('Number of trips')
```

Out[18]:

`plt.hist()`

function computes a histogram and plots it. It is a convenient wrapper around NumPy's `np.histogram()`

function that just computes a histogram. You will find more statistical functions in NumPy at http://docs.scipy.org/doc/numpy/reference/routines.statistics.html.

`.values`

attribute:

In [19]:

```
evening = (data.pickup_datetime.dt.hour >= 19).values
```

In [20]:

```
n = np.sum(evening)
```

In [21]:

```
n
```

Out[21]:

The `n`

variable contains the number of evening trips in our dataset.

`n`

evening trips. There are `2n`

of such points. Every point is associated with a weight of `-1`

for pickup locations and `+1`

for dropoff locations. The algebraic density of points at a given location, taking into account the weights, reflects whether people tend to leave or to arrive at this location.

`weights`

vector for our `2n`

points, we first create a vector containing only zeros. Then, we set the first half of the array to `-1`

(pickup) and the last half to `+1`

(dropoff):

In [22]:

```
weights = np.zeros(2 * n)
```

In [23]:

```
weights[:n] = -1
weights[n:] = +1
```

`weights`

is made of`weights[0]`

,`weights[1]`

, up to`weights[n-1]`

. There are`n`

of such elements. The slice`weights[:n]`

is equivalent to`weights[0:n]`

: it starts at`weights[0]`

, and ends at`weights[n]`

excluded, so the last element is effectively`weights[n-1]`

.

`np.tile()`

to concatenate copies of an array along several dimensions, or `np.repeat()`

to make copies of every element along several dimensions. You will find the list of manipulation functions at http://docs.scipy.org/doc/numpy/reference/routines.array-manipulation.html.

`(2n, 2)`

array defined by the vertical concatenation of the pickup and dropoff locations for the evening trips:

In [24]:

```
points = np.r_[pickup[evening],
dropoff[evening]]
```

In [25]:

```
points.shape
```

Out[25]:

`np.r_[]`

syntax allows us to concatenate arrays along the first (vertical) dimension. We could also have used more explicit manipulation functions such as `np.vstack()`

or `np.concatenate()`

.

In [26]:

```
def lat_lon_to_pixels(lat, lon):
lat_rad = lat * np.pi / 180.0
lat_rad = np.log(np.tan((lat_rad + np.pi / 2.0) / 2.0))
x = 100 * (lon + 180.0) / 360.0
y = 100 * (lat_rad - np.pi) / (2.0 * np.pi)
return (x, y)
```

In [27]:

```
lon, lat = points.T
x, y = lat_lon_to_pixels(lat, lon)
```

In [28]:

```
lon_min, lat_min = -74.0214, 40.6978
lon_max, lat_max = -73.9524, 40.7982
```

In [29]:

```
x_min, y_min = lat_lon_to_pixels(lat_min, lon_min)
x_max, y_max = lat_lon_to_pixels(lat_max, lon_max)
```

In [30]:

```
bin = .00003
bins_x = np.arange(x_min, x_max, bin)
bins_y = np.arange(y_min, y_max, bin)
```

These two arrays contain the horizontal and vertical bins.

`np.histogram2d()`

function. We pass as arguments the `y, x`

coordinates of the points (reversed because we want the grid's first axis to represent the `y`

coordinate), the weights, and the bins. This function computes a weighted sum of the points, in every bin. It returns several objects, the first of which is the density map we are interested in:

In [31]:

```
grid, _, _ = np.histogram2d(y, x, weights=weights,
bins=(bins_y, bins_x))
```

Before displaying the density map, we will apply a logistic function to it in order to smooth it:

In [32]:

```
density = 1. / (1. + np.exp(-.5 * grid))
```

**expit function**. It can also be found in the SciPy package at `scipy.special.expit()`

. `scipy.special`

provides many other special functions such as Bessel functions, Gamma functions, hypergeometric functions, and so on.

Finally, we display the density map with `plt.imshow()`

:

In [33]:

```
plt.figure(figsize=(8, 8))
plt.imshow(density,
origin='lower',
interpolation='bicubic'
)
plt.axis('off')
plt.tight_layout()
```

`plt.imshow()`

function displays a matrix as an image. It supports several interpolation methods. Here, we used a bicubic interpolation. The `origin`

argument is necessary because in our `density`

matrix, the top-left corner corresponds to the smallest latitude, so it should correspond to the bottom-left corner in the image.

- Search and sort in arrays
- Set operations
- Linear algebra
- Special mathematical functions
- Fourier transforms and signal processing
- Generation of pseudo-random numbers
- Statistics
- Numerical integration and numerical ODE solvers
- Function interpolation
- Basic image processing
- Numerical optimization

The IPython Cookbook covers many of these topics.

Here are a few references:

- NumPy reference at http://docs.scipy.org/doc/numpy/reference/
- SciPy reference at http://docs.scipy.org/doc/scipy/reference/
- IPython Cookbook at http://ipython-books.github.io/cookbook/