7 reasons why I love Vaex for data science

In [1]:
import vaex

import numpy as np
import pylab as plt

import warnings
warnings.filterwarnings('ignore')

Introduction

Vaex is an open-source DataFrame library for Python with an API that closely resembles that of Pandas. I have been using Vaex for several years in both academic and industry environments, and it is my go-to library for several of the data science projects I am working on. In this article I would like to share some of my favourite Vaex features. Some may be obvious by now, but some may surprise you.

The following code examples are run on a MacBook Pro (15", 2018, 2.6GHz Intel Core i7, 32GB RAM).

1. Easy to work with very large datasets

Nowadays, it becomes increasingly more common to encounter datasets that are larger than the available RAM on a typical laptop or a desktop workstation. Vaex solves this problem rather elegantly by the use of memory mapping and lazy evaluations. As long as your data is stored in a memory mappable file format such as Apache Arrow or HDF5, Vaex will open it instantly, no matter how large it is, or how much RAM your machine has. In fact, the size of the files Vaex can read are only limited by the amount of free hard-disk space you have. If your data is not in a memory-mappable file format (e.g. CSV, JSON), you can easily convert it by using the rich Pandas I/O in combination with Vaex. See this guide on how to do so.

In [2]:
# Check the file size on disk
!ls -l -h ../vaex-taxi/data/yellow_taxi_2009_2015_f32.hdf5
-rw-r--r--  1 jovan  staff   107G Jul  3  2019 ../vaex-taxi/data/yellow_taxi_2009_2015_f32.hdf5
In [3]:
# # Read data from S3
# df = vaex.open('s3://vaex/taxi/yellow_taxi_2009_2015_f32.hdf5?anon=true')

# Read data from local disk
df = vaex.open('../vaex-taxi/data/yellow_taxi_2009_2015_f32.hdf5')
df
Out[3]:
# vendor_id pickup_datetime dropoff_datetime passenger_count payment_type trip_distance pickup_longitude pickup_latitude rate_code store_and_fwd_flag dropoff_longitude dropoff_latitude fare_amount surcharge mta_tax tip_amount tolls_amount total_amount
0 VTS 2009-01-04 02:52:00.0000000002009-01-04 03:02:00.0000000001 CASH 2.630000114440918 -73.9919586181640640.72156524658203 nan nan -73.99380493164062 40.6959228515625 8.899999618530273 0.5 nan 0.0 0.0 9.399999618530273
1 VTS 2009-01-04 03:31:00.0000000002009-01-04 03:38:00.0000000003 Credit 4.550000190734863 -73.9821014404296940.736289978027344nan nan -73.95584869384766 40.76802825927734412.1000003814697270.5 nan 2.0 0.0 14.600000381469727
2 VTS 2009-01-03 15:43:00.0000000002009-01-03 15:57:00.0000000005 Credit 10.350000381469727-74.0025863647461 40.73974609375 nan nan -73.86997985839844 40.77022552490234423.7000007629394530.0 nan 4.739999771118164 0.0 28.440000534057617
3 DDS 2009-01-01 20:52:58.0000000002009-01-01 21:14:00.0000000001 CREDIT 5.0 -73.9742660522461 40.79095458984375 nan nan -73.9965591430664 40.73184967041015614.8999996185302730.5 nan 3.049999952316284 0.0 18.450000762939453
4 DDS 2009-01-24 16:18:23.0000000002009-01-24 16:24:56.0000000001 CASH 0.4000000059604645-74.0015792846679740.719383239746094nan nan -74.00837707519531 40.7203483581543 3.700000047683716 0.0 nan 0.0 0.0 3.700000047683716
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1,173,057,922VTS 2015-12-31 23:59:56.0000000002016-01-01 00:08:18.0000000005 1 1.2000000476837158-73.9938125610351640.72087097167969 1.0 0.0 -73.98621368408203 40.7224693298339847.5 0.5 0.5 1.75999999046325680.0 10.5600004196167
1,173,057,923CMT 2015-12-31 23:59:58.0000000002016-01-01 00:05:19.0000000002 2 2.0 -73.9652709960937540.76028060913086 1.0 0.0 -73.93951416015625 40.75238800048828 7.5 0.5 0.5 0.0 0.0 8.800000190734863
1,173,057,924CMT 2015-12-31 23:59:59.0000000002016-01-01 00:12:55.0000000002 2 3.799999952316284 -73.9872970581054740.7390785217285161.0 0.0 -73.9886703491211 40.69329833984375 13.5 0.5 0.5 0.0 0.0 14.800000190734863
1,173,057,925VTS 2015-12-31 23:59:59.0000000002016-01-01 00:10:26.0000000001 2 1.9600000381469727-73.99755859375 40.72569274902344 1.0 0.0 -74.01712036132812 40.705322265625 8.5 0.5 0.5 0.0 0.0 9.800000190734863
1,173,057,926VTS 2015-12-31 23:59:59.0000000002016-01-01 00:21:30.0000000001 1 1.059999942779541 -73.9843978881836 40.76725769042969 1.0 0.0 -73.99098205566406 40.76057052612305 13.5 0.5 0.5 2.96000003814697270.0 17.760000228881836

Caption: Opening and previewing a 100GB file with Vaex is instant.

2. No memory copies

Vaex has a zero memory copy policy. This means that filtering a DataFrames costs very little memory and does not copy the data. Consider the following example.

In [5]:
df_filtered = df[(df.passenger_count>0) & (df.passenger_count<5)]
df_filtered
Out[5]:
# vendor_id pickup_datetime dropoff_datetime passenger_count payment_type trip_distance pickup_longitude pickup_latitude rate_code store_and_fwd_flag dropoff_longitude dropoff_latitude fare_amount surcharge mta_tax tip_amount tolls_amount total_amount
0 VTS 2009-01-04 02:52:00.0000000002009-01-04 03:02:00.0000000001 CASH 2.630000114440918 -73.9919586181640640.72156524658203 nan nan -73.99380493164062 40.6959228515625 8.899999618530273 0.5 nan 0.0 0.0 9.399999618530273
1 VTS 2009-01-04 03:31:00.0000000002009-01-04 03:38:00.0000000003 Credit 4.550000190734863 -73.9821014404296940.736289978027344nan nan -73.95584869384766 40.76802825927734412.1000003814697270.5 nan 2.0 0.0 14.600000381469727
2 DDS 2009-01-01 20:52:58.0000000002009-01-01 21:14:00.0000000001 CREDIT 5.0 -73.9742660522461 40.79095458984375 nan nan -73.9965591430664 40.73184967041015614.8999996185302730.5 nan 3.049999952316284 0.0 18.450000762939453
3 DDS 2009-01-24 16:18:23.0000000002009-01-24 16:24:56.0000000001 CASH 0.4000000059604645-74.0015792846679740.719383239746094nan nan -74.00837707519531 40.7203483581543 3.700000047683716 0.0 nan 0.0 0.0 3.700000047683716
4 DDS 2009-01-16 22:35:59.0000000002009-01-16 22:43:35.0000000002 CASH 1.2000000476837158-73.9898071289062540.73500442504883 nan nan -73.98502349853516 40.72449493408203 6.099999904632568 0.5 nan 0.0 0.0 6.599999904632568
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1,061,605,165CMT 2015-12-31 23:59:56.0000000002016-01-01 00:09:25.0000000001 1 1.0 -73.9738998413086 40.74289321899414 1.0 0.0 -73.98957061767578 40.75054931640625 8.0 0.5 0.5 1.850000023841858 0.0 11.149999618530273
1,061,605,166CMT 2015-12-31 23:59:58.0000000002016-01-01 00:05:19.0000000002 2 2.0 -73.9652709960937540.76028060913086 1.0 0.0 -73.93951416015625 40.75238800048828 7.5 0.5 0.5 0.0 0.0 8.800000190734863
1,061,605,167CMT 2015-12-31 23:59:59.0000000002016-01-01 00:12:55.0000000002 2 3.799999952316284 -73.9872970581054740.7390785217285161.0 0.0 -73.9886703491211 40.69329833984375 13.5 0.5 0.5 0.0 0.0 14.800000190734863
1,061,605,168VTS 2015-12-31 23:59:59.0000000002016-01-01 00:10:26.0000000001 2 1.9600000381469727-73.99755859375 40.72569274902344 1.0 0.0 -74.01712036132812 40.705322265625 8.5 0.5 0.5 0.0 0.0 9.800000190734863
1,061,605,169VTS 2015-12-31 23:59:59.0000000002016-01-01 00:21:30.0000000001 1 1.059999942779541 -73.9843978881836 40.76725769042969 1.0 0.0 -73.99098205566406 40.76057052612305 13.5 0.5 0.5 2.96000003814697270.0 17.760000228881836

Caption: Filtering a Vaex DataFrame does not copy the data and takes negligible amount of memory.

The creation of the df_filtered DataFrame takes no extra memory! This is because df_filtered is a shallow copy of df. When creating filtered DataFrames, Vaex creates a binary mask which is then applied to the original data, without the need to make copies. The memory costs for these kind of filters are low: one needs ~1.2 GB of RAM to filter a 1 billion row DataFrame. This is negligible compared to other "classical" tools where one would need 100GB to simply read in the data, and another ~100GB for the filtered DataFrame.

3. Virtual columns

Transforming existing columns of a Vaex DataFrame into new ones results in the creation of virtual columns. Virtual columns behave just like normal ones, but they take up no memory what so ever. This is because Vaex only remembers the expression the defines them, and does not calculate the values up front. These columns are lazily evaluated only when necessary, keeping memory usage low.

In [6]:
df['tip_percentage'] = df.tip_amount / df.total_amount * 100
df[['fare_amount', 'total_amount', 'tip_amount', 'tip_percentage']].head(5)
Out[6]:
# fare_amount total_amount tip_amount tip_percentage
0 8.9 9.4 0 0
1 12.1 14.6 2 13.6986
2 23.7 28.44 4.74 16.6667
3 14.9 18.45 3.05 16.5312
4 3.7 3.7 0 0

Caption: The "tip_percentage" column is a virtual column: it take no extra memory and is lazily evaluated on the fly when needed.

4. Performance

Vaex is fast. I mean seriously fast. The evaluation of virtual columns is fully parallelized and done with one pass over the data. Column methods such as "value_counts", "groupby" , "unique" and the various string operations are using fast and efficient algoithms, implemented in C++ under the hood. All of them work in an out-of-core fashion, meaning you can process much more data than you fit into RAM, and use all available cores of your processor. For example, doing a "value_counts" operation takes only a second for over 1 billion rows!

In [9]:
df.passenger_count.value_counts(progress='widget')
Out[9]:
1     812321234
2     172864560
5      81923912
3      51435890
6      25614703
        ...    
69            1
66            1
61            1
53            1
70            1
Length: 62, dtype: int64

Caption: Using Vaex, the "value_counts" operation takes ~1s for over 1.1 billion rows!

5. Just-In-Time Compilation

2. No memory copies

Vaex has a zero memory copy policy. This means that filtering DataFrame costs very little memory and does not copy the data. Consider the following example.

As long as a virtual column is defined only using Numpy or pure Python operations, Vaex can accelerate its evaluation by jitting, or Just-In-Time compilation via Numba or Pythran. Vaex also supports acceleration via CUDA if your machine has a CUDA enabled NVIDIA graphics card. This can be quite useful for speeding up the evaluation of rather computationally expensive virtual columns.

Consider the example below. I have defined the arc distance between two geographical locations, a calculation that involves quite some algebra and trigonometry. Calculating the mean value will force the execution of this rather computationally expensive virtual column. When the execution is done purely with Numpy, it takes only 30 seconds, which I find impressive given that it is done for over 1.1 billion rows. Now, when we do the same with the numba pre-compiled expression, we get ~2.5 times faster execution time, at least on my laptop. Unfortunately, I do not have an NVIDIA graphics card so I can not do the same using CUDA at this time. If you do, I'll be very happy if you could try this out and share the results.

A small but important bonus: Notice how you do not need to call .compute or any such method - Vaex automatically knows when to be lazy and when execute a computation.

In [10]:
def arc_distance(theta_1, phi_1, theta_2, phi_2):
    temp = (np.sin((theta_2-theta_1)/2*np.pi/180)**2
           + np.cos(theta_1*np.pi/180)*np.cos(theta_2*np.pi/180) * np.sin((phi_2-phi_1)/2*np.pi/180)**2)
    distance = 2 * np.arctan2(np.sqrt(temp), np.sqrt(1-temp))
    return distance * 3958.8

# Expression to be evaluated with numpy as usual
df['arc_distance_numpy'] = arc_distance(df.pickup_longitude, 
                                        df.pickup_latitude, 
                                        df.dropoff_longitude, 
                                        df.dropoff_latitude)

# Expression to be pre-compiled with numba, and then executed
df['arc_distance_numba'] = arc_distance(df.pickup_longitude, 
                                        df.pickup_latitude, 
                                        df.dropoff_longitude, 
                                        df.dropoff_latitude).jit_numba()

# Expression to be pre-compiled with CUDA, and then executed on you GPU 
# provided you have a CUDA compatible NVIDIA GPU.
# df['arc_distance_cuda'] = arc_distance(df.pickup_longitude, 
#                                        df.pickup_latitude, 
#                                        df.dropoff_longitude, 
#                                        df.dropoff_latitude).jit_cuda()

# Calculate the mean 
mean_numpy = df.arc_distance_numpy.mean(progress='widget')
mean_numba = df.arc_distance_numba.mean(progress='widget')
print(f'Mean arc distance comuted with numpy: {mean_numpy:.5f}')
print(f'Mean arc distance comuted with numba: {mean_numba:.5f}')
Mean arc distance comuted with numpy: 12.70720
Mean arc distance comuted with numba: 12.70720

Caption: Jitting can lead to ~2.5 time faster execution times for a computationally expensive virtual column.

6. Selections

Vaex implements a concept called selections which is used to, ah, select the data. This is useful when you want to explore the data by, for example, calculating statistics on different portions of it without making a new reference DataFrame each time. The true power of using selections is that we can calculate a statistic for multiple selections with just one pass over the data.

In [12]:
select_n_passengers_lt3 = df.passenger_count < 3
select_n_passengers_ge3 = df.passenger_count >= 3

df.fare_amount.mean(selection=[None, select_n_passengers_lt3, select_n_passengers_ge3], progress='widget')
Out[12]:
array([11.21730816, 11.2078196 , 11.26832503])

Caption: You can calculate statistics for multiple selections with one pass over the data.

This can be be also useful for making various visualisations. For example, we can use the .count method to create a couple of histograms on different selections with just one pass over the data. Quite efficient!

In [13]:
binned_dist_npass_lt3, binned_dist_npass_ge3 = df.count(binby=['total_amount'],
                                                        limits=[5, 50],
                                                        shape=64,
                                                        selection=[select_n_passengers_lt3, 
                                                                   select_n_passengers_ge3], 
                                                        progress='widget')

xvalues = np.linspace(5, 50, 64)
plt.figure(figsize=(8, 4))
plt.plot(xvalues, binned_dist_npass_lt3, drawstyle="steps-pre", label='num passengers < 3', lw=3)
plt.plot(xvalues, binned_dist_npass_ge3, drawstyle="steps-pre", label='num passengers >=3', lw=3)
plt.legend(fontsize=14)
plt.xlabel('Total amount [$]', fontsize=14)
plt.ylabel('Number of trips', fontsize=14)
plt.show()

Caption: One can create multiple histograms on different selections with just one pass over the data.

7. Groupby aggregations with selections

One of my favourite features of Vaex is the ability to use selections inside aggregation functions. I often find myself wanting to do a group-by operation, in which the aggregations follow some additional rule or filter. The SQL-esque way of doing this would be to run several separate queries in which one would first filter the data and then do the group-by aggregation, and later join the outputs of those aggregations into one table. With Vaex, one can do this with a single operation, and with just one pass over the data! The following group-by example, ran on over 1.1 billion rows takes only 30 seconds to execute on my laptop!

In [15]:
df.groupby(by='vendor_id', agg={'count': vaex.agg.count(),
                                'count_fare_n_pass_lt3': vaex.agg.count(selection='passenger_count<3'),
                                'count_fare_n_pass_ge3': vaex.agg.count(selection='passenger_count>=3'),
                                'mean_fare_n_pass_lt3': vaex.agg.mean('fare_amount', selection='passenger_count<3'),
                                'mean_fare_n_pass_ge3': vaex.agg.mean('fare_amount', selection='passenger_count>=3'),
                               })
Out[15]:
# vendor_id count count_fare_n_pass_lt3 count_fare_n_pass_ge3 mean_fare_n_pass_lt3 mean_fare_n_pass_ge3
0CMT 5.71882e+08 5.37345e+08 3.4537e+07 11.0698 11.917
1DDS 1.41691e+07 1.30247e+07 1.14446e+06 10.3088 10.8373
2VTS 5.87007e+08 4.38728e+08 1.48279e+08 11.4036 11.1206

Caption: Selections can be used inside groupby aggregation functions.

Bonus: Progress bars

Looking at the examples above, you may have noticed many of the methods implemented in Vaex are accompanied by a progress bar. I absolutely love this feature! Since I often use Vaex with larger datasets, knowing how long an operation is likely to take is very handy. Besides, it looks pretty and it makes me feel good as a user :).

Conclusion

Vaex is a rather powerful DataFrame library, especially when it comes to larger datasets. It has a variety of useful methods for data manipulation and visualization. These are just my top 7 reasons why I like working with this library. What are yours?