Df 0 2 6_ As Numpy Arrays

This tutorial shows how read data of a RDataFrame into Numpy arrays.

Author: Stefan Wunsch
This notebook tutorial was automatically generated with ROOTBOOK-izer from the macro found in the ROOT repository on Sunday, January 19, 2020 at 01:02 AM.

In [1]:
import ROOT
from sys import exit
Welcome to JupyROOT 6.19/01

Let's create a simple dataframe with ten rows and two columns

In [2]:
df = ROOT.RDataFrame(10) \
         .Define("x", "(int)rdfentry_") \
         .Define("y", "1.f/(1.f+rdfentry_)")

Next, we want to access the data from Python as Numpy arrays. To do so, the content of the dataframe is converted using the AsNumpy method. The returned object is a dictionary with the column names as keys and 1D numpy arrays with the content as values.

In [3]:
npy = df.AsNumpy()
print("Read-out of the full RDataFrame:\n{}\n".format(npy))
Read-out of the full RDataFrame:
{'y': ndarray([1.        , 0.5       , 0.33333334, 0.25      , 0.2       ,
         0.16666667, 0.14285715, 0.125     , 0.11111111, 0.1       ],
        dtype=float32), 'x': ndarray([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=int32)}

/mnt/build/workspace/root-makedoc/rootspi/rdoc/src/master.build/lib/ROOT.py:421: FutureWarning: Instantiating a function template with parentheses ( f(type1, ..., typeN) ) is deprecated and will not be supported in a future version of ROOT. Instead, use square brackets: f[type1, ..., typeN]
  result_ptrs[column] = _root.ROOT.Internal.RDF.RDataFrameTake(column_type)(df_rnode, column)

Since reading out data to memory is expensive, always try to read-out only what is needed for your analysis. You can use all RDataFrame features to reduce your dataset, e.g., the Filter transformation. Furthermore, you can can pass to the AsNumpy method a whitelist of column names with the option columns or a blacklist with column names with the option exclude.

In [4]:
df2 = df.Filter("x>5")
npy2 = df2.AsNumpy()
print("Read-out of the filtered RDataFrame:\n{}\n".format(npy2))

npy3 = df2.AsNumpy(columns=["x"])
print("Read-out of the filtered RDataFrame with the columns option:\n{}\n".format(npy3))

npy4 = df2.AsNumpy(exclude=["x"])
print("Read-out of the filtered RDataFrame with the exclude option:\n{}\n".format(npy4))
Read-out of the filtered RDataFrame:
{'y': ndarray([0.14285715, 0.125     , 0.11111111, 0.1       ], dtype=float32), 'x': ndarray([6, 7, 8, 9], dtype=int32)}

Read-out of the filtered RDataFrame with the columns option:
{'x': ndarray([6, 7, 8, 9], dtype=int32)}

Read-out of the filtered RDataFrame with the exclude option:
{'y': ndarray([0.14285715, 0.125     , 0.11111111, 0.1       ], dtype=float32)}

You can read-out all objects from ROOT files since these are wrapped by PyROOT in the Python world. However, be aware that objects other than fundamental types, such as complex C++ objects and not int or float, are costly to read-out.

In [5]:
// Inject the C++ class CustomObject in the C++ runtime.
class CustomObject {
    int x = 42;
// Create a function that returns such an object. This is called to fill the dataframe.
CustomObject fill_object() { return CustomObject(); }

df3 = df.Define("custom_object", "fill_object()")
npy5 = df3.AsNumpy()
print("Read-out of C++ objects:\n{}\n".format(npy5["custom_object"]))
print("Access to all methods and data members of the C++ object:\nObject: {}\nAccess data member: custom_object.x = {}\n".format(
    repr(npy5["custom_object"][0]), npy5["custom_object"][0].x))
Read-out of C++ objects:
[<ROOT.CustomObject object at 0x56192bfb9980>
 <ROOT.CustomObject object at 0x56192bfb9984>
 <ROOT.CustomObject object at 0x56192bfb9988>
 <ROOT.CustomObject object at 0x56192bfb998c>
 <ROOT.CustomObject object at 0x56192bfb9990>
 <ROOT.CustomObject object at 0x56192bfb9994>
 <ROOT.CustomObject object at 0x56192bfb9998>
 <ROOT.CustomObject object at 0x56192bfb999c>
 <ROOT.CustomObject object at 0x56192bfb99a0>
 <ROOT.CustomObject object at 0x56192bfb99a4>]

Access to all methods and data members of the C++ object:
Object: <ROOT.CustomObject object at 0x56192bfb9980>
Access data member: custom_object.x = 42

Note that you can pass the object returned by AsNumpy directly to pandas.DataFrame including any complex C++ object that may be read-out.

In [6]:
    import pandas
    print("Failed to import pandas.")

df = pandas.DataFrame(npy5)
print("Content of the ROOT.RDataFrame as pandas.DataFrame:\n{}\n".format(df))
Content of the ROOT.RDataFrame as pandas.DataFrame:
                                  custom_object  x         y
0  <ROOT.CustomObject object at 0x56192bfb9980>  0  1.000000
1  <ROOT.CustomObject object at 0x56192bfb9984>  1  0.500000
2  <ROOT.CustomObject object at 0x56192bfb9988>  2  0.333333
3  <ROOT.CustomObject object at 0x56192bfb998c>  3  0.250000
4  <ROOT.CustomObject object at 0x56192bfb9990>  4  0.200000
5  <ROOT.CustomObject object at 0x56192bfb9994>  5  0.166667
6  <ROOT.CustomObject object at 0x56192bfb9998>  6  0.142857
7  <ROOT.CustomObject object at 0x56192bfb999c>  7  0.125000
8  <ROOT.CustomObject object at 0x56192bfb99a0>  8  0.111111
9  <ROOT.CustomObject object at 0x56192bfb99a4>  9  0.100000

Draw all canvases

In [7]:
from ROOT import gROOT