Welcome one, welcome all, to the biggest fight of the gigayear: The Great Emu Fight!
No, we don't mean this emu fight:
from IPython.display import Image Image('images/emu_war.png', width=400)
from IPython.display import HTML HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/d9OBqYbZ99c" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>')
/Users/ksf/miniconda3/lib/python3.7/site-packages/IPython/core/display.py:717: UserWarning: Consider using IPython.display.IFrame instead warnings.warn("Consider using IPython.display.IFrame instead")
We're talking about emulators, or "emus" for short. Like the giant Australian birds, these emus are powerful and fast. Unlike the birds, our emus are regression models that are important for astrophysics, and they're a lot less fluffy.
In this tutorial, we'll teach you what an emulator is, and show you how to build emulators with various methods and frameworks. We will then pit our emus against each other and see which emu wins!
Emulators imitate more complex models or simulations; they take input parameters and quickly spit out output values, without performing a physical calculation or a full simulation.
To do this, we train them on the inputs and outputs of simulations. They then learn this relationship, and for any new set of inputs, they can give an ouput. They are essentially fancy interpolators.
Let's take a simple example. Say a computational emu biologist wants to know how fast a particular emu can run. The biologist knows the emu's height, age, and fluffiness. They build a super intense simulator that takes in these demographics, and outputs the expected top speed of the bird.
The issue is that the biologist's simulator is so complicated that it takes an hour to run for each set of input parameters. This is where emulators come in. The biologist just runs a small number of simulations, sparsely spanning a reasonable range of parameter space (heights, ages, fluffiness levels). Then we train an emulator on this set of data. Now, for any other emu demographics, the emulator can quickly predict a given emu's top speed.
Why do we want an emulator of our simulation? They come in really handy for inference. Let's say we see an emu running at top speed down the street, breaking the city speed limit (yes, some can run this fast). But it got away, so to ID it we want to figure out its height, age, and fluffiness. If we only had the biologist's simulator, we would have to run a simulation at every possible demographic combination, and see which top speed most closely matched that of our runaway emu.
This would be untenable with the hour-long simulator. With our emulator, we can explore this parameter space much more quickly, and use techniques such as Markov Chain Monte Carlo to infer the distribution of likely bird characteristics.
Alas, we are not computational emu biologists but astrophysicists. But it turns out emulators are super handy for us too! Some areas emulators have been used:
Here we will demonstrate our emulators on the first use case. A common problem in cosmology is inferring the cosmological parameters that govern our universe - such as the overall density of matter and baryons. We can do this by running giant simulations of fake universes where we know the input cosmological parameters. We can then measure galaxy clustering statistics (e.g. a power spectrum) on both the real and fake universes, and compare them. However, we have the same problem here as we did with the emu speed simulator: it's very slow so we can't run a simulation and get statistics at all possible parameter sets.
We can instead run a small set of simulations spanning cosmological parameter space, and compute clustering statistics on these. Then we can train an emulator on these input parameters and output statistics, and use it to find a set of cosmological parameters that generates statistics most like those we measure in our real universe.
We generated a set of clustering statistics to demonstrate our emulators on. Instead of running full-blown cosmological simulations, which would take much longer than the span of this Hack Week, we used the
nbodykit package. This can compute a model of a clustering statistic for a given cosmology in about a second - which is actually quite slow for robust inference, so our emulator is still useful!
We choose to emulate the 2-point correlation function, the Fourier transform of the power spectrum, which measures the strength of galaxy clustering at a given spatial scale.
We vary the cosmological parameters $\Omega_m$ (the matter density), $\sigma_8$ (the amplitude of clustering), and $\Omega_b$ (the baryon density), fixing the other parameters to reasonable values. We generate a training set of 8000 correlation functions, on a uniform 3D grid of these parameters (20 points per side). We also generate a test set of 100 correlation functions, to see how our emulator did.
We need some specific packages so if they don't exist already, they will be downloaded.
try: import tensorflow except ModuleNotFoundError: !pip install tensorflow
import timeit import numpy as np import matplotlib.pyplot as plt
import emulator %load_ext autoreload %autoreload 2 %config InlineBackend.figure_format = 'retina'
emu = emulator.emulator() path_train = 'data/cosmology_train.pickle' path_test = 'data/cosmology_test.pickle' emu.read_data(path_train, path_test, scale=True, normalize_y=True)
The pattern is because these are in order along the grid, so parameter sets near each other have similar shapes and are plotted in that order with corresponding colors.
We pit a few common emulator methods against each other:
All of these are implemented in the
emulator.py class; check it out for details on the implementation and hyperparameters. This class also performs some basic preprocessing, including normalizing/scaling the inputs and outputs for improved training.
For standalone notebooks constructing each of these emulators, see the linked jupyter notebooks.
We compare each on accuracy and speed. It would also be interesting to compare scaleability with respect to input parameter dimensionality; we leave this as an exercise to the reader. ;)
run_timeit = True from collections import defaultdict time_dict = defaultdict(dict)
regressor_names = emu.regressors_implemented x = np.arange(len(regressor_names)) colors = np.array([plt.cm.Dark2(xx) for xx in x])
An Artificial Neural Network (ANN) is a class of machine learning methods loosely modeled after the neural connections in biological systems. It is also known as a Multi-Layer Perceptron (MLP), with layers that give various weights to input features. It uses an activation function to convert inputs to outputs and learn more complex functions. ANNs can be used for both classification and regression.
ANNs are useful for emulation as they can model complicated function spaces, with high-dimensional features. Here we use the scikitlearn implementation of an MLP regressor.
For a stand-alone walkthrough of constructing this emulator, see
if run_timeit: t = %timeit -o emu.train("ANN") time_dict["ANN"]["train_mean"] = t.average time_dict["ANN"]["train_std"] = t.stdev else: emu.train("ANN")
1.96 s ± 25.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
if run_timeit: t = %timeit -o emu.predict_test_set("ANN") time_dict["ANN"]["test_mean"] = t.average time_dict["ANN"]["test_std"] = t.stdev else: emu.predict_test_set("ANN")
1.09 ms ± 56.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
if run_timeit: t = %timeit -o emu.train("DTree") time_dict["DTree"]["train_mean"] = t.average time_dict["DTree"]["train_std"] = t.stdev else: emu.train("DTree")
138 ms ± 7.24 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
if run_timeit: t = %timeit -o emu.predict_test_set("DTree") time_dict["DTree"]["test_mean"] = t.average time_dict["DTree"]["test_std"] = t.stdev else: emu.predict_test_set("DTree")
700 µs ± 53.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)