# Hierarchical Bayesian Modeling with STAN: Exoplanet Mass Distribution from Radial Velocities¶

Our goal in this lab is to infer the mass distribution of extrasolar planets based on their measured radial velocity semi-amplitudes, their orbital periods, and their host stars' masses, using a hierarchical Bayesian model implemented by STAN, a high-level language for probabilistic modeling.

## Background¶

We can measure the mass of a planet by taking radial velocity (RV) measurements of a star over a period of time. The planetary mass, however, is not a direct observable; we infer it from the semi-amplitude $K$ of the periodic RV signal. Furthermore, there are several other quantities that we need to know in order to infer planet mass ($M_{\rm pl}$) from the RV semiamplitude: the mass of the star ($M_{\rm star}$), the period of the planet's orbit ($P$), and the planet's orbital inclination ($i$, unless it transits, which is not the case for the current dataset):

$$K = \left(\frac{2 \pi G}{P}\right)^{1/3} \frac{M_{\rm pl} \ \sin{i}}{\left(M_{\rm star}+M_{\rm pl}\right)^{2/3}}$$

There is uncertainty in each of these quantities; if we want to get an honest sense of our uncertainty in the planet mass distribution, then we would propagate the uncertainties in each of these quantities, for each individual planet, into the population inference. Hierarchical Bayesian modeling offers a self-consistent probabilistic framework to do just that.

## Part 1:¶

Estimate the mean and variance of the exoplanet mass distribution (modeled as a normal distribution), with the host star mass and planetary orbital period as constant, known values, and assuming all edge-on orbits (so that $i=90^\circ$) . . . for now.

In [ ]:
import numpy as np
import matplotlib as mpl
mpl.rcParams['lines.linewidth'] = 0.3
import matplotlib.pyplot as plt
import pystan

In [ ]:
# First we read in the data
'Mstar','Mstar_err','Per','Per_err'),'formats': ('i2', 'f4', 'f4', 'f4', 'f4', 'f4', 'f4')})
# And get it into a form STAN can interpret
dat4STAN = {'K_obs': data_RVmass['K_obs'],
'Mstar': data_RVmass['Mstar'],
'Per': data_RVmass['Per'],
'K_err': data_RVmass['K_err'],
'Mstar_err': data_RVmass['Mstar_err'],
'Perr_err': data_RVmass['Per_err'],
'N': len(data_RVmass['K_obs'])}


Now that we have the data set up so that STAN can interpret it, we need to define the hierarchical model. This is the one we are going to start with:

\begin{align} K_{obs,j}\ \Big|\ M_{pl,j},M_{star,j},P_j,\mu,\sigma & \sim \text{Normal}\Big(K_j,\sigma_{Kj}\Big) \\ K_j & = f(M_{pl},M_{star,j},P_j) \\ M_{pl,j}\ \Big|\ \mu, \sigma & \sim \text{Normal}\Big(\mu, \sigma \Big) \\ \mu & \sim \text{Normal}(3 , 1) \\ \sigma & \sim \text{Uniform}( 0.1 , 1000 ) \\ \end{align}

In this notation, the '~' means that the quantity on the left is being sampled, or drawn, from the distribution on the right. Specifically, the quantity being sampled is the one to the left of the '|'. I've listed the parameters it depends on, either explicitly or implicitly, to the right of the '|', to make it easier to see the hierarchical structure of the problem. The probability distribution being sampled from is named on the right side of the '~'. If it's a normal distribution, the first parameter inside the parentheses is the mean, and the second is the standard deviation. If it's a uniform distribution, then the first parameter is the minimum value that that parameter can take, and the second is the maximum.

To understand what's going on here, think about what you would do if you wanted to generate a population of exoplanet radial velocity measurements. First, you'd define what that population distribution is: here, it's a normal distribution with mean $\mu$ and standard deviation $\sigma$. You'd draw $N$ planet masses $M_{pl,j}$ from this distribution (here $N=100$, and $j=1,2,....,N$). This "drawing" from the population distribution is what is represented in the third line of the model.

Second, you'd use these drawn masses to calculate the radial velocity semi-amplitudes $K_j$ with your knowledge of the host star's mass $M_{star,j}$ and the planetary orbital period $P_j$. This step is deterministic, not probabilistic - we know the equation (shown above in the background section), and each unique set of $M_{pl,j}$, $M_{star,j}$, $P_j$ values gives one $K_j$ value that is always the same. This deterministic step is represented in the second line of the model. Note that, for now, we're assuming the planetary orbit is edge-on, so that $\sin i = 1$.

Finally, you'd take these "true" $K_j$ values and "add noise" to simulate taking a measurement with some uncertainty. This step of adding noise is probabilistic, and is represented by the first line of the model - you sample one single "observed" K value given the "true" K value you calculated and a guess for the measurement uncertainty. Those noisy RV semi-amplitudes are $K_{obs,j}$, and is what we have read in as data.

What you would have done in generating this population is exactly what happens in each iteration of a hierarchical MCMC chain: we have defined a forward model for the MCMC algorithm to evaluate. This forward model is summarized in graphical form here:

Of course, the algorithm does a lot more than that, too: it intelligently probes parameter space, proposing steps and accepting/rejecting them in a way that reflects their probability given the data.

In order to have the MCMC algorithm perform this parameter estimation, we need to define priors on our top-level parameters. Remember the first step of our forward model, where we had to define the population distribution? We would have had to choose values for $\mu$ and $\sigma$ at the very beginning, to get this all started. For our MCMC algorithm, we don't choose values for these top-level parameters; instead, we choose their prior distributions. These priors are the last two lines of the above hierarchical model. (In general, hierarchical models turn out to be fairly insensitive to the choice of these priors . . . but don't take my word for it, change things around in the model below! For example, a Log-Normal prior distribution for planet masses might be an even better choice, since it's only assigns non-zero probability to positive values for the mass. You can find a full list of probability distributions coded in the STAN language via the "Language Manual" at http://mc-stan.org/users/documentation/, starting on page 508.)

In [ ]:
# Here we define the hierarchical model in the STAN modeling language.  This
# definition includes the data (they are observed, unchanging quantities),
# parameters (that variables that you do NOT read in as data), and the
# probability distributions that define the probabilistic model.
# Note that STAN makes a distinction between parameters and "transformed parameters".
# Transformed parameters are simply variables that are deterministically
# calculated from others, like K_j in the second line of the above model.
model_RVmass = """
data {
int<lower=0> N; // number of planets
real K_obs[N]; // planet RV semi-amplitudes
real<lower=0> K_err[N]; // errors in RV semi-amplitude
real<lower=0> Per[N]; // planet orbital periods
real<lower=0> Mstar[N]; // host star masses
}
parameters {
real mu; // mean of the RV mass distribution
real<lower=0.1,upper=1000> sigma; // standard deviation of the RV mass distribution
real<lower=0> Mpl[N]; // individual planet masses
}
transformed parameters {
real K_calc[N]; // RV semi-amplitude calculated from planet mass, host star mass, and period.

for(i in 1:N){
K_calc[i] = (2*3.14*6.674e-11/(Per[i]*24*3600))^(1/3.)*Mpl[i]*5.972e24/(((Mstar[i]+Mpl[i]*3e-6)*1.989e30)^(2/3.));
}
}
model {
for(i in 1:N){
K_obs[i] ~ normal(K_calc[i], K_err[i]);
Mpl[i] ~ normal(mu,sigma);
}

// PRIORS on the hyperparameters
mu ~ normal(3, 1);
sigma ~ uniform(0.1, 1000);
}
"""

In [ ]:
# Now, let's run the model!  The compiling step takes a few minutes,
# but the results should come pretty soon after that.  Note that for
# science, you'll probably want to run your chains for longer than this.
fit = pystan.stan(model_code=model_RVmass, data=dat4STAN, iter=20000, chains=8, thin=50)
print fit


The print statement will print out a one-line summary of the posterior distribution for each parameter, showing the mean, standard deviation, and quantiles of that posterior, along with an estimate of the effective sample size and the Gelman-Rubin statistic, to assess convergence. The top-level parameters, i.e. hyperparameters, are listed first in the above output, then the parameters for each individual planet. Every quantity that is not defined as data has a posterior distribution. Note that planet masses are in units of the mass of Earth.

Once the fit is finished and the output has been printed, the first thing you want to check is the convergence of the chains. You can do this qualitatively by looking for a smooth, unimodal posterior distribution and for little to no autocorrelation in the trace plot.

In [ ]:
# Let's look at the posterior (left) and trace plot (right) for
# one of the hyperparameters, the mean of the distribution:
fig = fit.plot(['mu'])

In [ ]:
# Let's look at the other:
fig = fit.plot(['sigma'])


The marginal posteriors may a bit bumpy and the traceplot may show some signs of autocorrelation. The posteriors would probably benefit from the chains being run for longer.

In [ ]:
# You can run chains for longer without recompiling the model by calling
# the prior fit output. Executing this cell should take ~ 5 minutes.
# If you're trying to move along rapidly, skip executing this cell.
fit2 = pystan.stan(fit=fit, data=dat4STAN, iter=100000, chains=8, thin=100)
fig = fit2.plot(['mu'])
fig = fit2.plot(['sigma'])


We can also extract the values of the chains, to perform additional checks and make more plots beyond the above pySTAN fit.plot() function.

In [ ]:
# First we extract the posterior samples from the fit
samples = fit.extract(permuted=True)
mu_samp = samples['mu']
sig_samp = samples['sigma']
# Let's use these to make a contour plot of the
# joint mu, sigma hyperparameter posterior:
hist = plt.hist2d(mu_samp, sig_samp, bins=40)
plt.xlabel("mu")
plt.ylabel("sigma")


We can even look at the posteriors for individual planet masses!

In [ ]:
mass_pl1 = np.array(samples['Mpl']).T
hist = plt.hist(mass_pl1[0,],30)
plt.xlabel("Posterior mass of first planet in sample (M_Earth)")


## Part 2: Adding in uncertainties on the stellar mass¶

We'd normally want to do more to investigate convergence of the posteriors, but for now let's practice changing the model to get a sense for the wide range of things one can do with hierarchical modeling. Recall that we treated the stellar mass as data: it only had one value that we read into the "data" portion of the model. However, we know that the stellar mass also has an uncertainty, and that varying the stellar mass under that uncertainty will affect the inferred planet mass based on the observed $K$. So, let's change the stellar mass aspect of the model:

In [ ]:
model_RVmass_addMstar = """
data {
int<lower=0> N; // number of planets
real K_obs[N]; // planet RV semi-amplitudes
real<lower=0> K_err[N]; // errors in RV semi-amplitude
real<lower=0> Per[N]; // planet orbital periods
real<lower=0> Mstar[N]; // host star masses
real<lower=0> Mstar_err[N]; // errors on the host star masses
}
parameters {
real mu; // mean of the RV mass distribution
real<lower=0.1,upper=1000> sigma; // standard deviation of the RV mass distribution
real<lower=0> Mpl[N]; // individual planet masses
real<lower=0,upper=10> Mstar_true[N]; // individual "true" host star masses
}
transformed parameters {
real K_calc[N]; // RV semi-amplitude calculated from planet mass, host star mass, and period.

for(i in 1:N){
K_calc[i] = (2*3.14*6.674e-11/(Per[i]*24*3600))^(1/3.)*Mpl[i]*5.972e24/(((Mstar_true[i]+Mpl[i]*3e-6)*1.989e30)^(2/3.));
}
}
model {
for(i in 1:N){
K_obs[i] ~ normal(K_calc[i], K_err[i]);
Mpl[i] ~ normal(mu,sigma);
Mstar[i] ~ normal(Mstar_true[i],Mstar_err[i]);
Mstar_true[i] ~ uniform(0,10);
}

// PRIORS on the hyperparameters
mu ~ normal(3, 1);
sigma ~ uniform(0.1, 1000);
}
"""


While it's running, compare the differences between this model and the model from Part 1. What changed? Why? What would you change to this written model to accurately reflect what changed in the STAN code above?

\begin{align} K_{obs,j}\ \Big|\ M_{pl,j},M_{star,true,j},P_j,\mu,\sigma & \sim \text{Normal}\Big(K_j,\sigma_{Kj}\Big) \\ M_{star,j} \Big| .... & \sim .... \\ K_j & = f(M_{pl},M_{star,true,j},P_j) \\ M_{pl,j}\ \Big|\ \mu, \sigma & \sim \text{Normal}\Big(\mu, \sigma \Big) \\ M_{star,true,j} \Big| .... & \sim .... \\ \mu & \sim \text{Normal}(3 , 1) \\ \sigma & \sim \text{Uniform}( 0.1 , 1000 ) \\ \end{align}

Note that I have already changed $M_{star}$ to $M_{star,true}$ in a couple key places in this model. How would you redraw the graphical model shown above? Once the output has printed, how do the results compare? Feel free to make as many plots as you'd like.

# Part 3: Adding inclination as a latent parameter¶

Not that much changed between Parts 1 and 2. This is because we know the stellar masses pretty well for this sample: their uncertainties are ~10%. Therefore, there isn't much uncertainty to propagate to the hyperparameters' posterior distribution. But what if we change the model to accurately reflect our uncertainty about something we really don't know ... such as the planet's orbital inclination? For planet's that don't transit, we don't measure the orbital inclination at all. These kinds of parameters are called latent parameters in hierarchical models: they affect the inference, but are not observed.

Recall that to get things started, we assumed edge-on orbits ($i=90^\circ$). What happens when we no longer assume this? After all, it's a bad assumption for planets discovered via radial velocities that don't transit their host stars. Let's instead assume that exoplanet orbital inclinations are spherically uniform. Recall that drawing angles so that points are located uniformly on a sphere is different than simply drawing the angles from a uniform distribution. Therefore inclination will be another transformed parameter, calculated from a random number uniformly distributed on the interval [0,1).

In [ ]:
model_RVmass_addincl = """
data {
int<lower=0> N; // number of planets
real K_obs[N]; // planet RV semi-amplitudes
real<lower=0> K_err[N]; // errors in RV semi-amplitude
real<lower=0> Per[N]; // planet orbital periods
real<lower=0> Mstar[N]; // host star masses
real<lower=0> Mstar_err[N]; // errors on the host star masses
}
parameters {
real mu; // mean of the RV mass distribution
real<lower=0.1,upper=1000> sigma; // standard deviation of the RV mass distribution
real<lower=0> Mpl[N]; // individual planet masses
real<lower=0,upper=10> Mstar_true[N]; // individual "true" host star masses
real<lower=0,upper=1> tocalc_incl[N]; // the uniform random deviate needed to generate spherically uniform inclinations
}
transformed parameters {
real K_calc[N]; // RV semi-amplitude calculated from planet mass, host star mass, and period.
real incl[N]; // inclination of the planet's orbit

for(i in 1:N){
incl[i] = acos(1-2*tocalc_incl[i]);  // This transformation gives us spherically uniform inclinations.
K_calc[i] = (2*3.14*6.674e-11/(Per[i]*24*3600))^(1/3.)*Mpl[i]*sin(incl[i])*5.972e24/(((Mstar_true[i]+Mpl[i]*3e-6)*1.989e30)^(2/3.));
}
}
model {
for(i in 1:N){
K_obs[i] ~ normal(K_calc[i], K_err[i]);
Mpl[i] ~ normal(mu,sigma);
Mstar[i] ~ normal(Mstar_true[i],Mstar_err[i]);
Mstar_true[i] ~ uniform(0,10);
tocalc_incl[i] ~ uniform(0,1);
}

// PRIORS on the hyperparameters
mu ~ normal(3, 1);
sigma ~ uniform(0.1, 1000);
}
"""


How did the inferred hyperparameters change? Did they change in a way that makes sense?

In [ ]:
samples_addincl = fit_addincl.extract(permuted=True)

bins = np.linspace(1, 6, 40)
plt.hist(mu_samp, bins, alpha=0.5, label='incl = 90')
plt.xlabel("Mean of the population mass distribution")
plt.legend(loc='upper right')
plt.show()

plt.hist(sig_samp, bins, alpha=0.5, label='incl = 90')
plt.xlabel("Standard deviation of the population mass distribution")
plt.legend(loc='upper right')
plt.show()


## Bonus Challenges¶

There are many ways in which this model could be further developed. Consider how you might:

• incorporate uncertainties and a population distribution for the orbital periods
• incorporate an eccentricity distribution (requires a more general formula for $K$)
• change the lowest level of the model to be $K\cos w$ and $K \sin w$ instead of $K$; this would ensure that the true masses never go negative even when uncertainty in $K$ spans 0.
• again change the inclination distribution, so that it is related to a $v \sin i$ measurement from the star. After all, it is very difficult to detect planets that have a face-on orbit, so the inclination of the observed sample is likely not spherically uniform.

And finally, how would you incorporate the fact that only planets with measured $K > 1$ m/s are considered detected?! Take another look at the data - there are many planets with very small $K_{obs}$; most of these would never make it into an actual dataset of published exoplanet masses. But you could imagine that restricting the sample to only the planets with higher $K_{obs}$ would bias our inferred mass distribution!

Correctly accounting for detection bias involves an integral in the forward model that is annoying/difficult to compute. One can do this numerically within a hierarchical Bayesian model like those here, but there are also other selection effects to consider, if one wanted to publish an exoplanet mass distribution as a scientific result. The following make writing down a likelihood like those we have evaluated in our STAN models much more difficult:

• A cut on the brightness of the star (stars that are too dim will not achieve the spectral signal-to-noise ratio needed to extract extremely precise radial velocities of the planet)
• A cut on $v \sin i$ of the star (quickly rotating stars have broad lines which make it difficult to get very precise RVs of any companions)
• A cut on log($R_{hk}$) of the star as a measurement of the activity of the star (the more active the star, the noiser the RVs, the harder it is to find very low-mass planets)
• And the thing that makes this really intractable in real life: prioritizing which targets to observe given previous RV measurements (many people increase their observing cadence for "promising" targets - i.e those that exhibit some sort of low-amplitude RV variability early on in the observing campaign)

What to do?! All of these effects makes writing down a likelihood very difficult! Stay tuned for the next lecture on Approximate Bayesian Computation.