Bayesian inference

Nate Pope

Oct 28, 2016

Goals

  1. How does Bayesian inference work?

  2. Computation using Stan and rstan

  3. Diagnostics/summary using shinystan

Statistical inference in a nutshell

  • Data = process + noise

  • Statistical inference requires:

    1. A model for the process (deterministic model)

    2. A model for the noise (stochastic model)

“All models are wrong, but some are useful” - George Box

Maximum likelihood in a nutshell

Notation:

\(Y\) is data, \(\Theta\) are parameters of model

Likelihood:

Probability of data given parameters, symbolically \(p(Y|\Theta)\)

Maximum likelihood:
  1. Based on model, write down equation for \(p(Y|\Theta)\)

  2. Find parameter values that maximize \(p(Y|\Theta)\)

  3. These values are the “maximum likelihood estimate”

Maximum likelihood (examp.)

A tiny simulated dataset (line 31):

# simulate some data
set.seed(101)
fake_data <- rnorm(n = 3, mean = 2, sd = 2)
fake_data # look at it
## [1] 1.3479270 3.1049237 0.6501123
Model:

\(Y\) is normally distributed, with mean \(\mu\) and sd \(\sigma\)

Notation:

\[p(Y|\Theta = \{\mu, \sigma\}) = \prod_i (2\pi \sigma^{-2})^{-\frac{1}{2}} e^{-0.5 \frac{(y_i-\mu)^2}{\sigma^2}}\]

Maximum likelihood (examp.)

A function to implement the likelihood (line 40):

# a function that returns the likelihood 
# for given values of mu and sigma
likelihood <- function(mu, sigma, data_values){
    prod( dnorm(data_values, mean = mu, sd = sigma) )
}

# for example ...
likelihood(mu = 0.5, sigma = 0.3, data_values = fake_data)
## [1] 1.622847e-18

Maximum likelihood (examp.)

Visualized across a grid of values of \(\mu\) and \(\sigma\):

Bayesian inference in a nutshell

Goal:

Probabilities of possible parameter values (instead of “single best estimate”)

Posterior distribution:

Probability of the parameters given the data, \(p(\Theta|Y)\)

Bayesian inference:
  1. Equation for likelihood

  2. Equation for prior probability of parameter values

  3. Use Bayes’ rule to calculate \(p(\Theta|Y)\)

Bayesian inference

Bayes’ rule:

A tool to calculate the posterior:

\[p(\Theta|Y) = \dfrac{p(Y|\Theta)p(\Theta)}{p(Y)}\]

  • \(p(Y|\Theta)\) is called the likelihood

  • \(p(\Theta)\) is called the prior

  • \(p(Y)\) is called the marginal likelihood

Bayesian inference

Bayes’ rule:

A tool to calculate the posterior:

\[p(\Theta|Y) \propto p(Y|\Theta)p(\Theta)\]

  • Only the numerator is a function of the parameters!

  • The posterior is proportional to the likelihood weighted by the prior

Bayesian inference (examp.)

In “normal sample” likelihood, parameters are \(\mu\) and \(\sigma\).

Priors:
  • Prior on \(\mu\) is normal (mean = -1, sd = 1)

  • Prior on \(\sigma\) is half-normal (sd = 0.5)

Notation:

\[p(\Theta = \{\mu, \sigma\}) \propto \exp\{ -\frac{(\mu + 1)^2}{2} \} \times \exp\{ -\frac{|\sigma|^2}{0.5} \}\]

Bayesian inference

Function to calculate these priors (line 75):

# a function that returns the prior for given values of mu and sigma
prior <- function(mu, sigma){
    # normal prior on mu, half-normal prior on sigma
    dnorm(mu, mean = -1, sd = 1) * 
        2*dnorm(abs(sigma), mean = 0, sd = 1) 
}

# for example ...
prior(mu = 0.5, sigma = 0.3)
## [1] 0.09879287

Bayesian inference

Across (same) grid of values of \(\mu\) and \(\sigma\):

Bayesian inference

Numerator of Bayes’ rule:
  • Proportional to posterior distribution

  • Has the same “shape” as posterior

  • Also called: “target”, “unnormalized posterior”

Notation (model from before):

\[p(\Theta) p(Y|\Theta) \propto \exp\{ -\frac{(\mu + 1)^2}{2} \} \times \exp\{ -\frac{|\sigma|^2}{0.5} \} \times\\ \prod_i (2\pi \sigma^{-2})^{-\frac{1}{2}} e^{-0.5 \frac{(y_i-\mu)^2}{\sigma^2}}\]

Bayesian inference

Function to calculate (unnormalized) posterior (line 105):

# a function that computes the unnormalized posterior
# given the functions likelihood(), prior() defined above
posterior <- function(mu, sigma, data_values){
    likelihood(mu, sigma, data_values) * prior(mu, sigma)
}

# for example ...
posterior(0, 0.5, fake_data)
## [1] 4.15249e-12

Example 1

Prior, unnormalized posterior, likelihood:

Exercise 1

Bayesian inference

Denominator of Bayes’ rule:
  • The ‘sum’ of unnormalized posterior across possible parameter values

  • Makes the posterior sum (integrate) to 1

  • Called the marginal likelihood

Notation:

\[p(Y) = \int p(Y|\Theta) p(\Theta) d\Theta\]

Bayesian inference

Conclusions:
  1. Bayes’ rule is used to calculate posterior from likelihood and prior

  2. The likelihood and prior control the shape of the posterior

  3. The marginal likelihood forces posterior to sum (integrate) to 1

  4. As amount of data increases, likelihood dominates posterior

Why use Bayesian inference?

Three stereotypes:

(1) Bayesians:
  • Believe that priors are useful, principled

  • Like the simplicity, rigor of Bayesian interpretation

(2) “Get-’er-done”-ers:
  • Like Bayesian computational methods

  • Bayes generates more robust predictions, etc.

(3) Babesians:
  • Think that “Bayesian” sounds sexy and mysterious

Spectrum of software

Problem:

Marginal likelihood (thus posterior) hard to calculate

Markov chain Monte Carlo (MCMC):

Approximate posterior with simulation. Implementations:

  • Specific models: easy to use, efficiency high, flexibility low (ie. MCMCglmm, BayesLogit)
  • Modelling language: moderate difficulty, efficiency usually high, flexibility high (ie. Stan, JAGS/BUGS, Nimble)
  • Do-it-yourself: hard to do, efficiency depends, flexibility limitless

Stan

Possible Stan workflow:
  1. Write model in Stan’s language (can do within R)

  2. Assemble data/inputs into list (in R)

  3. Feed to Stan with function stan()

  4. Look at output with launch_shinystan()

Stan

Syntax:

Language is mash-up of R and C++ syntax:

  • Lines must end with a semi-colon ; (C++)

  • All variables/parameters must have defined types/dimensions (C++)

  • Indexing starts at 1 (R)

Stan

Syntax:

Language is mash-up of R and C++ syntax:

  • The “for-loop” syntax is for(iterator in range) { do stuff } (R)

  • Supports “incrementing operators”: for example, the operator left_side += right_side adds right_side to left_side (C++)

  • Lines are commented out by // (C++)

Stan

A Stan model is defined in strict blocks.

block_name {
    // do stuff here
}
Essential blocks:
  1. data {}

  2. parameters {}

  3. model {}

data block

Purpose:

For every input, declare:

  1. type

  2. dimension (if needed)

  3. boundaries (optional)

  4. name

Type declaration syntax:
type<lower, upper>[type dimension] name[dimension];

data block

Type declaration examples:
int N; // integer called N

int N[10]; // ten integers

int<lower=0> N[10]; // ten non-negative integers

vector[10] Y; // a vector of reals of length 10

vector[10] Y[5]; // five vectors of reals of length 10

parameters block

Purpose:

For every unknown parameter, declare:

  1. type

  2. dimension (if needed)

  3. boundaries (optional)

  4. name

Example:

An unknown real called sigma with a lower bound of 0:

real<lower=0> sigma;

model block

Purpose:

Calculate the log “target”: the log numerator of Bayes’ rule.

Structure:
  • The “programmy” part of the model code

  • Defines how the data and parameters are related, with:

    1. deterministic functions

    2. probability distributions

  • Must sum the logs of every term in the likelihood and prior

model block

Important syntax:
  • target is a name reserved for the log numerator of Bayes’ rule

  • target equals 0 at the top of the model block

  • Add to target with the syntax: target += something

  • Calculate probability distributions with syntax log_pdf( variable | parameters )

  • (This will be much more clear with an example.)

Stan (example model)

A regression problem with data mtcars

# we'll use the mtcars data
data(mtcars)
mtcars %>% head
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Stan (example model)

# regression of miles per gallon against weight
mtcars %>% ggplot(aes(x=wt,y=mpg)) + 
    geom_point(size=4) + geom_smooth(method="lm", se=F) +
    theme_minimal()

Stan (example model)

Definitions:

Let \(y\) be mtcars$mpg. Let \(x\) be mtcars$wt. Let \(\alpha, \beta\) be the intercept and slope. Let \(\sigma\) be the residual sd. Let \(N\) be the number of data points.

Model:

For the \(i\)th data point, \(y_i\) is normally distributed with mean \(\alpha + \beta x_i\) and sd \(\sigma\).

Notation:

\[ y_i \sim \mathrm{Normal}( \alpha + \beta x_i, \sigma )\]

Stan (example model)

Let’s move to R …