Bayesian inference

Nate Pope

Oct 28, 2016

Goals

How does Bayesian inference work?
Computation using Stan and rstan
Diagnostics/summary using shinystan

Statistical inference in a nutshell

Data = process + noise
Statistical inference requires:
1. A model for the process (deterministic model)
2. A model for the noise (stochastic model)

“All models are wrong, but some are useful” - George Box

Maximum likelihood in a nutshell

Notation:

$Y$ is data, $\Theta$ are parameters of model

Likelihood:

Probability of data given parameters, symbolically $p(Y|\Theta)$

Maximum likelihood:

Based on model, write down equation for $p(Y|\Theta)$
Find parameter values that maximize $p(Y|\Theta)$
These values are the “maximum likelihood estimate”

Maximum likelihood (examp.)

A tiny simulated dataset (line 31):

# simulate some data
set.seed(101)
fake_data <- rnorm(n = 3, mean = 2, sd = 2)
fake_data # look at it

## [1] 1.3479270 3.1049237 0.6501123

Model:

$Y$ is normally distributed, with mean $\mu$ and sd $\sigma$

Notation:

\[p(Y|\Theta = \{\mu, \sigma\}) = \prod_i (2\pi \sigma^{-2})^{-\frac{1}{2}} e^{-0.5 \frac{(y_i-\mu)^2}{\sigma^2}}\]

Maximum likelihood (examp.)

A function to implement the likelihood (line 40):

# a function that returns the likelihood 
# for given values of mu and sigma
likelihood <- function(mu, sigma, data_values){
    prod( dnorm(data_values, mean = mu, sd = sigma) )
}

# for example ...
likelihood(mu = 0.5, sigma = 0.3, data_values = fake_data)

## [1] 1.622847e-18

Maximum likelihood (examp.)

Visualized across a grid of values of $\mu$ and $\sigma$:

Bayesian inference in a nutshell

Goal:

Probabilities of possible parameter values (instead of “single best estimate”)

Posterior distribution:

Probability of the parameters given the data, $p(\Theta|Y)$

Bayesian inference:

Equation for likelihood
Equation for prior probability of parameter values
Use Bayes’ rule to calculate $p(\Theta|Y)$

Bayesian inference

Bayes’ rule:

A tool to calculate the posterior:

\[p(\Theta|Y) = \dfrac{p(Y|\Theta)p(\Theta)}{p(Y)}\]

$p(Y|\Theta)$ is called the likelihood
$p(\Theta)$ is called the prior
$p(Y)$ is called the marginal likelihood

Bayesian inference

Bayes’ rule:

A tool to calculate the posterior:

\[p(\Theta|Y) \propto p(Y|\Theta)p(\Theta)\]

Only the numerator is a function of the parameters!
The posterior is proportional to the likelihood weighted by the prior

Bayesian inference (examp.)

In “normal sample” likelihood, parameters are $\mu$ and $\sigma$.

Priors:

Prior on $\mu$ is normal (mean = -1, sd = 1)
Prior on $\sigma$ is half-normal (sd = 0.5)

Notation:

\[p(\Theta = \{\mu, \sigma\}) \propto \exp\{ -\frac{(\mu + 1)^2}{2} \} \times \exp\{ -\frac{|\sigma|^2}{0.5} \}\]

Bayesian inference

Function to calculate these priors (line 75):

# a function that returns the prior for given values of mu and sigma
prior <- function(mu, sigma){
    # normal prior on mu, half-normal prior on sigma
    dnorm(mu, mean = -1, sd = 1) * 
        2*dnorm(abs(sigma), mean = 0, sd = 1) 
}

# for example ...
prior(mu = 0.5, sigma = 0.3)

## [1] 0.09879287

Bayesian inference

Across (same) grid of values of $\mu$ and $\sigma$:

Bayesian inference

Numerator of Bayes’ rule:

Proportional to posterior distribution
Has the same “shape” as posterior
Also called: “target”, “unnormalized posterior”

Notation (model from before):

\[p(\Theta) p(Y|\Theta) \propto \exp\{ -\frac{(\mu + 1)^2}{2} \} \times \exp\{ -\frac{|\sigma|^2}{0.5} \} \times\\ \prod_i (2\pi \sigma^{-2})^{-\frac{1}{2}} e^{-0.5 \frac{(y_i-\mu)^2}{\sigma^2}}\]

Bayesian inference

Function to calculate (unnormalized) posterior (line 105):

# a function that computes the unnormalized posterior
# given the functions likelihood(), prior() defined above
posterior <- function(mu, sigma, data_values){
    likelihood(mu, sigma, data_values) * prior(mu, sigma)
}

# for example ...
posterior(0, 0.5, fake_data)

## [1] 4.15249e-12

Example 1

Prior, unnormalized posterior, likelihood:

Exercise 1

Bayesian inference

Denominator of Bayes’ rule:

The ‘sum’ of unnormalized posterior across possible parameter values
Makes the posterior sum (integrate) to 1
Called the marginal likelihood

Notation:

\[p(Y) = \int p(Y|\Theta) p(\Theta) d\Theta\]

Bayesian inference

Conclusions:

Bayes’ rule is used to calculate posterior from likelihood and prior
The likelihood and prior control the shape of the posterior
The marginal likelihood forces posterior to sum (integrate) to 1
As amount of data increases, likelihood dominates posterior

Why use Bayesian inference?

Three stereotypes:

(1) Bayesians:

Believe that priors are useful, principled
Like the simplicity, rigor of Bayesian interpretation

(2) “Get-’er-done”-ers:

Like Bayesian computational methods
Bayes generates more robust predictions, etc.

(3) Babesians:

Think that “Bayesian” sounds sexy and mysterious

Spectrum of software

Problem:

Marginal likelihood (thus posterior) hard to calculate

Markov chain Monte Carlo (MCMC):

Approximate posterior with simulation. Implementations:

Specific models: easy to use, efficiency high, flexibility low (ie. MCMCglmm, BayesLogit)
Modelling language: moderate difficulty, efficiency usually high, flexibility high (ie. Stan, JAGS/BUGS, Nimble)
Do-it-yourself: hard to do, efficiency depends, flexibility limitless

Stan

Possible Stan workflow:

Write model in Stan’s language (can do within R)
Assemble data/inputs into list (in R)
Feed to Stan with function stan()
Look at output with launch_shinystan()

Stan

Syntax:

Language is mash-up of R and C++ syntax:

Lines must end with a semi-colon ; (C++)
All variables/parameters must have defined types/dimensions (C++)
Indexing starts at 1 (R)

Stan

Syntax:

Language is mash-up of R and C++ syntax:

The “for-loop” syntax is for(iterator in range) { do stuff } (R)
Supports “incrementing operators”: for example, the operator left_side += right_side adds right_side to left_side (C++)
Lines are commented out by // (C++)

Stan

A Stan model is defined in strict blocks.

block_name {
    // do stuff here
}

Essential blocks:

data {}
parameters {}
model {}

data block

Purpose:

For every input, declare:

type
dimension (if needed)
boundaries (optional)
name

Type declaration syntax:

type<lower, upper>[type dimension] name[dimension];

`data` block

Type declaration examples:

int N; // integer called N

int N[10]; // ten integers

int<lower=0> N[10]; // ten non-negative integers

vector[10] Y; // a vector of reals of length 10

vector[10] Y[5]; // five vectors of reals of length 10

parameters block

Purpose:

For every unknown parameter, declare:

type
dimension (if needed)
boundaries (optional)
name

Example:

An unknown real called sigma with a lower bound of 0:

real<lower=0> sigma;

model block

Purpose:

Calculate the log “target”: the log numerator of Bayes’ rule.

Structure:

The “programmy” part of the model code
Defines how the data and parameters are related, with:
1. deterministic functions
2. probability distributions
Must sum the logs of every term in the likelihood and prior

model block

Important syntax:

target is a name reserved for the log numerator of Bayes’ rule
target equals 0 at the top of the model block
Add to target with the syntax: target += something
Calculate probability distributions with syntax log_pdf( variable | parameters )
(This will be much more clear with an example.)

Stan (example model)

A regression problem with data mtcars

# we'll use the mtcars data
data(mtcars)
mtcars %>% head

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Stan (example model)

# regression of miles per gallon against weight
mtcars %>% ggplot(aes(x=wt,y=mpg)) + 
    geom_point(size=4) + geom_smooth(method="lm", se=F) +
    theme_minimal()

Stan (example model)

Definitions:

Let $y$ be mtcars$mpg. Let $x$ be mtcars$wt. Let $\alpha, \beta$ be the intercept and slope. Let $\sigma$ be the residual sd. Let $N$ be the number of data points.

Model:

For the $i$th data point, $y_i$ is normally distributed with mean $\alpha + \beta x_i$ and sd $\sigma$.

Notation:

\[ y_i \sim \mathrm{Normal}( \alpha + \beta x_i, \sigma )\]

Stan (example model)

Let’s move to R …