How does Bayesian inference work?
Computation using Stan and rstan
Diagnostics/summary using shinystan
Data = process + noise
Statistical inference requires:
A model for the process (deterministic model)
A model for the noise (stochastic model)
“All models are wrong, but some are useful” - George Box
\(Y\) is data, \(\Theta\) are parameters of model
Probability of data given parameters, symbolically \(p(Y|\Theta)\)
Based on model, write down equation for \(p(Y|\Theta)\)
Find parameter values that maximize \(p(Y|\Theta)\)
These values are the “maximum likelihood estimate”
A tiny simulated dataset (line 31):
# simulate some data
set.seed(101)
fake_data <- rnorm(n = 3, mean = 2, sd = 2)
fake_data # look at it
## [1] 1.3479270 3.1049237 0.6501123
\(Y\) is normally distributed, with mean \(\mu\) and sd \(\sigma\)
\[p(Y|\Theta = \{\mu, \sigma\}) = \prod_i (2\pi \sigma^{-2})^{-\frac{1}{2}} e^{-0.5 \frac{(y_i-\mu)^2}{\sigma^2}}\]
A function to implement the likelihood (line 40):
# a function that returns the likelihood
# for given values of mu and sigma
likelihood <- function(mu, sigma, data_values){
prod( dnorm(data_values, mean = mu, sd = sigma) )
}
# for example ...
likelihood(mu = 0.5, sigma = 0.3, data_values = fake_data)
## [1] 1.622847e-18
Visualized across a grid of values of \(\mu\) and \(\sigma\):
Probabilities of possible parameter values (instead of “single best estimate”)
Probability of the parameters given the data, \(p(\Theta|Y)\)
Equation for likelihood
Equation for prior probability of parameter values
Use Bayes’ rule to calculate \(p(\Theta|Y)\)
A tool to calculate the posterior:
\[p(\Theta|Y) = \dfrac{p(Y|\Theta)p(\Theta)}{p(Y)}\]
\(p(Y|\Theta)\) is called the likelihood
\(p(\Theta)\) is called the prior
\(p(Y)\) is called the marginal likelihood
A tool to calculate the posterior:
\[p(\Theta|Y) \propto p(Y|\Theta)p(\Theta)\]
Only the numerator is a function of the parameters!
The posterior is proportional to the likelihood weighted by the prior
In “normal sample” likelihood, parameters are \(\mu\) and \(\sigma\).
Prior on \(\mu\) is normal (mean = -1, sd = 1)
Prior on \(\sigma\) is half-normal (sd = 0.5)
\[p(\Theta = \{\mu, \sigma\}) \propto \exp\{ -\frac{(\mu + 1)^2}{2} \} \times \exp\{ -\frac{|\sigma|^2}{0.5} \}\]
Function to calculate these priors (line 75):
# a function that returns the prior for given values of mu and sigma
prior <- function(mu, sigma){
# normal prior on mu, half-normal prior on sigma
dnorm(mu, mean = -1, sd = 1) *
2*dnorm(abs(sigma), mean = 0, sd = 1)
}
# for example ...
prior(mu = 0.5, sigma = 0.3)
## [1] 0.09879287
Across (same) grid of values of \(\mu\) and \(\sigma\):
Proportional to posterior distribution
Has the same “shape” as posterior
Also called: “target”, “unnormalized posterior”
\[p(\Theta) p(Y|\Theta) \propto \exp\{ -\frac{(\mu + 1)^2}{2} \} \times \exp\{ -\frac{|\sigma|^2}{0.5} \} \times\\ \prod_i (2\pi \sigma^{-2})^{-\frac{1}{2}} e^{-0.5 \frac{(y_i-\mu)^2}{\sigma^2}}\]
Function to calculate (unnormalized) posterior (line 105):
# a function that computes the unnormalized posterior
# given the functions likelihood(), prior() defined above
posterior <- function(mu, sigma, data_values){
likelihood(mu, sigma, data_values) * prior(mu, sigma)
}
# for example ...
posterior(0, 0.5, fake_data)
## [1] 4.15249e-12
Prior, unnormalized posterior, likelihood:
The ‘sum’ of unnormalized posterior across possible parameter values
Makes the posterior sum (integrate) to 1
Called the marginal likelihood
\[p(Y) = \int p(Y|\Theta) p(\Theta) d\Theta\]
Bayes’ rule is used to calculate posterior from likelihood and prior
The likelihood and prior control the shape of the posterior
The marginal likelihood forces posterior to sum (integrate) to 1
As amount of data increases, likelihood dominates posterior
Three stereotypes:
Believe that priors are useful, principled
Like the simplicity, rigor of Bayesian interpretation
Like Bayesian computational methods
Bayes generates more robust predictions, etc.
Marginal likelihood (thus posterior) hard to calculate
Approximate posterior with simulation. Implementations:
Write model in Stan’s language (can do within R)
Assemble data/inputs into list (in R)
Feed to Stan with function stan()
Look at output with launch_shinystan()
Language is mash-up of R and C++ syntax:
Lines must end with a semi-colon ;
(C++)
All variables/parameters must have defined types/dimensions (C++)
Indexing starts at 1 (R)
Language is mash-up of R and C++ syntax:
The “for-loop” syntax is for(iterator in range) { do stuff }
(R)
Supports “incrementing operators”: for example, the operator left_side += right_side
adds right_side
to left_side
(C++)
Lines are commented out by //
(C++)
A Stan model is defined in strict blocks.
block_name {
// do stuff here
}
data {}
parameters {}
model {}
For every input, declare:
type
dimension (if needed)
boundaries (optional)
name
type<lower, upper>[type dimension] name[dimension];
data
blockint N; // integer called N
int N[10]; // ten integers
int<lower=0> N[10]; // ten non-negative integers
vector[10] Y; // a vector of reals of length 10
vector[10] Y[5]; // five vectors of reals of length 10
For every unknown parameter, declare:
type
dimension (if needed)
boundaries (optional)
name
An unknown real called sigma
with a lower bound of 0:
real<lower=0> sigma;
Calculate the log “target”: the log numerator of Bayes’ rule.
The “programmy” part of the model code
Defines how the data and parameters are related, with:
deterministic functions
probability distributions
Must sum the logs of every term in the likelihood and prior
target
is a name reserved for the log numerator of Bayes’ rule
target
equals 0 at the top of the model block
Add to target
with the syntax: target += something
Calculate probability distributions with syntax log_pdf( variable | parameters )
(This will be much more clear with an example.)
A regression problem with data mtcars
# we'll use the mtcars data
data(mtcars)
mtcars %>% head
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
# regression of miles per gallon against weight
mtcars %>% ggplot(aes(x=wt,y=mpg)) +
geom_point(size=4) + geom_smooth(method="lm", se=F) +
theme_minimal()
Let \(y\) be mtcars$mpg
. Let \(x\) be mtcars$wt
. Let \(\alpha, \beta\) be the intercept and slope. Let \(\sigma\) be the residual sd. Let \(N\) be the number of data points.
For the \(i\)th data point, \(y_i\) is normally distributed with mean \(\alpha + \beta x_i\) and sd \(\sigma\).
\[ y_i \sim \mathrm{Normal}( \alpha + \beta x_i, \sigma )\]
Let’s move to R …