R makes it super easy to work with most (if not all) commonly used probability distributions directly. Accessing these distributions follows the same consistent syntax: xdist()
where x
is the probability function abbreviation and dist
is the command abbreviation (see tables below).
Distribution | Abbreviation |
---|---|
Beta | beta |
Binomial | binom |
Chi-square | chisq |
Exponential | exp |
Gamma | gamma |
Negative Binomial | nbinom |
Normal | norm |
Poisson | pois |
Uniform | unif |
T | t |
Abbreviation | Description |
---|---|
r | draw random values from a distribution |
d | density (likelihood) of a distribution at a certain value |
p | cumulative probability up to a certain value (is the output) |
q | inverse cumulative probability (returns the number that has a cumulative probability of a certain value) |
In this section you’ll get a feel for how to work with the probability distribution commands in R. We will primarily focus on the r
and d
probability commands, but see the R documentation of any command if you’re interested in learning more (e.g. ?pnorm
).
library(tidyverse)
library(cowplot)
# We set a random number generating seed,
# so that results are reproducible
set.seed(808)
# Draw 10,000 samples from N(0, 1) distribution and plot histogram
rnorm(n = 10000, mean = 0, sd = 1) %>%
as_data_frame() %>%
ggplot(aes(value)) + geom_histogram()
# Under the N(0,1) distribution what is the likelihood of obtaining x = 0.25
dnorm(x = 0.25, mean = 0, sd = 1)
## [1] 0.3866681
# You can also obtain the likelihood for multiple samples at once
samps <- c(-0.5, -1.5, 0.01, 0.8, 2.5)
dnorm(samps, mean = 0, sd = 1)
## [1] 0.3520653 0.1295176 0.3989223 0.2896916 0.0175283
You’ve figured out that on average you receive 15 emails every day. Assuming your daily mailbox follows a Poisson
distribution, what does your daily email total distribution look like? How likely is it that you get 25 emails in a given day?
# R Code here
Simulate 5 days of mail under the assumed Poisson
distribution. What is the likelihood of getting those daily email totals over 5 consecutive days?
# R Code here
In this section we are going to calculate the likelihood for data from a linear model. We will use the modelr
package primarily for the simulated data sim1
that it contains, but the package also has nice functionality useful in various modeling scenarios.
library(modelr)
# We will use the sim1 dataset to understand likelihood
head(sim1)
## # A tibble: 6 × 2
## x y
## <int> <dbl>
## 1 1 4.199913
## 2 1 7.510634
## 3 1 2.125473
## 4 2 8.988857
## 5 2 10.243105
## 6 2 11.296823
sim1 %>% ggplot(aes(x, y)) + geom_point()
# We can calculate the likelihood of a proposed model this way
slope = 4 # Proposed slope
yint = 10 # Proposed y intercept
prod(dnorm(x = sim1$y, mean = slope*sim1$x + yint, sd=2))
## [1] 0
# This is why we usually sum the log likelihood instead
sum(dnorm(x = sim1$y, mean = slope*sim1$x + yint, sd=2, log=T))
## [1] -1203.213
You can see we will be calculating the prediced ys from our model many times, so let’s turn it into a function, lm_preds()
lm_preds <- function(slope, yint, x){
slope * x + yint
}
# Now our log-likelihood function looks like ths
sum(dnorm(x = sim1$y, mean = lm_preds(slope=slope, yint=yint, sim1$x), sd=2, log=T))
## [1] -1203.213
Great, so now you’ve seen how to actually write a function, and how to calculate a likelihood. Now try to make a function that takes the data (xs and ys), the proposed model (slope and yintercept), and then returns the log likelihood. Test your function with different proposed models to try to find the maximum likelihood for the model
# R Code Here
In the previous example, we tested the likelihood of simulated data given by the modelr
package. Somestimes it can be slightl difficult to understand the limits of likelihood without knowing the true data generating process, so we will learn how to randomly generate data in this section.
# Take a look at the output of lm on our data
lm(y~x, data =sim1) %>% summary()
##
## Call:
## lm(formula = y ~ x, data = sim1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.1469 -1.5197 0.1331 1.4670 4.6516
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.2208 0.8688 4.858 4.09e-05 ***
## x 2.0515 0.1400 14.651 1.17e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.203 on 28 degrees of freedom
## Multiple R-squared: 0.8846, Adjusted R-squared: 0.8805
## F-statistic: 214.7 on 1 and 28 DF, p-value: 1.173e-14
# Generate some random data
n <- 100
slope = 5
yint = 0
sample <- data_frame(x=runif(n, min = 0, max = 100)) %>%
mutate(y=rnorm(n, mean = x * slope + yint, sd = 50))
ggplot(sample, aes(x,y)) + geom_point()
fit <- lm(y~x, data=sample)
fit %>% summary()
##
## Call:
## lm(formula = y ~ x, data = sample)
##
## Residuals:
## Min 1Q Median 3Q Max
## -164.00 -31.52 -3.72 24.79 124.09
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -14.8799 9.9826 -1.491 0.139
## x 5.3425 0.1777 30.061 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 51.35 on 98 degrees of freedom
## Multiple R-squared: 0.9022, Adjusted R-squared: 0.9012
## F-statistic: 903.7 on 1 and 98 DF, p-value: < 2.2e-16
confint(fit, level = 0.95) # 95% confidence interval
## 2.5 % 97.5 %
## (Intercept) -34.69000 4.930216
## x 4.98985 5.695216
c(slope, yint) # Does the confidence interval capture the true values?
## [1] 5 0
The lm()
function is able to get estimates for the slope and intercept for the linear model, but its ability to do so depends on the amount of data, and the noisiness of the data. Try to copy the code from the example above, and change the noisiness and number of samples within your data to understand the limits for correctly estimating the MLE.
# R Code Here
There are two extended exercises, and feel free to do either one of them.
Use your log-likelihood function to create likelihood profiles for the slope
and yintercept
variables like shown in the presentation. Try to use the tidyverse
and especially the map()
function to do so.
Create a function that can generate random samples from linear models. Then use the tidyverse
and especially the map()
function to fit linear models to samples with varying numbers of data and noisiness. Plot the 95% CI for the slope estimate as a function of the number of samples and facet by the noisiness.
# R Code here