Probability Functions in R

R makes it super easy to work with most (if not all) commonly used probability distributions directly. Accessing these distributions follows the same consistent syntax: xdist() where x is the probability function abbreviation and dist is the command abbreviation (see tables below).

Common probability distributions

Distribution Abbreviation
Beta beta
Binomial binom
Chi-square chisq
Exponential exp
Gamma gamma
Negative Binomial nbinom
Normal norm
Poisson pois
Uniform unif
T t

Available distribution commands

Abbreviation Description
r draw random values from a distribution
d density (likelihood) of a distribution at a certain value
p cumulative probability up to a certain value (is the output)
q inverse cumulative probability (returns the number that has a cumulative probability of a certain value)

Working with probability commands

In this section you’ll get a feel for how to work with the probability distribution commands in R. We will primarily focus on the r and d probability commands, but see the R documentation of any command if you’re interested in learning more (e.g. ?pnorm).

Example 1

library(tidyverse)
library(cowplot)

# We set a random number generating seed, 
# so that results are reproducible
set.seed(808)

# Draw 10,000 samples from N(0, 1) distribution and plot histogram
rnorm(n = 10000, mean = 0, sd = 1) %>% 
  as_data_frame() %>% 
  ggplot(aes(value)) + geom_histogram() 

# Under the N(0,1) distribution what is the likelihood of obtaining x = 0.25
dnorm(x = 0.25, mean = 0, sd = 1)
## [1] 0.3866681
# You can also obtain the likelihood for multiple samples at once
samps <- c(-0.5, -1.5, 0.01, 0.8, 2.5)
dnorm(samps, mean = 0, sd = 1)
## [1] 0.3520653 0.1295176 0.3989223 0.2896916 0.0175283

Exercise 1

You’ve figured out that on average you receive 15 emails every day. Assuming your daily mailbox follows a Poisson distribution, what does your daily email total distribution look like? How likely is it that you get 25 emails in a given day?

# R Code here

Exercise 1 (extended)

Simulate 5 days of mail under the assumed Poisson distribution. What is the likelihood of getting those daily email totals over 5 consecutive days?

# R Code here

Calculating a Likelihood

In this section we are going to calculate the likelihood for data from a linear model. We will use the modelr package primarily for the simulated data sim1 that it contains, but the package also has nice functionality useful in various modeling scenarios.

Example 2

library(modelr)

# We will use the sim1 dataset to understand likelihood
head(sim1)
## # A tibble: 6 × 2
##       x         y
##   <int>     <dbl>
## 1     1  4.199913
## 2     1  7.510634
## 3     1  2.125473
## 4     2  8.988857
## 5     2 10.243105
## 6     2 11.296823
sim1 %>% ggplot(aes(x, y)) + geom_point()

# We can calculate the likelihood of a proposed model this way
slope = 4 # Proposed slope
yint = 10 # Proposed y intercept
prod(dnorm(x = sim1$y, mean = slope*sim1$x + yint, sd=2))
## [1] 0
# This is why we usually sum the log likelihood instead
sum(dnorm(x = sim1$y, mean = slope*sim1$x + yint, sd=2, log=T))
## [1] -1203.213

You can see we will be calculating the prediced ys from our model many times, so let’s turn it into a function, lm_preds()

lm_preds <- function(slope, yint, x){
  slope * x + yint
}

# Now our log-likelihood function looks like ths
sum(dnorm(x = sim1$y, mean = lm_preds(slope=slope, yint=yint, sim1$x), sd=2, log=T))
## [1] -1203.213

Exercise 2

Great, so now you’ve seen how to actually write a function, and how to calculate a likelihood. Now try to make a function that takes the data (xs and ys), the proposed model (slope and yintercept), and then returns the log likelihood. Test your function with different proposed models to try to find the maximum likelihood for the model

# R Code Here  

Interpreting MLE Results

In the previous example, we tested the likelihood of simulated data given by the modelr package. Somestimes it can be slightl difficult to understand the limits of likelihood without knowing the true data generating process, so we will learn how to randomly generate data in this section.

Example 3

# Take a look at the output of lm on our data
lm(y~x, data =sim1) %>% summary()
## 
## Call:
## lm(formula = y ~ x, data = sim1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.1469 -1.5197  0.1331  1.4670  4.6516 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4.2208     0.8688   4.858 4.09e-05 ***
## x             2.0515     0.1400  14.651 1.17e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.203 on 28 degrees of freedom
## Multiple R-squared:  0.8846, Adjusted R-squared:  0.8805 
## F-statistic: 214.7 on 1 and 28 DF,  p-value: 1.173e-14
# Generate some random data
n <- 100
slope = 5
yint = 0
sample <- data_frame(x=runif(n, min = 0, max = 100)) %>%
          mutate(y=rnorm(n, mean = x * slope + yint, sd = 50))

ggplot(sample, aes(x,y)) + geom_point()

fit <- lm(y~x, data=sample) 
fit %>% summary()
## 
## Call:
## lm(formula = y ~ x, data = sample)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -164.00  -31.52   -3.72   24.79  124.09 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -14.8799     9.9826  -1.491    0.139    
## x             5.3425     0.1777  30.061   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 51.35 on 98 degrees of freedom
## Multiple R-squared:  0.9022, Adjusted R-squared:  0.9012 
## F-statistic: 903.7 on 1 and 98 DF,  p-value: < 2.2e-16
confint(fit, level = 0.95) # 95% confidence interval
##                 2.5 %   97.5 %
## (Intercept) -34.69000 4.930216
## x             4.98985 5.695216
c(slope, yint) # Does the confidence interval capture the true values?
## [1] 5 0

Exercise 3

The lm() function is able to get estimates for the slope and intercept for the linear model, but its ability to do so depends on the amount of data, and the noisiness of the data. Try to copy the code from the example above, and change the noisiness and number of samples within your data to understand the limits for correctly estimating the MLE.

# R Code Here

Exercise 3 (Extended)

There are two extended exercises, and feel free to do either one of them.

  1. Use your log-likelihood function to create likelihood profiles for the slope and yintercept variables like shown in the presentation. Try to use the tidyverse and especially the map() function to do so.

  2. Create a function that can generate random samples from linear models. Then use the tidyverse and especially the map() function to fit linear models to samples with varying numbers of data and noisiness. Plot the 95% CI for the slope estimate as a function of the number of samples and facet by the noisiness.

# R Code here