MTH 412/512 - Ch 2: probability distributions

R syntax for probability distributions and functions:

‘name’ is name of probability distribution, see list at
https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/Distributions
prefix letter describes which function:
- ‘d’ = density function (pdf)
- ‘p’ = probability function (cdf)
- ‘q’ = quantile function (inverse cdf)
- ‘r’ = random variate

example A

x <- 2
p <- pnorm(x) # standard normal

The cdf of the standard normal distribution at x = 2 is 0.9772499

example B

mu <- 1
sigma <- 3
p2 <- pnorm(x,mean=mu,sd=sigma) # normal dist parameters "mean","sd" for standard deviation

The pdf of normal distribution with mu=1 and sigma= 3 at x = 2 is 0.6305587

quantile function is the inverse of the cdf

p <- 0.9
qp <- qnorm(p,mean=mu,sd=sigma) # quantile function of p
pcheck <- pnorm(qp,mean=mu,sd=sigma) # cdf of qp

p = 0.9, quantile of p = qp = 4.8446547, pcheck = cdf(qp) = 0.9

plot of pdf of standard normal distribution

options(repr.plot.width=5, repr.plot.height=3.5) # resize plot
xplot <- seq(-3,3,length=101) # make a sequence from -3 to +3 of 101 points
plot(xplot,dnorm(xplot), # x and y values for plot
     type="l",col="red") # type is line "l" (plots points by default), color is "red"

compare random variates to theoretical probability distributions

x <- rnorm(1000,mean=mu,sd=sigma) # generate 100 random variates of normal distribution

sample mean = 0.9663802, sample standard deviation = 3.0170068
Caution: even when taking 1000 samples from a normal distribution, the mean and std dev are not necessarily close to the true mean and std dev.

histogram is a discrete version of pdf, hard to compare to theory pdf

hist(x)

empirical cumulative distribution function (ecdf)

plot(ecdf(x),col="purple") # plot of ecdf (right) continuous)
xplot <-seq(min(x),max(x),length=101) # make xplot span the range of x
lines(xplot,pnorm(xplot,mu,sigma),col='green') # theoretical cdf for N(mu,sigma^2)

Note: For n=100 samples, the agreement of ecdf and cdf is not good, but for n=1000 samples it looks ok.

qqplot is the best way to determine fit with a theoretical distribution
https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/qqnorm

qqnorm(x) # makes a plot of the data quantiles vs the theoretical quantiles
qqline(x,col='blue') # draws line through 1st and 3rd quantiles, use to check
grid(nx = NULL, ny = NULL,           # optionally add gridlines
     lty = 2, col = "gray", lwd = 2) # line type, line color, line width

Notes:
(1) If the distribution is correct, data falls on a line; we do not need to know the parameters
(2) parameters can be found from the fit: mean = intercept, sd = slope.
(3) for a larger sample, the fit with the theoretical distribution gets better.
(4) the qqplot is most sensitive to behavior in the “tails” of the distribution (very small values for the pdf) for which the number of samples is very small.