R syntax for probability distributions and functions:
example A
x <- 2
p <- pnorm(x) # standard normal
The cdf of the standard normal distribution at x = 2 is 0.9772499
example B
mu <- 1
sigma <- 3
p2 <- pnorm(x,mean=mu,sd=sigma) # normal dist parameters "mean","sd" for standard deviation
The pdf of normal distribution with mu=1 and sigma= 3 at x = 2 is 0.6305587
quantile function is the inverse of the cdf
p <- 0.9
qp <- qnorm(p,mean=mu,sd=sigma) # quantile function of p
pcheck <- pnorm(qp,mean=mu,sd=sigma) # cdf of qp
p = 0.9, quantile of p = qp = 4.8446547, pcheck = cdf(qp) = 0.9
plot of pdf of standard normal distribution
options(repr.plot.width=5, repr.plot.height=3.5) # resize plot
xplot <- seq(-3,3,length=101) # make a sequence from -3 to +3 of 101 points
plot(xplot,dnorm(xplot), # x and y values for plot
type="l",col="red") # type is line "l" (plots points by default), color is "red"
compare random variates to theoretical probability distributions
x <- rnorm(1000,mean=mu,sd=sigma) # generate 100 random variates of normal distribution
sample mean = 0.9663802, sample standard deviation = 3.0170068
Caution: even when taking 1000 samples from a normal distribution, the
mean and std dev are not necessarily close to the true mean and std
dev.
histogram is a discrete version of pdf, hard to compare to theory pdf
hist(x)
empirical cumulative distribution function (ecdf)
plot(ecdf(x),col="purple") # plot of ecdf (right) continuous)
xplot <-seq(min(x),max(x),length=101) # make xplot span the range of x
lines(xplot,pnorm(xplot,mu,sigma),col='green') # theoretical cdf for N(mu,sigma^2)
Note: For n=100 samples, the agreement of ecdf and cdf is not good, but for n=1000 samples it looks ok.
qqplot is the best way to determine fit with a theoretical
distribution
https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/qqnorm
qqnorm(x) # makes a plot of the data quantiles vs the theoretical quantiles
qqline(x,col='blue') # draws line through 1st and 3rd quantiles, use to check
grid(nx = NULL, ny = NULL, # optionally add gridlines
lty = 2, col = "gray", lwd = 2) # line type, line color, line width
Notes:
(1) If the distribution is correct, data falls on a line; we do not need
to know the parameters
(2) parameters can be found from the fit: mean = intercept, sd =
slope.
(3) for a larger sample, the fit with the theoretical distribution gets
better.
(4) the qqplot is most sensitive to behavior in the “tails” of the
distribution (very small values for the pdf) for which the number of
samples is very small.