Investigate some of the data sets introduced in Chapter 1.
All data sets in the textbook can be accessed from installing an R
library ‘resampledata3’ (see code chunk below)
An index of all the data sets and their contents is here (also posted on
UBLearns): https://cran.r-project.org/web/packages/resampledata3/resampledata3.pdf
head(FlightDelays,3) # look at the first few rows of the data frame
## ID Carrier FlightNo Destination DepartTime Day Month FlightLength Delay
## 1 1 UA 403 DEN 4-8am Fri May 281 -1
## 2 2 UA 405 DEN 8-Noon Fri May 277 102
## 3 3 UA 409 DEN 4-8pm Fri May 279 4
## Delayed30
## 1 No
## 2 Yes
## 3 No
Does the length of delay depend on the Carrier?
Compare Delay vs Day: compare means using the “tapply” command.
tapply(x,y,function_name) sorts the data x by categories in y then
applies function_name to x for each category
tapply(FlightDelays$Delay,FlightDelays$Carrier,mean)
## AA UA
## 10.09738 15.98308
Based on the above comparison, UA has longer average flight delays than AA.
Another comparison: look at the ecdf for each carrier. First extract data by Carrier:
UAdata <- FlightDelays[which(FlightDelays$Carrier=='UA'),] # select all the data for UA
AAdata <- FlightDelays[which(FlightDelays$Carrier=='AA'),] # select all the data for AA
head(AAdata,3)
## ID Carrier FlightNo Destination DepartTime Day Month FlightLength Delay
## 545 545 AA 301 ORD 4-8am Fri May 140 -3
## 546 546 AA 303 ORD 4-8am Fri May 145 -5
## 547 547 AA 305 ORD 4-8am Fri May 145 0
## Delayed30
## 545 No
## 546 No
## 547 No
Then make an ecdf of the Delay data for each Carrier
options(repr.plot.width=5, repr.plot.height=3.5) # resize plot
FAA <- ecdf(AAdata$Delay) # ecdf of delay for AA
FUA <- ecdf(UAdata$Delay) # ecdf of delay for UA
plot(FAA,col='red',main="",xlim=c(0,500)) # "main" is title, "xlim" sets x limits
lines(FUA,col='blue')
Figure above shows ecdf for AA (red) and UA (blue).
The UA ecdf is to the right of that for AA. So, for a given value of the
cdf (say p=0.8), that 80% of the AA flights have delays less than ~20
min, while 80% of the UA flights have delays less than ~40 min. So UA
has worse flight delays by comparing the ecdf.
Is the distribution of flight delays a normal distribution? Look at a qqplot.
qqnorm(FlightDelays$Delay)
qqline(FlightDelays$Delay,col='blue')
The above qqplot not close to linear, so data not described by normal
distribution.
We can also see this from histogram, which is very skewed in shape (not
shown in handout).
#hist(FlightDelays$Delay)
df <- NCBirths2004 # store data with a simpler generic name "df"
head(df,3) #look at the first few rows of the data frame
## ID MothersAge Smoker Alcohol Gender Weight Gestation
## 1 1 30-34 No No Male 3827 40
## 2 2 30-34 No No Male 3629 38
## 3 3 35-39 No No Female 3062 37
What is the mean and standard deviation of baby weights?
mean_wt = mean(df$Weight)
sd_wt = sd(df$Weight)
mean = 3448.259663, std dev = 487.7360397
Is the weight of babies a normal distribution? Look at the qqplot.
qqnorm(df$Weight)
qqline(df$Weight,col='blue') # line through Q1 and Q3 of data
qt <- seq(-3,3,length=101) # grid of equal spacing for theoretical quantiles from -3 to +3
theory_line <- mean_wt + sd_wt*qt # intercept = mean of data, slope = std dev of data
lines(qt,theory_line,col='red') # graph of theory line
The qqplot above fit by a red line with intercept = mean and slope =
sd of data.
This visually confirms data described by normal distribution N(mean_wt,
sd_wt^2).