Investigate some of the data sets introduced in Chapter 1.

All data sets in the textbook can be accessed from installing an R library ‘resampledata3’ (see code chunk below)
An index of all the data sets and their contents is here (also posted on UBLearns): https://cran.r-project.org/web/packages/resampledata3/resampledata3.pdf

Flight Delays (data set FlightDelays.csv)

head(FlightDelays,3) # look at the first few rows of the data frame
##   ID Carrier FlightNo Destination DepartTime Day Month FlightLength Delay
## 1  1      UA      403         DEN      4-8am Fri   May          281    -1
## 2  2      UA      405         DEN     8-Noon Fri   May          277   102
## 3  3      UA      409         DEN      4-8pm Fri   May          279     4
##   Delayed30
## 1        No
## 2       Yes
## 3        No

Does the length of delay depend on the Carrier?

Compare Delay vs Day: compare means using the “tapply” command.
tapply(x,y,function_name) sorts the data x by categories in y then applies function_name to x for each category

tapply(FlightDelays$Delay,FlightDelays$Carrier,mean)
##       AA       UA 
## 10.09738 15.98308

Based on the above comparison, UA has longer average flight delays than AA.

Another comparison: look at the ecdf for each carrier. First extract data by Carrier:

UAdata <- FlightDelays[which(FlightDelays$Carrier=='UA'),] # select all the data for UA
AAdata <- FlightDelays[which(FlightDelays$Carrier=='AA'),] # select all the data for AA
head(AAdata,3)
##      ID Carrier FlightNo Destination DepartTime Day Month FlightLength Delay
## 545 545      AA      301         ORD      4-8am Fri   May          140    -3
## 546 546      AA      303         ORD      4-8am Fri   May          145    -5
## 547 547      AA      305         ORD      4-8am Fri   May          145     0
##     Delayed30
## 545        No
## 546        No
## 547        No

Then make an ecdf of the Delay data for each Carrier

options(repr.plot.width=5, repr.plot.height=3.5) # resize plot
FAA <- ecdf(AAdata$Delay)  # ecdf of delay for AA
FUA <- ecdf(UAdata$Delay)  # ecdf of delay for UA
plot(FAA,col='red',main="",xlim=c(0,500)) # "main" is title, "xlim" sets x limits 
lines(FUA,col='blue')

Figure above shows ecdf for AA (red) and UA (blue).
The UA ecdf is to the right of that for AA. So, for a given value of the cdf (say p=0.8), that 80% of the AA flights have delays less than ~20 min, while 80% of the UA flights have delays less than ~40 min. So UA has worse flight delays by comparing the ecdf.

Is the distribution of flight delays a normal distribution? Look at a qqplot.

qqnorm(FlightDelays$Delay)
qqline(FlightDelays$Delay,col='blue')

The above qqplot not close to linear, so data not described by normal distribution.
We can also see this from histogram, which is very skewed in shape (not shown in handout).

#hist(FlightDelays$Delay)

Baby Weight

df <- NCBirths2004  # store data with a simpler generic name "df"
head(df,3) #look at the first few rows of the data frame
##   ID MothersAge Smoker Alcohol Gender Weight Gestation
## 1  1      30-34     No      No   Male   3827        40
## 2  2      30-34     No      No   Male   3629        38
## 3  3      35-39     No      No Female   3062        37

What is the mean and standard deviation of baby weights?

mean_wt = mean(df$Weight)
sd_wt = sd(df$Weight)

mean = 3448.259663, std dev = 487.7360397

Is the weight of babies a normal distribution? Look at the qqplot.

qqnorm(df$Weight)
qqline(df$Weight,col='blue') # line through Q1 and Q3 of data
qt <- seq(-3,3,length=101)   # grid of equal spacing for theoretical quantiles from -3 to +3
theory_line <- mean_wt + sd_wt*qt  # intercept = mean of data, slope = std dev of data
lines(qt,theory_line,col='red') # graph of theory line

The qqplot above fit by a red line with intercept = mean and slope = sd of data.
This visually confirms data described by normal distribution N(mean_wt, sd_wt^2).