Merge pull request #40 from Ocalak/master

Update
mca91 · Nov 8, 2023 · 9ee3292 · 9ee3292
2 parents c61546b + 6347c17
commit 9ee3292
Show file tree

Hide file tree

Showing 13 changed files with 102 additions and 83 deletions.
diff --git a/02-ch2.Rmd b/02-ch2.Rmd
@@ -3,9 +3,9 @@
 This chapter reviews some basic concepts of probability theory and demonstrates how they can be
 applied in `r ttcode("R")`.
 
-Most of the statistical functionalities in base `r ttcode("R")` are collected in the `r ttcode("stats")` package. It provides simple functions which compute descriptive measures and facilitate computations involving a variety of probability distributions. It also contains more sophisticated routines that, enable the user to estimate a large number of models based on the same data or help to conduct extensive simulation studies. `r ttcode("stats")` is part of the base distribution of `r ttcode("R")`, meaning that it is installed by default so there is no need to run `install.packages("stats")` or `library("stats")`. Simply execute `library(help = "stats")` in the console to view the documentation and a complete list of all functions gathered in `r ttcode("stats")`. For most packages a documentation that can be viewed within *RStudio* is available. Documentations can be invoked using the `r ttcode("?")` operator, e.g., upon execution of `?stats` the documentation of the `r ttcode("stats")` package is shown in the help tab of the bottom-right pane.
+Most of the statistical functionalities in base `r ttcode("R")` are collected in the `r ttcode("stats")` package. It provides simple functions which compute descriptive measures and facilitate computations involving a variety of probability distributions. It also contains more sophisticated routines that, enable the user to estimate a large number of models based on the same data or help to conduct extensive simulation studies. `r ttcode("stats")` is part of the base distribution of `r ttcode("R")`, meaning that it is installed by default so there is no need to run `install.packages("stats")` or `library("stats")`. Simply execute `library(help = "stats")` in the console to view the documentation and a complete list of all functions gathered in `r ttcode("stats")`. For most packages a documentation that can be viewed within *RStudio* is available. Documentation can be invoked using the `r ttcode("?")` operator, for example, upon execution of `?stats` the documentation of the `r ttcode("stats")` package is shown in the help tab of the bottom-right pane.
 
-In what follows, our focus is on (some of) the probability distributions that are handled by `r ttcode("R")` and show how to use the relevant functions to solve simple problems. Thereby, we refresh some core concepts of probability theory. Among other things, you will learn how to draw random numbers, how to compute densities, probabilities, quantiles and alike. As we shall see, it is very convenient to rely on these routines.
+In what follows, our focus is on (some of) the probability distributions that are handled by `r ttcode("R")` and show how to use the relevant functions to solve simple problems. Afterwards, we will review some core concepts of probability theory. Among other things, you will learn how to draw random numbers, how to compute densities, probabilities, quantiles and alike. As we shall see, it is very convenient to do these computations in R.
 
 ## Random Variables and Probability Distributions
 
@@ -32,7 +32,7 @@ events, e.g., 'the observed outcome lies between $2$ and $5$'.
 A basic function to draw random samples from a specified set of elements is the function `r ttcode("sample()")`, see `?sample`. We can use it to simulate the random outcome of a dice roll. Let's roll the dice!
 
 ```{r, echo = T, eval = T, message = F, warning = F} 
-sample(1:6, 1) 
+sample(1:6, size=1) 
 ```
 
 The probability distribution of a discrete random variable is the list of all possible values of the variable and their probabilities which sum to $1$. The cumulative probability distribution function gives the probability that the random variable is less than or equal to a particular value.
@@ -54,7 +54,8 @@ probability <- rep(1/6, 6)
 plot(probability,
      xlab = "Outcomes",
      ylab="Probability",
-     main = "Probability Distribution") 
+     main = "Probability Distribution",
+     pch=20) 
 ``` 
 
 For the cumulative probability distribution we need the cumulative probabilities, i.e., we need the cumulative sums of the vector `r ttcode("probability")`. These sums can be computed using `r ttcode("cumsum()")`.
@@ -67,7 +68,8 @@ cum_probability <- cumsum(probability)
 plot(cum_probability, 
      xlab = "Outcomes", 
      ylab="Cumulative Probability",
-     main = "Cumulative Probability Distribution") 
+     main = "Cumulative Probability Distribution",
+     pch=20) 
 ```
 
 ### Bernoulli Trials {-}
@@ -143,7 +145,8 @@ probability <- dbinom(x = k,
 plot(x = k, 
      y = probability,
      ylab="Probability",
-     main = "Probability Distribution Function") 
+     main = "Probability Distribution Function",
+     pch=20) 
 ``` 
 
 In a similar fashion we may plot the cumulative distribution function of $k$ by
@@ -159,7 +162,8 @@ prob <- pbinom(q = k,
 plot(x = k, 
      y = prob,
      ylab="Probability",
-     main = "Cumulative Distribution Function") 
+     main = "Cumulative Distribution Function",
+     pch=20) 
 ```
 
 ### Expected Value, Mean and Variance {-}
@@ -399,7 +403,7 @@ g <- function(x) x * f(x)
 h <- function(x) x^2 * f(x)
 ```
 
-Next, we use `r ttcode("integrate()")` and set lower and upper limits of integration to $1$ and $\infty$ using arguments `r ttcode("lower")` and `r ttcode("upper")`. By default, `r ttcode("integrate()")` prints the result along with an estimate of the approximation error to the console. However, the outcome is not a numeric value one can readily do further calculation with. In order to get only a numeric value of the integral, we need to use the `r ttcode("\\$")` operator in conjunction with `r ttcode("value")`. The `r ttcode("\\$")` operator is used to extract elements by name from an object of type `r ttcode("list")`. 
+Next, we use `r ttcode("integrate()")` and set lower and upper limits of integration to $1$ and $\infty$ using arguments `r ttcode("lower")` and `r ttcode("upper")`. By default, `r ttcode("integrate()")` prints the result along with an estimate of the approximation error to the console. However, the outcome is not a numeric value one can readily do further calculation with. In order to get only a numeric value of the integral, we need to use the `r ttcode("$")` operator in conjunction with `r ttcode("value")`. The `r ttcode("$")` operator is used to extract elements by name from an object of type `r ttcode("list")`. 
 
 ```{r, echo = T, eval = T, message = F, warning = F}
 # compute area under the density curve
@@ -438,7 +442,7 @@ Thus, for the normal distribution we have the `r ttcode("R")` functions `r ttcod
 
 ### The Normal Distribution {-}
 
-The probably most important probability distribution considered here is the normal
+Perhaps the most important probability distribution considered here is the normal
 distribution. This is not least due to the special role of the standard normal distribution and the Central Limit Theorem which is to be treated shortly. Normal distributions are symmetric and bell-shaped. A normal distribution is characterized by its mean $\mu$ and its standard deviation $\sigma$, concisely expressed by
 $\mathcal{N}(\mu,\sigma^2)$. The normal distribution has the PDF
 
@@ -807,7 +811,7 @@ To clarify the basic idea of random sampling, let us jump back to the dice rolli
 
 Suppose we are rolling the dice $n$ times. This means we are interested in the outcomes of random $Y_i, \ i=1,...,n$ which are characterized by the same distribution. Since these outcomes are selected randomly, they are *random variables* themselves and their realizations will differ each time we draw a sample, i.e., each time we roll the dice $n$ times. Furthermore, each observation is randomly drawn from the same population, that is, the numbers from $1$ to $6$, and their individual distribution is the same. Hence $Y_1,\dots,Y_n$ are identically distributed. 
 
-Moreover, we know that the value of any of the $Y_i$ does not provide any information on the remainder of the outcomes In our example, rolling a six as the first observation in our sample does not alter the distributions of $Y_2,\dots,Y_n$: all numbers are equally likely to occur. This means that all $Y_i$ are also independently distributed. Thus $Y_1,\dots,Y_n$ are independently and identically distributed (*i.i.d.*). 
+Moreover, we know that the value of any of the $Y_i$ does not provide any information on the remainder of the outcomes. In our example, rolling a six as the first observation in our sample does not alter the distributions of $Y_2,\dots,Y_n$: all numbers are equally likely to occur. This means that all $Y_i$ are also independently distributed. Thus $Y_1,\dots,Y_n$ are independently and identically distributed (*i.i.d.*). 
 The dice example uses this most simple sampling scheme. That is why it is called *simple random sampling*. This concept is summarized in Key Concept 2.5. 
 
 ```{r, eval = my_output == "html", results='asis', echo=FALSE, purl=FALSE}
@@ -1204,7 +1208,7 @@ In `r ttcode("R")`, realize this as follows:
 
 3. Next, we combine two `r ttcode("for()")` loops to simulate the data and plot the distributions. The inner loop generates $10000$ random samples, each consisting of `r ttcode("n")` observations that are drawn from the Bernoulli distribution, and computes the standardized averages. The outer loop executes the inner loop for the different sample sizes `r ttcode("n")` and produces a plot for each iteration.
 
-```{r, echo = T, eval = T, message = F, warning = F, cache=T, fig.align='center'} 
+```{r, echo = T, eval = T, message = F, warning = F, cache=T, fig.align='center',fig.width=10, fig.height=10} 
 # subdivide the plot panel into a 2-by-2 array
 par(mfrow = c(2, 2))
 

diff --git a/03-ch3.Rmd b/03-ch3.Rmd
@@ -234,7 +234,7 @@ First, *all* sampling distributions (represented by the solid lines) are centere
 
 Next, have a look at the spread of the sampling distributions. Several things are noteworthy:
 
-- The sampling distribution of $Y_1$ (green curve) tracks the density of the $\mathcal{N}(10,1)$ distribution (black dashed line) pretty closely. In fact, the sampling distribution of $Y_1$ is the $\mathcal{N}(10,1)$ distribution. This is less surprising if you keep in mind that the $Y_1$ estimator does nothing but reporting an observation that is randomly selected from a population with $\mathcal{N}(10,1)$ distribution. Hence, $Y_1 \sim \mathcal{N}(10,1)$. Note that this result does not depend on the sample size $n$: the sampling distribution of $Y_1$ *is always* the population distribution, no matter how large the sample is.  $Y_1$ is a good a estimate of $\mu_Y$, but we can do better.
+- The sampling distribution of $Y_1$ (green curve) tracks the density of the $\mathcal{N}(10,1)$ distribution (black dashed line) pretty closely. In fact, the sampling distribution of $Y_1$ is the $\mathcal{N}(10,1)$ distribution. This is less surprising if you keep in mind that the $Y_1$ estimator does nothing but reporting an observation that is randomly selected from a population with $\mathcal{N}(10,1)$ distribution. Hence, $Y_1 \sim \mathcal{N}(10,1)$. Note that this result does not depend on the sample size $n$: the sampling distribution of $Y_1$ *is always* the population distribution, no matter how large the sample is.  $Y_1$ is a good estimate of $\mu_Y$, but we can do better.
 
 - Both sampling distributions of $\overline{Y}$ show less dispersion than the sampling distribution of $Y_1$. This means that $\overline{Y}$ has a lower variance than $Y_1$. In view of Key Concepts 3.2 and 3.3, we find that $\overline{Y}$ is a more efficient estimator than $Y_1$. In fact, this holds for all $n>1$.
 
@@ -427,9 +427,9 @@ curve(dnorm(x),
 axis(1, 
      at = c(-1.5, 0, 1.5), 
      padj = 0.75,
-     labels = c(expression(-frac(bar(Y)^"act"~-~bar(mu)[Y,0], sigma[bar(Y)])),
+     labels = c(expression(-frac(bar(Y)^"act"~-~bar(mu)["Y,0"], sigma[bar(Y)])),
                 0,
-                expression(frac(bar(Y)^"act"~-~bar(mu)[Y,0], sigma[bar(Y)]))))
+                expression(frac(bar(Y)^"act"~-~bar(mu)["Y,0"], sigma[bar(Y)]))))
 
 # shade p-value/2 region in left tail
 polygon(x = c(-6, seq(-6, -1.5, 0.01), -1.5),
@@ -612,7 +612,7 @@ tstatistic <- (samplemean_act - mean_h0) / SE_samplemean
 tstatistic
 ```
 
-Using `r ttcode("R")` we can illustrate that if $\mu_{Y,0}$ equals the true value, that is, if the null hypothesis is true, \@ref(eq:tstat) is approximately $\mathcal{N}(0,1)$ distributed when $n$ is large.
+Using `r ttcode("R")` we can illustrate that if $\mu_{Y,0}$ equal the true value, that is, if the null hypothesis is true, \@ref(eq:tstat) is approximately $\mathcal{N}(0,1)$ distributed when $n$ is large.
 
 ```{r}
 # prepare empty vector for t-statistics
@@ -929,7 +929,7 @@ $$ p\text{-value} = 2.2\cdot 10^{-16} \ll 0.05. $$
 
 ## Comparing Means from Different Populations {#cmfdp}
 
-Suppose you are interested in the means of two different populations, denote them $\mu_1$ and $\mu_2$. More specifically, you are interested whether these population means are different from each other and plan to use a hypothesis test to verify this on the basis of independent sample data from both populations. A suitable pair of hypotheses is
+Suppose you are interested in the means of two different populations, denote them $\mu_1$ and $\mu_2$. More specifically, you are interested in whether these population means are different from each other and plan to use a hypothesis test to verify this on the basis of independent sample data from both populations. A suitable pair of hypotheses is
 
 \begin{equation}
 H_0: \mu_1 - \mu_2 = d_0 \ \ \text{vs.} \ \ H_1: \mu_1 - \mu_2 \neq d_0 (\#eq:hypmeans)
@@ -1107,7 +1107,7 @@ The estimates indicate that $X$ and $Y$ are moderately correlated.
 
 The next code chunk uses the function `r ttcode("mvnorm()")` from package `r ttcode("MASS")` [@R-MASS] to generate bivariate sample data with different degrees of correlation. 
 
-```{r, fig.align='center'}
+```{r, fig.align='center',fig.width=8, fig.height=8}
 library(MASS)
 
 # set random seed

diff --git a/04-ch4.Rmd b/04-ch4.Rmd
@@ -70,7 +70,7 @@ The following code reproduces Figure 4.1 from the textbook.
 
 ```{r , echo=TRUE, fig.align='center', cache=TRUE}
 # create a scatterplot of the data
-plot(TestScore ~ STR,ylab="Test Score")
+plot(TestScore ~ STR,ylab="Test Score",pch=20)
 
 # add the systematic relationship to the plot
 abline(a = 713, b = -3)
@@ -744,7 +744,7 @@ curve(dnorm(x,
             -2, 
             sqrt(var_b0)), 
       add = T, 
-      col = "darkred")
+      col = "darkred",lwd=2)
 
 # plot histograms of beta_hat_1 
 hist(fit[, 2],
@@ -758,7 +758,7 @@ curve(dnorm(x,
             3.5, 
             sqrt(var_b1)), 
       add = T, 
-      col = "darkred")
+      col = "darkred",lwd=2)
 ```
 
 
@@ -770,7 +770,7 @@ A further result implied by Key Concept 4.4 is that both estimators are consiste
 Let us look at the distributions of $\beta_1$. The idea here is to add an additional call of `r ttcode("for()")` to the code. This is done in order to loop over the vector of sample sizes `r ttcode("n")`. For each of the sample sizes we carry out the same simulation as before but plot a density estimate for the outcomes of each iteration over `r ttcode("n")`. Notice that we have to change `r ttcode("n")` to `r ttcode("n[j]")` in the inner loop to ensure that the `r ttcode("j")`$^{th}$ element of `r ttcode("n")` is used. In the simulation, we use sample sizes of $100, 250, 1000$ and $3000$. Consequently we have a total of four distinct simulations using different sample sizes.
 
 
-```{r, fig.align='center', cache=T}
+```{r, fig.align='center', cache=T,fig.width=8, fig.height=8}
 # set seed for reproducibility
 set.seed(1)
 
@@ -820,24 +820,7 @@ and
 $$Cov(X,Y)=4.$$
 
 Formally, this is written down as
-
-\begin{equation}
-  \begin{pmatrix}
-    X \\
-    Y \\
-  \end{pmatrix}
-  \overset{i.i.d.}{\sim} \ \mathcal{N} 
-  \left[
-    \begin{pmatrix}
-      5 \\
-      5 \\
-    \end{pmatrix}, \ 
-    \begin{pmatrix}
-      5 & 4 \\
-      4 & 5 \\
-    \end{pmatrix}
-  \right]. \tag{4.3}
-\end{equation}
+$$\begin{pmatrix} X \\ Y  \end{pmatrix}\overset{i.i.d.}{\sim} \ \mathcal{N}\left[\begin{pmatrix} 5 \\ 5  \end{pmatrix}, \begin{pmatrix} 5 & 4 \\ 4 & 5 \end{pmatrix} \right].\tag{4.3} $$
 
 To carry out the random sampling, we make use of the function `r ttcode("mvrnorm()")` from the package `r ttcode("MASS")` [@R-MASS] which allows to draw random samples from multivariate normal distributions, see `?mvtnorm`. Next, we use `r ttcode("subset()")` to split the sample into two subsets such that the first set, `r ttcode("set1")`, consists of observations that fulfill the condition $\lvert X - \overline{X} \rvert > 1$ and the second set, `r ttcode("set2")`, includes the remainder of the sample. We then plot both sets and use different colors to distinguish the observations.
 
@@ -871,6 +854,12 @@ plot(set1,
 points(set2, 
        col = "steelblue", 
        pch = 19)
+legend("topleft", 
+       legend = c("Set1", 
+                  "Set2"),
+       cex = 1,
+       pch = 19,
+       col = c("black","steelblue"))
 ```
 
 
@@ -887,8 +876,15 @@ plot(set1, xlab = "X", ylab = "Y", pch = 19)
 points(set2, col = "steelblue", pch = 19)
 
 # add both lines to the plot
-abline(lm.set1, col = "green")
-abline(lm.set2, col = "red")
+abline(lm.set1, col = "black",lwd=2)
+abline(lm.set2, col = "steelblue",lwd=2)
+legend("bottomright", 
+       legend = c("Set1", 
+                  "Set2"),
+       cex = 1,
+       lwd=2,
+       col = c("black","steelblue"))
+
 ```
 
 

diff --git a/05-ch5.Rmd b/05-ch5.Rmd
@@ -1,4 +1,4 @@
-# Hypothesis Tests and Confidence Intervals in the Simple Linear Regression Model {#htaciitslrm}
+# Hypothesis Tests and Confidence Intervals in SLR Model {#htaciitslrm}
 
 This chapter continues our treatment of the simple linear regression model. The following subsections discuss how we may use our knowledge about the sampling distribution of the OLS estimator in order to make statements regarding its uncertainty.
 
@@ -1312,17 +1312,17 @@ if (my_output=="html") {
 
 In the simple regression model, the covariance matrix of the coefficient estimators is denoted
 
-\\begin{equation}
-\\text{Var}
+
+$$\\text{Var}
   \\begin{pmatrix}
     \\hat\\beta_0 \\
     \\hat\\beta_1
   \\end{pmatrix} = 
 \\begin{pmatrix}
   \\text{Var}(\\hat\\beta_0) & \\text{Cov}(\\hat\\beta_0,\\hat\\beta_1) \\\\
 \\text{Cov}(\\hat\\beta_0,\\hat\\beta_1) & \\text{Var}(\\hat\\beta_1)
-\\end{pmatrix}
-\\end{equation}
+\\end{pmatrix}$$
+
 
 The function <tt>vcovHC</tt> can be used to obtain estimates of this matrix for a model object of interest.
 
@@ -1448,4 +1448,3 @@ The function <tt>DGP_OLS()</tt> and the estimated variance <tt>est_var_OLS</tt>
 
 </div>')}
 ```
-