Skip to content

Commit

Permalink
Merge pull request #40 from Ocalak/master
Browse files Browse the repository at this point in the history
Update
  • Loading branch information
mca91 authored Nov 8, 2023
2 parents c61546b + 6347c17 commit 9ee3292
Show file tree
Hide file tree
Showing 13 changed files with 102 additions and 83 deletions.
26 changes: 15 additions & 11 deletions 02-ch2.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,9 @@
This chapter reviews some basic concepts of probability theory and demonstrates how they can be
applied in `r ttcode("R")`.

Most of the statistical functionalities in base `r ttcode("R")` are collected in the `r ttcode("stats")` package. It provides simple functions which compute descriptive measures and facilitate computations involving a variety of probability distributions. It also contains more sophisticated routines that, enable the user to estimate a large number of models based on the same data or help to conduct extensive simulation studies. `r ttcode("stats")` is part of the base distribution of `r ttcode("R")`, meaning that it is installed by default so there is no need to run `install.packages("stats")` or `library("stats")`. Simply execute `library(help = "stats")` in the console to view the documentation and a complete list of all functions gathered in `r ttcode("stats")`. For most packages a documentation that can be viewed within *RStudio* is available. Documentations can be invoked using the `r ttcode("?")` operator, e.g., upon execution of `?stats` the documentation of the `r ttcode("stats")` package is shown in the help tab of the bottom-right pane.
Most of the statistical functionalities in base `r ttcode("R")` are collected in the `r ttcode("stats")` package. It provides simple functions which compute descriptive measures and facilitate computations involving a variety of probability distributions. It also contains more sophisticated routines that, enable the user to estimate a large number of models based on the same data or help to conduct extensive simulation studies. `r ttcode("stats")` is part of the base distribution of `r ttcode("R")`, meaning that it is installed by default so there is no need to run `install.packages("stats")` or `library("stats")`. Simply execute `library(help = "stats")` in the console to view the documentation and a complete list of all functions gathered in `r ttcode("stats")`. For most packages a documentation that can be viewed within *RStudio* is available. Documentation can be invoked using the `r ttcode("?")` operator, for example, upon execution of `?stats` the documentation of the `r ttcode("stats")` package is shown in the help tab of the bottom-right pane.

In what follows, our focus is on (some of) the probability distributions that are handled by `r ttcode("R")` and show how to use the relevant functions to solve simple problems. Thereby, we refresh some core concepts of probability theory. Among other things, you will learn how to draw random numbers, how to compute densities, probabilities, quantiles and alike. As we shall see, it is very convenient to rely on these routines.
In what follows, our focus is on (some of) the probability distributions that are handled by `r ttcode("R")` and show how to use the relevant functions to solve simple problems. Afterwards, we will review some core concepts of probability theory. Among other things, you will learn how to draw random numbers, how to compute densities, probabilities, quantiles and alike. As we shall see, it is very convenient to do these computations in R.

## Random Variables and Probability Distributions

Expand All @@ -32,7 +32,7 @@ events, e.g., 'the observed outcome lies between $2$ and $5$'.
A basic function to draw random samples from a specified set of elements is the function `r ttcode("sample()")`, see `?sample`. We can use it to simulate the random outcome of a dice roll. Let's roll the dice!

```{r, echo = T, eval = T, message = F, warning = F}
sample(1:6, 1)
sample(1:6, size=1)
```

The probability distribution of a discrete random variable is the list of all possible values of the variable and their probabilities which sum to $1$. The cumulative probability distribution function gives the probability that the random variable is less than or equal to a particular value.
Expand All @@ -54,7 +54,8 @@ probability <- rep(1/6, 6)
plot(probability,
xlab = "Outcomes",
ylab="Probability",
main = "Probability Distribution")
main = "Probability Distribution",
pch=20)
```

For the cumulative probability distribution we need the cumulative probabilities, i.e., we need the cumulative sums of the vector `r ttcode("probability")`. These sums can be computed using `r ttcode("cumsum()")`.
Expand All @@ -67,7 +68,8 @@ cum_probability <- cumsum(probability)
plot(cum_probability,
xlab = "Outcomes",
ylab="Cumulative Probability",
main = "Cumulative Probability Distribution")
main = "Cumulative Probability Distribution",
pch=20)
```

### Bernoulli Trials {-}
Expand Down Expand Up @@ -143,7 +145,8 @@ probability <- dbinom(x = k,
plot(x = k,
y = probability,
ylab="Probability",
main = "Probability Distribution Function")
main = "Probability Distribution Function",
pch=20)
```

In a similar fashion we may plot the cumulative distribution function of $k$ by
Expand All @@ -159,7 +162,8 @@ prob <- pbinom(q = k,
plot(x = k,
y = prob,
ylab="Probability",
main = "Cumulative Distribution Function")
main = "Cumulative Distribution Function",
pch=20)
```

### Expected Value, Mean and Variance {-}
Expand Down Expand Up @@ -399,7 +403,7 @@ g <- function(x) x * f(x)
h <- function(x) x^2 * f(x)
```

Next, we use `r ttcode("integrate()")` and set lower and upper limits of integration to $1$ and $\infty$ using arguments `r ttcode("lower")` and `r ttcode("upper")`. By default, `r ttcode("integrate()")` prints the result along with an estimate of the approximation error to the console. However, the outcome is not a numeric value one can readily do further calculation with. In order to get only a numeric value of the integral, we need to use the `r ttcode("\\$")` operator in conjunction with `r ttcode("value")`. The `r ttcode("\\$")` operator is used to extract elements by name from an object of type `r ttcode("list")`.
Next, we use `r ttcode("integrate()")` and set lower and upper limits of integration to $1$ and $\infty$ using arguments `r ttcode("lower")` and `r ttcode("upper")`. By default, `r ttcode("integrate()")` prints the result along with an estimate of the approximation error to the console. However, the outcome is not a numeric value one can readily do further calculation with. In order to get only a numeric value of the integral, we need to use the `r ttcode("$")` operator in conjunction with `r ttcode("value")`. The `r ttcode("$")` operator is used to extract elements by name from an object of type `r ttcode("list")`.

```{r, echo = T, eval = T, message = F, warning = F}
# compute area under the density curve
Expand Down Expand Up @@ -438,7 +442,7 @@ Thus, for the normal distribution we have the `r ttcode("R")` functions `r ttcod

### The Normal Distribution {-}

The probably most important probability distribution considered here is the normal
Perhaps the most important probability distribution considered here is the normal
distribution. This is not least due to the special role of the standard normal distribution and the Central Limit Theorem which is to be treated shortly. Normal distributions are symmetric and bell-shaped. A normal distribution is characterized by its mean $\mu$ and its standard deviation $\sigma$, concisely expressed by
$\mathcal{N}(\mu,\sigma^2)$. The normal distribution has the PDF

Expand Down Expand Up @@ -807,7 +811,7 @@ To clarify the basic idea of random sampling, let us jump back to the dice rolli

Suppose we are rolling the dice $n$ times. This means we are interested in the outcomes of random $Y_i, \ i=1,...,n$ which are characterized by the same distribution. Since these outcomes are selected randomly, they are *random variables* themselves and their realizations will differ each time we draw a sample, i.e., each time we roll the dice $n$ times. Furthermore, each observation is randomly drawn from the same population, that is, the numbers from $1$ to $6$, and their individual distribution is the same. Hence $Y_1,\dots,Y_n$ are identically distributed.

Moreover, we know that the value of any of the $Y_i$ does not provide any information on the remainder of the outcomes In our example, rolling a six as the first observation in our sample does not alter the distributions of $Y_2,\dots,Y_n$: all numbers are equally likely to occur. This means that all $Y_i$ are also independently distributed. Thus $Y_1,\dots,Y_n$ are independently and identically distributed (*i.i.d.*).
Moreover, we know that the value of any of the $Y_i$ does not provide any information on the remainder of the outcomes. In our example, rolling a six as the first observation in our sample does not alter the distributions of $Y_2,\dots,Y_n$: all numbers are equally likely to occur. This means that all $Y_i$ are also independently distributed. Thus $Y_1,\dots,Y_n$ are independently and identically distributed (*i.i.d.*).
The dice example uses this most simple sampling scheme. That is why it is called *simple random sampling*. This concept is summarized in Key Concept 2.5.

```{r, eval = my_output == "html", results='asis', echo=FALSE, purl=FALSE}
Expand Down Expand Up @@ -1204,7 +1208,7 @@ In `r ttcode("R")`, realize this as follows:

3. Next, we combine two `r ttcode("for()")` loops to simulate the data and plot the distributions. The inner loop generates $10000$ random samples, each consisting of `r ttcode("n")` observations that are drawn from the Bernoulli distribution, and computes the standardized averages. The outer loop executes the inner loop for the different sample sizes `r ttcode("n")` and produces a plot for each iteration.

```{r, echo = T, eval = T, message = F, warning = F, cache=T, fig.align='center'}
```{r, echo = T, eval = T, message = F, warning = F, cache=T, fig.align='center',fig.width=10, fig.height=10}
# subdivide the plot panel into a 2-by-2 array
par(mfrow = c(2, 2))
Expand Down
12 changes: 6 additions & 6 deletions 03-ch3.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -234,7 +234,7 @@ First, *all* sampling distributions (represented by the solid lines) are centere

Next, have a look at the spread of the sampling distributions. Several things are noteworthy:

- The sampling distribution of $Y_1$ (green curve) tracks the density of the $\mathcal{N}(10,1)$ distribution (black dashed line) pretty closely. In fact, the sampling distribution of $Y_1$ is the $\mathcal{N}(10,1)$ distribution. This is less surprising if you keep in mind that the $Y_1$ estimator does nothing but reporting an observation that is randomly selected from a population with $\mathcal{N}(10,1)$ distribution. Hence, $Y_1 \sim \mathcal{N}(10,1)$. Note that this result does not depend on the sample size $n$: the sampling distribution of $Y_1$ *is always* the population distribution, no matter how large the sample is. $Y_1$ is a good a estimate of $\mu_Y$, but we can do better.
- The sampling distribution of $Y_1$ (green curve) tracks the density of the $\mathcal{N}(10,1)$ distribution (black dashed line) pretty closely. In fact, the sampling distribution of $Y_1$ is the $\mathcal{N}(10,1)$ distribution. This is less surprising if you keep in mind that the $Y_1$ estimator does nothing but reporting an observation that is randomly selected from a population with $\mathcal{N}(10,1)$ distribution. Hence, $Y_1 \sim \mathcal{N}(10,1)$. Note that this result does not depend on the sample size $n$: the sampling distribution of $Y_1$ *is always* the population distribution, no matter how large the sample is. $Y_1$ is a good estimate of $\mu_Y$, but we can do better.

- Both sampling distributions of $\overline{Y}$ show less dispersion than the sampling distribution of $Y_1$. This means that $\overline{Y}$ has a lower variance than $Y_1$. In view of Key Concepts 3.2 and 3.3, we find that $\overline{Y}$ is a more efficient estimator than $Y_1$. In fact, this holds for all $n>1$.

Expand Down Expand Up @@ -427,9 +427,9 @@ curve(dnorm(x),
axis(1,
at = c(-1.5, 0, 1.5),
padj = 0.75,
labels = c(expression(-frac(bar(Y)^"act"~-~bar(mu)[Y,0], sigma[bar(Y)])),
labels = c(expression(-frac(bar(Y)^"act"~-~bar(mu)["Y,0"], sigma[bar(Y)])),
0,
expression(frac(bar(Y)^"act"~-~bar(mu)[Y,0], sigma[bar(Y)]))))
expression(frac(bar(Y)^"act"~-~bar(mu)["Y,0"], sigma[bar(Y)]))))
# shade p-value/2 region in left tail
polygon(x = c(-6, seq(-6, -1.5, 0.01), -1.5),
Expand Down Expand Up @@ -612,7 +612,7 @@ tstatistic <- (samplemean_act - mean_h0) / SE_samplemean
tstatistic
```

Using `r ttcode("R")` we can illustrate that if $\mu_{Y,0}$ equals the true value, that is, if the null hypothesis is true, \@ref(eq:tstat) is approximately $\mathcal{N}(0,1)$ distributed when $n$ is large.
Using `r ttcode("R")` we can illustrate that if $\mu_{Y,0}$ equal the true value, that is, if the null hypothesis is true, \@ref(eq:tstat) is approximately $\mathcal{N}(0,1)$ distributed when $n$ is large.

```{r}
# prepare empty vector for t-statistics
Expand Down Expand Up @@ -929,7 +929,7 @@ $$ p\text{-value} = 2.2\cdot 10^{-16} \ll 0.05. $$

## Comparing Means from Different Populations {#cmfdp}

Suppose you are interested in the means of two different populations, denote them $\mu_1$ and $\mu_2$. More specifically, you are interested whether these population means are different from each other and plan to use a hypothesis test to verify this on the basis of independent sample data from both populations. A suitable pair of hypotheses is
Suppose you are interested in the means of two different populations, denote them $\mu_1$ and $\mu_2$. More specifically, you are interested in whether these population means are different from each other and plan to use a hypothesis test to verify this on the basis of independent sample data from both populations. A suitable pair of hypotheses is

\begin{equation}
H_0: \mu_1 - \mu_2 = d_0 \ \ \text{vs.} \ \ H_1: \mu_1 - \mu_2 \neq d_0 (\#eq:hypmeans)
Expand Down Expand Up @@ -1107,7 +1107,7 @@ The estimates indicate that $X$ and $Y$ are moderately correlated.

The next code chunk uses the function `r ttcode("mvnorm()")` from package `r ttcode("MASS")` [@R-MASS] to generate bivariate sample data with different degrees of correlation.

```{r, fig.align='center'}
```{r, fig.align='center',fig.width=8, fig.height=8}
library(MASS)
# set random seed
Expand Down
44 changes: 20 additions & 24 deletions 04-ch4.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,7 @@ The following code reproduces Figure 4.1 from the textbook.

```{r , echo=TRUE, fig.align='center', cache=TRUE}
# create a scatterplot of the data
plot(TestScore ~ STR,ylab="Test Score")
plot(TestScore ~ STR,ylab="Test Score",pch=20)
# add the systematic relationship to the plot
abline(a = 713, b = -3)
Expand Down Expand Up @@ -744,7 +744,7 @@ curve(dnorm(x,
-2,
sqrt(var_b0)),
add = T,
col = "darkred")
col = "darkred",lwd=2)
# plot histograms of beta_hat_1
hist(fit[, 2],
Expand All @@ -758,7 +758,7 @@ curve(dnorm(x,
3.5,
sqrt(var_b1)),
add = T,
col = "darkred")
col = "darkred",lwd=2)
```


Expand All @@ -770,7 +770,7 @@ A further result implied by Key Concept 4.4 is that both estimators are consiste
Let us look at the distributions of $\beta_1$. The idea here is to add an additional call of `r ttcode("for()")` to the code. This is done in order to loop over the vector of sample sizes `r ttcode("n")`. For each of the sample sizes we carry out the same simulation as before but plot a density estimate for the outcomes of each iteration over `r ttcode("n")`. Notice that we have to change `r ttcode("n")` to `r ttcode("n[j]")` in the inner loop to ensure that the `r ttcode("j")`$^{th}$ element of `r ttcode("n")` is used. In the simulation, we use sample sizes of $100, 250, 1000$ and $3000$. Consequently we have a total of four distinct simulations using different sample sizes.


```{r, fig.align='center', cache=T}
```{r, fig.align='center', cache=T,fig.width=8, fig.height=8}
# set seed for reproducibility
set.seed(1)
Expand Down Expand Up @@ -820,24 +820,7 @@ and
$$Cov(X,Y)=4.$$

Formally, this is written down as

\begin{equation}
\begin{pmatrix}
X \\
Y \\
\end{pmatrix}
\overset{i.i.d.}{\sim} \ \mathcal{N}
\left[
\begin{pmatrix}
5 \\
5 \\
\end{pmatrix}, \
\begin{pmatrix}
5 & 4 \\
4 & 5 \\
\end{pmatrix}
\right]. \tag{4.3}
\end{equation}
$$\begin{pmatrix} X \\ Y \end{pmatrix}\overset{i.i.d.}{\sim} \ \mathcal{N}\left[\begin{pmatrix} 5 \\ 5 \end{pmatrix}, \begin{pmatrix} 5 & 4 \\ 4 & 5 \end{pmatrix} \right].\tag{4.3} $$

To carry out the random sampling, we make use of the function `r ttcode("mvrnorm()")` from the package `r ttcode("MASS")` [@R-MASS] which allows to draw random samples from multivariate normal distributions, see `?mvtnorm`. Next, we use `r ttcode("subset()")` to split the sample into two subsets such that the first set, `r ttcode("set1")`, consists of observations that fulfill the condition $\lvert X - \overline{X} \rvert > 1$ and the second set, `r ttcode("set2")`, includes the remainder of the sample. We then plot both sets and use different colors to distinguish the observations.

Expand Down Expand Up @@ -871,6 +854,12 @@ plot(set1,
points(set2,
col = "steelblue",
pch = 19)
legend("topleft",
legend = c("Set1",
"Set2"),
cex = 1,
pch = 19,
col = c("black","steelblue"))
```


Expand All @@ -887,8 +876,15 @@ plot(set1, xlab = "X", ylab = "Y", pch = 19)
points(set2, col = "steelblue", pch = 19)
# add both lines to the plot
abline(lm.set1, col = "green")
abline(lm.set2, col = "red")
abline(lm.set1, col = "black",lwd=2)
abline(lm.set2, col = "steelblue",lwd=2)
legend("bottomright",
legend = c("Set1",
"Set2"),
cex = 1,
lwd=2,
col = c("black","steelblue"))
```


Expand Down
11 changes: 5 additions & 6 deletions 05-ch5.Rmd
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Hypothesis Tests and Confidence Intervals in the Simple Linear Regression Model {#htaciitslrm}
# Hypothesis Tests and Confidence Intervals in SLR Model {#htaciitslrm}

This chapter continues our treatment of the simple linear regression model. The following subsections discuss how we may use our knowledge about the sampling distribution of the OLS estimator in order to make statements regarding its uncertainty.

Expand Down Expand Up @@ -1312,17 +1312,17 @@ if (my_output=="html") {
In the simple regression model, the covariance matrix of the coefficient estimators is denoted
\\begin{equation}
\\text{Var}
$$\\text{Var}
\\begin{pmatrix}
\\hat\\beta_0 \\
\\hat\\beta_1
\\end{pmatrix} =
\\begin{pmatrix}
\\text{Var}(\\hat\\beta_0) & \\text{Cov}(\\hat\\beta_0,\\hat\\beta_1) \\\\
\\text{Cov}(\\hat\\beta_0,\\hat\\beta_1) & \\text{Var}(\\hat\\beta_1)
\\end{pmatrix}
\\end{equation}
\\end{pmatrix}$$
The function <tt>vcovHC</tt> can be used to obtain estimates of this matrix for a model object of interest.
Expand Down Expand Up @@ -1448,4 +1448,3 @@ The function <tt>DGP_OLS()</tt> and the estimated variance <tt>est_var_OLS</tt>
</div>')}
```

Loading

0 comments on commit 9ee3292

Please sign in to comment.