14-oop.Rmd

# (PART) Programming Styles {-} 

# An Introduction to Object-Oriented Programming

**Object-Oriented Programming (OOP)**\index{object-oriented programming} is a way of thinking about how to organize programs. This way of thinking focuses on objects. In the next chapter, we focus on organizing programs by functions, but for now we stick to objects. We already know about objects from the last chapter, so what's new here?

The difference is that we're creating our own *types* now. In the last chapter we learned about built-in types: floating point numbers, lists, arrays, functions, etc. Now we will discuss broadly how one can create his own types in both R and Python. These user-defined types can be used as cookie cutters. Once we have the cookie cutter, we can make as many cookies as we want!

We will not go into this too deeply, but it is important to know how how code works so that we can use it more effectively. For instance, in Python, we frequently write code like `my_data_frame.doSomething()`. The material in this chapter will go a long way to describe how we can make our own types with custom behavior. 

------------------------

Here are a few abstract concepts that will help thinking about OOP. They are not mutually exclusive, and they aren't unique to OOP, but understanding these words will help you understand the purpose of OOP. Later on, when we start looking at code examples, I will alert you to when these concepts are coming into play. 

- **Composition**\index{object-oriented programming!composition} refers to the idea when one type of object *contains* an object of another type. For example, a linear model object could hold on to estimated regression coefficients, residuals, etc.

- **Inheritance**\index{object-oriented programming!inheritance} takes place when an object can be considered to be of another type(s). For example, an analysis of variance linear regression model might be a special case of a general linear model. 

 - **Polymorphism**\index{object-oriented programming!polymorphism} is the idea that the programmer can use the same code on objects of different types. For example, built-in functions in both R and Python can work on arguments of a wide variety of different types. 

 - **Encapsulation**\index{object-oriented programming!encapsulation} is another word for complexity hiding. Do you have to understand every line of code in a package you're using? No, because a lot of details are purposefully hidden from you. 
 
 - **Modularity**\index{object-oriented programming!modularity} is an idea related to encapsulation--it means splitting something into independent pieces. How you split code into different files, different functions, different classes--all of that has to do with modularity. It promotes encapsulation, and it allows you to think about only a few lines of code at a time.
 
 - The **interface**\index{object-oriented programming!interface}, between you and the code you're using, describes *what* can happen, but not *how* it happens. In other words, it describes some functionality so that you can decide whether you want to use it, but there are not enough details for you to make it work yourself. For example, all you have to do to be able to estimate a complicated statistical model is to look up some documentation.^[Just because you can do this, doesn't mean you *should*, though!] In other words, you only need to be familiar with the interface, not the implementation. 

- The **implementation**\index{object-oriented programming!implementation} of some code you're using describes *how* it works in detail. If you are a package author, you can change your code's implementation "behind the scenes" and ideally, your end-users would never notice.

## OOP In Python

### Overview

In Python, [classes](https://docs.python.org/3/tutorial/classes.html) are user-defined types\index{object-oriented programming!classes in Python}. When you define your own class, you describe what kind of information it holds onto, and how it behaves.  

To define your own type, use the [`class` keyword](https://docs.python.org/3/tutorial/classes.html#class-definition-syntax)\index{object-oriented programming!class keyword in Python}. Objects created with a user-defined class are sometimes called **instances**\index{object-oriented programming!instance}. They behave according to the rules written in the class definition--they always have data and/or functions bundled together in the same way, but these instances do not all have the same data. 

To be more clear, classes may have the following two things in their definition.

- **Attributes (aka data members)**\index{object-oriented programming!attributes in Python|see{data member}} are pieces of data "owned" by an instance created by the class. 

 - **(Instance) methods**\index{object-oriented programming!methods in Python} are functions "owned" by an instance created by the class. They can use and/or modify data belonging to the class. 


### A First Example

Here's a simple example. Say we are interested in calculating, from numerical data $x_1, \ldots, x_n$, a sample mean:
\begin{equation} 
\bar{x}_n = \frac{\sum_{i=1}^n x_i}{n}.
\end{equation}

In Python, we can usually calculate this one number very easily using `np.average`. However, this function requires that we pass into it all of the data at once. What if we don't have all the data at any given time? In other words, suppose that the data arrive intermittently 
.
We might consider taking advantage of a recursive formula for the sample means.

\begin{equation} 
\bar{x}_n =  \frac{(n-1) \bar{x}_{n-1} + x_n}{n}
\end{equation}

How would we program this in Python? A first option: we might create a variable `my_running_ave`, and after every data point arrives, we could 

```{python, collapse = TRUE}
my_running_ave = 1.0
my_running_ave
my_running_ave = ((2-1)*my_running_ave + 3.0)/2
my_running_ave
my_running_ave = ((3-1)*my_running_ave + 2.0)/3
my_running_ave
```

There are a few problems with this. Every time we add a data point, the formula slightly changes. Every time we update the average, we have to write a different line of code. This opens up the possibility for more bugs, and it makes your code less likely to be used by other people and more difficult to understand. And if we were trying to code up something more complicated than a running average? That would make matters even worse. 

A second option: write a class that holds onto the running average, and that has 

 1. an `update` method that updates the running average every time a new data point is received, and 
 2. a `get_current_xbar` method that gets the most up-to-date information for us.
 
Using our code would look like this:

```{python, collapse = TRUE, echo =FALSE}
class RunningMean:
    """Updates a running average"""
    def __init__(self):
        self.current_xbar = 0.0
        self.n = 0
    def update(self, new_x):
        self.n += 1
        self.current_xbar *= (self.n-1)
        self.current_xbar += new_x
        self.current_xbar /= self.n
    def get_current_xbar(self):
        if self.n == 0:
            return None
        else:
            return self.current_xbar
```

```{python, collapse = TRUE}
my_ave = RunningMean() # create running average object
my_ave.get_current_xbar() # no data yet!
my_ave.update(1.) # first data point
my_ave.get_current_xbar() # xbar_1
my_ave.update(3.)  # second data point
my_ave.get_current_xbar()  #xbar_2
my_ave.n   # my_ave.n instead of self.n
```

::: {.rmd-details data-latex=""}
There is a Python convention that stipules class names should be written in `UpperCamelCase` (e.g. `RunningMean`).
:::

That's much better! Notice the *encapsulation*--while looking at this code we do not need to think about the mathematical formula that is used to process the data. We only need to type in the correct data being used. In other words, the *implementation* is separated from the *interface*. The interface in this case, is just the name of the class methods, and the arguments they expect. That's all we need to know about to use this code. 

::: {.rmd-caution data-latex=""}
After seeing these new words that are unfamiliar and long, it's tempting to dismiss these new ideas as superfluous. After all, if you are confident that you can get your program working, why stress about all these new concepts? If it ain't broke, don't fix it, right?

I urge you to try to keep an open mind, particularly if you are already confident that you understand the basics of programming in R and Python. The topics in this chapter are more centered around design choices. This material won't help you write a first draft of a script even faster, but it will make your code much better. Even though you will have to slow down a bit before you start typing, thinking about your program more deeply will prevent bugs and allow more people to use your code.  
:::

Classes (obviously) need to be defined before they are used, so here is the definition of our class. 

```{python, collapse = TRUE}
class RunningMean:
    """Updates a running average"""
    def __init__(self):
        self.current_xbar = 0.0
        self.n = 0
    def update(self, new_x):
        self.n += 1
        self.current_xbar *= (self.n-1)
        self.current_xbar += new_x
        self.current_xbar /= self.n
    def get_current_xbar(self):
        if self.n == 0:
            return None
        else:
            return self.current_xbar
```

::: {.rmd-details data-latex=""}
Methods that look like `__init__`, or that possess names that begin and end with two underscores, are called **dunder (double underscore) methods**, **special methods** or **magic methods**. There are many that you can take advantage of! For more information see [this](https://docs.python.org/3/reference/datamodel.html#special-method-names).
::: 

Here are the details of the class definition:

1. Defining class methods looks exactly like defining functions! The primary difference is that the first argument must be `self`. If the definition of a method refers to `self`, then this allows the class instance to refer to its own (heretofore undefined) data attributes. Also, these method definitions are indented inside the definition of the class.

2. This class owns two data attributes. One to represent the number of data points seen up to now (`n`), and another to represent the current running average (`current_xbar`). 

3. Referring to data members requires dot notation. `self.n` refers to the `n` belonging to any instance. This data attribute is free to vary between all the objects instantiated by this class. 
4. The `__init__` method performs the setup operations that are performed every time any object is instantiated. 

5. The `update` method provides the core functionality using the recursive formula displayed above.

6. `get_current_xbar` simply returns the current average. In the case that this function is called before any data has been seen, it returns `None`.

A few things you might find interesting:

i. Computationally, there is never any requirement that we must hold *all* of the data points in memory. Our data set could be infinitely large, and our class will hold onto only one floating point number, and one integer. 

ii. This example is generalizable to other statistical methods. In a mathematical statistics course, you will learn about a large class of models having *sufficient statistics*. Most sufficient statistics have recursive formulas like the one above. Second, many algorithms in *time series analysis* have recursive formulas and are often needed to analyze large streams of data. They can all be wrapped into a class in a way that is similar to the above example.

### Adding Inheritance


How can we use inheritance in statistical programming? A primary benefit of inheritance is code re-use, so one example of inheritance is writing a generic algorithm as a base class, and a specific algorithm as a class that inherits from the base class. For example, we could re-use the code in the `RunningMean` class in a variety of other classes. 

Let's make some assumptions about a *parametric model* that is generating our data. Suppose I assume that the data points $x_1, \ldots, x_n$ are a "random sample"^[Otherwise known as an independent and identically distributed sample] from a normal distribution with mean $\mu$ and variance $\sigma^2=1$. $\mu$ is assumed to be unknown (this is, after all, and interval for $\mu$), and $\sigma^2$ is assumed to be known, for simplicity. 

A $95\%$ confidence interval for the true unknown population mean $\mu$ is 

\begin{equation} 
\left( \bar{x} - 1.96 \sqrt{\frac{\sigma^2}{n}}, \bar{x} + 1.96 \sqrt{\frac{\sigma^2}{n}} \right).
\end{equation}

The width of the interval shrinks as we get more data (as $n \to \infty$). We can write another class that, not only calculates the center of this interval, $\bar{x}$, but also returns the interval endpoints. 

If we wrote another class from scratch, then we would need to rewrite a lot of the code that we already have in the definition of `RunningMean`. Instead, we'll use the idea of [*inheritance*](https://docs.python.org/3/tutorial/classes.html#inheritance). 

```{python, collapse = TRUE}
import numpy as np

class RunningCI(RunningMean):# <-notice what's inside the parentheses
    """Updates a running average and 
    gives you a known-variance confidence interval"""
    
    def __init__(self, known_var):
        super().__init__()
        self.known_var = known_var
        
    def get_current_interval(self):
        if self.n == 0:
            return None
        else:
            half_width = 1.96 * np.sqrt(self.known_var / self.n)  
            left_num = self.current_xbar - half_width
            right_num = self.current_xbar + half_width
            return np.array([left_num, right_num])
```

The parentheses in the first line of the class definition signal that this new class definition is inheriting from `RunningMean`. Inside the definition of this new class, when I refer to `self.current_xbar` Python knows what I'm referring to because it is defined in the base class. Last, I am using `super()` to access the base class's methods, such as `__init__`.

```{python, collapse = TRUE}
my_ci = RunningCI(1) # create running average object
my_ci.get_current_xbar() # no data yet!
my_ci.update(1.) 
my_ci.get_current_interval() 
my_ci.update(3.)  
my_ci.get_current_interval()  
```

This example also demonstrates **polymorphism.** Polymorphism comes from the Greek for "many forms." "Forms" means "type" or "class" in this case. If the same code (usually a function or method) works on objects of different types, that's polymorphic. Here, the `update` method worked on an object of class `RunningCI`, as well as an object of `RunningMean`.

Why is this useful? Consider this example. 

```{python, collapse = TRUE, eval = FALSE}
for datum in time_series:
    for thing in obj_list:
        thing.update(xt)
```

Inside the inner `for` loop, there is no need for include conditional logic that tests for what kind of type each `thing` is. We can iterate through time more succinctly.

```{python, collapse = TRUE, eval = FALSE}
for datum in time_series:
    for thing in obj_list:
        if isinstance(thing, class1):
            thing.updatec1(xt)
        if isinstance(thing, class2):
            thing.updatec2(xt)
        if isinstance(thing, class3):
            thing.updatec3(xt)
        if isinstance(thing, class4):
            thing.updatec4(xt)
        if isinstance(thing, class5):
            thing.updatec5(xt)
        if isinstance(thing, class6):
            thing.updatec6(xt)
```

If, in the future, you add a new class called `class7`, then you need to change this inner `for` loop, as well as provide new code for the class. 


### Adding in Composition

*Composition* also enables code re-use. Inheritance ensures an "is a" relationship between base and derived classes, and composition promotes a "has a" relationship. Sometimes it can be tricky to decide which technique to use, especially when it comes to statistical programming.  

Regarding the example above, you might argue that a confidence interval isn't a particular type of a sample mean. Rather, it only *has a* sample mean. If you believe this, then you might opt for a composition based model instead. With composition, the derived class (the confidence interval class) will be decoupled from the base class (the sample mean class). This decoupling will have a few implications. In general, composition is more flexible, but can lead to longer, uglier code. 

1. You will lose polymorphism. 

2. Your code might become less re-usable. 

    - You have to write any derive class methods you want because you don't inherit any from the base class. For example, you won't automatically get the `.update()` or the `.get_current_xbar()` method for free. This can be tedious if there are a lot of methods you want both classes to have that should work the same exact way for both classes. If there are, you would have to re-write a bunch of method definitions. 

    - On the other hand, this could be good if you have methods that behave completely differently. Each method you write can have totally different behavior in the derived class, even if the method names are the same in both classes. For instance, `.update()` could mean something totally different in these two classes. Also, in the derive class, you can still call the base class's `.update()` method.

3. Many-to-one relationships are easier. It's generally easier to "own" many base class instances rather than inherit from many base classes at once. This is especially true if this is the only book on programming you plan on reading--I completely avoid the topic of multiple inheritance!

::: {.rmd-caution data-latex=""}
Sometimes it is very difficult to choose between using composition or using inheritance. However, this choice should be made very carefully. If you make the wrong one, and realize too late, *refactoring* your code might be very time consuming! 
:::

Here is an example implementation of a confidence interval using composition. Notice that this class "owns" a `RunningMean` instance called `self.mean`. This is contrast with *inheriting* from the `RunningMean` class. 

```{python, collapse = TRUE}
class RunningCI2:
    """Updates a running average and 
    gives you a known-variance confidence interval"""
    
    def __init__(self, known_var):
        self.mean = RunningMean()
        self.known_var = known_var
        
    def update(self, new_x):    
        self.mean.update(new_x)
        
    def get_current_interval(self):
        if self.n == 0:
            return None
        else:
            half_width = 1.96 * np.sqrt(self.known_var / self.n)  
            left = self.mean.get_current_xbar() - half_width
            right = self.mean.get_current_xbar() + half_width
            return np.array([left, right])
```


## OOP In R

R, unlike Python, has many different kinds of classes. In R, there is not only one way to make a class. There are many! I will discuss 

- S3 classes,
- S4 classes,
- Reference classes, and
- R6 classes.

If you like how Python does OOP, you will like reference classes and R6 classes, while S3 and S4 classes will feel strange to you. 

It's best to learn about them chronologically, in my opinion. S3 classes came first, S4 classes sought to improve upon those. Reference classes rely on S4 classes, and R6 classes are an improved version of Reference classes [@wickham2014advanced].

### S3 objects: The Big Picture

With S3 (and S4) objects, calling a method\index{object-oriented programming!methods in R} `print()` will not look like this.\index{object-oriented programming!S3 classes in R}

```
my_obj.print()
```
Instead, it will look like this:

```
print(my_obj)
```

The primary goal of S3 is *polymorphism* [@grolemund2014hands]. We want functions like `print()`, `summary()` and `plot()` to behave differently when objects of a different type are passed in to them. Printing a linear model should look a lot different than printing a data frame, right? So we can write code like the following, we only have to remember fewer functions as an end-user, and the "right" thing will always happen. If you're writing a package, it's also nice for your users that they're able to use the regular functions that they're familiar with. For instance, I allow users of my package [`cPseudoMaRg`](https://cran.r-project.org/web/packages/cPseudoMaRg/index.html) [@cpm] to call `print()` on objects of type `cpmResults`. In section \@ref(plotting-with-ggplot2), `ggplot2` instances, which are much more complicated than plain `numeric` `vector`s, are `+`ed together.

```{r, eval=FALSE}
# print works on pretty much everything
print(myObj)
print(myObjOfADifferentClass)
print(aThirdClassObject)
```

This works because these "high-level" functions (such as `print()`), will look at its input and choose the most appropriate function to call, based on what kind of type the input has. `print()` is the high-level function. When you run some of the above code, it might not be obvious which specific function `print()` chooses for each input. You can't see that happening, yet.

Last, recall that this discussion only applies to S3 objects. Not all objects are S3 objects, though. To find out if an object `x` is an S3 object, use `is.object(x)`.


### Using S3 objects

Using S3 objects is so easy that you probably don't even know that you're actually using them. You can just try to call functions on objects, look at the output, and if you're happy with it, you're good. However, if you've ever asked yourself: "why does `print()` (or another function) do different things all the time?" then this section will be useful to you. 

`print()` is a [**generic function**](https://cran.r-project.org/doc/manuals/r-release/R-lang.html#Method-dispatching) which is a function that looks at the type of its first argument, and then calls another, more specialized function depending on what type that argument is\index{object-oriented programming!generic functions in R}. Not all functions in R are generic, but some are. In addition to `print()`, `summary()` and `plot()` are the most ubiquitous generic functions. Generic functions are an *interface*, because the user does not need to concern himself with the details going on behind the scenes.

In R, a **method**\index{object-oriented programming!methods in R} is a specialized function that gets chosen by the generic function for a particular type of input. The method is the *implementation*. When the generic function chooses a particular method, this is called **method dispatch**.\index{object-oriented programming!method dispatch in R} 

If you look at the definition of a generic function, let's take `plot()` for instance, it has a single line that calls `UseMethod()`.

```{r}
plot
```

`UseMethod()` performs method dispatch. Which methods can be dispatched to? To see that, use the `methods()` function.

```{r}
length(methods(plot))
```

All of these S3 class methods share the same naming convention. Their name has the generic function's name as a prefix, then a dot (`.`), then the name of the class that they are specifically written to be used with. 

::: {.rmd-caution data-latex=""}
R's dot-notation is nothing like Python's dot-notation! In R, functions do not *belong* to types/classes like they do in Python!
::: 

Method dispatch works by looking at the `class` attribute of an S3 object argument. An object in R may or may not have a set of [**attributes**](https://cran.r-project.org/doc/manuals/r-release/R-lang.html#Attributes)\index{object-oriented programming!attributes in R}, which are a collection of name/value pairs that give a particular object extra functionality. Regular `vector`s in R don't have attributes (e.g. try running `attributes(1:3)`), but objects that are "embellished" versions of a `vector` might (e.g. run `attributes(factor(1:3))`). Attributes in R are similar to attributes in Python, but they are usually only used as "tags" to elicit some sort of behavior when the object is passed into a function.

::: {.rmd-caution data-latex=""}
`class()` will return misleading results if it's called on an object that isn't an S3 object. Make sure to check with `is.object()` first. 
:::

Also, these methods are not *encapsulated* inside a class definition like they are in Python, either. They just look like loose functions--the method definition for a particular class is not defined inside the class. These class methods can be defined just as ordinary functions, out on their own, in whatever file you think is appropriate to define functions in.  

As an example, let's try to `plot()` some specific objects. 

```{r, collapse = TRUE, out.width="80%", fig.cap = "A Scatterplot Matrix", fig.align='center'}
aDF <- data.frame(matrix(rnorm(100), nrow = 10))
is.object(aDF) # is this s3?
class(aDF)
plot(aDF)
```

Because `aDF` has its `class` set to `data.frame`, this causes `plot()` to try to find a `plot.data.frame()` method. If this method was not found, R would attempt to find/use a `plot.default()` method. If no default method existed, an error would be thrown. 

As another example, we can play around with objects created with the `ecdf()` function. This function computes an *empirical cumulative distribution function*, which takes a real number as an input, and outputs the proportion of observations that are less than or equal to that input^[It's defined as $\hat{F}(x) = \frac{1}{n}\sum_{i=1}^n \mathbf{1}(X_i \le x)$.].


```{r, collapse = TRUE, out.width="80%", fig.cap = "Plotting an Empirical Cumulative Distribution Function", fig.align='center'}
myECDF <- ecdf(rnorm(100))
is.object(myECDF)
class(myECDF)
plot(myECDF)
```

This is how *inheritance* works in S3. The `ecdf` class inherits from the `stepfun` class, which in turn inherits from the `function` class. When you call `plot(myECDF)`, ultimately `plot.ecdf()` is used on this object. However, if `plot.ecdf()` did not exist, `plot.stepfun()` would be tried. S3 inheritance in R is much simpler than Python's inheritance! 

### Creating S3 objects

Creating an S3 object is very easy, and is a nice way to spruce up some bundled up object that you're returning from a function, say. All you have to do is tag an object by changing its class attribute. Just assign a character `vector` to it!

Here is an example of creating an object of `CoolClass`.

```{r, collapse = TRUE}
myThing <- 1:3
attributes(myThing)
class(myThing) <- "CoolClass"
attributes(myThing) # also try class(myThing)
```

`myThing` is now an instance of `CoolClass`, even though I never defined what a `CoolClass` was ahead of time! If you're used to Python, this should seem very strange. Compared to Python, this approach is very flexible, but also, kind of dangerous, because you can change the `class`es of existing objects. You shouldn't do that, but you could if you wanted to.

After you start creating your own S3 objects, you can write your own methods associated with these objects. That way, when a user of your code uses typical generic functions, such as `summary()`, on your S3 object, you can control what interesting things will happen to the user of your package. Here's an example.

```{r, collapse = TRUE, echo=FALSE}
# the user does not see this function
summary.CoolClass <- function(object,...){
  print("No summary available!")
  print("Cool Classes are too cool for summaries!")
  print(":)")
}
```

```{r, collapse = TRUE}
summary(myThing)
```

The `summary()` method I wrote for this class is the following. 

```{r, collapse = TRUE, eval=F}
summary.CoolClass <- function(object,...){
  print("No summary available!")
  print("Cool Classes are too cool for summaries!")
  print(":)")
}
```
When writing this, I kept the signature the same as `summary()`'s.


### S4 objects: The Big Picture

S4 was developed after S3. If you look at your search path (type `search()`), you will see `package:methods`. That's where all the code you need to do S4 is. Here are the similarities and differences between S3 and S4.

- They both use generic functions and methods work in the same way.
- Unlike in S3, S4 classes allow you to use multiple dispatch, which means the generic function can dispatch on multiple arguments, instead of just the first argument.
- S4 class definitions are strict--they aren't just name tags like in S3.
- S4 inheritance feels more like Python's. You can think of a base class object living inside a child class object. 
- S4 classes can have default data members via `prototype`s.

Much more information about S4 classes can be obtained by reading [Chapter 15 in Hadley Wickham's book.](https://adv-r.hadley.nz/s4.html)\index{object-oriented programming!S4 classes in R}


### Using S4 objects

One CRAN package that uses S4 is the [Matrix package.](
https://cran.r-project.org/web/packages/Matrix/vignettes/Intro2Matrix.pdf) Here is a short and simple code example. 

::: {.rmd-details data-latex=""}
S4 objects are also extremely popular in packages hosted on [Bioconductor](https://www.bioconductor.org/). Bioconductor is similar to CRAN, but its software has a much more specific focus on bioinformatics. To download packages from Bioconductor, you can check out the installation instructions provided [here](https://bioconductor.org/help/course-materials/2017/Zurich/S4-classes-and-methods.html).
::: 

```{r, collapse = TRUE}
library(Matrix)
M <- Matrix(10 + 1:28, 4, 7)
isS4(M)
M
M@Dim
```

Inside an S4 object, data members are called **slots**, and they are accessed with the `@` operator (instead of the `$` operator). Objects can be tested if they are S4 with the function `isS4()`. Otherwise, they look and feel just like S3 objects. 

### Creating S4 objects

Here are the key takeaways:

- create a new S4 class with `setClass()`,
- create a new S4 object with `new()`,
- S4 classes have a fixed amount of slots, a name, and a fixed inheritance structure.

Let's do an example that resembles the example we did in Python, where we defined a `RunningMean` class and a `RunningCI` class.

```{r, collapse = TRUE}
setClass("RunningMean",
         slots = list(n = "integer", 
                      currentXbar = "numeric"))
setClass("RunningCI",
         slots = list(knownVar = "numeric"),
         contains = "RunningMean")
```

Here, unlike in S3 class's, we actually have to define a class with `setClass()`. In the parlance of S4, `slots=` are a class' data members, and `contains=` signals that one class inherits from another (even though "contains" kind of sounds like it's suggesting *composition*).

New objects can be created with the `new()` function after this is accomplished.

```{r, eval=F}
new("RunningMean", n = 0L, currentXbar = 0)
new("RunningCI", n = 0L, currentXbar = 0, knownVar = 1.0)
```


Next we want to define an `update()` generic function that will work on objects of both types. This is what gives us *polymorphism* . The generic `update()` will call specialized methods for objects of class `RunningMean` and `RunningCI`. 

Recall that in the Python example, each class had its own update method. Here, we still have a specialized method for each class, but S4 methods don't have to be defined inside the class definition, as we can see below. 

```{r, collapse=TRUE, echo=F}
setGeneric("update", function(oldMean, newNum) {
  standardGeneric("update")
})
setMethod("update",
          c(oldMean = "RunningMean", newNum = "numeric"),
          function(oldMean, newNum) {
            oldN <- oldMean@n
            oldAve <- oldMean@currentXbar
            newAve <- (oldAve*oldN + newNum)/(oldN + 1)
            newN <- oldN + 1L
            return(new("RunningMean", n = newN, currentXbar = newAve))
          }
)
setMethod("update",
          c(oldMean = "RunningCI", newNum = "numeric"),
          function(oldMean, newNum) {
            oldN <- oldMean@n
            oldAve <- oldMean@currentXbar
            newAve <- (oldAve*oldN + newNum)/(oldN + 1)
            newN <- oldN + 1L
            return(new("RunningCI", n = newN, currentXbar = newAve, 
                       knownVar = oldMean@knownVar))
          }
)
```

```{r, collapse=TRUE, eval = FALSE}
setGeneric("update", function(oldMean, newNum) {
  standardGeneric("update")
})
setMethod("update",
          c(oldMean = "RunningMean", newNum = "numeric"),
          function(oldMean, newNum) {
            oldN <- oldMean@n
            oldAve <- oldMean@currentXbar
            newAve <- (oldAve*oldN + newNum)/(oldN + 1)
            newN <- oldN + 1L
            return(new("RunningMean", n = newN, currentXbar = newAve))
          }
)
setMethod("update",
          c(oldMean = "RunningCI", newNum = "numeric"),
          function(oldMean, newNum) {
            oldN <- oldMean@n
            oldAve <- oldMean@currentXbar
            newAve <- (oldAve*oldN + newNum)/(oldN + 1)
            newN <- oldN + 1L
            return(new("RunningCI", n = newN, currentXbar = newAve, 
                       knownVar = oldMean@knownVar))
          }
)
```


Here's a demonstration of using these two classes that mirrors the example in subsection \@ref(adding-inheritance)

```{r, collapse=TRUE}
myAve <- new("RunningMean", n = 0L, currentXbar = 0)
myAve <- update(myAve, 3)
myAve
myAve <- update(myAve, 1)
myAve

myCI <- new("RunningCI", n = 0L, currentXbar = 0, knownVar = 1.0)
myCI <- update(myCI, 3)
myCI
myCI <- update(myCI, 1)
myCI
```

This looks more *functional* (more information on functional programming is available in chapter \@ref(an-introduction-to-functional-programming)) than the Python example because the `update()` function does not change an object with a side effect. Instead, it takes the old object, changes it, returns the new object, and uses assignment to overwrite the object. The benefit of this approach is that the assignment operator (`<-`) \index{assignment operator!assignment operator in R} signals to us that something is changing. However, there is more data copying involved, so the program is presumably slower than it needs to be.

The big takeaway here is that S3 and S4 don't feel like Python's encapsulated OOP. If we wanted to write stateful programs, we might consider using Reference Classes, or R6 classes. 

### Reference Classes: The Big Picture

[**Reference Classes**](https://www.rdocumentation.org/packages/methods/versions/3.6.2/topics/ReferenceClasses) are built on top of S4 classes, and were released in late 2010. They feel very different from S3 and S4 classes, and they more closely resemble Python classes, because 

1. their method definitions are *encapsulated* inside class definitions like in Python, and
2. the objects they construct are *mutable*.\index{object-oriented programming!Reference Classes in R}

So it will feel much more like Python's class system. Some might say using reference classes that will lead to code that is not very R-ish, but it can be useful for certain types of programs (e.g. long-running code, code that that performs many/high-dimensional/complicated simulations, or code that circumvents storing large data set in your computer's memory all at once). 


### Creating Reference Classes

Creating reference classes is done with the function `setRefClass`. I create a class called `RunningMean` that produces the same behavior as that in the previous example.

```{r, collapse=T}
RunningMeanRC <- setRefClass("RunningMeanRC", 
                             fields = list(current_xbar = "numeric",
                                           n = "integer"),
                             methods = list(
                               update = function(new_x){
                                 n <<- n + 1L
                                 new_sum <- current_xbar*(n-1) + new_x
                                 current_xbar <<- new_sum/n
                                }))
```

This tells us a few things. First, data members are called *fields* now. Second, changing class variables is done with the `<<-`. We can use it just as before. 

```{r, collapse = TRUE}
my_ave <- RunningMeanRC$new(current_xbar=0, n=0L)
my_ave
my_ave$update(1.) 
my_ave$current_xbar
my_ave$n   
my_ave$update(3.)  
my_ave$current_xbar
my_ave$n   
```

Compare how similar this code looks to the code in \@ref(a-first-example)! Note the paucity of assignment operators, and plenty of side effects.

### Creating R6 Classes

I quickly implement the above example as an R6 class. A more detailed introduction to R6 classes is provided in [the vignette from the package authors.](https://r6.r-lib.org/articles/Introduction.html) 

You'll notice the reappearance of the `self` keyword. R6 classes have a `self` keyword just like in Python. They are similar to reference classes, but there are a few differences:

1. they have [better performance than reference classes](https://r6.r-lib.org/articles/Performance.html), and
2. they don't make use of the `<<-` operator.\index{object-oriented programming!R6 classes in R}


```{r, collapse = TRUE}
library(R6)

RunningMeanR6 <- R6Class("RunningMeanR6",
                  public = list(
                    current_xbar = NULL,
                    n = NULL,
                    initialize = function(current_xbar = NA, n = NA) {
                      self$current_xbar <- current_xbar
                      self$n <- n
                    },
                    update = function(new_x) {
                      newSum <- self$current_xbar*self$n + new_x
                      self$n <- self$n + 1L
                      self$current_xbar <- newSum / self$n
                    }
                  )
)
my_r6_ave <- RunningMeanR6$new(current_xbar=0, n=0L)
my_r6_ave
my_r6_ave$update(1.) 
my_r6_ave$current_xbar
my_r6_ave$n   
my_r6_ave$update(3.)  
my_r6_ave$current_xbar
my_r6_ave$n   
```


<!-- Now we can create a new object -->

<!-- ```{r, collapse = TRUE} -->
<!-- myAve <- new("RunningMean", n = 0, currentXbar = 0.0) -->
<!-- ``` -->

<!-- my_ave = RunningMean() # create running average object -->
<!-- my_ave.get_current_xbar() # no data yet! -->
<!-- my_ave.update(1.) # first data point -->
<!-- my_ave.get_current_xbar() # xbar_1 -->
<!-- my_ave.update(3.)  # second data point -->
<!-- my_ave.get_current_xbar()  #xbar_2 -->
<!-- my_ave.n   # my_ave.n instead of self.n -->

<!-- import numpy as np -->
<!-- class RunningCI(RunningMean): -->
<!--   """Updates a running average and gives you a known-variance confidence interval""" -->
<!--   def __init__(self, known_var): -->
<!--     super().__init__() -->
<!--     self.known_var = known_var -->
<!--   def get_current_interval(self): -->
<!--     if self.n == 0: -->
<!--       return None -->
<!--     else: -->
<!--       half_width = 1.96 * np.sqrt(self.known_var / self.n)   -->
<!--       return np.array([self.current_xbar - half_width, self.current_xbar + half_width]) -->


## Exercises


### Python Questions


1. 

If you are interested in estimating a linear regression model, there is a [`LinearRegression` class](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) that you might consider using in the `sklearn.linear_model` submodule. In this lab, we will create something similar, but a little more simplified. 

A simple linear regression model will take in $n$ independent variables $x_1, \ldots, x_n$, and $n$ dependent variables $y_1, \ldots, y_n$, and try to describe their relationship with a function:
\begin{equation} 
y_i = \beta_0 + \beta_1 x_i + \epsilon_i.
\end{equation}
The coefficients $\beta_0$ and $\beta_1$ are unknown, and so must be estimated with the data. Estimating the variance of the noise terms $\epsilon_i$ may also be of interest, but we do not concern ourselves with that here. 

The formulas for the estimated slope (i.e. $\hat{\beta}_1$) and the estimated intercept (i.e. $\hat{\beta}_0$) are as follows:
\begin{equation} 
\hat{\beta}_1 = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i-\bar{y)}}{\sum_{j=1}^n (x_j - \bar{x})^2} 
\end{equation}

\begin{equation} 
\hat{\beta}_0 = \bar{y} - \hat{\beta}_1\bar{x}.
\end{equation}


Create a class that performs simple linear regression. 


  + Name it `SimpleLinReg`.
  + Its `.__init__()` method should not take any additional parameters. It should set two data attributes/members `est_intercept` and `est_slope`, to `np.nan`. 
  + Give it a `.fit()` method that takes in two 1-dimensional Numpy arrays. The first should be the array of independent values, and the second should be the set of dependent values.
  + `.fit(x,y)` should not `return` anything, but it should store two data attributes/members: `est_intercept` and `est_slope`. Every time `.fit()` is called, it will re-calculate the coefficient/parameter estimates.
  + Give it a `.get_coeffs()` method. It should not make any changes to the data attributes/members of the class. It should simply `return` a Numpy array with the parameter estimates inside. Make the first element the estimated intercept, and the second element the estimated slope. If no such coefficients have been estimated at the time of its calling, it should return the same size array but with the initial `np.nan`s inside.

After you've finished writing your first class, you can bask in the glory and run the following test code:

```{python, eval = FALSE}
mod = SimpleLinReg()
mod.get_coeffs()
x = np.arange(10)
y = 1 + .2 * x + np.random.normal(size=10)
mod.fit(x,y)
mod.get_coeffs()
```


2.

Reconsider the above question that asked you to write a class called `SimpleLinReg`. 

  + Write a new class called `LinearRegression2` that preserves all of the existing functionality of `SimpleLinReg`. Do this in a way that does not excessively copy and paste code from `SimpleLinReg`.
  + Give your new class a method called `.visualize()` that takes no arguments and plots the most recent data, the data most recently provided to `.fit()`, in a scatterplot with the estimated regression line superimposed.
  + Unfortunately, `SimpleLinReg().fit(x,y).get_coeffs()` will not return estimated regression coefficients. Give your new class this functionality. In other words, make `LinearRegression2().fit(x,y).get_coeffs()` spit out regression coefficients. Hint: the solution should only require one extra line of code, and it should involve the `self` keyword.

3.

Consider the following time series model [@West1989BayesianFA]
\begin{equation} 
y_t = \beta_t + \epsilon_t \\
\beta_t = \beta_{t-1} + w_t \\
\beta_1 = w_1
\end{equation}
Here $y_t$ is observed time series data, each $\epsilon_t$ is measurement noise with variance $V$ and each $w_t$ is also noise but with variance $W$. Think of $\beta_t$ as a time-varying regression coefficient. 

Imagine our data are arriving sequentially. The **Kalman Filter** [@kalman_filt] cite provides an "optimal" estimate of each $\beta_t$ given all of the information we have up to time $t$. What's better is that the algorithm is recursive. Future estimates of $\beta_t$ will be easy to calculate given our estimates of $\beta_{t-1}$.

Let's call the mean of $\beta_{t-1}$ (given all the information up to time $t-1$) $m_{t-1}$, and the variance of $\beta_{t-1}$ (given all the information up to time $t-1$) $P_{t-1}$. Then the Kalman recursions for this particular model are

\begin{equation} 
M_t = M_{t-1} + \left(\frac{P_{t-1} + W}{P_{t-1} + W+V} \right)(y_t - M_{t-1}) 
\end{equation}

\begin{equation} 
P_t = \left(1 - \frac{P_{t-1} + W}{P_{t-1} + W+V} \right)(P_{t-1} + W)
\end{equation}
for $t \ge 1$.

  + Write a class called `TVIKalman` (TVI stands for time-varying intercept). 
  + Have `TVIKalman` take two floating points in to its `.__init__()` method in this order: `V` and `W`. These two numbers are positive, and are the variance parameters of the model. Store these two numbers. Also, store $0$ and $W+V$ as members/attributes called `filter_mean` and `filter_variance`, respectively.
  + Give `TVIKalman` another method: `.update(yt)`. This function should not return anything, but it should update the filtering distribution's mean and variance numbers, `filter_mean` and `filter_variance`, given a new data point. 
  + Give `TVIKalman` another method: `.get_confidence_interval()`. It should not take any arguments, and it should return a length two Numpy array. The ordered elements of that array should be $M_t$ plus and minus two standard deviations--a standard deviation at time $t$ is $\sqrt{P_t}$.
  + Create a `DataFrame` called `results` with the three columns called `yt`, `lower`, and `upper`. The last two columns should be a sequence of confidence intervals given to you by the method you wrote. The first column should contain the following data: `[-1.7037539, -0.5966818, -0.7061919, -0.1226606, -0.5431923]`. Plot all three columns in a single line plot. Initialize your Kalman Filter object with both `V` and `W` set equal to $.5$.


### R Questions

1. 

Which of the following classes in R produce objects that are mutable? Select all that apply: S3, S4, reference classes, and R6.

2.

Which of the following classes in R produce objects that can have methods? Select all that apply: S3, S4, reference classes, and R6.

3.

Which of the following classes in R produce objects that can store data? Select all that apply: S3, S4, reference classes, and R6.

4.

Which of the following classes in R have encapsulated definitions? Select all that apply: S3, S4, reference classes, and R6.

5.

Which of the following classes in R have "slots"? Select all that apply: S3, S4, reference classes, and R6.

6.

Which of the following class systems in R is the newest? S3, S4, reference classes, or R6?

7.

Which of the following class systems in R is the oldest? S3, S4, reference classes, or R6?

8.

Which of the following classes in R requires you to `library()` in something? Select all that apply: S3, S4, reference classes, and R6.

9.

Suppose you have the following data set: $X_1, \ldots, X_n$. You assume it is a random sample from a Normal distribution with unknown mean and variance parameters, denoted by $\mu$ and $\sigma^2$, respectively. Consider testing the null hypothesis that $\mu = 0$ at a significance level of $\alpha$. To carry out this test, you calculate 
\begin{equation} 
t = \frac{\bar{X}}{S/\sqrt{n}}
\end{equation}
and you reject the null hypothesis if $|t| > t_{n-1,\alpha/2}$. This is **Student's T-Test** [@student1908probable]. Here $S^2 = \sum_i(X_i - \bar{X})^2 / (n-1)$ is the sample variance, and $t_{n-1,\alpha/2}$ is the $1-\alpha/2$ quantile of a t-distribution with $n-1$ degrees of freedom.

  + Write a function called `doTTest()` that performs the above hypothesis test. It should accept two parameters: `dataVec` (a `vector` of data) and `significanceLevel` (which is $\alpha$). Have the second parameter default to `.05`.
  + Have it return an S3 object created from a `list`. The `class` of this `list` should be `"TwoSidedTTest"`. The elements in the `list` should be named `decision` and `testStat`. The `decision` object should be either `"reject"` or `"fail to reject"`. The test stat should be equal to the calculation you made above for $t$.
  + Create a `summary` method for this new class you created: `TwoSidedTTest`. 

10.

Suppose you have a target density $p(x)$ that you are only able to evaluate up to a normalizing constant. In other words, suppose that for some $c > 0$, $p(x) = f(x) / c$, and you are only able to evaluate $f(\cdot)$. Your goal is that you would like to be able to approximate the expected value of $p(x)$ (i.e. $\int x p(x) dx$) using some proposal distribution $q(x)$. $q(x)$ is flexible in that you can sample from it, and you can evaluate it. We will use **importance sampling** [@impsamping1] [@impsamping2] to achieve this.^[Note that this is a similar setup to the accept-reject sampling problem we had earlier, and this algorithm is closely-related with the Monte Carlo algorithm we used in the exercises of chapter 3.]

**Algorithm 1**:  Importance Sampling 
i) Sample $X^1, \ldots, X^n$ from $q(x)$,
  ii. For each sample $x^i$, calculate an unnormalized weight $\tilde{w}^i:= \frac{f(x^i)}{q(x^i)}$,
  iii. Calculate the normalized weights $w^i = \tilde{w}^i \bigg/ \sum_j \tilde{w}^j$.
  iv. Calculate the weighted average $\sum_{i=1}^n w^i x^i$.

In practice it is beneficial to use log-densities, because it will avoid underflow issues. After you evaluate each $\log \tilde{w}^i$, before you exponentiate them, subtract a number $m$ from all the values. A good choice for $m$ is $\max_i (\log \tilde{w}^i)$. These new values will produce the same normalized weights because

\begin{equation} 
w^i = \exp[ \log \tilde{w}^i - m] \bigg/ \sum_j \exp[\log \tilde{w}^j - m].
\end{equation}

  + Create an R6 class called `ImpSamp` that performs importance sampling.
  + It should store function data values `qSamp`, `logQEval`, and `logFEval`.
  + Give it an `initialize()` method that takes in three arguments: `logFEvalFunc`, `qSampFunc` and `logQEvalFunc`. 
  + `initialize()` will set the stored function values equal to the function objects passed in.
  + The functions performing random sampling should only take a single argument referring to the number of samples desired. 
  + The evaluation functions should take a `vector` as an input and return a `vector` as an output. 
  + Write a method called `computeApproxExpec` that computes and returns the importance sampling estimate of $\int x p(x) dx$. Have this method take an integer argument that represents the number of desired samples to use for the computation.
  + Is it better to make this code object-oriented, or would you prefer a simple function that spits out the answer? Why?