10-third_party_code.Rmd

# Using Third-Party Code

Before using third-party code, it must first be installed. After it is installed, it must be "loaded in" to your session. I will describe both of these steps in R and Python.


## Installing Packages In R 

In R, there are thousands of free, user-created **packages** [@r_for_everyone].\index{packages!packages in R} You can download most of these from the [*Comprehensive R Archive Network*](https://cran.r-project.org/). You can also download packages from other publishing platforms like [Bioconductor](https://www.bioconductor.org/), or [Github](https://github.com/). Installing from CRAN is more commonplace, and extremely easy to do. Just use the `install.packages()` function. This can be run inside your R console, so there is no need to type things into the command line.

```{r, eval = FALSE}
install.packages("thePackage")
```

## Installing Packages In Python 

In Python, installing packages is more complicated. Commands must be written in the command line, and there are multiple package managers. This isn't surprising, because Python is used more extensively than R in fields other than data science.\index{packages!packages in Python}

If you followed the suggestions provided in earlier in the text, then you installed Anaconda. This means you will usually be using the [`conda` command](https://docs.anaconda.com/anaconda/user-guide/tasks/install-packages/). Point-and-click interfaces are made available as well. 

```{bash, eval=FALSE}
conda install the_package
```

There are some packages that will not be available using this method. For more information on that situation, see [here.](https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-pkgs.html#install-non-conda-packages)

## Loading Packages In R 

After they are installed on your machine, third-party code will need to be "loaded" into your R or Python session. 

Loading in a package is relatively simple in R, however complications can arise when different variables share the same name. This happens relatively often because

- it's easy to create a variable in the global environment that has the same name as another object you don't know about, and
- different packages you load in sometimes share names accidentally.

Starting off with the basics, here's how to load in a package of third-party code. Just type the following into your R console.

```{r, eval = FALSE}
library(thePackage)
```

You can also use the `require()` function, which has slightly different behavior when the requested package is not found. 

To understand this more deeply, we need to talk about **environments** again. We discussed these before in  \@ref(more-details-on-rs-user-defined-functions), but only in the context of user-defined functions. When we load in a package with `library()`, we make its contents available by putting it all in an environment for that package. 

An [environment](https://cran.r-project.org/doc/manuals/R-lang.html#Environment-objects) holds the names of objects.\index{environments in R} There are usually several environments, and each holds a different set of functions and variables. All the variables you define are in an environment, every package you load in gets its own environment, and all the functions that come in R pre-loaded have their own environment. 

::: {.rmd-details data-latex=""}
Formally, each environment is pair of two things: a **frame** and an **enclosure**. The frame is the set of symbol-value pairs, and the enclosure is a pointer to the parent environment. If you've heard of a *linked list* in a computer science class, it's the same thing. 
::: 

Moreover, all of these environments are connected in a chain-like structure.  To see what environments are loaded on your machine, and what order they were loaded in, use the `search()` function. This displays the [**search path**](https://cran.r-project.org/doc/manuals/R-lang.html#Search-path), or the ordered sequence of all of your environments.\index{search path!search path in R}

Alternatively, if you're using RStudio, the search path, and the contents of each of its environments, are displayed in the "Environment" window. You can choose which environment you'd like to look at by selecting it from the drop-down menu. This allows you to see all of the variables in that particular environment. The **global environment** (i.e. `".GlobalEnv"`) is displayed by default, because that is where you store all the objects you are creating in the console.

```{r rstudiodisp, fig.cap='The Environment Window in RStudio', out.width='80%', fig.align='center', echo=F}
knitr::include_graphics("pics/environments_display_rstudio.png")
```

When you call `library(thePackage)`, the package has an environment created for it, and it is *inserted between the global environment, and the most recently loaded package.* When you want to access an object by name, R will first search the global environment, and then it will traverse the environments in the search path in order. These has a few important implications.

 - First, **don't define variables in the global environment that are already named in another environment.** There are many variables that come pre-loaded in the `base` package (to see them, type `ls("package:base")`), and if you like using a lot of packages, you're increasing the number of names you should avoid using. 

 - Second, **don't `library()` in a package unless you need it, and if you do, be aware of all the names it will mask it packages you loaded in before**. The good news is that `library` will often print warnings letting you know which names have been masked. The bad news is that it's somewhat out of your control--if you need two packages, then they might have a shared name, and the only thing you can do about it is watch the ordering you load them in.
 
 - Third, don't use `library()` inside code that is `source()`'d in other files. For example, if you attach a package to the search path from within a function you defined, anybody that uses your function loses control over the order of packages that get attached. 

All is not lost if there is a name conflict. The variables haven't disappeared. It's just slightly more difficult to refer to them. For instance, if I load in `Hmisc` [@hmisc], I get the warning warning that `format.pval` and `units` are now masked because they were names that were in `"package:base"`. I can still refer to these masked variables with the double colon operator (`::`).

```{r, collapse = TRUE, eval = FALSE}
library(Hmisc)
# this now refers to Hmisc's format.pval 
# because it was loaded more recently
format.pval 
Hmisc::format.pval # in this case is the same as above
# the below code is the only way 
# you can get base's format.pval now
base::format.pval  
```


## Loading Packages In Python 

In Python, you use the `import` statement to access objects defined in another file. It is slightly more complicated than R's `library()` function, but it is also more flexible. To make the contents of a package called, say, `the_package` available, type *one of the following* inside a Python session. 

```{python, eval = FALSE}
import the_package
import the_package as tp 
from the_package import *
```

To describe the difference between these three approaches, as well as to highlight the important takeaways and compare them with the important takeaways in the last section, we need to discuss what a Python module is, what a package is, and what a Python namespace is.^[I am avoiding any mention of *R's* namespaces and modules. These are things that exist, but they are different from Python's namespaces and modules, and are not within the scope of this text.] 

 - A Python [**`module`**](https://docs.python.org/3/tutorial/modules.html) is a `.py` file,\index{modules in Python} separate from the one you are currently editing, with function and/or object definitions in it.^[The scripts you write are modules. They usually come with the intention of being run from start to finish. Other non-script modules are just a bag of definitions to be used in other places.]
 
 - A [package](https://docs.python.org/3/tutorial/modules.html#packages) is a group of modules.^[Sometimes a package is called a *library* but I will avoid this terminology.]\index{packages!packages in Python} 
 
 - A [**namespace**](https://docs.python.org/3/tutorial/classes.html#python-scopes-and-namespaces) is "a mapping from names to objects."\index{namespace in Python}
 
With these definitions, we can define `import`ing. According to the [Python documentation](https://docs.python.org/3/reference/import.html#the-import-system), "[t]he import statement combines two operations; it searches for the named module, then it binds the results of that search to a name in the local scope." 

The sequence of places Python looks for a module is called the search path. This is not the same as R's search path, though. In Python, the search path is a list of places to look for *modules*, not a list of places to look for variables. To see it, `import sys`, then type `sys.path`.\index{search path!search path in Python}

After a module is found, the variable names inside it become available to the `import`ing module. These variables are available in the global scope, but the names you use to access them will depend on what kind of `import` statement you used. From there, you are using the same scoping rules that we described in \@ref(function-scope-in-python), which means the LEGB acronym still applies. 

::: {.rmd-details data-latex=""}
In both languages, an (unqualified) variable name can only refer to one object at any time. This does not necessarily have anything to do with using third-party code--you can redefine objects, but don't expect to be able to access the old object after you do it. 
 
The same thing can happen when you use third-party code.

  - In R, you have to worry about the order of `library()` and `require()` calls, because there is potential *masking* going on. 
  - If you don't want to worry about masking, don't use `library()` or `require()`, and just refer to variables using the `::` operator (e.g. `coolPackage::specialFunc()`).
  - In Python, loading packages using either the `import package` format or the `import package as p` format means you do not need to worry about the order of imports because you will be forced to qualify variable names (e.g. `package.func()` or `p.func()`).
  - In Python, if you load third-party code using either `from package import foo` or `from package import *`, you won't have to qualify variable names, but imported objects will overwrite any variables that happen to have the same name as something you're importing. 

The way variable names are stored are only slightly different between R and Python.\index{namespace in Python} 

 - Python namespaces are similar to R environments in that they hold name-value pairs; however
 - Python namespaces are unlike R environments in that they are not arranged into a sorted list.
 - Also, Python modules may be organized into a nested or tree-like structure, whereas R packages will always have a flat structure.
::: 


### `import`ing Examples

In the example below, we import the entire `numpy` package in a way that lets us refer to it as `np`. This reduces the amount of typing that is required of us, but it also protects against variable name clashing. We then use the `normal()` function to simulate normal random variables. This function is in the [`random` sub-module](https://numpy.org/doc/stable/reference/random/index.html?highlight=random#module-numpy.random), which is a sub-module in `numpy` that collects all of the pseudorandom number generation functionality together.   

```{python, collapse = TRUE}
import numpy as np # import all of numpy
np.random.normal(size=4)
```

This is one use of the dot operator (`.`). It is also used to access attributes and methods of objects (more information on that will come later in chapter \@ref(an-introduction-to-object-oriented-programming)). `normal` is *inside of* `random`, which it itself inside of `np`. 

As a second example, suppose we were interested in the [`stats` sub-module](https://docs.scipy.org/doc/scipy/reference/tutorial/stats.html) found inside the `scipy` package. We could import all of `scipy`, but just like the above example, that would mean we would need to consistently refer to a variable's module, the sub-module, and the variable name. For long programs, this can become tedious if we had to type `scipy.stats.norm` over and over again. Instead, let's import the sub-module (or sub-package) and ignore the rest of `scipy`.


```{python, collapse = TRUE}
from scipy import stats
stats.norm().rvs(size=4)
```

So we don't have to type `scipy` every time we use something in `scipy.stats`.

Finally, we can import the function directly, and refer to it with only one letter. This is highly discouraged, though. We are much more likely to accidentally use the name `n` twice. Further, `n` is not a very descriptive name, which means it could be difficult to understand what your program is doing later. 

```{python, collapse = TRUE}
from numpy.random import normal as n
n(size=4)
```

Keep in mind, you're always at risk of accidentally re-using names, even if you aren't `import`ing anything. For example, consider the following code. 
```{python, eval = FALSE}
n = 3 # now you can't use n as a function 
n()   
```

This is very bad, because now you cannot use the `n()` function that was imported from the `numpy.random` sub-module earlier. In other words, it is longer *callable*. The error message from the above code will be something like `TypeError: 'int' object is not callable`.

Use the `dir()` function to see what is available inside a module. Here are a few examples. Type them into your own machine to see what they output.

```{python, eval = FALSE}
dir(np) # numpy stuff
dir(__builtins__) #built-in stuff
```


## Exercises


1. 

What are important differences in the package installation procedures of R and Python? Select all that apply. 

  a) Installing R packages can be done from within R, while installing packages in Python can be done in the command line.
  b) Installing R packages can usually be done with the same function `install.packages()`, while installing packages in Python can be done with a variety of package installers such as `pip install` and `conda install`.
  c) There is only one package repository for R, but many for Python.
  d) There is only one package repository for Python, but many for R.

2. 

What are important similarities and differences in the package loading procedures of R and Python? Select all that apply. 

  a) R and Python both have a search path.
  b) R's `::` operator is very similar to Python's `.` operator because they can both help access variable names inside packages.
  c) Python namespaces are unlike R environments in that they are not arranged into a sorted list.
  d) `library(package)` in R is similar to `from package import *` in Python because it will allow you to refer to all variables in `package` without qualification.
  e) Python packages might have `sub-modules` whereas R's packages do not.

3.

In Python, which of the following is, generally speaking, the best way to `import`?

  + `import the_package`
  + `from the_package import *`
  + `import the_package as tp`

4.

In Python, which of the following is, generally speaking, the worst way to `import`?

  + `import the_package`
  + `from the_package import *`
  + `import the_package as tp`

5.

In R, if you want to use a function `func()` from `package`, do you always have to use `library(package)` or `require(package)` first?

  + Yes, otherwise `func()` won't be available.
  + No, you can just use `package::func()` without calling any function that performs pre-loading.