intro_webscraping.html

<!DOCTYPE html>
<html lang="" xml:lang="">
  <head>
    <title>MAT381E-Week 8: Introduction to Web Scraping</title>
    <meta charset="utf-8" />
    <meta name="author" content="Gül İnan" />
    <meta name="date" content="2021-11-28" />
    <script src="intro_webscraping_files/header-attrs-2.11/header-attrs.js"></script>
    <link href="intro_webscraping_files/remark-css-0.0.1/default.css" rel="stylesheet" />
    <script src="intro_webscraping_files/fabric-4.3.1/fabric.min.js"></script>
    <link href="intro_webscraping_files/xaringanExtra-scribble-0.0.1/scribble.css" rel="stylesheet" />
    <script src="intro_webscraping_files/xaringanExtra-scribble-0.0.1/scribble.js"></script>
    <script>document.addEventListener('DOMContentLoaded', function() { window.xeScribble = new Scribble({"pen_color":["#FF0000"],"pen_size":3,"eraser_size":30,"palette":[]}) })</script>
    <link href="intro_webscraping_files/panelset-0.2.6/panelset.css" rel="stylesheet" />
    <script src="intro_webscraping_files/panelset-0.2.6/panelset.js"></script>
    <script src="intro_webscraping_files/kePrint-0.0.1/kePrint.js"></script>
    <link href="intro_webscraping_files/lightable-0.0.1/lightable.css" rel="stylesheet" />
    <link rel="stylesheet" href="xaringan-themer.css" type="text/css" />
  </head>
  <body>
    <textarea id="source">
class: left, middle, my-title, title-slide

# MAT381E-Week 8: Introduction to Web Scraping
### Gül İnan
### Department of Mathematics<br/>Istanbul Technical University
### November 28, 2021

---


class: left

# Homework I review

- Turn off warnings and messages in code chunks. It does not look good when you render the documents.
- Do not show whole big data, show a piece of it. 
- Do not use View() function in homework/reports since if it forces to open another window.
- library(tidyverse) already involves library(ggplot2) etc. If you write them sequentially, this implies that
you do not know the tidyverse ecosystem well.
- Please, do commenting as needed (short comments). The reader does not have to guess what you are doing. You need to navigate the reader.
- Present a well-organized homework/report. This is a sign how you respect your readers.
- Please, do use data science related packages' functions for mathematical operations. 
- Please, prefer piping as needed, it increases the code's readability.
- Please, pay attention your project folder design. Keep data related files under data folder, keep image
related files under image file etc. 
- As in everything, how you present something matters as what you have done.
---
class: left

&lt;!-- First code block is setting options for theme of the slides  --&gt;  


# Outline

* Motivation.
* What is `Web Scraping`?
* `HTML` basics.
* Web scraping with `rvest` package.
* Ethical issues.
* 01-web_scraping.Rmd.

---

# Motivation

&lt;style type="text/css"&gt;
.pull-left {
  float: left;
  width: 50%;
}
.pull-right {
  float: right;
  width: 50%;
}
&lt;/style&gt;

.pull-left[
&lt;img src="images/hatem_crime_stat.jpeg" width="90%" height="100%" /&gt;
[Source](https://www.statista.com/chart/24442/anti-asian-hate-crime/)
]

.pull-right[
* "A survey of police reports by the [Center for the Study of Hate and Extremism at California State University](https://www.csusb.edu/sites/default/files/FACT%20SHEET-%20Anti-Asian%20Hate%202020%203.2.21.pdf) confirmed that racially motivated crimes against those of Asian descent in the U.S. have risen in the pandemic year of 2020. **While hate crimes against Asians still make up a smaller fraction of all hate crimes reported in America’s 15 largest cities, their number rose from 49 in 2019 to 122 in 2020.**"
* "Separate reports released by the [Stop AAPI Hate](https://stopaapihate.org/) reporting center confirm that attacks on Asians were highest in the early days of the pandemic, but also show that they have been rising again lately."
* "[Stop AAPI Hate](https://stopaapihate.org/) said yesterday that verbal harassment was the most common incident recorded by them at 68 percent of all cases, followed by deliberate shunning (20 percent of cases) and physical attacks (11 percent of cases)."
]

---
# What is a hate crime?
- According to the [US Department of Justice](https://www.justice.gov/hatecrimes/learn-about-hate-crimes/chart): A hate crime is a crime committed on the basis of the victim’s perceived or actual race, color, religion, national origin, sexual orientation, gender, gender identity, or disability.
- The [US Department of Justice](https://www.justice.gov/hatecrimes/learn-about-hate-crimes/chart) adds: "Hate crimes have a broader effect than most other kinds of crime. Hate crime victims include not only the crime’s immediate target but also others like them. Hate crimes affect families, communities, and at times, the entire nation, as **others fear that they too could be threatened, attacked, or forced from their homes, because of what they look like, who they are, where they worship, whom they love, or whether they have a disability.**"

[Source](https://www.usatoday.com/story/news/politics/2021/03/18/hate-crime-attacks-georgia-raise-motive-bias-questions/4739328001/)

---
# Why report hate crimes?

- According to the [US Department of Justice](https://www.justice.gov/hatecrimes/learn-about-hate-crimes/chart): "The Hate Crimes Reporting Gap is the **significant disparity** between hate crimes that actually occur and those reported to law enforcement. It is **critical to report hate crimes** not only to show support and get help for victims, but also to send a clear message that the community will not tolerate these kinds of crimes. Reporting hate crimes allows communities and law enforcement to fully understand the scope of the problem in a community and put resources toward preventing and addressing attacks based on bias and hate."

---
# Lacking Hate Crime Data

&lt;style type="text/css"&gt;
.pull-left {
  float: left;
  width: 50%;
}
.pull-right {
  float: right;
  width: 50%;
}
&lt;/style&gt;

.pull-left[
&lt;img src="images/atlanta-hate.png" width="90%" height="100%" /&gt;
[Source](https://www.theguardian.com/us-news/datablog/2021/mar/20/asian-american-hate-crime-data-mona-chalabi?utm_source=dlvr.it&amp;utm_medium=twitter)
]

.pull-right[
* "This, of course, ignores the possibility that someone might be motivated by racial hatred and sexism."
* "Unfortunately, most statistics make the same assumption. Hate crime data that is gathered by the FBI is often categorized according to **a single motivation** (such as religion, sexual orientation, race/ethnicity, gender identity). Less than 3% of the hate crimes that were reported in 2019 recorded **multiple biases.**"
* "**Reality is obviously much more complex than these numbers capture.** Things get even more complicated when you consider reporting rates. A person’s race and gender identity will affect the likelihood that they will report a hate crime to the police."
]


---
# Motivating Data
- The data we need to answer a question may not always come in a spreadsheet and be ready for us to read. Sometimes, data can be available on the web.
- For example, following [Wikipedia page](https://en.wikipedia.org/wiki/Hate_crime_laws_in_the_United_States) illustrates **Hate crime statistics by bias motivation in the US** in a `html` table:

&lt;img src="images/wiki_hate1.png" width="100%" /&gt;

---
# Web Scraping 
- **Web scraping** or **web harvesting** are the terms used to describe the process of extracting data from a website. 
- The **web pages** are written in a **text** format using **hyper text markup language** (HTML) code.
- Afterwards, they are rendered by **web browsers** to be viewed.
- To see the `HTML` source code for a web page we can visit the page on the _browser_, then we can use the _View Page Source_ tool to see it.
- Because `HTML` code is accessible, we can download the `HTML` files, import it into `R`, and then write `R` code to extract the information we need from the page. 


---

- To get an idea of how `HTML` code works, here we show a few lines of code from the [Wikipedia page](https://en.wikipedia.org/wiki/Hate_crime_laws_in_the_United_States) that provides information on US hate statistics:

--
&lt;img src="images/left.png" width="100%" height="100%" /&gt;

--

&lt;img src="images/right.png" width="100%" height="100%" /&gt;


---

- Once we look at the full `HTML` source code, we can actually see the text and data along with `HTML` codes. 
- We can also see **a pattern** of how it is stored. If you know `HTML`, you can write programs that leverage knowledge of these patterns to extract what we want. 
- We also take advantage of a language widely used to make web pages look "pretty" called Cascading Style Sheets (CSS).

---
# HTML basics

- All `HTML` documents must start with a document type declaration: `&lt;!DOCTYPE html&gt;`.
- Every `HTML` page itself must be in an `&lt;html&gt;` element, and it must have **two children**: `&lt;head&gt;`, which contains document metadata like the page title, author etc and `&lt;body&gt;`, which contains the content you see in the browser. 

.pull-left[
```html
&lt;!DOCTYPE html&gt;
&lt;html&gt;
&lt;head&gt;
  &lt;title&gt;Page title&lt;/title&gt;
&lt;/head&gt;
&lt;body&gt;
  &lt;h1&gt; Welcome to İTÜ! &lt;/h1&gt;
  &lt;p&gt;Some text &amp;amp; &lt;b&gt;some bold text.&lt;/b&gt;
  &lt;i&gt; Some italic text &lt;/i&gt; &lt;/p&gt;
  &lt;a href="http://kutuphane.itu.edu.tr/"&gt;Visit İTÜ Library&lt;/a&gt; for: 
  &lt;ol&gt;
  &lt;li&gt;Calculus Books&lt;/li&gt;
  &lt;li&gt;Engineering Books&lt;/li&gt;
  &lt;li&gt;Statistics Books&lt;/li&gt;
  &lt;/ol&gt;
&lt;/body&gt;
&lt;/html&gt;
```
]

--
.pull-right[
* Each `HTML` element has a hierarchical structure which consist of a start tag (e.g. `&lt;tag&gt;`), optional attributes (`id='first'`), an end tag (like `&lt;/tag&gt;`), and contents (everything in between the start and end tag).
* Block tags like `&lt;h1&gt;` (most important heading 1), `&lt;p&gt;` (paragraph), and `&lt;ol&gt;` (ordered list), `&lt;li&gt;` (list item) form the overall structure of the page.
* Inline tags like `&lt;b&gt;` (bold), `&lt;i&gt;` (italics), and `&lt;a&gt;` (links) formats text inside block tags.
* On the left: The `&lt;a&gt;` tag defines a hyperlink. The `href` **attribute specifies the URL of the page the link goes to**.
]

---

- Note: Since `&lt;` and `&gt;` are used for start and end tags, we cannot use them directly. 
- Instead we have to use the `HTML` escapes `&amp;gt;` (greater than) and `&amp;lt;` (less than). 
- And of couse, since those escapes use `&amp;`, if we want a literal ampersand (and) we have to escape it as `&amp;amp;`.
- If you encounter a tag that you have never seen before, you can find out what it does at [WWW3 school](https://www.w3schools.com/tags/).

---

- Let's try out our `HTML` code at [WWW3 school](https://www.w3schools.com/html/tryit.asp?filename=tryhtml_default):

&lt;img src="images/www3.png" width="100%" height="100%" /&gt;

- More on [HTML](https://www.w3schools.com/html/html_headings.asp).

---

- Some elements, like `&lt;img&gt;` cannot have children. These elements depend solely on **attributes for their behavior**.

```html
&lt;img src='logo/rvest.jpg' width="400" height="400"&gt;
```

- Here, `src` attribute specifies the path (URL) to the image; `width` and `height` attributes define the `width` and `height` of the image in **pixels**.


--

&lt;img src='logo/rvest.jpg' width="400" height="400"&gt;

---
# Named attributes
- Sometimes, the start tags of `HTML` elements can have **named attributes** which look like `&lt;tag name1='value1'&gt; Content &lt;/tag&gt;`. 
- Two of the most important named attributes are `id` and `class`, which are used in conjunction with `CSS` to **control the visual appearance** of the page. These are often useful when scraping data off a page.
- Note that attributes are always specified in the start tag.

---

#### id attribute

- The `id` attribute is used to point to a specific style declaration in a **style element within head** and the value of the `id` attribute must be **unique** within the `HTML` document.
- The syntax for `id` is: write a hash character (`#`), followed by an `id name`. Then, define the CSS properties within curly braces `{}`.

&lt;img src='images/idattribute.png' height="400"&gt;

[Source1](https://www.w3schools.com/html/html_id.asp) and [Source2](https://www.w3schools.com/html/tryit.asp?filename=tryhtml_id_css)

---

#### class attribute

- The `class` attribute is often used to point to a class name in a style sheet. Multiple `HTML` elements can share the same class.
- The syntax for `class` is: write a period character (`.`), followed by an `class name`. Then, define the CSS properties within curly braces `{}`.

&lt;img src='images/classattribute.png' height="400"&gt;

[Source1](https://www.w3schools.com/html/html_classes.asp) and [Source2](https://www.w3schools.com/html/tryit.asp?filename=tryhtml_classes_capitals) 
---

- Note that main difference between `id` and `class` attribute is that `id` is unique in a page and can only apply to **at most one HTML element**, while `class` attribute can be applied to **multiple HTML elements**.

---
class: center, middle

# Rvest
&lt;!-- Import image with HTML code, dimensions are in terms of pixel --&gt;
 &lt;img src='logo/rvest.jpg' height="400"&gt;

---
# The rvest package
- The [rvest package](https://rvest.tidyverse.org/articles/rvest.html) provides web harvesting tools within [tidyverse](https://www.tidyverse.org/packages/) ecosystem.


```r
# rvest is not within the core tidyverse ecosystem
# library(tidyverse) will not load rvest package
# load rvest package by library(rvest) call specifically 
library(rvest)
```

- The [rvest manual](https://cran.r-project.org/web/packages/rvest/rvest.pdf) tells us that it depends on a few other packages including  `xml2`. This enables us to use functions available in these packages as well.

|Function        |Description                                   | 
|----------------|----------------------------------------------|
| `read_html()`  |takes a string that can be either a path, a url and then creates a HTML document from a webpage.|

---
- Here are basic `rvest` functions:

|Function           |Description                                   | 
|-------------------|----------------------------------------------|
| `html_elements()` |select specified elements with the specified tags from the HTML document.|
| `html_table()`    |extract table, to be used after `html_elements()`.       | 
| `html_text()`     |extract text within tags, to be used after `html_elements()`.|  
| `html_attr()`     |extract the value of attribute, to be used after `html_elements()`.| 

---

- The first step in using this package is to import the web page, you are interested in, into `R`. 


```r
# Use `read_html()`: to read HTML data from a url or character string into R.
url &lt;- "https://en.wikipedia.org/wiki/Hate_crime_laws_in_the_United_States"
h   &lt;- read_html(url)
h
```

```
#&gt; {html_document}
#&gt; &lt;html class="client-nojs" lang="en" dir="ltr"&gt;
#&gt; [1] &lt;head&gt;\n&lt;meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
#&gt; [2] &lt;body class="mediawiki ltr sitedir-ltr mw-hide-empty-elt ns-0 ns-subject  ...
```

---

- Note that the entire Wikipedia webpage is now contained in `h` object:


```r
h
```

```
#&gt; {html_document}
#&gt; &lt;html class="client-nojs" lang="en" dir="ltr"&gt;
#&gt; [1] &lt;head&gt;\n&lt;meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
#&gt; [2] &lt;body class="mediawiki ltr sitedir-ltr mw-hide-empty-elt ns-0 ns-subject  ...
```

- The `h` object  is a _list_ (`R` data type) and the items in the `h` object correspond to the basic document structure of an `HTML` document. 
- Displaying the `h` object shows that the first item in the _list_ is `head`  and the second item is `body`.  
- Note that these items include the basic component of the `HTML` document, in other words, the _text, links_, and HTML "stuff" which were scraped from the web page.  
- Specifically this stuff is found in the _body_ element of the `h` _list_.

---


```r
library(xml2)
xml_child(h, 1)
```

```
#&gt; {html_node}
#&gt; &lt;head&gt;
#&gt;  [1] &lt;meta http-equiv="Content-Type" content="text/html; charset=UTF-8"&gt;\n
#&gt;  [2] &lt;meta charset="UTF-8"&gt;\n
#&gt;  [3] &lt;title&gt;Hate crime laws in the United States - Wikipedia&lt;/title&gt;\n
#&gt;  [4] &lt;script&gt;document.documentElement.className="client-js";RLCONF={"wgBreakF ...
#&gt;  [5] &lt;script&gt;(RLQ=window.RLQ||[]).push(function(){mw.loader.implement("user.o ...
#&gt;  [6] &lt;link rel="stylesheet" href="/w/load.php?lang=en&amp;amp;modules=ext.cite.st ...
#&gt;  [7] &lt;script async="" src="/w/load.php?lang=en&amp;amp;modules=startup&amp;amp;only=s ...
#&gt;  [8] &lt;meta name="ResourceLoaderDynamicStyles" content=""&gt;\n
#&gt;  [9] &lt;link rel="stylesheet" href="/w/load.php?lang=en&amp;amp;modules=site.styles ...
#&gt; [10] &lt;meta name="generator" content="MediaWiki 1.38.0-wmf.9"&gt;\n
#&gt; [11] &lt;meta name="referrer" content="origin"&gt;\n
#&gt; [12] &lt;meta name="referrer" content="origin-when-crossorigin"&gt;\n
#&gt; [13] &lt;meta name="referrer" content="origin-when-cross-origin"&gt;\n
#&gt; [14] &lt;meta name="format-detection" content="telephone=no"&gt;\n
#&gt; [15] &lt;meta property="og:title" content="Hate crime laws in the United States  ...
#&gt; [16] &lt;meta property="og:type" content="website"&gt;\n
#&gt; [17] &lt;link rel="preconnect" href="//upload.wikimedia.org"&gt;\n
#&gt; [18] &lt;link rel="alternate" media="only screen and (max-width: 720px)" href="/ ...
#&gt; [19] &lt;link rel="alternate" type="application/x-wiki" title="Edit this page" h ...
#&gt; [20] &lt;link rel="apple-touch-icon" href="/static/apple-touch/wikipedia.png"&gt;\n
#&gt; ...
```

---

```r
library(xml2)
xml_child(h, 2)
```

```
#&gt; {html_node}
#&gt; &lt;body class="mediawiki ltr sitedir-ltr mw-hide-empty-elt ns-0 ns-subject mw-editable page-Hate_crime_laws_in_the_United_States rootpage-Hate_crime_laws_in_the_United_States skin-vector action-view skin-vector-legacy"&gt;
#&gt; [1] &lt;div id="mw-page-base" class="noprint"&gt;&lt;/div&gt;
#&gt; [2] &lt;div id="mw-head-base" class="noprint"&gt;&lt;/div&gt;
#&gt; [3] &lt;div id="content" class="mw-body" role="main"&gt;\n\t&lt;a id="top"&gt;&lt;/a&gt;\n\t&lt;di ...
#&gt; [4] &lt;div id="mw-data-after-content"&gt;\n\t&lt;div class="read-more-container"&gt;&lt;/di ...
#&gt; [5] &lt;div id="mw-navigation"&gt;\n\t&lt;h2&gt;Navigation menu&lt;/h2&gt;\n\t&lt;div id="mw-head" ...
#&gt; [6] &lt;footer id="footer" class="mw-footer" role="contentinfo"&gt;&lt;ul id="footer-i ...
#&gt; [7] &lt;script&gt;(RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgPageParseR ...
#&gt; [8] &lt;script type="application/ld+json"&gt;{"@context":"https:\\/\\/schema.org"," ...
#&gt; [9] &lt;script&gt;(RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgBackendRes ...
```
---
#### Extract a table
- Now, question is "**how do we extract the table from the object `h`?**" 
- Remember that `HTML` code has a hierarchical tree structure. The different parts of an `HTML` code, often defined with a message in between  `&lt;` and `&gt;`  are referred to as **nodes** (in other words, **tags**).
- When we know that the information is stored in an `HTML table`, we can see this in the `HTML code` with `&lt;table&gt;` tags. 
- To extract a table from the `h` _list_, then we need to gather all the `HTML` code within the `&lt;table&gt;` tags in the `h` _list_. 
- You can learn more about the `&lt;table&gt;` tag structure from [HTML documentation](https://www.w3schools.com/TAGS/tag_table.asp).

---
- The `rvest` package includes functions to extract nodes of an `HTML` document: the function `html_elements()` extracts all nodes of different type and `html_element()` extracts the first one. To extract all tables we use:


```r
wiki_tables &lt;- h %&gt;% 
               html_elements("table")
```


```r
# note that in HTML source code there are currently 4 tables!..
# pages are up to change!..
wiki_tables
```

```
#&gt; {xml_nodeset (4)}
#&gt; [1] &lt;table class="box-Cite_check plainlinks metadata ambox ambox-content" rol ...
#&gt; [2] &lt;table class="wikitable"&gt;\n&lt;caption&gt;\n&lt;/caption&gt;\n&lt;tbody&gt;\n&lt;tr&gt;\n&lt;th&gt;Stat ...
#&gt; [3] &lt;table class="wikitable" style="margin: 1em auto 1em auto"&gt;\n&lt;caption&gt;\n&lt; ...
#&gt; [4] &lt;table class="wikitable" style="margin: 1em auto 1em auto"&gt;\n&lt;caption&gt;\n&lt; ...
```


- Now, instead of the entire web page, we just have the `HTML` code for the **tables only**:

---
- But we want the table titled "Victims per Year by Bias Motivation" on the page. 
- Looking at the output above it looks like the **table index** is [3]. To extract just the third table - the table with the data we are interested in - we can type the following:


```r
victim_table &lt;- wiki_tables %&gt;% .[3]
# subsetting with square brackets while piping: .[]
victim_table
```

```
#&gt; {xml_nodeset (1)}
#&gt; [1] &lt;table class="wikitable" style="margin: 1em auto 1em auto"&gt;\n&lt;caption&gt;\n&lt; ...
```

---

- We are not quite there yet because this is **not a data frame**. 
- In fact, `rvest` includes a function just for converting `HTML` tables into data frames:


```r
#html_table() #returns a list and get the first component
victim_table_df &lt;- victim_table %&gt;% 
                      html_table()  %&gt;% .[[1]] 
```


```r
View(victim_table_df)
class(victim_table_df) #returns a data frame
```

---
- We are still not done because this is clearly not a **tidy data set**.


```r
str(victim_table_df)
```

- Change the column names properly, replace "unknown" and
empty spaces with NA, then remove the commas and turn character variables into numeric.
 

```r
library(dplyr)
table_tidy &lt;- victim_table_df  %&gt;% 
               setNames(c("Bias Motive", paste(c(1995:2018), sep=""))) %&gt;% #change the column names to desired character format.
               #mutate_at(vars("1995":"2018"), as.numeric)   #did not work!help needed #NAs did not allow coercion!.. 
               mutate_at(vars("1995":"2018"), funs(gsub(',', '',.))) %&gt;% #remove commas #discuss!!!
               mutate_at(vars("1995":"2018"), as.numeric) #change the columns except bias motive into numeric
               #na_if("unknown") %&gt;% 
               #na_if("") %&gt;% 
#https://github.com/tidyverse/readxl/issues/572  
               ###mutate_at(vars("1995":"2018"), as.numeric) 
# https://stackoverflow.com/questions/46787515/remove-commas-from-character-vectors-based-on-specific-col# umn-names-in-r/46788523
```


```r
#not desired format, but let's continue!.(Some rows should be empty, not NA)
View(table_tidy)
```

---
- Finally, let's get the final look of the table!..


```r
#More on HTML tables: https://haozhu233.github.io/kableExtra/awesome_table_in_html.html
library(kableExtra)
table_tidy %&gt;% 
  kbl() %&gt;%
  kable_paper() %&gt;%
  scroll_box(width = "1000px", height = "400px") #add a scroll-box
```

&lt;div style="border: 1px solid #ddd; padding: 0px; overflow-y: scroll; height:400px; overflow-x: scroll; width:1000px; "&gt;&lt;table class=" lightable-paper" style='font-family: "Arial Narrow", arial, helvetica, sans-serif; margin-left: auto; margin-right: auto;'&gt;
 &lt;thead&gt;
  &lt;tr&gt;
   &lt;th style="text-align:left;position: sticky; top:0; background-color: #FFFFFF;"&gt; Bias Motive &lt;/th&gt;
   &lt;th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"&gt; 1995 &lt;/th&gt;
   &lt;th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"&gt; 1996 &lt;/th&gt;
   &lt;th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"&gt; 1997 &lt;/th&gt;
   &lt;th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"&gt; 1998 &lt;/th&gt;
   &lt;th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"&gt; 1999 &lt;/th&gt;
   &lt;th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"&gt; 2000 &lt;/th&gt;
   &lt;th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"&gt; 2001 &lt;/th&gt;
   &lt;th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"&gt; 2002 &lt;/th&gt;
   &lt;th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"&gt; 2003 &lt;/th&gt;
   &lt;th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"&gt; 2004 &lt;/th&gt;
   &lt;th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"&gt; 2005 &lt;/th&gt;
   &lt;th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"&gt; 2006 &lt;/th&gt;
   &lt;th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"&gt; 2007 &lt;/th&gt;
   &lt;th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"&gt; 2008 &lt;/th&gt;
   &lt;th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"&gt; 2009 &lt;/th&gt;
   &lt;th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"&gt; 2010 &lt;/th&gt;
   &lt;th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"&gt; 2011 &lt;/th&gt;
   &lt;th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"&gt; 2012 &lt;/th&gt;
   &lt;th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"&gt; 2013 &lt;/th&gt;
   &lt;th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"&gt; 2014 &lt;/th&gt;
   &lt;th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"&gt; 2015 &lt;/th&gt;
   &lt;th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"&gt; 2016 &lt;/th&gt;
   &lt;th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"&gt; 2017 &lt;/th&gt;
   &lt;th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"&gt; 2018 &lt;/th&gt;
  &lt;/tr&gt;
 &lt;/thead&gt;
&lt;tbody&gt;
  &lt;tr&gt;
   &lt;td style="text-align:left;"&gt; Race &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 6438 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 6994 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 6084 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 5514 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 5485 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 5397 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 5545 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 4580 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 4754 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 5119 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 4895 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 5020 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 4956 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 4934 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 4057 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 3949 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 3645 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 3467 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 3563 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 3227 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; NA &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; NA &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; NA &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; NA &lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
   &lt;td style="text-align:left;"&gt; Race/Ethnicity/Ancestry &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; NA &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; NA &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; NA &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; NA &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; NA &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; NA &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; NA &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; NA &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; NA &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; NA &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; NA &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; NA &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; NA &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; NA &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; NA &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; NA &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; NA &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; NA &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; NA &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; NA &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 4216 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 4426 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 5060 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 5155 &lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
   &lt;td style="text-align:left;"&gt; Religion &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 1617 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 1535 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 1586 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 1720 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 1686 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 1699 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 2118 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 1659 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 1489 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 1586 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 1405 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 1750 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 1628 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 1732 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 1575 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 1552 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 1480 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 1340 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 1223 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 1140 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 1402 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 1584 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 1749 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 1617 &lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
   &lt;td style="text-align:left;"&gt; Sexual Orientation &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 1347 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 1281 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 1401 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 1488 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 1558 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 1558 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 1664 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 1513 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 1479 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 1482 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 1213 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 1472 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 1512 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 1706 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 1482 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 1528 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 1572 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 1376 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 1461 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 1248 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 1263 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 1255 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 1338 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 1445 &lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
   &lt;td style="text-align:left;"&gt; Ethnicity/National Origin &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 1044 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 1207 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 1132 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 956 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 1040 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 1216 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 2634 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 1409 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 1326 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 1254 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 1228 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 1305 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 1347 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 1226 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 1109 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 1122 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 939 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 866 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 821 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 821 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; NA &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; NA &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; NA &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; NA &lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
   &lt;td style="text-align:left;"&gt; Disability &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; NA &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; NA &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 12 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 27 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 23 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 36 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 37 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 50 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 43 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 73 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 54 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 95 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 84 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 85 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 99 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 48 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 61 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 102 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 99 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 96 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 88 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 77 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 160 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 179 &lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
   &lt;td style="text-align:left;"&gt; Gender &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; NA &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; NA &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; NA &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; NA &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; NA &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; NA &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; NA &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; NA &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; NA &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; NA &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; NA &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; NA &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; NA &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; NA &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; NA &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; NA &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; NA &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; NA &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 30 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 40 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 30 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 36 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 54 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 61 &lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
   &lt;td style="text-align:left;"&gt; Gender Identity &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; NA &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; NA &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; NA &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; NA &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; NA &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; NA &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; NA &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; NA &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; NA &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; NA &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; NA &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; NA &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; NA &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; NA &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; NA &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; NA &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; NA &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; NA &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 33 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 109 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 122 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 131 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 132 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 189 &lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
   &lt;td style="text-align:left;"&gt; Single-Bias &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 10446 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 11017 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 10215 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 9705 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 9792 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 9906 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 11998 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 9211 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 9091 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 9514 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 8795 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 9642 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 9527 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 9683 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 8322 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 8199 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 7697 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 7151 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 7230 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 6681 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 7121 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 7509 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 8493 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 8646 &lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
   &lt;td style="text-align:left;"&gt; Multiple-Bias &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 23 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 22 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 40 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 17 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 10 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 18 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 22 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 11 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 9 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 14 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 9 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 10 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 8 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 8 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 14 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 9 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 16 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 13 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 12 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 46 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 52 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 106 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 335 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 173 &lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
   &lt;td style="text-align:left;"&gt; Total &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 10469 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 11039 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 10255 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 9722 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 9802 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 9924 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 12020 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 9222 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 9100 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 9528 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 8804 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 9652 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 9535 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 9691 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 8336 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 8208 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 7713 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 7164 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 7242 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 6727 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 7173 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 7615 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 8828 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 8819 &lt;/td&gt;
  &lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

---
#### Exract Text

.panelset[

.panel[.panel-name[Data]
- Let's assume that you want to extract the following unordered list at the [US Department of Justice](https://www.justice.gov/hatecrimes/hate-crime-statistics): 

```r
knitr::include_graphics('images/offense.png')
```

&lt;img src="images/offense.png" width="20%" height="100%" /&gt;
]

.panel[.panel-name[Code]

```r
results &lt;- read_html("https://www.justice.gov/hatecrimes/hate-crime-statistics")
names &lt;- results %&gt;% 
         html_elements("ul") %&gt;% .[10]  #ul: is unordered tag
```


```r
names %&gt;% 
         html_text()
```

```
#&gt; [1] "Crimes against persons: 69.6%\n\t\t\t\tCrimes against property: 28.2%\n\t\t\t\tCrimes against society: 2.2%\n\t\t\t"
```

```r
#more way to go!..use stringr() package or regular expressions to tidy up this text!..
```
]
]

---

#### Exract image URL
- Let's say we would like to import the image of "ortanca" at https://www.bitkivt.itu.edu.tr/vt/report.php?sor=665 into the `R`.
- This require obtaining image url: http://www.bitkivt.itu.edu.tr/foto/Hydrangea_macrophylla_c%C4%B1cek.sem%C4%B1ha.jpg


```r
image  &lt;- read_html("https://www.bitkivt.itu.edu.tr/vt/report.php?sor=665")
```


```r
image_url &lt;- image %&gt;% 
              html_elements("img") %&gt;% .[3] %&gt;%  #we need third image
              html_attr("src") #get image url
```


```r
#library magick is for image editing (reading, writing, and joining).
library(magick)
magick::image_read(image_url)
```

---
#News

&lt;style type="text/css"&gt;
.pull-left {
  float: left;
  width: 50%;
}
.pull-right {
  float: right;
  width: 50%;
}
&lt;/style&gt;

.pull-left[
&lt;img src="images/wiki_ref.png" width="100%" height="100%" /&gt;
]

.pull-right[
&lt;img src="images/scribe_api.png" width="100%" height="100%" /&gt;

]

---
* [Scribe](https://misinfocon.com/scribes-reference-api-enables-users-to-access-wikipedia-references-b8f749bf60d1) says that: 

   * "We, therefore, started the Scribe credibility API. The goal was to make the   Wikipedia references not only accessible to anyone but also queryable. We implemented this in two steps: (1) extracting Wikipedia references, and (2) setting up an API to query the references."

   * "We extract Wikipedia references from the Wikipedia dump and enrich it with Wikidata information, such as the entity ID in Wikidata. This data is saved as structured data in the database. We focus on online references, i.e., references that include a URL."

 * YOUR TURN?
            
---

&lt;style type="text/css"&gt;
.pull-left {
  float: left;
  width: 50%;
}
.pull-right {
  float: right;
  width: 50%;
}
&lt;/style&gt;

# Ethical considerations

- Legal Concerns:
  - If internet data is publicly available (e.g., tweets from a public Twitter account), it is **generally considered legal** to collect this data.
  - Research that involves human participants (e.g., surveys, interviews, blood draws) needs to be approved by the Institutional Ethics Committee.  

---

&lt;style type="text/css"&gt;
.pull-left {
  float: left;
  width: 50%;
}
.pull-right {
  float: right;
  width: 50%;
}
&lt;/style&gt;

 - "İTÜ İnsan araştırmaları etik kurulları Sosyal ve Beşeri Bilimler İnsan Araştırmaları (SB-INAREK) ve Sağlık ve Mühendislik Bilimleri İnsan Araştırmaları (SM-INAREK) olmak üzere iki ayrı kuruldan oluşmaktadır."


.pull-left[
&lt;img src="images/etik1.png" width="90%" height="100%" /&gt;
[Source](http://sbinarek.itu.edu.tr/)
]

.pull-right[
&lt;img src="images/etik2.png" width="90%" height="100%" /&gt;

[Source](https://sminarek.itu.edu.tr/)
] 

---

- But it is still not certain whether research about publicly available internet data require Institutional Ethics Committee approval or not.
  
- User Ethics:
  - [According to this information](https://melaniewalsh.github.io/Intro-Cultural-Analytics/Data-Collection/User-Ethics-Legal-Concerns.html):  
"Just because something is legal does not mean it is ethical. Collecting, sharing, and publishing internet data created by or about individuals can lead to unwanted public scrutiny, harm, and other negative consequences for those individuals. There is no single, simple answer to the many difficult questions raised by internet data collection. It is important to develop an ethical framework that responds to the specifics of your particular research project or use case (e.g., the platform, the people involved, the context, the potential consequences, etc.)."  


---

- **Hands-on example:** Visit `01-web_scraping.Rmd` file for data harvesting from craiglist.

&lt;img src="images/craiglist.png" width="90%" /&gt;

---

- More on web scraping:
  - https://www.r-bloggers.com/2020/01/web-scraping-with-rvest-astro-throwback/
  - https://www.storybench.org/scraping-html-tables-and-downloading-files-with-r/
  - https://cran.r-project.org/web/packages/RSelenium/vignettes/basics.html.  
  

---
# Attributions

- [rvest](https://rvest.tidyverse.org/articles/rvest.html).
- [Data Science Labs](https://raw.githubusercontent.com/datasciencelabs/2020/master/03_wrangling/06_web-scraping.Rmd).
- [Ethics](https://melaniewalsh.github.io/Intro-Cultural-Analytics/Data-Collection/User-Ethics-Legal-Concerns.html).
- [CSS Selectors](https://raw.githubusercontent.com/gulinan/lectures/master/06-web-css/06-web-css.Rmd).


    </textarea>
<style data-target="print-only">@media screen {.remark-slide-container{display:block;}.remark-slide-scaler{box-shadow:none;}}</style>
<script src="https://remarkjs.com/downloads/remark-latest.min.js"></script>
<script>var slideshow = remark.create({
"highlightStyle": "github",
"highlightLines": true,
"countIncrementalSlides": false,
"ratio": "16:9",
"navigation": {
"scroll": false
}
});
if (window.HTMLWidgets) slideshow.on('afterShowSlide', function (slide) {
  window.dispatchEvent(new Event('resize'));
});
(function(d) {
  var s = d.createElement("style"), r = d.querySelector(".remark-slide-scaler");
  if (!r) return;
  s.type = "text/css"; s.innerHTML = "@page {size: " + r.style.width + " " + r.style.height +"; }";
  d.head.appendChild(s);
})(document);

(function(d) {
  var el = d.getElementsByClassName("remark-slides-area");
  if (!el) return;
  var slide, slides = slideshow.getSlides(), els = el[0].children;
  for (var i = 1; i < slides.length; i++) {
    slide = slides[i];
    if (slide.properties.continued === "true" || slide.properties.count === "false") {
      els[i - 1].className += ' has-continuation';
    }
  }
  var s = d.createElement("style");
  s.type = "text/css"; s.innerHTML = "@media print { .has-continuation { display: none; } }";
  d.head.appendChild(s);
})(document);
// delete the temporary CSS (for displaying all slides initially) when the user
// starts to view slides
(function() {
  var deleted = false;
  slideshow.on('beforeShowSlide', function(slide) {
    if (deleted) return;
    var sheets = document.styleSheets, node;
    for (var i = 0; i < sheets.length; i++) {
      node = sheets[i].ownerNode;
      if (node.dataset["target"] !== "print-only") continue;
      node.parentNode.removeChild(node);
    }
    deleted = true;
  });
})();
(function() {
  "use strict"
  // Replace <script> tags in slides area to make them executable
  var scripts = document.querySelectorAll(
    '.remark-slides-area .remark-slide-container script'
  );
  if (!scripts.length) return;
  for (var i = 0; i < scripts.length; i++) {
    var s = document.createElement('script');
    var code = document.createTextNode(scripts[i].textContent);
    s.appendChild(code);
    var scriptAttrs = scripts[i].attributes;
    for (var j = 0; j < scriptAttrs.length; j++) {
      s.setAttribute(scriptAttrs[j].name, scriptAttrs[j].value);
    }
    scripts[i].parentElement.replaceChild(s, scripts[i]);
  }
})();
(function() {
  var links = document.getElementsByTagName('a');
  for (var i = 0; i < links.length; i++) {
    if (/^(https?:)?\/\//.test(links[i].getAttribute('href'))) {
      links[i].target = '_blank';
    }
  }
})();
// adds .remark-code-has-line-highlighted class to <pre> parent elements
// of code chunks containing highlighted lines with class .remark-code-line-highlighted
(function(d) {
  const hlines = d.querySelectorAll('.remark-code-line-highlighted');
  const preParents = [];
  const findPreParent = function(line, p = 0) {
    if (p > 1) return null; // traverse up no further than grandparent
    const el = line.parentElement;
    return el.tagName === "PRE" ? el : findPreParent(el, ++p);
  };

  for (let line of hlines) {
    let pre = findPreParent(line);
    if (pre && !preParents.includes(pre)) preParents.push(pre);
  }
  preParents.forEach(p => p.classList.add("remark-code-has-line-highlighted"));
})(document);</script>

<script>
slideshow._releaseMath = function(el) {
  var i, text, code, codes = el.getElementsByTagName('code');
  for (i = 0; i < codes.length;) {
    code = codes[i];
    if (code.parentNode.tagName !== 'PRE' && code.childElementCount === 0) {
      text = code.textContent;
      if (/^\\\((.|\s)+\\\)$/.test(text) || /^\\\[(.|\s)+\\\]$/.test(text) ||
          /^\$\$(.|\s)+\$\$$/.test(text) ||
          /^\\begin\{([^}]+)\}(.|\s)+\\end\{[^}]+\}$/.test(text)) {
        code.outerHTML = code.innerHTML;  // remove <code></code>
        continue;
      }
    }
    i++;
  }
};
slideshow._releaseMath(document);
</script>
<!-- dynamically load mathjax for compatibility with self-contained -->
<script>
(function () {
  var script = document.createElement('script');
  script.type = 'text/javascript';
  script.src  = 'https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-MML-AM_CHTML';
  if (location.protocol !== 'file:' && /^https?:/.test(script.src))
    script.src  = script.src.replace(/^https?:/, '');
  document.getElementsByTagName('head')[0].appendChild(script);
})();
</script>
  </body>
</html>