-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathintro_webscraping.html
executable file
·1143 lines (954 loc) · 50.8 KB
/
intro_webscraping.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html>
<html lang="" xml:lang="">
<head>
<title>MAT381E-Week 8: Introduction to Web Scraping</title>
<meta charset="utf-8" />
<meta name="author" content="Gül İnan" />
<meta name="date" content="2021-11-28" />
<script src="intro_webscraping_files/header-attrs-2.11/header-attrs.js"></script>
<link href="intro_webscraping_files/remark-css-0.0.1/default.css" rel="stylesheet" />
<script src="intro_webscraping_files/fabric-4.3.1/fabric.min.js"></script>
<link href="intro_webscraping_files/xaringanExtra-scribble-0.0.1/scribble.css" rel="stylesheet" />
<script src="intro_webscraping_files/xaringanExtra-scribble-0.0.1/scribble.js"></script>
<script>document.addEventListener('DOMContentLoaded', function() { window.xeScribble = new Scribble({"pen_color":["#FF0000"],"pen_size":3,"eraser_size":30,"palette":[]}) })</script>
<link href="intro_webscraping_files/panelset-0.2.6/panelset.css" rel="stylesheet" />
<script src="intro_webscraping_files/panelset-0.2.6/panelset.js"></script>
<script src="intro_webscraping_files/kePrint-0.0.1/kePrint.js"></script>
<link href="intro_webscraping_files/lightable-0.0.1/lightable.css" rel="stylesheet" />
<link rel="stylesheet" href="xaringan-themer.css" type="text/css" />
</head>
<body>
<textarea id="source">
class: left, middle, my-title, title-slide
# MAT381E-Week 8: Introduction to Web Scraping
### Gül İnan
### Department of Mathematics<br/>Istanbul Technical University
### November 28, 2021
---
class: left
# Homework I review
- Turn off warnings and messages in code chunks. It does not look good when you render the documents.
- Do not show whole big data, show a piece of it.
- Do not use View() function in homework/reports since if it forces to open another window.
- library(tidyverse) already involves library(ggplot2) etc. If you write them sequentially, this implies that
you do not know the tidyverse ecosystem well.
- Please, do commenting as needed (short comments). The reader does not have to guess what you are doing. You need to navigate the reader.
- Present a well-organized homework/report. This is a sign how you respect your readers.
- Please, do use data science related packages' functions for mathematical operations.
- Please, prefer piping as needed, it increases the code's readability.
- Please, pay attention your project folder design. Keep data related files under data folder, keep image
related files under image file etc.
- As in everything, how you present something matters as what you have done.
---
class: left
<!-- First code block is setting options for theme of the slides -->
# Outline
* Motivation.
* What is `Web Scraping`?
* `HTML` basics.
* Web scraping with `rvest` package.
* Ethical issues.
* 01-web_scraping.Rmd.
---
# Motivation
<style type="text/css">
.pull-left {
float: left;
width: 50%;
}
.pull-right {
float: right;
width: 50%;
}
</style>
.pull-left[
<img src="images/hatem_crime_stat.jpeg" width="90%" height="100%" />
[Source](https://www.statista.com/chart/24442/anti-asian-hate-crime/)
]
.pull-right[
* "A survey of police reports by the [Center for the Study of Hate and Extremism at California State University](https://www.csusb.edu/sites/default/files/FACT%20SHEET-%20Anti-Asian%20Hate%202020%203.2.21.pdf) confirmed that racially motivated crimes against those of Asian descent in the U.S. have risen in the pandemic year of 2020. **While hate crimes against Asians still make up a smaller fraction of all hate crimes reported in America’s 15 largest cities, their number rose from 49 in 2019 to 122 in 2020.**"
* "Separate reports released by the [Stop AAPI Hate](https://stopaapihate.org/) reporting center confirm that attacks on Asians were highest in the early days of the pandemic, but also show that they have been rising again lately."
* "[Stop AAPI Hate](https://stopaapihate.org/) said yesterday that verbal harassment was the most common incident recorded by them at 68 percent of all cases, followed by deliberate shunning (20 percent of cases) and physical attacks (11 percent of cases)."
]
---
# What is a hate crime?
- According to the [US Department of Justice](https://www.justice.gov/hatecrimes/learn-about-hate-crimes/chart): A hate crime is a crime committed on the basis of the victim’s perceived or actual race, color, religion, national origin, sexual orientation, gender, gender identity, or disability.
- The [US Department of Justice](https://www.justice.gov/hatecrimes/learn-about-hate-crimes/chart) adds: "Hate crimes have a broader effect than most other kinds of crime. Hate crime victims include not only the crime’s immediate target but also others like them. Hate crimes affect families, communities, and at times, the entire nation, as **others fear that they too could be threatened, attacked, or forced from their homes, because of what they look like, who they are, where they worship, whom they love, or whether they have a disability.**"
[Source](https://www.usatoday.com/story/news/politics/2021/03/18/hate-crime-attacks-georgia-raise-motive-bias-questions/4739328001/)
---
# Why report hate crimes?
- According to the [US Department of Justice](https://www.justice.gov/hatecrimes/learn-about-hate-crimes/chart): "The Hate Crimes Reporting Gap is the **significant disparity** between hate crimes that actually occur and those reported to law enforcement. It is **critical to report hate crimes** not only to show support and get help for victims, but also to send a clear message that the community will not tolerate these kinds of crimes. Reporting hate crimes allows communities and law enforcement to fully understand the scope of the problem in a community and put resources toward preventing and addressing attacks based on bias and hate."
---
# Lacking Hate Crime Data
<style type="text/css">
.pull-left {
float: left;
width: 50%;
}
.pull-right {
float: right;
width: 50%;
}
</style>
.pull-left[
<img src="images/atlanta-hate.png" width="90%" height="100%" />
[Source](https://www.theguardian.com/us-news/datablog/2021/mar/20/asian-american-hate-crime-data-mona-chalabi?utm_source=dlvr.it&utm_medium=twitter)
]
.pull-right[
* "This, of course, ignores the possibility that someone might be motivated by racial hatred and sexism."
* "Unfortunately, most statistics make the same assumption. Hate crime data that is gathered by the FBI is often categorized according to **a single motivation** (such as religion, sexual orientation, race/ethnicity, gender identity). Less than 3% of the hate crimes that were reported in 2019 recorded **multiple biases.**"
* "**Reality is obviously much more complex than these numbers capture.** Things get even more complicated when you consider reporting rates. A person’s race and gender identity will affect the likelihood that they will report a hate crime to the police."
]
---
# Motivating Data
- The data we need to answer a question may not always come in a spreadsheet and be ready for us to read. Sometimes, data can be available on the web.
- For example, following [Wikipedia page](https://en.wikipedia.org/wiki/Hate_crime_laws_in_the_United_States) illustrates **Hate crime statistics by bias motivation in the US** in a `html` table:
<img src="images/wiki_hate1.png" width="100%" />
---
# Web Scraping
- **Web scraping** or **web harvesting** are the terms used to describe the process of extracting data from a website.
- The **web pages** are written in a **text** format using **hyper text markup language** (HTML) code.
- Afterwards, they are rendered by **web browsers** to be viewed.
- To see the `HTML` source code for a web page we can visit the page on the _browser_, then we can use the _View Page Source_ tool to see it.
- Because `HTML` code is accessible, we can download the `HTML` files, import it into `R`, and then write `R` code to extract the information we need from the page.
---
- To get an idea of how `HTML` code works, here we show a few lines of code from the [Wikipedia page](https://en.wikipedia.org/wiki/Hate_crime_laws_in_the_United_States) that provides information on US hate statistics:
--
<img src="images/left.png" width="100%" height="100%" />
--
<img src="images/right.png" width="100%" height="100%" />
---
- Once we look at the full `HTML` source code, we can actually see the text and data along with `HTML` codes.
- We can also see **a pattern** of how it is stored. If you know `HTML`, you can write programs that leverage knowledge of these patterns to extract what we want.
- We also take advantage of a language widely used to make web pages look "pretty" called Cascading Style Sheets (CSS).
---
# HTML basics
- All `HTML` documents must start with a document type declaration: `<!DOCTYPE html>`.
- Every `HTML` page itself must be in an `<html>` element, and it must have **two children**: `<head>`, which contains document metadata like the page title, author etc and `<body>`, which contains the content you see in the browser.
.pull-left[
```html
<!DOCTYPE html>
<html>
<head>
<title>Page title</title>
</head>
<body>
<h1> Welcome to İTÜ! </h1>
<p>Some text &amp; <b>some bold text.</b>
<i> Some italic text </i> </p>
<a href="http://kutuphane.itu.edu.tr/">Visit İTÜ Library</a> for:
<ol>
<li>Calculus Books</li>
<li>Engineering Books</li>
<li>Statistics Books</li>
</ol>
</body>
</html>
```
]
--
.pull-right[
* Each `HTML` element has a hierarchical structure which consist of a start tag (e.g. `<tag>`), optional attributes (`id='first'`), an end tag (like `</tag>`), and contents (everything in between the start and end tag).
* Block tags like `<h1>` (most important heading 1), `<p>` (paragraph), and `<ol>` (ordered list), `<li>` (list item) form the overall structure of the page.
* Inline tags like `<b>` (bold), `<i>` (italics), and `<a>` (links) formats text inside block tags.
* On the left: The `<a>` tag defines a hyperlink. The `href` **attribute specifies the URL of the page the link goes to**.
]
---
- Note: Since `<` and `>` are used for start and end tags, we cannot use them directly.
- Instead we have to use the `HTML` escapes `&gt;` (greater than) and `&lt;` (less than).
- And of couse, since those escapes use `&`, if we want a literal ampersand (and) we have to escape it as `&amp;`.
- If you encounter a tag that you have never seen before, you can find out what it does at [WWW3 school](https://www.w3schools.com/tags/).
---
- Let's try out our `HTML` code at [WWW3 school](https://www.w3schools.com/html/tryit.asp?filename=tryhtml_default):
<img src="images/www3.png" width="100%" height="100%" />
- More on [HTML](https://www.w3schools.com/html/html_headings.asp).
---
- Some elements, like `<img>` cannot have children. These elements depend solely on **attributes for their behavior**.
```html
<img src='logo/rvest.jpg' width="400" height="400">
```
- Here, `src` attribute specifies the path (URL) to the image; `width` and `height` attributes define the `width` and `height` of the image in **pixels**.
--
<img src='logo/rvest.jpg' width="400" height="400">
---
# Named attributes
- Sometimes, the start tags of `HTML` elements can have **named attributes** which look like `<tag name1='value1'> Content </tag>`.
- Two of the most important named attributes are `id` and `class`, which are used in conjunction with `CSS` to **control the visual appearance** of the page. These are often useful when scraping data off a page.
- Note that attributes are always specified in the start tag.
---
#### id attribute
- The `id` attribute is used to point to a specific style declaration in a **style element within head** and the value of the `id` attribute must be **unique** within the `HTML` document.
- The syntax for `id` is: write a hash character (`#`), followed by an `id name`. Then, define the CSS properties within curly braces `{}`.
<img src='images/idattribute.png' height="400">
[Source1](https://www.w3schools.com/html/html_id.asp) and [Source2](https://www.w3schools.com/html/tryit.asp?filename=tryhtml_id_css)
---
#### class attribute
- The `class` attribute is often used to point to a class name in a style sheet. Multiple `HTML` elements can share the same class.
- The syntax for `class` is: write a period character (`.`), followed by an `class name`. Then, define the CSS properties within curly braces `{}`.
<img src='images/classattribute.png' height="400">
[Source1](https://www.w3schools.com/html/html_classes.asp) and [Source2](https://www.w3schools.com/html/tryit.asp?filename=tryhtml_classes_capitals)
---
- Note that main difference between `id` and `class` attribute is that `id` is unique in a page and can only apply to **at most one HTML element**, while `class` attribute can be applied to **multiple HTML elements**.
---
class: center, middle
# Rvest
<!-- Import image with HTML code, dimensions are in terms of pixel -->
<img src='logo/rvest.jpg' height="400">
---
# The rvest package
- The [rvest package](https://rvest.tidyverse.org/articles/rvest.html) provides web harvesting tools within [tidyverse](https://www.tidyverse.org/packages/) ecosystem.
```r
# rvest is not within the core tidyverse ecosystem
# library(tidyverse) will not load rvest package
# load rvest package by library(rvest) call specifically
library(rvest)
```
- The [rvest manual](https://cran.r-project.org/web/packages/rvest/rvest.pdf) tells us that it depends on a few other packages including `xml2`. This enables us to use functions available in these packages as well.
|Function |Description |
|----------------|----------------------------------------------|
| `read_html()` |takes a string that can be either a path, a url and then creates a HTML document from a webpage.|
---
- Here are basic `rvest` functions:
|Function |Description |
|-------------------|----------------------------------------------|
| `html_elements()` |select specified elements with the specified tags from the HTML document.|
| `html_table()` |extract table, to be used after `html_elements()`. |
| `html_text()` |extract text within tags, to be used after `html_elements()`.|
| `html_attr()` |extract the value of attribute, to be used after `html_elements()`.|
---
- The first step in using this package is to import the web page, you are interested in, into `R`.
```r
# Use `read_html()`: to read HTML data from a url or character string into R.
url <- "https://en.wikipedia.org/wiki/Hate_crime_laws_in_the_United_States"
h <- read_html(url)
h
```
```
#> {html_document}
#> <html class="client-nojs" lang="en" dir="ltr">
#> [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
#> [2] <body class="mediawiki ltr sitedir-ltr mw-hide-empty-elt ns-0 ns-subject ...
```
---
- Note that the entire Wikipedia webpage is now contained in `h` object:
```r
h
```
```
#> {html_document}
#> <html class="client-nojs" lang="en" dir="ltr">
#> [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
#> [2] <body class="mediawiki ltr sitedir-ltr mw-hide-empty-elt ns-0 ns-subject ...
```
- The `h` object is a _list_ (`R` data type) and the items in the `h` object correspond to the basic document structure of an `HTML` document.
- Displaying the `h` object shows that the first item in the _list_ is `head` and the second item is `body`.
- Note that these items include the basic component of the `HTML` document, in other words, the _text, links_, and HTML "stuff" which were scraped from the web page.
- Specifically this stuff is found in the _body_ element of the `h` _list_.
---
```r
library(xml2)
xml_child(h, 1)
```
```
#> {html_node}
#> <head>
#> [1] <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">\n
#> [2] <meta charset="UTF-8">\n
#> [3] <title>Hate crime laws in the United States - Wikipedia</title>\n
#> [4] <script>document.documentElement.className="client-js";RLCONF={"wgBreakF ...
#> [5] <script>(RLQ=window.RLQ||[]).push(function(){mw.loader.implement("user.o ...
#> [6] <link rel="stylesheet" href="/w/load.php?lang=en&amp;modules=ext.cite.st ...
#> [7] <script async="" src="/w/load.php?lang=en&amp;modules=startup&amp;only=s ...
#> [8] <meta name="ResourceLoaderDynamicStyles" content="">\n
#> [9] <link rel="stylesheet" href="/w/load.php?lang=en&amp;modules=site.styles ...
#> [10] <meta name="generator" content="MediaWiki 1.38.0-wmf.9">\n
#> [11] <meta name="referrer" content="origin">\n
#> [12] <meta name="referrer" content="origin-when-crossorigin">\n
#> [13] <meta name="referrer" content="origin-when-cross-origin">\n
#> [14] <meta name="format-detection" content="telephone=no">\n
#> [15] <meta property="og:title" content="Hate crime laws in the United States ...
#> [16] <meta property="og:type" content="website">\n
#> [17] <link rel="preconnect" href="//upload.wikimedia.org">\n
#> [18] <link rel="alternate" media="only screen and (max-width: 720px)" href="/ ...
#> [19] <link rel="alternate" type="application/x-wiki" title="Edit this page" h ...
#> [20] <link rel="apple-touch-icon" href="/static/apple-touch/wikipedia.png">\n
#> ...
```
---
```r
library(xml2)
xml_child(h, 2)
```
```
#> {html_node}
#> <body class="mediawiki ltr sitedir-ltr mw-hide-empty-elt ns-0 ns-subject mw-editable page-Hate_crime_laws_in_the_United_States rootpage-Hate_crime_laws_in_the_United_States skin-vector action-view skin-vector-legacy">
#> [1] <div id="mw-page-base" class="noprint"></div>
#> [2] <div id="mw-head-base" class="noprint"></div>
#> [3] <div id="content" class="mw-body" role="main">\n\t<a id="top"></a>\n\t<di ...
#> [4] <div id="mw-data-after-content">\n\t<div class="read-more-container"></di ...
#> [5] <div id="mw-navigation">\n\t<h2>Navigation menu</h2>\n\t<div id="mw-head" ...
#> [6] <footer id="footer" class="mw-footer" role="contentinfo"><ul id="footer-i ...
#> [7] <script>(RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgPageParseR ...
#> [8] <script type="application/ld+json">{"@context":"https:\\/\\/schema.org"," ...
#> [9] <script>(RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgBackendRes ...
```
---
#### Extract a table
- Now, question is "**how do we extract the table from the object `h`?**"
- Remember that `HTML` code has a hierarchical tree structure. The different parts of an `HTML` code, often defined with a message in between `<` and `>` are referred to as **nodes** (in other words, **tags**).
- When we know that the information is stored in an `HTML table`, we can see this in the `HTML code` with `<table>` tags.
- To extract a table from the `h` _list_, then we need to gather all the `HTML` code within the `<table>` tags in the `h` _list_.
- You can learn more about the `<table>` tag structure from [HTML documentation](https://www.w3schools.com/TAGS/tag_table.asp).
---
- The `rvest` package includes functions to extract nodes of an `HTML` document: the function `html_elements()` extracts all nodes of different type and `html_element()` extracts the first one. To extract all tables we use:
```r
wiki_tables <- h %>%
html_elements("table")
```
```r
# note that in HTML source code there are currently 4 tables!..
# pages are up to change!..
wiki_tables
```
```
#> {xml_nodeset (4)}
#> [1] <table class="box-Cite_check plainlinks metadata ambox ambox-content" rol ...
#> [2] <table class="wikitable">\n<caption>\n</caption>\n<tbody>\n<tr>\n<th>Stat ...
#> [3] <table class="wikitable" style="margin: 1em auto 1em auto">\n<caption>\n< ...
#> [4] <table class="wikitable" style="margin: 1em auto 1em auto">\n<caption>\n< ...
```
- Now, instead of the entire web page, we just have the `HTML` code for the **tables only**:
---
- But we want the table titled "Victims per Year by Bias Motivation" on the page.
- Looking at the output above it looks like the **table index** is [3]. To extract just the third table - the table with the data we are interested in - we can type the following:
```r
victim_table <- wiki_tables %>% .[3]
# subsetting with square brackets while piping: .[]
victim_table
```
```
#> {xml_nodeset (1)}
#> [1] <table class="wikitable" style="margin: 1em auto 1em auto">\n<caption>\n< ...
```
---
- We are not quite there yet because this is **not a data frame**.
- In fact, `rvest` includes a function just for converting `HTML` tables into data frames:
```r
#html_table() #returns a list and get the first component
victim_table_df <- victim_table %>%
html_table() %>% .[[1]]
```
```r
View(victim_table_df)
class(victim_table_df) #returns a data frame
```
---
- We are still not done because this is clearly not a **tidy data set**.
```r
str(victim_table_df)
```
- Change the column names properly, replace "unknown" and
empty spaces with NA, then remove the commas and turn character variables into numeric.
```r
library(dplyr)
table_tidy <- victim_table_df %>%
setNames(c("Bias Motive", paste(c(1995:2018), sep=""))) %>% #change the column names to desired character format.
#mutate_at(vars("1995":"2018"), as.numeric) #did not work!help needed #NAs did not allow coercion!..
mutate_at(vars("1995":"2018"), funs(gsub(',', '',.))) %>% #remove commas #discuss!!!
mutate_at(vars("1995":"2018"), as.numeric) #change the columns except bias motive into numeric
#na_if("unknown") %>%
#na_if("") %>%
#https://github.com/tidyverse/readxl/issues/572
###mutate_at(vars("1995":"2018"), as.numeric)
# https://stackoverflow.com/questions/46787515/remove-commas-from-character-vectors-based-on-specific-col# umn-names-in-r/46788523
```
```r
#not desired format, but let's continue!.(Some rows should be empty, not NA)
View(table_tidy)
```
---
- Finally, let's get the final look of the table!..
```r
#More on HTML tables: https://haozhu233.github.io/kableExtra/awesome_table_in_html.html
library(kableExtra)
table_tidy %>%
kbl() %>%
kable_paper() %>%
scroll_box(width = "1000px", height = "400px") #add a scroll-box
```
<div style="border: 1px solid #ddd; padding: 0px; overflow-y: scroll; height:400px; overflow-x: scroll; width:1000px; "><table class=" lightable-paper" style='font-family: "Arial Narrow", arial, helvetica, sans-serif; margin-left: auto; margin-right: auto;'>
<thead>
<tr>
<th style="text-align:left;position: sticky; top:0; background-color: #FFFFFF;"> Bias Motive </th>
<th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"> 1995 </th>
<th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"> 1996 </th>
<th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"> 1997 </th>
<th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"> 1998 </th>
<th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"> 1999 </th>
<th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"> 2000 </th>
<th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"> 2001 </th>
<th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"> 2002 </th>
<th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"> 2003 </th>
<th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"> 2004 </th>
<th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"> 2005 </th>
<th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"> 2006 </th>
<th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"> 2007 </th>
<th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"> 2008 </th>
<th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"> 2009 </th>
<th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"> 2010 </th>
<th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"> 2011 </th>
<th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"> 2012 </th>
<th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"> 2013 </th>
<th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"> 2014 </th>
<th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"> 2015 </th>
<th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"> 2016 </th>
<th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"> 2017 </th>
<th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"> 2018 </th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align:left;"> Race </td>
<td style="text-align:right;"> 6438 </td>
<td style="text-align:right;"> 6994 </td>
<td style="text-align:right;"> 6084 </td>
<td style="text-align:right;"> 5514 </td>
<td style="text-align:right;"> 5485 </td>
<td style="text-align:right;"> 5397 </td>
<td style="text-align:right;"> 5545 </td>
<td style="text-align:right;"> 4580 </td>
<td style="text-align:right;"> 4754 </td>
<td style="text-align:right;"> 5119 </td>
<td style="text-align:right;"> 4895 </td>
<td style="text-align:right;"> 5020 </td>
<td style="text-align:right;"> 4956 </td>
<td style="text-align:right;"> 4934 </td>
<td style="text-align:right;"> 4057 </td>
<td style="text-align:right;"> 3949 </td>
<td style="text-align:right;"> 3645 </td>
<td style="text-align:right;"> 3467 </td>
<td style="text-align:right;"> 3563 </td>
<td style="text-align:right;"> 3227 </td>
<td style="text-align:right;"> NA </td>
<td style="text-align:right;"> NA </td>
<td style="text-align:right;"> NA </td>
<td style="text-align:right;"> NA </td>
</tr>
<tr>
<td style="text-align:left;"> Race/Ethnicity/Ancestry </td>
<td style="text-align:right;"> NA </td>
<td style="text-align:right;"> NA </td>
<td style="text-align:right;"> NA </td>
<td style="text-align:right;"> NA </td>
<td style="text-align:right;"> NA </td>
<td style="text-align:right;"> NA </td>
<td style="text-align:right;"> NA </td>
<td style="text-align:right;"> NA </td>
<td style="text-align:right;"> NA </td>
<td style="text-align:right;"> NA </td>
<td style="text-align:right;"> NA </td>
<td style="text-align:right;"> NA </td>
<td style="text-align:right;"> NA </td>
<td style="text-align:right;"> NA </td>
<td style="text-align:right;"> NA </td>
<td style="text-align:right;"> NA </td>
<td style="text-align:right;"> NA </td>
<td style="text-align:right;"> NA </td>
<td style="text-align:right;"> NA </td>
<td style="text-align:right;"> NA </td>
<td style="text-align:right;"> 4216 </td>
<td style="text-align:right;"> 4426 </td>
<td style="text-align:right;"> 5060 </td>
<td style="text-align:right;"> 5155 </td>
</tr>
<tr>
<td style="text-align:left;"> Religion </td>
<td style="text-align:right;"> 1617 </td>
<td style="text-align:right;"> 1535 </td>
<td style="text-align:right;"> 1586 </td>
<td style="text-align:right;"> 1720 </td>
<td style="text-align:right;"> 1686 </td>
<td style="text-align:right;"> 1699 </td>
<td style="text-align:right;"> 2118 </td>
<td style="text-align:right;"> 1659 </td>
<td style="text-align:right;"> 1489 </td>
<td style="text-align:right;"> 1586 </td>
<td style="text-align:right;"> 1405 </td>
<td style="text-align:right;"> 1750 </td>
<td style="text-align:right;"> 1628 </td>
<td style="text-align:right;"> 1732 </td>
<td style="text-align:right;"> 1575 </td>
<td style="text-align:right;"> 1552 </td>
<td style="text-align:right;"> 1480 </td>
<td style="text-align:right;"> 1340 </td>
<td style="text-align:right;"> 1223 </td>
<td style="text-align:right;"> 1140 </td>
<td style="text-align:right;"> 1402 </td>
<td style="text-align:right;"> 1584 </td>
<td style="text-align:right;"> 1749 </td>
<td style="text-align:right;"> 1617 </td>
</tr>
<tr>
<td style="text-align:left;"> Sexual Orientation </td>
<td style="text-align:right;"> 1347 </td>
<td style="text-align:right;"> 1281 </td>
<td style="text-align:right;"> 1401 </td>
<td style="text-align:right;"> 1488 </td>
<td style="text-align:right;"> 1558 </td>
<td style="text-align:right;"> 1558 </td>
<td style="text-align:right;"> 1664 </td>
<td style="text-align:right;"> 1513 </td>
<td style="text-align:right;"> 1479 </td>
<td style="text-align:right;"> 1482 </td>
<td style="text-align:right;"> 1213 </td>
<td style="text-align:right;"> 1472 </td>
<td style="text-align:right;"> 1512 </td>
<td style="text-align:right;"> 1706 </td>
<td style="text-align:right;"> 1482 </td>
<td style="text-align:right;"> 1528 </td>
<td style="text-align:right;"> 1572 </td>
<td style="text-align:right;"> 1376 </td>
<td style="text-align:right;"> 1461 </td>
<td style="text-align:right;"> 1248 </td>
<td style="text-align:right;"> 1263 </td>
<td style="text-align:right;"> 1255 </td>
<td style="text-align:right;"> 1338 </td>
<td style="text-align:right;"> 1445 </td>
</tr>
<tr>
<td style="text-align:left;"> Ethnicity/National Origin </td>
<td style="text-align:right;"> 1044 </td>
<td style="text-align:right;"> 1207 </td>
<td style="text-align:right;"> 1132 </td>
<td style="text-align:right;"> 956 </td>
<td style="text-align:right;"> 1040 </td>
<td style="text-align:right;"> 1216 </td>
<td style="text-align:right;"> 2634 </td>
<td style="text-align:right;"> 1409 </td>
<td style="text-align:right;"> 1326 </td>
<td style="text-align:right;"> 1254 </td>
<td style="text-align:right;"> 1228 </td>
<td style="text-align:right;"> 1305 </td>
<td style="text-align:right;"> 1347 </td>
<td style="text-align:right;"> 1226 </td>
<td style="text-align:right;"> 1109 </td>
<td style="text-align:right;"> 1122 </td>
<td style="text-align:right;"> 939 </td>
<td style="text-align:right;"> 866 </td>
<td style="text-align:right;"> 821 </td>
<td style="text-align:right;"> 821 </td>
<td style="text-align:right;"> NA </td>
<td style="text-align:right;"> NA </td>
<td style="text-align:right;"> NA </td>
<td style="text-align:right;"> NA </td>
</tr>
<tr>
<td style="text-align:left;"> Disability </td>
<td style="text-align:right;"> NA </td>
<td style="text-align:right;"> NA </td>
<td style="text-align:right;"> 12 </td>
<td style="text-align:right;"> 27 </td>
<td style="text-align:right;"> 23 </td>
<td style="text-align:right;"> 36 </td>
<td style="text-align:right;"> 37 </td>
<td style="text-align:right;"> 50 </td>
<td style="text-align:right;"> 43 </td>
<td style="text-align:right;"> 73 </td>
<td style="text-align:right;"> 54 </td>
<td style="text-align:right;"> 95 </td>
<td style="text-align:right;"> 84 </td>
<td style="text-align:right;"> 85 </td>
<td style="text-align:right;"> 99 </td>
<td style="text-align:right;"> 48 </td>
<td style="text-align:right;"> 61 </td>
<td style="text-align:right;"> 102 </td>
<td style="text-align:right;"> 99 </td>
<td style="text-align:right;"> 96 </td>
<td style="text-align:right;"> 88 </td>
<td style="text-align:right;"> 77 </td>
<td style="text-align:right;"> 160 </td>
<td style="text-align:right;"> 179 </td>
</tr>
<tr>
<td style="text-align:left;"> Gender </td>
<td style="text-align:right;"> NA </td>
<td style="text-align:right;"> NA </td>
<td style="text-align:right;"> NA </td>
<td style="text-align:right;"> NA </td>
<td style="text-align:right;"> NA </td>
<td style="text-align:right;"> NA </td>
<td style="text-align:right;"> NA </td>
<td style="text-align:right;"> NA </td>
<td style="text-align:right;"> NA </td>
<td style="text-align:right;"> NA </td>
<td style="text-align:right;"> NA </td>
<td style="text-align:right;"> NA </td>
<td style="text-align:right;"> NA </td>
<td style="text-align:right;"> NA </td>
<td style="text-align:right;"> NA </td>
<td style="text-align:right;"> NA </td>
<td style="text-align:right;"> NA </td>
<td style="text-align:right;"> NA </td>
<td style="text-align:right;"> 30 </td>
<td style="text-align:right;"> 40 </td>
<td style="text-align:right;"> 30 </td>
<td style="text-align:right;"> 36 </td>
<td style="text-align:right;"> 54 </td>
<td style="text-align:right;"> 61 </td>
</tr>
<tr>
<td style="text-align:left;"> Gender Identity </td>
<td style="text-align:right;"> NA </td>
<td style="text-align:right;"> NA </td>
<td style="text-align:right;"> NA </td>
<td style="text-align:right;"> NA </td>
<td style="text-align:right;"> NA </td>
<td style="text-align:right;"> NA </td>
<td style="text-align:right;"> NA </td>
<td style="text-align:right;"> NA </td>
<td style="text-align:right;"> NA </td>
<td style="text-align:right;"> NA </td>
<td style="text-align:right;"> NA </td>
<td style="text-align:right;"> NA </td>
<td style="text-align:right;"> NA </td>
<td style="text-align:right;"> NA </td>
<td style="text-align:right;"> NA </td>
<td style="text-align:right;"> NA </td>
<td style="text-align:right;"> NA </td>
<td style="text-align:right;"> NA </td>
<td style="text-align:right;"> 33 </td>
<td style="text-align:right;"> 109 </td>
<td style="text-align:right;"> 122 </td>
<td style="text-align:right;"> 131 </td>
<td style="text-align:right;"> 132 </td>
<td style="text-align:right;"> 189 </td>
</tr>
<tr>
<td style="text-align:left;"> Single-Bias </td>
<td style="text-align:right;"> 10446 </td>
<td style="text-align:right;"> 11017 </td>
<td style="text-align:right;"> 10215 </td>
<td style="text-align:right;"> 9705 </td>
<td style="text-align:right;"> 9792 </td>
<td style="text-align:right;"> 9906 </td>
<td style="text-align:right;"> 11998 </td>
<td style="text-align:right;"> 9211 </td>
<td style="text-align:right;"> 9091 </td>
<td style="text-align:right;"> 9514 </td>
<td style="text-align:right;"> 8795 </td>
<td style="text-align:right;"> 9642 </td>
<td style="text-align:right;"> 9527 </td>
<td style="text-align:right;"> 9683 </td>
<td style="text-align:right;"> 8322 </td>
<td style="text-align:right;"> 8199 </td>
<td style="text-align:right;"> 7697 </td>
<td style="text-align:right;"> 7151 </td>
<td style="text-align:right;"> 7230 </td>
<td style="text-align:right;"> 6681 </td>
<td style="text-align:right;"> 7121 </td>
<td style="text-align:right;"> 7509 </td>
<td style="text-align:right;"> 8493 </td>
<td style="text-align:right;"> 8646 </td>
</tr>
<tr>
<td style="text-align:left;"> Multiple-Bias </td>
<td style="text-align:right;"> 23 </td>
<td style="text-align:right;"> 22 </td>
<td style="text-align:right;"> 40 </td>
<td style="text-align:right;"> 17 </td>
<td style="text-align:right;"> 10 </td>
<td style="text-align:right;"> 18 </td>
<td style="text-align:right;"> 22 </td>
<td style="text-align:right;"> 11 </td>
<td style="text-align:right;"> 9 </td>
<td style="text-align:right;"> 14 </td>
<td style="text-align:right;"> 9 </td>
<td style="text-align:right;"> 10 </td>
<td style="text-align:right;"> 8 </td>
<td style="text-align:right;"> 8 </td>
<td style="text-align:right;"> 14 </td>
<td style="text-align:right;"> 9 </td>
<td style="text-align:right;"> 16 </td>
<td style="text-align:right;"> 13 </td>
<td style="text-align:right;"> 12 </td>
<td style="text-align:right;"> 46 </td>
<td style="text-align:right;"> 52 </td>
<td style="text-align:right;"> 106 </td>
<td style="text-align:right;"> 335 </td>
<td style="text-align:right;"> 173 </td>
</tr>
<tr>
<td style="text-align:left;"> Total </td>
<td style="text-align:right;"> 10469 </td>
<td style="text-align:right;"> 11039 </td>
<td style="text-align:right;"> 10255 </td>
<td style="text-align:right;"> 9722 </td>
<td style="text-align:right;"> 9802 </td>
<td style="text-align:right;"> 9924 </td>
<td style="text-align:right;"> 12020 </td>
<td style="text-align:right;"> 9222 </td>
<td style="text-align:right;"> 9100 </td>
<td style="text-align:right;"> 9528 </td>
<td style="text-align:right;"> 8804 </td>
<td style="text-align:right;"> 9652 </td>
<td style="text-align:right;"> 9535 </td>
<td style="text-align:right;"> 9691 </td>
<td style="text-align:right;"> 8336 </td>
<td style="text-align:right;"> 8208 </td>
<td style="text-align:right;"> 7713 </td>
<td style="text-align:right;"> 7164 </td>
<td style="text-align:right;"> 7242 </td>
<td style="text-align:right;"> 6727 </td>
<td style="text-align:right;"> 7173 </td>
<td style="text-align:right;"> 7615 </td>
<td style="text-align:right;"> 8828 </td>
<td style="text-align:right;"> 8819 </td>
</tr>
</tbody>
</table></div>
---
#### Exract Text
.panelset[
.panel[.panel-name[Data]
- Let's assume that you want to extract the following unordered list at the [US Department of Justice](https://www.justice.gov/hatecrimes/hate-crime-statistics):
```r
knitr::include_graphics('images/offense.png')
```
<img src="images/offense.png" width="20%" height="100%" />
]
.panel[.panel-name[Code]
```r
results <- read_html("https://www.justice.gov/hatecrimes/hate-crime-statistics")
names <- results %>%
html_elements("ul") %>% .[10] #ul: is unordered tag
```
```r
names %>%
html_text()
```
```
#> [1] "Crimes against persons: 69.6%\n\t\t\t\tCrimes against property: 28.2%\n\t\t\t\tCrimes against society: 2.2%\n\t\t\t"
```
```r
#more way to go!..use stringr() package or regular expressions to tidy up this text!..
```
]
]
---
#### Exract image URL
- Let's say we would like to import the image of "ortanca" at https://www.bitkivt.itu.edu.tr/vt/report.php?sor=665 into the `R`.
- This require obtaining image url: http://www.bitkivt.itu.edu.tr/foto/Hydrangea_macrophylla_c%C4%B1cek.sem%C4%B1ha.jpg
```r
image <- read_html("https://www.bitkivt.itu.edu.tr/vt/report.php?sor=665")
```
```r
image_url <- image %>%
html_elements("img") %>% .[3] %>% #we need third image
html_attr("src") #get image url
```
```r
#library magick is for image editing (reading, writing, and joining).
library(magick)
magick::image_read(image_url)
```
---
#News
<style type="text/css">
.pull-left {
float: left;
width: 50%;
}
.pull-right {
float: right;
width: 50%;
}
</style>
.pull-left[
<img src="images/wiki_ref.png" width="100%" height="100%" />
]
.pull-right[
<img src="images/scribe_api.png" width="100%" height="100%" />
]
---
* [Scribe](https://misinfocon.com/scribes-reference-api-enables-users-to-access-wikipedia-references-b8f749bf60d1) says that:
* "We, therefore, started the Scribe credibility API. The goal was to make the Wikipedia references not only accessible to anyone but also queryable. We implemented this in two steps: (1) extracting Wikipedia references, and (2) setting up an API to query the references."
* "We extract Wikipedia references from the Wikipedia dump and enrich it with Wikidata information, such as the entity ID in Wikidata. This data is saved as structured data in the database. We focus on online references, i.e., references that include a URL."
* YOUR TURN?
---
<style type="text/css">
.pull-left {
float: left;
width: 50%;
}
.pull-right {
float: right;
width: 50%;
}
</style>
# Ethical considerations
- Legal Concerns:
- If internet data is publicly available (e.g., tweets from a public Twitter account), it is **generally considered legal** to collect this data.
- Research that involves human participants (e.g., surveys, interviews, blood draws) needs to be approved by the Institutional Ethics Committee.
---
<style type="text/css">
.pull-left {
float: left;
width: 50%;
}
.pull-right {
float: right;
width: 50%;
}
</style>
- "İTÜ İnsan araştırmaları etik kurulları Sosyal ve Beşeri Bilimler İnsan Araştırmaları (SB-INAREK) ve Sağlık ve Mühendislik Bilimleri İnsan Araştırmaları (SM-INAREK) olmak üzere iki ayrı kuruldan oluşmaktadır."
.pull-left[
<img src="images/etik1.png" width="90%" height="100%" />
[Source](http://sbinarek.itu.edu.tr/)
]
.pull-right[
<img src="images/etik2.png" width="90%" height="100%" />
[Source](https://sminarek.itu.edu.tr/)
]
---
- But it is still not certain whether research about publicly available internet data require Institutional Ethics Committee approval or not.
- User Ethics:
- [According to this information](https://melaniewalsh.github.io/Intro-Cultural-Analytics/Data-Collection/User-Ethics-Legal-Concerns.html):
"Just because something is legal does not mean it is ethical. Collecting, sharing, and publishing internet data created by or about individuals can lead to unwanted public scrutiny, harm, and other negative consequences for those individuals. There is no single, simple answer to the many difficult questions raised by internet data collection. It is important to develop an ethical framework that responds to the specifics of your particular research project or use case (e.g., the platform, the people involved, the context, the potential consequences, etc.)."
---
- **Hands-on example:** Visit `01-web_scraping.Rmd` file for data harvesting from craiglist.
<img src="images/craiglist.png" width="90%" />
---