forked from tidyverse/rvest
-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathREADME.Rmd
99 lines (72 loc) · 3.01 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
---
output:
md_document:
variant: markdown_github
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, echo = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "README-"
)
```
# rvest
[![Build Status](https://travis-ci.org/hadley/rvest.svg?branch=master)](https://travis-ci.org/hadley/rvest)
[![CRAN_Status_Badge](http://www.r-pkg.org/badges/version/rvest)](http://cran.r-project.org/package=rvest)
[![Coverage Status](https://img.shields.io/codecov/c/github/hadley/rvest/master.svg)](https://codecov.io/github/hadley/rvest?branch=master)
rvest helps you scrape information from web pages. It is designed to work with [magrittr](https://github.com/smbache/magrittr) to make it easy to express common web scraping tasks, inspired by libraries like [beautiful soup](https://www.crummy.com/software/BeautifulSoup/).
```{r, message = FALSE}
library(rvest)
lego_movie <- read_html("http://www.imdb.com/title/tt1490017/")
rating <- lego_movie %>%
html_nodes("strong span") %>%
html_text() %>%
as.numeric()
rating
cast <- lego_movie %>%
html_nodes("#titleCast .primary_photo img") %>%
html_attr("alt")
cast
poster <- lego_movie %>%
html_nodes(".poster img") %>%
html_attr("src")
poster
```
## Overview
The most important functions in rvest are:
* Create an html document from a url, a file on disk or a string containing
html with `read_html()`.
* Select parts of a document using css selectors: `html_nodes(doc, "table td")`
(or if you've a glutton for punishment, use xpath selectors with
`html_nodes(doc, xpath = "//table//td")`). If you haven't heard of
[selectorgadget](http://selectorgadget.com/), make sure to read
`vignette("selectorgadget")` to learn about it.
* Extract components with `html_tag()` (the name of the tag), `html_text()`
(all text inside the tag), `html_attr()` (contents of a single attribute) and
`html_attrs()` (all attributes).
* (You can also use rvest with XML files: parse with `xml()`, then extract
components using `xml_node()`, `xml_attr()`, `xml_attrs()`, `xml_text()`
and `xml_tag()`.)
* Parse tables into data frames with `html_table()`.
* Extract, modify and submit forms with `html_form()`, `set_values()` and
`submit_form()`.
* Detect and repair encoding problems with `guess_encoding()` and
`repair_encoding()`.
* Navigate around a website as if you're in a browser with `html_session()`,
`jump_to()`, `follow_link()`, `back()`, `forward()`, `submit_form()` and
so on. (This is still a work in progress, so I'd love your feedback.)
To see examples of these function in use, check out the demos.
## Installation
Install the release version from CRAN:
```{r, eval = FALSE}
install.packages("rvest")
```
Or the development version from github
```{r, eval = FALSE}
# install.packages("devtools")
devtools::install_github("hadley/rvest")
```
## Inspirations
* Python: [Robobrowser](http://robobrowser.readthedocs.org/en/latest/readme.html),
[beautiful soup](https://www.crummy.com/software/BeautifulSoup/).