forked from coolbutuseless/flexo
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.Rmd
261 lines (176 loc) · 7.81 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r setup, include=FALSE}
suppressPackageStartupMessages({
library(flexo)
})
knitr::opts_chunk$set(echo = TRUE)
```
```{r echo = FALSE, eval = FALSE}
# Quick logo generation. Borrowed heavily from Nick Tierney's Syn logo process
library(magick)
library(showtext)
font_add_google("Alfa Slab One", "gf")
if (FALSE) {
pkgdown::build_site(override = list(destination = "../coolbutuseless.github.io/package/flexo"))
}
```
```{r echo = FALSE, eval = FALSE}
img <- image_read("man/figures/flexo.png")
hexSticker::sticker(subplot = img,
s_x = 1,
s_y = 1.2,
s_width = 1.5,
s_height = 1.2,
package = "flexo",
p_x = 1,
p_y = 0.45,
p_color = "#223344",
p_family = "gf",
p_size = 8,
h_size = 1.2,
h_fill = "#ffffff",
h_color = "#223344",
filename = "man/figures/logo.png")
image_read("man/figures/logo.png")
```
# flexo: Simple Lex/Parse Tools in R <img src="man/figures/logo.png" align="right" height=300 />
<!-- badges: start -->
![](https://img.shields.io/badge/cool-useless-green.svg)
![](https://img.shields.io/badge/test coverage-100%25-blue.svg)
[![R build status](https://github.com/coolbutuseless/flexo/workflows/R-CMD-check/badge.svg)](https://github.com/coolbutuseless/flexo/actions)
<!-- badges: end -->
`flexo` provides tools for simple tokenising/lexing/parsing of text files.
`flexo` aims to be useful in getting otherwise unsupported text data formats into R.
## What's in the box
* `lex(text, regexes)` break a text string into tokens using the supplied
regular expressions
* `TokenStream` is an R6 class for manipulating a stream of tokens - a
first step for parsing the data into a more useful format
* `create_stream()` is a base R version of `TokenStream` which uses environments
directly
Installation
-----------------------------------------------------------------------------
You can install `flexo` from [github coolbutuseless/flexo](https://github.com/coolbutuseless/flexo) with
```{r eval=FALSE}
# install.packages('remotes')
remotes::install_github('coolbutuseless/flexo', ref='main')
```
Usage Overview
-----------------------------------------------------------------------------
* Define a set of regular expressions (`regexes`) that define the tokens in the data
* Call `lex()` to use these `regexes` to split data into tokens i.e. `lex(text, regexes)`
* `lex()` returns a named character vector of tokens. The names of the tokens
correspond to the respective regex which captured it.
* Optionally use the `TokenStream` [R6](https://cran.r-project.org/package=R6)
class to aid in the manipulation of the raw tokens into more structured data.
* New in v0.2.5 use a base R environment-based token stream iniated with
`create_stream(tokens)`
I often do not import the `flexo` package for projects, but instead copy
the `lex.R` and `stream.R` files into the new package. This avoids having
a non-CRAN dependency on an otherwise simple project.
Limitations
-----------------------------------------------------------------------------
For complicated parsing (e.g. programming languages) you'll want to use the more
formally correct lexing/parsing provided by the [`rly` package](https://cran.r-project.org/package=rly)
or the [`dparser` package](https://cran.r-project.org/package=dparser).
Vignettes
-----------------------------------------------------------------------------
Vignettes are available to read [online](https://coolbutuseless.github.io/package/flexo)
* [Parsing Chess games in PGN format](https://coolbutuseless.github.io/package/flexo/articles/chess.html)
* [Parsing 3d models in OBJ format](https://coolbutuseless.github.io/package/flexo/articles/parse_obj.html)
* [Parsing Scrabble games in GCG format](https://coolbutuseless.github.io/package/flexo/articles/Scrabble.html)
* [Parsing PBRT scene description format](https://coolbutuseless.github.io/package/flexo/articles/PBRT.html)
Example: Using `lex()` to split sentence into tokens
-----------------------------------------------------------------------------
```{r}
sentence_regexes <- c(
word = "\\w+",
whitespace = "\\s+",
fullstop = "\\.",
comma = ","
)
sentence = "Hello there, Rstats."
flexo::lex(sentence, sentence_regexes)
```
Example: Using `lex()` to split some simplified R code into tokens
-----------------------------------------------------------------------------
```{r}
R_regexes <- c(
number = "-?\\d*\\.?\\d+",
name = "\\w+",
equals = "==",
assign = "<-|=",
plus = "\\+",
lbracket = "\\(",
rbracket = "\\)",
newline = "\n",
whitespace = "\\s+"
)
R_code <- "x <- 3 + 4.2 + rnorm(1)"
R_tokens <- flexo::lex(R_code, R_regexes)
R_tokens
```
Example: Using `lex()` with `TokenStream`
-----------------------------------------------------------------------------
Once `lex()` is used to create the separate tokens, the next step is to intepret
the token sequence into something much more structured e.g. a data.frame or
matrix.
The example below shows `flexo` being used to to parse a hypothetical tic-tac-toe
game format into a matrix.
```{r}
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# A tough game between myself and Kasparov (my pet rock)
# The comment line denotes who the game was between
# Each square is marked with an 'X' or 'O'
# After each X and O is a number indicating the order in which the mark
# appeared on the board.
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
game <- "
# Kasparov(X) vs Coolbutuseless(O)
X2 | O1 | O5
O3 | X4 | X6
X7 | O8 | X9
"
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Define all the regexes to split the game into tokens
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
game_regexes <- c(
comment = "(#.*?)\n",
whitespace = "\\s+",
sep = "\\|",
mark = "X|O",
order = flexo::re$number
)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Use flexo::lex() to break game into tokens with these regexes
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
tokens <- flexo::lex(game, game_regexes)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Remove some tokens that don't contain actual information
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
tokens <- tokens[!names(tokens) %in% c('whitespace', 'comment', 'sep')]
tokens
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Create a TokenStream object for help in manipulating the tokens
# Obviously there are easier ways to do this on such a simple example, but
# the below example is hopefully illustrative of the technique
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
stream <- TokenStream$new(tokens)
mark <- c()
order <- c()
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Keep processing all the tokens until done.
# The tokens should exist in pairs of 'mark' and 'order', so assert that pairing
# Consume and store the values from the stream into 'mark' and 'order' vectors
# Bind all the information into a matrix
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
while (!stream$end_of_stream()) {
stream$assert_name_seq(c('mark', 'order'))
mark <- c(mark , stream$consume(1))
order <- c(order, stream$consume(1))
}
cbind(mark, order)
```