-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathnotes.txt
255 lines (211 loc) · 10.7 KB
/
notes.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
Red split: 2 lines ; splits only at every delimiter
R3 split: 67 lines ; split into N pieces, sized pieces, uneven pieces, by delimiter, or by parse rule
New split: ~350 lines ; All R3 does plus [once/count before/after/last value/index filter/partition 2-level-split]
Notes: {
This work is based on the Rebol SPLIT function, which I designed
along with Gabriele Santilli, if my memory serves me.
The original goal was to have a single function, like ROUND, that
covered many common cases. The benefit is that users have to learn
only one function, and help is in one place for all of them. In
ROUND's case, it subsumes [ceil floor trunc], half-rounding variants
and rounding to a given scale, for many datatypes. For SPLIT, we
also have a number of cases to cover. For example:
- Split at a delimiter or size (into 2 parts)
- Split into N parts at one or more delimiters
- Split before or after delimiters
- Split up to a certain number of times
- Split at the last occurrence of a delimiter or size
- Split into N pieces
- Splint into pieces of size N
- Split into pieces of varying sizes
- Split based on predicate tests
It's important to note that while we want this to be reasonably
efficient, it's target use cases are scripting and analysis
scenarios. i.e., making users more efficient for moderate amounts
of data, and exploration before optimizing. If you need the fastest
splitting, based on size or delimiter, use R/S. Those sub-funcs
could be written as routines as well, but we don't have any feedback
yet on what's wort the effort.
For non-string values, like blocks, you can split into groups based
on a custom function, this is sometimes called GROUP or PARTITION,
but also acts like FILTER if your predicate returns a logic result.
Then you may want to keep one or both partitions.
That's a lot of cases and a lot of flexibility. Even down to whether
you want to keep the delimiter when splitting/breaking a series, and
which side the delimiter falls to.
We can't cover every case while also keeping the code managable and
the interface not overwhelming or ambiguous.
Another goal has a cost, that you can express splits in more than
one way. e.g. splitting a YYYY-MM-DD/HH:MM:SS string could be done
by multiple delimiter splits, or a single uneven size split. The
point of options is that it lets the user choose what expresses
their intent most clearly for a given case.
Where Rebol's SPLIT was small enough to keep all in one func, we
have to decide if we want to stay within those feature limits, or
if we want to grow the capabilities. Either way, dispatching to
sub-funcs, with SPLIT as the dialected interface to most or all of
them makes sense. Each can have a clear name, be used directly if
we decide to expose them in a named context, and make us thinks in
terms of a consistent interface for splitting and its results. When
people need to write new functions for their own needs, using the
standard model will benefit users from a consistency standpoint,
along with other funcs designed to consume split results.
The Rebol version has an `/into` refinement, but that is now semi-
standardized in Red to mean "specify an existing output buffer
rather than returning a new one." That lets advanced users reduce
memory and GC pressure but uses a nice word previously used to
mean "into N pieces".
}
comment {
split break divide separate partition
join delimit combine append union ; opposites of split
segment section part piece portion slice chunk item
Notice how everyone else just copied the [delim opt limit] arg model.
VBA: Split(expression, [ delimiter, [ limit, [ compare ]]]) ; limit=max items returned
Python: string.split(separator, maxsplit) ; maxsplit=limit
Ruby: split(pattern=nil, [limit])
Java: string.split(String regex, int limit)
PHP: str_split ( string $string , int $length = 1 ) ; into chunks of size $length
JS: string.split(separator, limit)
Swift: split(separator: Character, maxSplits: Int = Int.max, omittingEmptySubsequences: Bool = true) -> [Substring]
Go: strings.Split(str, sep_string)
Rust: https://doc.rust-lang.org/std/string/struct.String.html#method.split
An iterator over substrings of this string slice, separated by characters
matched by a pattern.
The pattern can be a &str, char, a slice of chars, or a function or closure
that determines if a character matches.
split_inclusive keeps the delimiter as part of the preceding element.
split_at int break at index N
split_once ch split at first dlm
split_terminator Equivalent to split, except that the trailing substring is skipped if empty.
.NET https://docs.microsoft.com/en-us/dotnet/api/system.string.split?view=net-5.0
so great to see how they support multiple chars and removing empty elements.
}
comment {
What if SPLIT was fully dialected? That is, 'dlm becomes 'rules and supports
extra words to clarify intent.
split series [into 5 parts/pieces]
split series [into parts of size 5]
split series [piece 5]
split series [sizes [1 2 3]]
split [at 5]
split [at last]
split [at /last]
split [every 5]
split [into 5]
split [at [5 4 2]] ; relative offsets from previous number
; Standard arg
split <integer!> ; into N parts, last may be longer
split <char! string! bitset!> ; into N parts, at each dlm (int can also be a non-index dlm)
split <block of integer!> ; into parts at relative offsets
split <function!> ; into 2 groups, partition by func
split <block of function!> ; into N+1 groups, partition by func; last group = default
split <block of char!|str!> ; Split by each dlm successively.
; Dialected arg
split [at <n>] ; into 2 parts, at absolute position
split [at <dlm>] ; into 2 parts, at first dlm
split [at last <dlm>] ; into 2 parts, at last dlm
split [after <n>] ; into 2 parts, after absolute position
split [after <dlm>] ; into 2 parts, after first dlm
split [after last <dlm>] ; into 2 parts, after last dlm
; Use `each` to indicate full splitting? e.g.
split [at each <dlm>] ; into N parts, at each dlm
split [after each <dlm>] ; into N parts, after each dlm
split [at lit <dlm>] ; quote int and paren values to use as non-computed delims?
; EACH or EVERY?
split [each <n>] ; into ? parts, all of size N, last may be shorter
; Is INTO ambgiuous, with series funcs that support `/into`?
split [into <n>] ; into N parts, last may be longer
; Is SKIP better than AT, to match standard func's behavior?
split [skip [<i> <j> <k>]] ; relative offsets from previous number
; Wordy but clear
split [first by <char!|str!> then by <char!|str!>] ; Split by each dlm successively.
split [
at third dash ; break YYYY-MM-DD-HH-MM-SS into date and time
then at every dash ; break date and time into fields
]
split comma
split [comma]
split [at comma]
split [after comma]
split [once at comma]
split [twice at comma]
split [3 times at comma]
split [integer! times at comma] ; int > 0
; Maybe EVERY is implied, and is overridden by ordinal specifiers.
split [before last comma]
split [at last comma]
split [at third dash] ; break YYYY-MM-DD-HH-MM-SS into date and time
split [after last comma]
split [before every comma] ; keep delim with next value
split [at every comma] ; don't keep delim
split [after every comma] ; keep delim with prior value
split [after each comma] ; each = every
split [
opt [
'once ; n: 1
| 'twice ; n: 2
| integer! 'times ; n: int > 0
]
opt ['at | 'after] ; keep delim with next or previous part
opt 'each
<dlm>
]
'before = at+remove
BREAK as name for func that eats delims versus keeping them?
Can `before|after` mean "keep the delim" on one side or the other? Then "at"
says "don't keep the delim"? How do you say keep the delim as a separate item,
or is that such rare use case that it doesn't matter. For a single value,
like a char or string, it is lossless to throw it away, but for a charset it's
lossy. Before/After keep it as part of the value, but that requires another
processing step for the user.
Delims aren't keys, but grouping by key is another facet of splitting by
function. It needs to keep the keys though, under which all matching items
are grouped, making the resulting structure key-value based and a special
case.
More advanced than I want to think about this late evening, but what about
a lazy approach? In thinking about SPLIT, you have to allocate all those
new pieces maybe just to operate on them. But in an HOF/callback model,
you could stream them out for aggregation or other processing, just accessing
them where they lie in the original series. Without copying, or Boris' PART
func and true slices, there's no protection of the original, which is dangerous
of course. But it's a declarative, or functional if you prefer, approach, where
you pass in the splitting criteria and an action to apply to each result. We
may be able to reserve parens for that, making it consistent with PARSE.
Splitting at a datatype could be useful. e.g. log files where each starts with
a date!, or config files where entries start with a set-word! (though *we*
wouldn't do that ;^), using issue! as an ID for entries that may vary in length.
But it raises a question, because you might want to get word values that
refer to them, so you don't have to compose. But that conflicts with splitting
at literal word values.
;-------------------------------------------------------------------------------
; "As verbs the difference between split and break is that split is (ergative) of
; something solid, to divide fully or partly along a more or less straight line
; while break is (intransitive) to separate into two or more pieces, to fracture
; or crack, by a process that cannot easily be reversed for reassembly." wikidiff
; 'at should implied, because it applies to both 'once and 'every.
; - at every
; - once at opt [first | last]
; In a dialected block, it may help readability to include it, but it's a noop.
; split-once-every-delimiter /before /after
; split-once-at-first-delimiter delim
; split-once-at-last-delimiter delim
; split-once-at-Nth-delimiter delim
; split-every-N-items part-size split-into-equal-parts-of-size-N
; split-into-N-parts num-parts
; split-into-sized-parts part-sizes
; split-once-at-index index
; split-once-at-index-from-tail index
; split-by-predicate test always every
; integer!: [once | every | into | Nth] [lit opt keep]
; function!: [every]
; string!: [once | every] [before | after | keep] No INTO because that implies a known number
; char!: [once | every] [before | after | keep]
; bitset!: [once | every] [before | after | keep]
; block!: [
; all ints [into]
; all funcs [every] partition
; delim [every]
;
; no-fill ?
}