Is there support for capturing groups? #8

greatvovan · 2023-02-22T09:46:28Z

How can I extract the second group from aa-bb-c-dddd-eeeee defined as (\w+)-(\w+)-(\w+)-(\w+)-(\w+) (nothing, if no match)?

The text was updated successfully, but these errors were encountered:

greatvovan · 2023-02-22T10:02:38Z

Tried regex_find_all() but got:

sqlite> select * from regex_find_all('\w+', 'aaa-bb-c-dddd-eeeee') where rowid = 2;
thread '<unnamed>' panicked at 'not yet implemented', src/find_all.rs:79:22
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
fatal runtime error: failed to initiate panic, error 5
Abort trap: 6

asg017 · 2023-02-24T00:25:10Z

My apologies for the panic you got with WHERE rowid = 2, it has now been fixed in v0.2.2.

sqlite> select * from regex_find_all('\w+', 'aaa-bb-c-dddd-eeeee') where rowid = 2;
┌───────┬─────┬───────┐
│ start │ end │ match │
├───────┼─────┼───────┤
│ 7     │ 8   │ c     │
└───────┴─────┴───────┘

But for capturing groups specifically, there isn't a great way. I want to add a new select * from regex_captures table function where you'll get all the capture groups and their names/positions, which is being tracked in #1

greatvovan · 2023-03-16T20:30:55Z

Great. Any ETA on this by chance?
Also curious why isn't it a great way – is it due to poor performance, poor readability or other?

asg017 · 2023-03-16T21:22:46Z

I'll give it a shot in the next week or so!

I say "it's not a great way" mostly because it's awkward and not really readable. For single values like 'aaa-bb-c-dddd-eeeee' it's fine, but it gets weirder on multiple values, like so:

with strings as (
  select value
  from json_each('[
    "aaa-bb-c-dddd-eeeee",
    "jjj-k-lll-mmmm-nnnn",
    "only-two"
  ]')
)
select * 
from strings
join regex_find_all(regex('\w+'), value) as parts 
where parts.rowid = 2;
/*
┌─────────────────────┬───────┬─────┬───────┐
│        value        │ start │ end │ match │
├─────────────────────┼───────┼─────┼───────┤
│ aaa-bb-c-dddd-eeeee │ 7     │ 8   │ c     │
│ jjj-k-lll-mmmm-nnnn │ 6     │ 9   │ lll   │
└─────────────────────┴───────┴─────┴───────┘
*/

First issue: There's a rowid = 2 in the WHERE clause, so you would think that this query only returns one row. But it doesn't, since the regex_find_all table function is called multiple times and returns multiple rows with the same rowid. Again, it works, but just awkward.

The second issue: For the "only-two" value, no row is returned. This may be what you want, but I'd prefer to see a NULL rather than a missing row for these types of cases.

All this to say, when capture group support is added, you'd be able to do something like:

with strings as (
  select value
  from json_each('[
    "aaa-bb-c-dddd-eeeee",
    "jjj-k-lll-mmmm-nnnn",
    "only-two"
  ]')
)
select 
  value,
  regex_capture(
    '\w+-\w+-(?P<third_word>[^-]+).*', 
    value, 
    'third_word'
  ) as third_word
from strings;

Which, in my opinion, is much cleaner and easier to reason about.

Re performance: They should be both equal, but one important note when using regex table function like regex_find_all is the wrap the regex() function around patterns. This enables caching, which is tricky for table functions, but automatically done for scalar functions like regexp() or regex_find().

asg017 · 2023-03-23T17:42:01Z

Hey @greatvovan , I just pushed capture group support with the new regex_capture() and regex_captures() functions. It's available as of v0.2.3-alpha.3.

Using your ID example, here's how it would work:

create table items as 
  select value as code
  from json_each('["ID123Y2023ABC", "ID456Y2022ABC", "ID789Y1984ABC"]')
;


select 
  items.code,
  regex_capture(
    regex('ID(?P<id>\d+)Y(?P<year>\d+)ABC'),
    items.code,
    'id'
  ) as id,
  regex_capture(
    regex('ID(?P<id>\d+)Y(?P<year>\d+)ABC'),
    items.code,
    'year'
  ) as year
from items;

There's also the regex_captures() table function, which is only really good if you want to extract multiple matches from the same thing, but might be useful:

select
  rowid,
  regex_capture(captures, 'id') as id,
  regex_capture(captures, 'year') as year
from regex_captures(
  regex('ID(?P<id>\d+)Y(?P<year>\d+)ABC'),
  "ID123Y2023ABC, ID456Y2022ABC, and ID789Y1984ABC"
);
/*
┌───────┬─────┬──────┐
│ rowid │ id  │ year │
├───────┼─────┼──────┤
│ 0     │ 123 │ 2023 │
│ 1     │ 456 │ 2022 │
│ 2     │ 789 │ 1984 │
└───────┴─────┴──────┘
*/

Due to limitations in the SQLite virtual table interface, we can't return custom column names with a table function, so we can't do something like select id, year from regex_captures("...", "..."). But I'm tracking a possible solution to that in #9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there support for capturing groups? #8

Is there support for capturing groups? #8

greatvovan commented Feb 22, 2023

greatvovan commented Feb 22, 2023

asg017 commented Feb 24, 2023

greatvovan commented Mar 16, 2023

asg017 commented Mar 16, 2023

asg017 commented Mar 23, 2023

Is there support for capturing groups? #8

Is there support for capturing groups? #8

Comments

greatvovan commented Feb 22, 2023

greatvovan commented Feb 22, 2023

asg017 commented Feb 24, 2023

greatvovan commented Mar 16, 2023

asg017 commented Mar 16, 2023

asg017 commented Mar 23, 2023