Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there support for capturing groups? #8

Open
greatvovan opened this issue Feb 22, 2023 · 5 comments
Open

Is there support for capturing groups? #8

greatvovan opened this issue Feb 22, 2023 · 5 comments

Comments

@greatvovan
Copy link

How can I extract the second group from aa-bb-c-dddd-eeeee defined as (\w+)-(\w+)-(\w+)-(\w+)-(\w+) (nothing, if no match)?

@greatvovan
Copy link
Author

Tried regex_find_all() but got:

sqlite> select * from regex_find_all('\w+', 'aaa-bb-c-dddd-eeeee') where rowid = 2;
thread '<unnamed>' panicked at 'not yet implemented', src/find_all.rs:79:22
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
fatal runtime error: failed to initiate panic, error 5
Abort trap: 6

@asg017
Copy link
Owner

asg017 commented Feb 24, 2023

My apologies for the panic you got with WHERE rowid = 2, it has now been fixed in v0.2.2.

sqlite> select * from regex_find_all('\w+', 'aaa-bb-c-dddd-eeeee') where rowid = 2;
┌───────┬─────┬───────┐
│ start │ end │ match │
├───────┼─────┼───────┤
│ 78   │ c     │
└───────┴─────┴───────┘

But for capturing groups specifically, there isn't a great way. I want to add a new select * from regex_captures table function where you'll get all the capture groups and their names/positions, which is being tracked in #1

@greatvovan
Copy link
Author

Great. Any ETA on this by chance?
Also curious why isn't it a great way – is it due to poor performance, poor readability or other?

@asg017
Copy link
Owner

asg017 commented Mar 16, 2023

I'll give it a shot in the next week or so!

I say "it's not a great way" mostly because it's awkward and not really readable. For single values like 'aaa-bb-c-dddd-eeeee' it's fine, but it gets weirder on multiple values, like so:

with strings as (
  select value
  from json_each('[
    "aaa-bb-c-dddd-eeeee",
    "jjj-k-lll-mmmm-nnnn",
    "only-two"
  ]')
)
select * 
from strings
join regex_find_all(regex('\w+'), value) as parts 
where parts.rowid = 2;
/*
┌─────────────────────┬───────┬─────┬───────┐
│        value        │ start │ end │ match │
├─────────────────────┼───────┼─────┼───────┤
│ aaa-bb-c-dddd-eeeee │ 7     │ 8   │ c     │
│ jjj-k-lll-mmmm-nnnn │ 6     │ 9   │ lll   │
└─────────────────────┴───────┴─────┴───────┘
*/

First issue: There's a rowid = 2 in the WHERE clause, so you would think that this query only returns one row. But it doesn't, since the regex_find_all table function is called multiple times and returns multiple rows with the same rowid. Again, it works, but just awkward.

The second issue: For the "only-two" value, no row is returned. This may be what you want, but I'd prefer to see a NULL rather than a missing row for these types of cases.

All this to say, when capture group support is added, you'd be able to do something like:

with strings as (
  select value
  from json_each('[
    "aaa-bb-c-dddd-eeeee",
    "jjj-k-lll-mmmm-nnnn",
    "only-two"
  ]')
)
select 
  value,
  regex_capture(
    '\w+-\w+-(?P<third_word>[^-]+).*', 
    value, 
    'third_word'
  ) as third_word
from strings;

Which, in my opinion, is much cleaner and easier to reason about.

Re performance: They should be both equal, but one important note when using regex table function like regex_find_all is the wrap the regex() function around patterns. This enables caching, which is tricky for table functions, but automatically done for scalar functions like regexp() or regex_find().

@asg017
Copy link
Owner

asg017 commented Mar 23, 2023

Hey @greatvovan , I just pushed capture group support with the new regex_capture() and regex_captures() functions. It's available as of v0.2.3-alpha.3.

Using your ID example, here's how it would work:

create table items as 
  select value as code
  from json_each('["ID123Y2023ABC", "ID456Y2022ABC", "ID789Y1984ABC"]')
;


select 
  items.code,
  regex_capture(
    regex('ID(?P<id>\d+)Y(?P<year>\d+)ABC'),
    items.code,
    'id'
  ) as id,
  regex_capture(
    regex('ID(?P<id>\d+)Y(?P<year>\d+)ABC'),
    items.code,
    'year'
  ) as year
from items;

There's also the regex_captures() table function, which is only really good if you want to extract multiple matches from the same thing, but might be useful:

select
  rowid,
  regex_capture(captures, 'id') as id,
  regex_capture(captures, 'year') as year
from regex_captures(
  regex('ID(?P<id>\d+)Y(?P<year>\d+)ABC'),
  "ID123Y2023ABC, ID456Y2022ABC, and ID789Y1984ABC"
);
/*
┌───────┬─────┬──────┐
│ rowid │ id  │ year │
├───────┼─────┼──────┤
│ 0     │ 123 │ 2023 │
│ 1     │ 456 │ 2022 │
│ 2     │ 789 │ 1984 │
└───────┴─────┴──────┘
*/

Due to limitations in the SQLite virtual table interface, we can't return custom column names with a table function, so we can't do something like select id, year from regex_captures("...", "..."). But I'm tracking a possible solution to that in #9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants