Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[stdlib] Move StringRef find() implementation to Span #3548

Open
wants to merge 66 commits into
base: nightly
Choose a base branch
from

Conversation

martinvuyk
Copy link
Contributor

@martinvuyk martinvuyk commented Sep 25, 2024

Move StringRef find() implementation to Span

Signed-off-by: martinvuyk <martin.vuyklop@gmail.com>
Signed-off-by: martinvuyk <martin.vuyklop@gmail.com>
Signed-off-by: martinvuyk <martin.vuyklop@gmail.com>
@martinvuyk martinvuyk marked this pull request as ready for review September 25, 2024 03:21
@martinvuyk martinvuyk requested a review from a team as a code owner September 25, 2024 03:21
@martinvuyk martinvuyk changed the title [stdlib] Move StrRef find() implementation to Span [stdlib] Move StringRef find() implementation to Span Sep 25, 2024
martinvuyk and others added 7 commits September 24, 2024 23:23
Signed-off-by: martinvuyk <martin.vuyklop@gmail.com>
Signed-off-by: martinvuyk <martin.vuyklop@gmail.com>
Signed-off-by: martinvuyk <martin.vuyklop@gmail.com>
Signed-off-by: martinvuyk <martin.vuyklop@gmail.com>
@@ -335,3 +338,194 @@ struct Span[
return Span[T, _lit_mut_cast[lifetime, False].result](
unsafe_ptr=self._data, len=self._len
)

fn find[
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question It seems a bit weird from an API design perspective to have this "find" function in span, can you help me understand why we'd want this? It feels more coupled to string algorithms, and Span isn't just for a view over string data.

Copy link
Contributor Author

@martinvuyk martinvuyk Oct 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be used if you want to find occurrences of a specific scalar value sequence. My main thought was that we'll lose the ability to do that once StringSlice.find() is fixed to use unicode codepoints and it introduces quite some overhead for algorithms that are faster using raw bytes. We could also make this private WDYT?

Also, once PR's #3577 DType.get_dtype() gets merged, List.index() can also delegate to this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JoeLoser gentle ping and also to add something, since it seems like we are moving toward unifying Python's bytes into Span[UInt8]in PR #3636, request #3634 and many other places. Python's bytes.find is actually a method, and many other things that we put in StringSlice and StringRef are actually operations from bytes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a great change for overall performance when manipulating slices. It opens a lot of possibilities.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JoeLoser per discussion on discord, Span[T].find is intended to be memchr or memmem as appropriate, but that the subspan of bytes must be aligned to alignof[T](). We can probably expand it to any T: Trivial once we have a Trivial trait.

The proper string version likely needs be specialized for UTF-8 comparisons. @martinvuyk and I have discussed an "AsciiString" which can more aggressively optimize for using byte-oriented intrinsics since UTF-8 is annoying to do with SIMD.

Signed-off-by: martinvuyk <martin.vuyklop@gmail.com>
Signed-off-by: martinvuyk <martin.vuyklop@gmail.com>
Signed-off-by: martinvuyk <martin.vuyklop@gmail.com>
martinvuyk and others added 11 commits October 6, 2024 16:48
Signed-off-by: martinvuyk <martin.vuyklop@gmail.com>
Signed-off-by: martinvuyk <martin.vuyklop@gmail.com>
Signed-off-by: martinvuyk <martin.vuyklop@gmail.com>
Signed-off-by: martinvuyk <martin.vuyklop@gmail.com>
Signed-off-by: martinvuyk <martin.vuyklop@gmail.com>
martinvuyk and others added 15 commits November 4, 2024 09:07
Signed-off-by: martinvuyk <martin.vuyklop@gmail.com>
Signed-off-by: martinvuyk <martin.vuyklop@gmail.com>
Signed-off-by: martinvuyk <martin.vuyklop@gmail.com>
Signed-off-by: martinvuyk <martin.vuyklop@gmail.com>
Signed-off-by: martinvuyk <martin.vuyklop@gmail.com>
Signed-off-by: martinvuyk <martin.vuyklop@gmail.com>
Signed-off-by: martinvuyk <martin.vuyklop@gmail.com>
Signed-off-by: martinvuyk <martin.vuyklop@gmail.com>
Signed-off-by: martinvuyk <martin.vuyklop@gmail.com>
Signed-off-by: martinvuyk <martin.vuyklop@gmail.com>
Signed-off-by: martinvuyk <martin.vuyklop@gmail.com>
@martinvuyk
Copy link
Contributor Author

@JoeLoser would it be possible to land this before the next stable release? I'd like to use this to be able to bypass string's .find() for algorithms which work on raw bytes. And using that bypass, I'd like to make the breaking change to switch over to full unicode:

  • String.__len__() should return unicode length and not byte length
  • String.find() should work on unicode offsets
  • String.__getitem__() should work on unicode offsets

I would like to make the change as soon as possible because it will break a lot of projects, and many more are coming because the community is growing.

Signed-off-by: martinvuyk <martin.vuyklop@gmail.com>
Signed-off-by: martinvuyk <martin.vuyklop@gmail.com>
Signed-off-by: martinvuyk <martin.vuyklop@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants