-
-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimization for sparse automatons #78
Comments
Hmm. I don't think I quite understand your question quite yet. Could you maybe write out a simple example? Additionally, do either the |
OK, I'll try to explain a little better. I have a [u8] „pattern“. In its basic form, it matches directly, like Map::get would. However, it's also allowed for the pattern to skip some part of it, based on some marks. So, let's say the patter is
Allowed matches are the pattern itself, but also I do use the Is this easier to understand, or am I still confusing? 😇 |
OK, so I'm going through some of the issues on this repo and I've re-read your comments here. But I still just don't quite grok what's going on. It's probably because I haven't fully context switched back into this library, but I'm not sure. Is the patch you're envisioning difficult? If not, the best path forward might be to just submit the patch and hopefully that will help make what you're trying to do clearer to me. If it is difficult to write, then I'm not sure, maybe a different way of explaining it? |
I've also context-switched to other things since, and this is not a big priority for me right now, but I'll try. I don't know how hard the full patch would be, I think the implementation might require some amount of work. But I'll try to write an actual proposed interface change and what the usage I envisioned would look like. First, the (simplified) use case. Let's say I create a huge struct MyAutomaton<'a>(&'a [u8]);
enum State {
InStr(usize),
Outside,
}
impl Automaton for MyAutomaton {
type State = State;
fn start(&self) -> State {
State(0)
}
fn is_match(&self, state: State) -> bool {
match state {
State::InStr(n) if n == self.0.len() => true,
_ => false,
}
}
fn accept(&self, state: State, byte: u8) -> State {
match state {
State::InStr(n) if self.0.get(n) == Some(byte) => State::InStr(n + 1),
_ => State::Outside,
}
}
fn can_match(&self, state: State) -> bool {
match state {
State::InStr(_) => true,
State::Outside => false,
}
}
} The
etc. While the
Apparently, the So my proposal was to add a (tentatively named) In my simplified example, the method would look something like this: fn prefix_hint(&self, state: State) -> &[u8] {
match state {
State::InStr(n) => self.0.get(n..).unwrap_or(b""),
_ => b"",
}
} Is this explanation with example better? (It's all too possible the example contains some +-1 errors, I hope they don't matter) |
@vorner Ahhhh okay, thank you so much, that helps me understand things quite a bit better. I really appreciate you taking the time to write that. The key to my understanding was the inequality between But not totally privileged. IIRC, That's not to say that I think your Note that Lines 410 to 429 in 4993d14
Although, now that I write this out, I'm realizing that your whole point is that you want |
Kind of, yes. I want something a little bit more powerful ‒ I want to interleave the full-featured automaton search with the simplified get-style traversal. In the pattern I search, there are interesting points (where the full power of the automaton is very useful) and paths where nothing interesting happens. It's similar to the optimization of trie vs. compressed trie. The hint would always return the prefix of the rest of the match from the current state. To put some more context to it, I've used fst to implement a service at work and it turned out searching the fst was the most prominent function in the profile (which made sense, since it did the heavy lifting of the service, the rest was mostly just wrapping it into http). I was thinking about how to improve that and using the raw API to skip the simple parts of matching was a possibility. Nevertheless, I felt like using it would be a bit painful and would produce code that would be much less readable. We also have a policy that if some internal project needs a change in an opensource library, it's OK to create the pull request as part of work time and publish it. So I looked for a way to abstract some part that could be usable by others that'd also make my code more readable and use the work time to implement it. The idea is this extension of the trait, but it turned out the service is fast enough as it is. The evil master plan to use company time to improve fst won't fly, I'm still willing to attempt it if you like the interface in my free time, but it might take a while to find some. If you think it would only complicate matters and wouldn't bring enough improvement, I'm OK with that too. |
@vorner Thanks very much! I am nearly sold, but not 100%, so I'd say hold off on a patch for now. I will keep on noodling about this though and might just do it myself. One other question: does this optimization only apply to prefixes? It sounds like it ought to be generalizable to other parts. e.g., The prefix might be variable but the suffix might be fixed. Although I guess it's not possible to take advantage of this because one does not know how many transitions remain to the next match state at any given point in traversal. So I guess skipping past prefixes is a special case. I think Does that all make sense to you? |
Also, given that you're using this in production, it would be good to give my plans in #96 a quick perusal to make sure that makes sense to you. (And if you have any suggestions for breaking changes, now would be a good time! :-)) |
Well, as I envision it, it would be prefix of the rest of the match (so maybe the word prefix might be a bit misleading). If I wanted to match With the forwarding and combining of automatons, I think you're right. |
Hello
The fst can be queried with a specific key or with an automaton. The key is optimized by just „going forward“, while the automaton is „offered“ one byte at a time to modify the state.
Now, I have a sparse automaton ‒ one that doesn't describe one specific key only, but it has only few branching points and is just single possible next byte in most of the states.
I wonder if it's possible to somehow make the querying faster, given the above. Instead of offering it all the bytes that won't match for sure, just going forward on the straight parts. I could write a code specific to my use case, but is there something that could be done in general?
Maybe extending the automaton with a
fn hint(&self) -> Option<&[u8]>
(default to None) that would not store the intermediate visited states and just go forward? Would it make it faster? Would it make sense to have something like that as part fst itself? Or, is there a better way? If so, I might try to implement it, but I want to ask first before I invest the time.Thank you
The text was updated successfully, but these errors were encountered: