Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[meta] kv and grok filters timeout handling #10976

Closed
colinsurprenant opened this issue Jul 18, 2019 · 13 comments
Closed

[meta] kv and grok filters timeout handling #10976

colinsurprenant opened this issue Jul 18, 2019 · 13 comments

Comments

@colinsurprenant
Copy link
Contributor

colinsurprenant commented Jul 18, 2019

Description

Both the kv and grok filters share a similar strategy for supporting the timeout option to abort long running regex operations.

The strategy, initially put in place for grok, is to rely on the Java interrupt status flag. The basic idea is to create a watchdog thread which monitor the time spent from the worker thread inside the regex parsing calls (that are ultimately handled by the Java joni regex library). Upon figuring that a worker thread has spent too much time in a regex call, to abort it, it sets that worker thread underlying Java thread interrupt status by using Thread.interrupt(). See kv implementation for example.

Problem

We have realized that logstash was sometimes randomly crashing with InterruptedException and figured these were tied to the grok and kv filters timeout handling and also figured that the timeout was not working at all in other cases.

The problem seems to be related to the direct manipulation of the Java interrupt status flag on the underlying Java thread from a JRuby execution context. JRuby threads + runtime has its own layer to handle the Java interrupt status flag, see the RubyThread implementation. We can see that the RubyThread.pollThreadEvents() method calls RubyThreads.anyInterrupted() which is turn call Thread.interrupted() which reads and resets the interrupt status flag :

The interrupted status of the thread is cleared by this method. In other words, if this method were to be called twice in succession, the second call would return false

The problem is that that the JRuby RubyThread.pollThreadEvents() methods is invoked in many places in the JRuby runtime and it is very hard to predict the behaviour of explicitly setting a Java thread interrupt flag from a JRuby context because once set in Java from JRuby, the JRuby runtime will continue to be invoked up until the final joni Java code is executed.

Also looking at this JRuby issue we can see that figuring the right interruption semantics between Java and Ruby in not trivial and that these implementation details can change. Relying on such implementation details can become future problems if these implementation details change.

Solution

I am not sure. There are possibly two paths;

  • Finding a fix for the current Ruby plugins implementations of grok and kv, maybe look into using the Timeout class?
  • Rewriting both plugins in pure Java using the Java plugins API and make sure that setting the Java thread interrupt flag on a thread is never inter winded with JRuby execution on that same thread?
@colinsurprenant
Copy link
Contributor Author

/cc @headius @kares do you think this is a correct assessment of a strategy of directly using Java Thread.interrupt() from JRuby?

I see that this strategy was actually suggested back then (1.7.3) by @headius in jruby/jruby#792 because Timeout did not seem to be working correctly with joni - I will have to see what the state of this is.

@kares
Copy link
Contributor

kares commented Jul 18, 2019

will need to look at details, but there's a somehow recent interrupt related change, altough pbly not regexp related (certainly smt where logstash was considered). anyways, there's some support in joni itself for regexp timeouts that isn't used much but could be fine tuned if it would help logstash's case ...

@colinsurprenant
Copy link
Contributor Author

@kares thanks - you probably refer to jruby/jruby#5683 ?

While ultimately the end goal is about interrupting a long regex call in joni (and obviously all suggestions are welcome) my question is more about the potential dangers of using Java Thread.interrupt() from JRuby? My assessment is that it is dangerous and can lead to hard to predict behaviours. Feedback about that would be useful in helping to decide if we should spend more energy in trying to make it work correctly or is it probably better to avoid that?

@jsvd
Copy link
Member

jsvd commented Jul 19, 2019

Finding a fix for the current Ruby plugins implementations of grok and kv, maybe look into using the Timeout class?

From what I tested I see no current way of interrupting joni from jruby, be it using interrupt or ruby's Timeout.

@colinsurprenant
Copy link
Contributor Author

FYI I found some history on joni's interruptibility; jruby/joni#6 and digging further I realize that an option was added to control joni's interruptability from JRuby https://github.com/jruby/jruby/blob/493822c134fe74e9630d2dc49325e291008b3101/core/src/main/java/org/jruby/util/cli/Options.java#L147 and it is off by default.

@jsvd running your above interrupt example using

jruby -Xregexp.interruptible=true using_interrupt.rb

actually works - so that solves the mystery of why joni's interruption stopped working at some point.

@jsvd
Copy link
Member

jsvd commented Jul 19, 2019

yep, this makes both java interrupt and ruby's Timeout work again.

@colinsurprenant
Copy link
Contributor Author

@untergeek
Copy link
Member

👏 Well done, gentlemen. I know a few sites I've visited had kv timeout issues, and this is a tremendous bit of investigation and correction. Bravo!

@colinsurprenant
Copy link
Contributor Author

Also related and somewhat full-circle logstash-plugins/logstash-filter-grok#130 logstash-plugins/logstash-filter-grok#131

@kares
Copy link
Contributor

kares commented Jul 22, 2019

@kares thanks - you probably refer to jruby/jruby#5683 ?

right, my shady memory forgot that the work-around was specific to mutex-es. shouldn't matter here.

While ultimately the end goal is about interrupting a long regex call in joni (and obviously all suggestions are welcome) my question is more about the potential dangers of using Java Thread.interrupt() from JRuby? My assessment is that it is dangerous and can lead to hard to predict behaviours. Feedback about that would be useful in helping to decide if we should spend more energy in trying to make it work correctly or is it probably better to avoid that?

believe you are correct esp. as you dissected JRuby's Thread.interrupted() usage.
there's some historical reasons for that, I am not sure JRuby could change that in 9.2 (to keep the native interrupted status) maybe in a new major but take my words lightly.
if logstash needs it we could do some experiments.

that being said the interruptible regexp is probably a better way, the reason why its not on by default is likely just performance (slows down matching - not sure how much pbly worth an isolated performance test).

@colinsurprenant
Copy link
Contributor Author

grok v4.1.0 and kv v4.4.0 have been published both using Timeout.timeout instead direct Java Thread.interrupt() for timeout handling.

@colinsurprenant
Copy link
Contributor Author

For reference, we traced back to v6.7.0 (where JRuby was updated from 9.1.13.0 to 9.2.6.0) when joni interruptability was disabled by default and resulted in both the kv and grok filters timeout options just not working. This is tracked in #10978

@colinsurprenant colinsurprenant changed the title kv and grok filters timeout handling [meta] kv and grok filters timeout handling Nov 12, 2019
@colinsurprenant
Copy link
Contributor Author

I am closing this issue since a fix was pushed.

Note that we discovered a substantial performance regression with the current fix as described in #11302

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants