You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Since most of the needle in a "haystack" test injects a line into a pre-defined book text (that can be part of the original dataset), it can be hypothesized that the LLM simply is "smelling" for something that does not fit the context.
So, is it possible to create a "haystack" that is a mix of multiple articles, or just a list of one-liners such that it cannot guess?
The text was updated successfully, but these errors were encountered:
What would you consider "fair" conditions for ad-hoc generation of haystack vs needle? Would there be tools to help with the randomized construction of the haystack (and maybe averaged performance of multiple tests)?
Word length of the needle (Single sentence vs single paragraph)
The size of the haystack (relative to the needle, singular book or single whole anthology)
The variance of the haystack (single source vs multiple source shuffled into a collection)
Bonus question: can this be used to evaluate FOSS models as well (esp. those without OpenAI APIs)? Would Ollama or similar do the job?
Since most of the needle in a "haystack" test injects a line into a pre-defined book text (that can be part of the original dataset), it can be hypothesized that the LLM simply is "smelling" for something that does not fit the context.
So, is it possible to create a "haystack" that is a mix of multiple articles, or just a list of one-liners such that it cannot guess?
The text was updated successfully, but these errors were encountered: