Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Contextual Retrieval #17367

Merged
merged 26 commits into from
Jan 20, 2025
Merged
Show file tree
Hide file tree
Changes from 23 commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
3c9f7d9
1st commit of DocumentContextExtractor
cklapperich3 Dec 21, 2024
e5796e5
integrated upates to documentcontextextractor.
cklapperich3 Dec 21, 2024
815e2f8
removed unused import
cklapperich3 Dec 21, 2024
7f04900
fixed test code
cklapperich3 Dec 23, 2024
316e72a
mypy compatibility
cklapperich3 Dec 23, 2024
3628ffe
more typing shenanigans
cklapperich3 Dec 23, 2024
839f263
typypesafety
cklapperich3 Dec 24, 2024
fd5c8ba
unused imports and an #ignore removed
cklapperich3 Dec 24, 2024
dfcaa49
removed dumb comments
cklapperich3 Dec 24, 2024
a690b28
node sorting.
cklapperich3 Dec 24, 2024
5b121e5
moved into core.
cklapperich3 Dec 24, 2024
6d51d86
final commit
cklapperich3 Dec 25, 2024
b7eb5e5
all tests pass.
cklapperich3 Dec 25, 2024
8e88e2e
linting
cklapperich3 Dec 25, 2024
6776695
cleanup old stuff
cklapperich3 Dec 25, 2024
08b9fe0
added a notebook
cklapperich3 Dec 30, 2024
995acb6
added documentation
cklapperich3 Dec 30, 2024
5c4b8ba
notebook done
cklapperich3 Jan 1, 2025
3763427
fixed commentt
cklapperich3 Jan 1, 2025
75a23d1
Merge branch 'main' into contextual_retrieval
cklapperich Jan 1, 2025
4091a87
updates per PR comments
cklapperich3 Jan 12, 2025
d916f9a
Merge branch 'contextual_retrieval' of https://github.com/cklapperich…
cklapperich3 Jan 12, 2025
8173a27
bugs in notebook fixed; stateless document context added
cklapperich3 Jan 14, 2025
7252c54
Merge branch 'main' into contextual_retrieval
logan-markewich Jan 20, 2025
e7de522
nits
logan-markewich Jan 20, 2025
7dfbbbc
fix type check
logan-markewich Jan 20, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions docs/docs/api_reference/extractors/documentcontext.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
::: llama_index.extractors
options:
members:
- DocumentContextExtractor
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. They were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep. The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The district's machine happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. The space was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights. The language we used was an early version of Fortran. You had to type programs on punch cards, then stack them in the reader and press a button to load the code into memory and run it. The result would ordinarily be to print something on the spectacularly loud device. I was puzzled by the machine. I couldn't figure out what to do with it. And in retrospect there's not much I could have done with it. The only form of input to programs was data stored on cards, and I didn't have any information stored on them. The only other option was to do things that didn't rely on any input, like calculate approximations of pi, but I didn't know enough math to do anything interesting of that type. So I'm not surprised I can't remember any code I wrote, because it can't have done much. My clearest memory is of the moment I learned it was possible for programs not to terminate, when one of mine didn't. On a machine without time-sharing, this was a social as well as a technical error, as the manager's expression made clear. With microcomputers, everything changed. Now you could have one sitting right in front of you, on a desk, that could respond to your keystrokes as it was running instead of just churning through a stack of punched inputs and then stopping.

I shifted to writing essays again, and created several new ones over the next few months. Some even ventured beyond startup topics. Then in March 2015 I began working on Lisp again.
Lisp's unique characteristic is that its core is a language defined by writing an interpreter in itself. It wasn't originally intended as a standard programming language. It was created as a formal model of computation, an alternative to the Turing machine. If you want to write an interpreter for a language in itself, what's the minimum set of predefined operators do you need? The Lisp that John McCarthy invented, or more accurately discovered, is an answer to that question.
McCarthy didn't realize the language could even be used to program computers until his grad student Steve Russell suggested it. Russell translated McCarthy's interpreter into IBM 704 machine language, and from then on Lisp also became a programming language in the conventional sense. But its origins as a model of computation gave it a power and elegance that other languages couldn't match. This quality was what attracted me in college, though I didn't understand why at the time.
McCarthy's 1960 version did nothing more than interpret Lisp expressions. It was missing many features you'd want in a programming language. So these had to be added, and when they were, they weren't defined using his original axiomatic approach. That wouldn't have been feasible at the time. McCarthy tested his interpreter by hand-simulating the execution of programs. But it was already getting close to the limit of interpreters you could test that way — indeed, there was a bug in it that he had overlooked. To test a more complicated system, you'd have had to run it, and computers then weren't powerful enough.

Now they are powerful enough. Now you could continue using the axiomatic approach till you'd defined a complete programming language. And as long as every change you made to the original system was a discoveredness-preserving transformation, you could, in principle, end up with a complete language that had this quality. Harder to do than to talk about, of course, but if it was possible in principle, why not try? So I decided to take a shot at it. The work took 4 years, from March 26, 2015 to October 12, 2019. It was fortunate that I had a precisely defined goal, or it would have been hard to keep at it for so long.
I wrote this new Lisp, called Bel, in itself in Arc. That may sound like a contradiction, but it's an indication of the sort of trickery I had to engage in to make this work. By means of an egregious collection of hacks I managed to make something close enough to an interpreter written in itself that could actually run. Not fast, but fast enough to test.
I had to ban myself from writing essays during most of this time, or I'd never have finished. In late 2015 I spent 3 months writing essays, and when I went back to working on Bel I could barely understand the code. Not so much because it was badly written as because the problem is so convoluted. When you're working on an interpreter written in itself, it's hard to keep track of what's happening at what level, and errors can be practically encrypted by the time you get them.
So I said no more writing till the project was done. But I told few people about it while I was working on it. So for years it must have seemed that I was doing nothing, when in fact I was working harder than I'd ever worked on anything. Occasionally after wrestling for hours with some gruesome bug I'd check Twitter or HN and see someone asking "Does Paul Graham still code?"

Working on the language was hard but satisfying. I worked on it so intensively that at any given time I had a decent chunk of the code in my head and could write more there. I remember taking the boys to the coast on a sunny day in 2015 and figuring out how to deal with some problem involving continuations while I watched them play in the tide pools. This experience felt like I was doing life right. I remember that moment because I was slightly dismayed at how novel it felt. The good news is that I had more moments like this over the next few years.
In the summer of 2016 we moved to England. We wanted our kids to see what it was like living in another country, and since I was a British citizen by birth, that country seemed the obvious choice. We only meant to stay for a year, but we liked it so much that we still live there. So most of the work was written in England.
In the fall of 2019, Bel was finally finished. Like McCarthy's original version, it was a spec rather than an implementation, although like McCarthy's work it's a spec expressed as code.
Now that I could write essays again, I wrote a bunch about topics I'd had stacked up. I kept writing through 2020, but I also started to think about other things I could work on. How should I choose what to do? Well, how had I chosen what to work on in the past? I wrote an essay for myself to answer that question, and I was surprised how long and messy the answer turned out to be. If this surprised me, who'd lived it, then I thought perhaps it would be interesting to other people, and encouraging to those with similarly messy lives. So I wrote a more detailed version for others to read, and this is the last sentence of it.

[1] My experience skipped a step in the evolution of computers: time-sharing machines with interactive OSes. I went straight from batch processing to microcomputers, which made the latter seem all the more exciting.
[2] Italian words for abstract concepts can nearly always be predicted from their English cognates (except for occasional traps like polluzione). It's the everyday words that differ. So if you string together a lot of abstract concepts with a few simple verbs, you can make a little Italian go a long way.
[3] I lived at Piazza San Felice 4, so my walk to the Accademia went straight down the spine of old Florence: past the Pitti, across the bridge, past Orsanmichele, between the Duomo and the Baptistery, and then up Via Ricasoli to Piazza San Marco. I saw the city at street level in every possible condition, from empty dark winter evenings to sweltering summer days when the streets were packed with tourists.
[4] You can of course paint people like still lives if you want to, and they're willing. That sort of portrait is arguably the apex of still life painting, though the long sitting does tend to produce pained expressions in the sitters.
[5] Interleaf was one of many companies that had smart people and built impressive technology, and yet got crushed by Moore's Law. In the 1990s the exponential growth in the power of commodity (i.e. Intel) processors rolled up high-end, special-purpose hardware and software companies like a bulldozer.
[6] The signature style seekers at RISD weren't specifically mercenary. In the art world, money and coolness are tightly coupled. Anything expensive comes to be seen as fashionable, and anything seen as trendy will soon become equally costly.
[7] Technically the apartment wasn't rent-controlled but rent-stabilized, but this is a refinement only New Yorkers would know or care about. The point is that the place was really cheap, less than half market price.
[8] Most software you can launch as soon as it's done. But when the software is an online store builder and you're hosting the stores, if you don't have any users yet, that fact will be painfully obvious. So before we could launch publicly we had to launch privately, in the sense of recruiting an initial set of users and making sure they had decent-looking shops.
[9] We'd had a code editor in Viaweb for users to define their own page styles. They didn't know it, but they were editing Lisp expressions underneath. But this wasn't an app editor, because the code ran when the merchants' sites were generated, not when shoppers visited them.
[10] This was the first instance of what is now a familiar experience, and so was what happened next, when I read the comments and found they were full of angry people. How could I claim that Lisp was better than other languages? Weren't they all Turing complete? People who see the responses to essays I write sometimes tell me how sorry they feel for me, but I'm not exaggerating when I reply that things have always been like this, since the very beginning. It comes with the territory. An essay must tell readers things they don't already know, and some people dislike being told such information.
Continuing with the notes:
[11] People put plenty of stuff on the internet in the 90s of course, but putting something online is not the same as publishing it online. Publishing online means you treat the online version as the (or at least a) primary version.
[12] There is a general lesson here that our experience with Y Combinator also teaches: Customs continue to constrain you long after the restrictions that caused them have disappeared. Customary VC practice had once, like the customs about publishing essays, been based on real constraints. Startups had once been much more expensive to start, and proportionally rare. Now they could be cheap and common, but the VCs' customs still reflected the old world, just as customs about writing essays still reflected the constraints of the print era.
Which in turn implies that people who are independent-minded (i.e. less influenced by custom) will have an advantage in fields affected by rapid change (where customs are more likely to be obsolete).
Here's an interesting point, though: you can't always predict which fields will be affected by rapid change. Obviously software and venture capital will be, but who would have predicted that essay writing would be?
[13] Y Combinator was not the original name. At first we were called Cambridge Seed. But we didn't want a regional name, in case someone copied us in Silicon Valley, so we renamed ourselves after one of the coolest tricks in the lambda calculus, the Y combinator.
I picked orange as our color partly because it's the warmest, and partly because no VC used it. In 2005 all the VCs used staid colors like maroon, navy blue, and forest green, because they were trying to appeal to LPs, not founders. The YC logo itself is an inside joke: the Viaweb logo had been a white V on a red circle, so I made the new one a white Y on an orange square.
[14] YC did become a fund for a couple years starting in 2009, because it was getting so big I could no longer afford to fund it personally. But after Heroku got bought we had enough money to go back to being self-funded.
[15] I've never liked the term "deal flow," because it implies that the number of new startups at any given time is fixed. This assumption is not only false, but it's the purpose of YC to falsify it, by causing startups to be founded that would not otherwise have existed.
[16] She reports that the air conditioners were all different shapes and sizes, because there was a run on them and she had to get whatever she could, but that they were all heavier than she could carry now.
[17] Another problem with HN was a bizarre edge case that occurs when you both write essays and run a forum. When you run a forum, you're assumed to see if not every conversation, at least every conversation involving you. And when you write essays, people post highly imaginative misinterpretations of them on forums. Individually these two phenomena are tedious but bearable, but the combination is disastrous. You actually have to respond to the misinterpretations, because the assumption that you're present in the conversation means that not responding to any sufficiently upvoted criticism reads as a tacit admission that it's correct. But that response in turn encourages more; anyone who wants to pick a fight with you senses that now is their chance.
[18] The worst thing about leaving YC was not working with Jessica anymore. We'd been working on the company almost the whole time we'd known each other, and we'd neither tried nor wanted to separate it from our personal lives, so leaving was like pulling up a deeply rooted tree.
[19] One way to get more precise about the concept of invented vs discovered is to talk about space aliens. Any sufficiently advanced alien civilization would certainly know about the Pythagorean theorem, for example. I believe, though with less certainty, that they would also know about the Lisp in McCarthy's 1960 paper.
But if so there's no reason to suppose that this is the limit of the language that might be known to them. Presumably aliens need numbers and errors and I/O too. So it seems likely there exists at least one path out of McCarthy's Lisp along which discoveredness is preserved.
Thanks to Trevor Blackwell, John Collison, Patrick Collison, Daniel Gackle, Ralph Hazell, Jessica Livingston, Robert Morris, and Harj Taggar for reading drafts of this.
Loading
Loading