This is a tool that converts scientific PDFs into plain text for your LLM-related needs.
- Convert PDF to LaTeX using Mathpix API that is tailored to work with scientific papers.
- Extract images and tables from LaTeX and replace them with text using a multimodal LLM.
- The prompts are made to extract all values and relationships represented within each table or graph and minimize information loss.
See hierarchical_retrieval.ipynb
for example LlamaIndex workflow.
It uses hierarchical retrieval to utilize text descriptions generated by GPT together to retrieve original tables and images.
- Set
MATHPIX_APP_ID
andMATHPIX_APP_KEY
in your environment. We suggest using a.env
file.
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv(".env")) # read local .env file
- Instantiate a text and a vision model. This tool uses LlamaIndex abstractions to interface with LLMs.
from llama_index.llms import OpenAI
from llama_index.multi_modal_llms import OpenAIMultiModal
text_model = OpenAI()
vision_model = OpenAIMultiModal(max_new_tokens=4096)
Next, pass those models to the converter.
converter = MathpixPdfConverter(text_model=text_model, vision_model=vision_model)
- Convert PDF and extract the result.
pdf_path = Path("path/to/file.pdf")
pdf_result = converter.convert(pdf_path)
with Path(f"output.txt").open("w") as f:
f.write(pdf_result.content)
In order to persist intermediate results or run processing in parallel,
you can use MathpixProcessor
and MathpixResultParser
directly.
processor = MathpixProcessor()
parser = MathpixResultParser(text_model=text_model, vision_model=vision_model)
mathpix_result = processor.submit_pdf(pdf_path)
mathpix_result = processor.await_result(mathpix_result)
pdf_result = parser.parse_result(mathpix_result)