paperweight backs up and extracts data from the pdf links in your markdown files
script running in CLI:
dash showing embeddings in 3d space:
sqlite db (shown in Beekeper Studio Ultimate)
- finds links to pdfs in markdown files
- backup the pdf's in a sqlite database
- full pdf saved in a blob
- text extraction via https://github.com/pymupdf/PyMuPDF
- text embeddings from openai stored as encoded JSON
- metadata via gpt-3.5 turbo functions
- title
- keywords
- authors
- abstract
- published_date
- summary
- institution
- location
- doi
- screenshot of first page saved as blob
- 3d viz of embeddings via dash and plotly
- cloud backup via cloudflare r2 or amazon s3
embeddings are currently limited to the first 8191 tokens of the pdf. this is the max input size of text-embedding-3-small
. chunking the text and sending it in parts to support full embeddings is a future feature.
(2/15/24 update) - perhaps huge context windows are around the corner anyways
full text is currently limited to 10mb per row. this is arbitrary and will be configurable in the future.
run via the command line with the following command
python main.py --directory ~/path/to/your/mds
the dash will be accessible on http://127.0.0.1:8050/
. see the REMAIN_OPEN
arg below for keeping the dash running when processing is complete.
--directory
- path to the directory containing markdown files. defaults to the value of theDIRECTORY_NAME
environment variable or the current directory if not set--db-name
- the name for the database. defaults to the value of theDB_NAME
environment variable orpapers.db
if not specified--model-name
- specifies the OpenAI model name to be used. defaults to the value of theMODEL_NAME
environment variable orgpt-3.5-turbo-0125
if not provided--verbose
- enables verbose mode, providing detailed logging. this defaults to the boolean value of theVERBOSE
environment variable orFalse
if not set--remain-open
- keeps the application running even after processing is complete, useful for continuous operation or debugging. this defaults to the boolean value of theREMAIN_OPEN
environment variable orFalse
if not specified
To enhance security and flexibility, certain configurations are managed through environment variables:
OPENAI_API_KEY
- your OpenAI API key, required for generating embeddings and extracting data. this is not explicitly called for anywhere in the application code, but is rather automagically used by the openai library.DIRECTORY_NAME
- (optional) can be set to define a default directory for--directory
, overriding the default current directoryDB_NAME
- (optional) sets a default database name for--db-name
, overriding the defaultpapers.db
MODEL_NAME
- (optional) determines the default model name for--model-name
, if not specified via CLI, defaulting togpt-3.5-turbo-0125
VERBOSE
- (optional) can be set totrue
to enable verbose mode by default, overriding the CLI--verbose
flagREMAIN_OPEN
- (optional) when set totrue
, the application remains open after processing, overriding the--remain-open
CLI flag. this is used to continue looking at the dash app after processing is complete.
the following environment variables are used for cloud backup functionality:
S3_BUCKET_NAME
- specifies the S3 bucket name where backups are storedS3_ENDPOINT_URL
- the endpoint URL for S3 servicesAWS_ACCESS_KEY_ID
- your AWS access key IDAWS_SECRET_ACCESS_KEY
- your AWS secret access keyS3_REGION_NAME
- defines the AWS region for the S3 service. defaults toauto
if not explicitly set, allowing automatic determination based on the endpoint URL
To configure your script with environment variables, you can use a .env
file. Here's an example that you can customize:
# Application Configuration
DIRECTORY_NAME=./path/to/markdowns
DB_NAME=papers.db
MODEL_NAME=gpt-3.5-turbo-0125
VERBOSE=true
REMAIN_OPEN=false
# AWS S3 Configuration
S3_BUCKET_NAME=your_bucket_name
S3_ENDPOINT_URL=https://s3.your-region.amazonaws.com
AWS_ACCESS_KEY_ID=your_access_key_id
AWS_SECRET_ACCESS_KEY=your_secret_access_key
S3_REGION_NAME=your_region_name
replace the placeholder values with your actual configuration details. this file should be named .env
and should not be committed to version control for security reasons (people can steal your openai key). this is why .env
is in the .gitignore
file.
- openai api key
- python 3.11.1
- probably works with 3.12, but has not been tested
- for cloud backup you will need the following. see https://developers.cloudflare.com/r2/examples/aws/boto3/ for more information on how to set up your keys and what the endpoint_url should be:
- endpoint_url
- access_key
- secret_key
see usage
above for how to configure these as environment variables.
you can optionally setup a cloud backup of the sqlite database to aws s3/cloudflare r2.
see usage
section for detailed instructions on how to set up the backup.
I evaluated using turso embedded replicas for this, but turso does not seem to support BLOB columns, so I did not think it would be a good fit. The primary purpose of these backups is so your PDFs do not dissapear to link rot, so the blob columns are important.
if you want to turn the non-blob columns into a distributed database, perhaps to create an API to serve your papers with a JS based frontend, you might want to consider using turso. I have not tried this, but it seems like it would be a good fit for that use case.
I did not run into any issues serving the dash app locally which pulls from the sqlite database file, as papers are inserted into the same database in a separate thread.
if I were to run into issues, turning on WAL mode would probably be the first thing I would try.
if you are running into issues, or want to pull from the database more intensely, you might want to consider turning on WAL mode. The main downside seems to be the creation of two more files.
results can be improved by better prompt engineering extractor
in models.py
- n-shot training to extract data better
- turbo mode with many modal containers
- support more link types
- arxiv
- wikipedia
- support local pdf files
- backing up full text and embeddings
- make embeddings and NER optional
- modularize the existing functionality to reduce core dependencies
- dash server is a separate service
- cloud backup is a separate service
- test coverage
- datasette plugin
- gptstore plugin (?)
- service to run in the cloud on different data sources
inspired by and hopefully mostly compatible with
inpired by varepsilon's rsrch.space
inspired to work with files over apps by kepano