-
Notifications
You must be signed in to change notification settings - Fork 72
Home
1FileLLM is a command-line and Flask-based web tool that consolidates text from various sources—such as GitHub repositories, GitHub pull requests/issues, local folders, academic papers, YouTube transcripts, and web pages—into a single, LLM-ready text file. The main goal is to enable fast and seamless creation of information-dense prompts for Large Language Models (LLMs). Text is automatically preprocessed, optionally compressed, and copied to the clipboard for immediate use in LLMs.
Key Objectives
- Streamline text ingestion from multiple external data sources (local files, GitHub repos, YouTube transcripts, academic PDFs, etc.).
- Provide both a command-line and a web-based interface to process and download results.
- Preprocess text by removing stopwords and unnecessary characters, and optionally encapsulate it in XML tags for better LLM performance.
- Report token counts for both compressed and uncompressed text, simplifying prompt-size management.
Features
- Automatic detection of source type (local, GitHub, ArXiv, etc.).
- Support for multiple file formats:
.py
,.ipynb
,.txt
,.md
, PDFs, etc. - Web crawling with user-defined link depth.
- Optionally retrieve research papers from Sci-Hub via DOI/PMID.
- Clipboard copy of the processed text.
- Token count reporting for immediate insight into LLM prompt sizing.
1FileLLM’s architecture comprises:
-
Command-Line Interface (CLI) (
onefilellm.py
):- Entry point via
main()
to process URLs/paths. - Detects source type and dispatches to the relevant processing function.
- Encapsulates logic for reading/writing output files, preprocessing text, and handling environment variables.
- Entry point via
-
Web Interface (
web_app.py
):- A Flask-based HTTP server providing a front-end form for users to submit paths/URLs.
- Mirrors CLI functionality by calling the same underlying processing functions from
onefilellm.py
. - Returns processed outputs as downloadable files and token counts in a web page.
-
Processing Modules (all in
onefilellm.py
but grouped conceptually):-
GitHub Operations
-
process_github_repo
: Recursively fetches files from a repository, downloading only allowed file types. -
process_github_pull_request
&process_github_issue
: Gathers PR or issue details, plus full repo content.
-
-
PDF/Text Retrieval
-
process_arxiv_pdf
: Retrieves and extracts text from ArXiv PDFs. -
process_doi_or_pmid
: Pulls PDFs via Sci-Hub for a given DOI/PMID, then extracts text. -
process_local_folder
: Reads local directories, processing recognized file types. -
fetch_youtube_transcript
: Uses the YouTubeTranscriptApi to retrieve transcripts.
-
-
Web Crawling
-
crawl_and_extract_text
: Recursively scrapes web pages (HTML and optional PDFs), collecting text up to a user-defined depth.
-
-
Preprocessing & Token Counting
-
preprocess_text
: Normalizes text, removing punctuation and stopwords, optionally preserving an XML structure. -
get_token_count
: Tokenizes text for reporting, usingtiktoken
.
-
-
GitHub Operations
-
Shared Utilities
-
Token & Stopword Handling (
nltk
,tiktoken
): For text tokenization, compression, and cleaning. -
Clipboard Integration (
pyperclip
): Copies final uncompressed text. -
PDF Libraries (
PyPDF2
): Extracts text from PDFs. -
HTML Parsing (
BeautifulSoup
): Used in web crawling and GitHub API JSON parsing. -
Environment Variable Handling: Leverages
GITHUB_TOKEN
for private GitHub repo access.
-
Token & Stopword Handling (
Data Flow
-
User Input
- Provided via command-line arguments or via a Flask web form.
-
Dispatch
-
main()
orweb_app.py
identifies the source and calls the corresponding processing function.
-
-
Fetch & Parse
- PDF text extraction, web crawling, GitHub file downloads, or local file reads.
-
Preprocess
- Normalizes text by stripping stopwords and extraneous characters.
- Optionally wraps output in XML tags for structured LLM input.
-
Output
- Generates
uncompressed_output.txt
andcompressed_output.txt
. - Copies
uncompressed_output.txt
to clipboard. - Provides optional download from the web interface.
- Shows token counts for each output.
- Generates
git clone https://github.com/jimmc414/1filellm.git
cd 1filellm
Use the provided requirements.txt
:
pip install -r requirements.txt
(Optionally, create a virtual environment before installing.)
For private GitHub repository access, set a GITHUB_TOKEN
environment variable:
Windows
setx GITHUB_TOKEN "YourGitHubToken"
Linux/macOS
echo 'export GITHUB_TOKEN="YourGitHubToken"' >> ~/.bashrc
source ~/.bashrc
You can run 1FileLLM in two ways:
python onefilellm.py
- You will be prompted to enter a local path, GitHub URL, YouTube link, etc.
Or directly provide an argument:
python onefilellm.py https://github.com/jimmc414/onefilellm
- The script detects the source type (e.g., GitHub repo) and processes accordingly.
- Output files:
-
uncompressed_output.txt
: Full text (copied to clipboard). -
compressed_output.txt
: Preprocessed text (cleaner, fewer tokens). -
processed_urls.txt
: (For web crawls) Contains each visited URL.
-
-
Launch the web server:
python web_app.py
-
Access
http://localhost:5000
in your browser. -
Provide the input path/URL in the text field, click Process, and view/download the output.
-
File Extensions:
- Editable in
onefilellm.py
underis_allowed_filetype()
. By default, includes.py
,.txt
,.md
,.ipynb
, etc.
- Editable in
-
Max Depth for Crawling:
- In
crawl_and_extract_text
, themax_depth
argument controls how deeply linked pages are followed. The default is2
.
- In
-
ArXiv & Sci-Hub:
- The tool constructs standard PDF URLs from ArXiv links.
- Sci-Hub domain is hardcoded (
sci-hub.se
). Modify code if needed.
-
Fork the repository and create your own feature branch from
main
. -
Add/modify tests in
test_onefilellm.py
to cover any new functionality. -
Run the test suite to ensure everything passes:
python -m unittest test_onefilellm.py
- Submit a pull request with a detailed description of your changes.
Preferred Contributions
- Improvements to text preprocessing or token counting.
- Additional source type integrations (e.g., other API endpoints).
- Performance optimizations or caching.
- Security enhancements (token encryption, better error handling).
Q1: Why am I getting an error about GITHUB_TOKEN
?
A1: You must set a valid GITHUB_TOKEN
if you want to access private repos. Public repos will still work without it.
Q2: The script can’t find my local PDF.
A2: Confirm the file’s absolute path or current directory context. Also check allowed file extensions.
Q3: Web crawling is slow or fails unexpectedly.
A3: Large or complex sites can cause performance or request-limit issues. Consider reducing max_depth
or limiting the domain.
Q4: Sci-Hub isn’t returning a PDF.
A4: Sci-Hub might be unavailable or blocking requests from your region. Try again later, or update the Sci-Hub domain in process_doi_or_pmid()
.
Q5: Token counts differ from my LLM’s actual usage.
A5: Different LLMs and tokenizers can yield slightly different counts. The included tiktoken
library is an approximation for certain model families.