This project is a comprehensive system for processing PDFs, generating audio content, and creating scene visualizations. It provides end-to-end functionality from text extraction to audio generation and scene visualization.
- Preserves complex text formatting
- Supports chapter-specific audio processing
- Detects and handles quotes with unique voice characteristics
- Maintains text structure in intermediate JSON format
- Interactive chatbot for scene visualization (DALL-E)
- Extracts text from PDFs using PyMuPDF
- Preserves advanced text formatting (headings, quotes, italics)
- Generates metadata-rich text outputs
- Integrates Tortoise TTS for speech generation
- Converts formatted text to speech
- Dynamically adjusts voice characteristics based on text formatting
- Creates audiobooks with contextually appropriate pauses
- Implements scene visualization using DALL-E
- Provides interactive chatbot interface
- Generates images based on textual descriptions
- Central application entry point
- Supports multiple operational modes
- Handles command-line arguments
- Offers interactive chat functionality
- Python 3.8+
- OpenAI API Key (for DALL-E integration)
Clone the repository + install dependencies:
pip install -r requirements.txt
python -m spacy download en_core_web_sm
python main.py --pdf_path book.pdf --output_dir output --mode pdf2text
python main.py --output_dir output --mode text2audio
python main.py --output_dir output --mode chat --openai_key YOUR_API_KEY