Automated-Document-Analysis

Designed and implemented a comprehensive Python-based solution for automating the extraction, processing, and analysis of text from unstructured PDF and DOCX documents. This system utilizes Natural Language Processing (NLP) techniques to perform tasks such as sentence tokenization, pattern matching, and identifying specific text elements like numeric lists and references.

• Leveraged regular expressions (regex) and libraries like NLTK and pandas to clean and filter document content. The solution is capable of processing large volumes of data efficiently, extracting key information while filtering out irrelevant content.

• Applied transformer models (pre-trained sentence embeddings) to analyze and compute semantic similarity between different sections of text, enabling accurate content comparison and context understanding across documents.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
PJT FINAL.ipynb		PJT FINAL.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Automated-Document-Analysis

About

Releases

Packages

Languages

sThapaswi/Automated-Document-Analysis

Folders and files

Latest commit

History

Repository files navigation

Automated-Document-Analysis

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages