Skip to content

An object-oriented Python script for extracting structured data from medical documents. Successfully processed 2,000+ files, combining OCR technology and data parsing into a modular, maintainable codebase to output clean datasets for analytics. Includes collaboration with medical professionals and statistical analysis via RMarkdown.

License

Notifications You must be signed in to change notification settings

Tanguy9862/Medical-OCR-Data-Extraction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Medical OCR Data Extraction

Automated the extraction of medical data from scanned documents using Optical Character Recognition (OCR) and regular expressions, enabling efficient data processing and analysis.

📄 Overview

This project focuses on automating the extraction of medical data from scanned documents. By leveraging OCR technology, the script reads text from images and applies regular expressions to extract specific data fields. It is designed to handle various types of medical reports, such as IPS (Intravascular Pressure System) and EFR (Electrocardiogram Frequency Response).

🛠️ Technical Highlights

  • Programming Language: Python
  • Libraries:
    • pytesseract
    • Pandas
    • NumPy
    • langdetect
    • re
  • Reporting: RMarkdown

🧰 Challenges and Solutions

  • Language Variability: One of the challenges was dealing with documents in multiple languages. The solution was to use langdetect to identify the language and then apply the appropriate regular expressions.
  • Complex Data Fields: Some medical data fields required intricate regular expressions for accurate extraction. This required a deep understanding of both medical terminologies and regular expression capabilities.
  • Data Quality: Scanned documents can have varying quality, affecting OCR performance. The project allows for custom Tesseract configurations to handle such cases.

👨‍⚕️ Collaboration and Statistical Analysis

This project involved regular collaboration with medical professionals to identify and prioritize critical data fields relevant for medical diagnoses and treatment plans. Their expertise guided the selection of key metrics to focus on during data extraction and analysis. A comprehensive RMarkdown document was also developed to perform statistical tests on the extracted data, featuring interactive graphs powered by Plotly to aid decision-making. Due to confidentiality, the RMarkdown file is not included in this repository.

About

An object-oriented Python script for extracting structured data from medical documents. Successfully processed 2,000+ files, combining OCR technology and data parsing into a modular, maintainable codebase to output clean datasets for analytics. Includes collaboration with medical professionals and statistical analysis via RMarkdown.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages