Skip to content

End to End fine tuning for RAG models using PyTorch Lightning, Hydra, W&B and MosaicML Streaming

License

Notifications You must be signed in to change notification settings

das-projects/DeepRAGTuner

Repository files navigation

Dr. Arijit Das

Location: Koeln, North Rhine-Westphalia, Germany
Email: arijitd@gmail.com
Phone: +49 17684376170
LinkedIn: linkedin.com/in/arijitdas1986
GitHub: github.com/das-projects


Summary

Data Scientist and Research Engineer with over 10 years of experience in machine learning, computational biology, and statistical data analysis. Proven track record of developing and deploying advanced models for intelligent document processing, automated medical screening, and signal processing. Expertise in Python, PyTorch, and various MLOps tools, with a strong background in managing large-scale datasets and implementing fairness-aware algorithms. Recognized for improving system performance, reducing bias in automated decision-making, and enhancing the accuracy of predictive models. Seeking to leverage my skills and experience to drive innovation and efficiency in a dynamic research or industry setting.


Experience

Data Scientist

ERGO Group AG
March 2022 - Present
Düsseldorf, North Rhine-Westphalia, Germany

  • Developed and deployed 5+ models into production for large-scale Intelligent Document Processing, automating 40% of 100 million documents that were previously manually processed.
  • Led the development of 3+ prototypes and MVPs for Retrieval Augment Generation using AWS and Azure, achieving over 95% accuracy in retrieval performance.
  • Designing an MLOps framework to automate the deployment of classification and extraction models, reducing deployment time by 50%.
  • Modernized the software stack to Python (PyTorch, PyTorch Lightning, MLFlow, Fast-API) and JavaScript (NextJS, Svelte) to enhance pre-processing, training/validation, experiment management, and server deployment, leading to a 25% improvement in system performance.

Chair: Algorithmic Fairness Working Group

Institute and Faculty of Actuaries
August 2021 - December 2023

  • Spearheaded the development and implementation of fairness-aware algorithms, decreasing bias in automated decision-making processes by 25%, enhancing ethical standards across 10 institutions, and improving decision accuracy by 20% through comprehensive data analysis and cross-disciplinary collaboration.
  • Published a comprehensive paper in the British Actuarial Journal titled "From Bias to Black Boxes: Understanding and Managing the Risks of AI," influencing industry standards and practices in managing AI risks.

Group Leader

Uniklinik Köln
April 2019 - December 2021
Köln, North Rhine-Westphalia, Germany

  • Secured a Köln Fortune Research Grant of €120,000 for developing Automated Breast Cancer Screening technology, advancing early detection methods by 30%.
  • Supervised four master's theses and collaborated with two doctors on their PhD theses, resulting in three published papers and advancements in Statistically Robust Machine Learning.
  • Developed anomaly detection methods in multi-parametric MRIs using Deep Convolutional Neural Networks with FDR control, enhancing detection accuracy by 25%.
  • Implemented Lie Group covariant representation of 3D data, improving 3D model accuracy by 20%.
  • Conducted Non-linear Independent Component Analysis using Random Fourier Features, improving data separation quality by 15%.

Postdoctoral Researcher

Uniklinik Köln
January 2018 - March 2019
Cologne

  • Developed a Discrete Compound Process model for single-cell modeling, incorporating a novel cost function with regularization, improving parameter estimation consistency in under-sampled regimes by 20%.
  • Automated breast cancer screening using multiparametric MRI and Deep Convolutional Neural Networks, enhancing early detection accuracy by 30%.
  • Enhanced interpretability of Deep Bayesian Convolutional Networks, ensuring invariance to rotations in 3D imaging, leading to a 25% increase in diagnostic reliability.
  • Implemented model selection techniques in Deep Neural Networks, controlling false discoveries of features and improving predictive model reliability by 15%.

Research Scientist

Max Planck Institute: MPIPZ
September 2013 - December 2017
Cologne Area, Germany

  • Designed and analyzed algorithms to control false discoveries, developing machine learning techniques to manage generalization errors. Achieved state-of-the-art results in Genome-Wide Association Studies (GWAS) for breast cancer, reducing false discoveries by 25%.
  • Developed an efficient sampling algorithm to sparsify a kernel matrix with bounded error in O(n log n) time, improving computational efficiency by 50% over the standard O(n^2) complexity.
  • Facilitated efficient implementations of Gaussian process regression and kernel-based hypothesis testing algorithms for large datasets, reducing processing time by 40%.
  • Constructed a regularized cost function for Deep Convolutional Networks to classify Diabetic Retinopathy from retinal images, ensuring controlled false discovery rates and improving classification accuracy by 30%.

Research Engineer

Trinity College Dublin
January 2011 - August 2012
Electronics Engineering Department

  • Developed variational Bayes techniques for signal processing, focusing on turbo-coding algorithms, which improved inference speed and accuracy by 25%.
  • Conducted an in-depth study on signal inference under Raleigh fading of wireless signals in a noisy environment, resulting in improved signal processing algorithms by 20%.
  • Taught Statistical Signal Processing to Masters and PhD students, receiving excellent feedback and improving course engagement by 30%.

Research Engineer

INRIA Süd-Ouest
December 2009 - November 2010
Bordeaux, France

  • Designed and analyzed unsupervised learning algorithms for time series prediction, improving prediction accuracy by 20%.
  • Collaborated with EDF (Électricité de France) on forecasting daily and weekly consumption patterns of millions of customers across Europe, using advanced time series analysis techniques to improve prediction accuracy by 40% from their baseline.
  • Managed and processed large-scale datasets, handling hundreds of gigabytes of data, which improved data processing efficiency by 30%.

Summer Project: FlexMix Package

R Foundation for Statistical Computation
May 2008 - August 2008

  • Implemented the EM algorithm in C to exploit multi-core architectures, providing an API for parallel computing, which improved computational speed by 35%.

Education

Doctor of Philosophy (PhD), Machine Learning and Computational Biology
Max Planck Institute, Cologne, Germany, 2018
Magna cum Laude

Masters, Mathematics and Statistics
Indian Institute of Technology, Kanpur, India, 2009

Bachelor's Degree, Statistics
Delhi University, New Delhi, 2007


Certifications

MLOps Engineering on AWS
Amazon Web Services (AWS), 2022


Skills

Industry Knowledge:

  • Generative AI
  • Large Language Models (LLM) fine-tuning
  • Reinforcement Learning from Human Feedback (RLHF)
  • Direct Policy Optimization (DPO)
  • Proximal Policy Optimization (PPO)
  • Prompt Engineering
  • Semantic Search
  • Text-to-Speech
  • Speech-to-Text
  • Statistical Data Analysis
  • Time Series Analysis
  • Project Management
  • Agile Project Management

Programming Skills:

  • Python: PyTorch, PyTorch Lightning, HuggingFace Transformers, Docker, MLFlow, Weights and Biases
  • MLOps: AWS Sagemaker, Azure ML
  • Web Development: NextJs, Svelte, Fast-API

Tools and Technologies:

  • Data Processing: Custom OCR, Document Classifier, Named Entity Recognition (NER)
  • Machine Learning: Deep Convolutional Neural Networks, Gaussian Process Regression, Kernel-Based Hypothesis Testing
  • Cloud Platforms: AWS, Azure
  • Software Development: C, Gradio

About

End to End fine tuning for RAG models using PyTorch Lightning, Hydra, W&B and MosaicML Streaming

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages