Python based project for pipelining on classifying Mycobacterium Tuberculosis first-line drugs resistance from DNA genome sequence powered by ML model
Tools/library used in the pipeline :
- Tabula : Extracting DST data from pdf files into csv
- enaWebTools (FTP) : https://github.com/enasequence/enaBrowserTools
- FASP Aspera client : https://download.asperasoft.com/download/sw/connect/3.9.9/ibm-aspera-connect-3.9.9.177872-linux-g2.12-64.tar.gz
- ARIBA : https://github.com/sanger-pathogens/ariba
- scikit-learn : https://github.com/scikit-learn/scikit-learn
- From scratch RF and DT code : https://github.com/zhaoxingfeng/RandomForest
- Other library needed in python3 is Numpy,Pandas,Bowtie,CD-HIT,sklearn,matplotlib, etc.
- Check threading process (it seems still have error)
- Implementing into other multilabel cases
- This MLRF could be integrated with other modified RF/DT algorithm, or pipelined into other Classifier process
Gdocs : https://docs.google.com/document/d/1HKc87iLV8qUzujZ_jzEfRqFR9IoTSv-x7UFBGVyLq54/edit?usp=sharing