Skip to content
/ TCGPR Public

[Small 2024] Machine Learning Algorithm : outlier identifying, feature selection

License

Notifications You must be signed in to change notification settings

Bin-Cao/TCGPR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TCGPR

Citation

If you use this code in your research, please cite the following:

  • Paper: Machine Learning-Engineered Nanozyme System for Synergistic Anti-Tumor Ferroptosis/Apoptosis Therapy. Tianliang Li(李天亮)&, Bin Cao(曹斌)&, Tianhao Su(苏天昊)&, ..., Lingyan Feng^(冯凌燕), Tongyi Zhang^(张统一). ( SMALL)
  • Paper: Divide and conquer: Machine learning accelerated design of lead-free solder alloys with high strength and high ductility Qinghua Wei(魏清华)&, Bin Cao(曹斌)&, Hao Yuan (元皓)&, ..., Ziqiang Dong^(董自强), Tong-Yi Zhang^(张统一). (NPJcm
  • Patent: Zhang Tongyi (张统一), Cao Bin (曹斌), Yuan Hao, Wei Qinghua, Dong Ziqiang. Authorized Chinese Patent.

History

  • 2022: I proposed TCGPR and developed the first version. In collaboration with Mr. Hao Yuan (元皓) (experimental) and Mr. Qinghua Wei (魏清华) (experimental), this method was successfully applied to lead solder optimization. Our first paper was published in npj Computational Materials. News

  • 2024: After two years of development, we introduced the sequential forward/backward feature selection methods and outlier detection techniques. In collaboration with Mr. Tianliang Li (李天亮) (experimental) and Mr. Tianhao Su (苏天昊) (computional), we successfully applied TCGPR to anti-tumor ferroptosis studies. The paper has been published in SMALL. News


Source Code

PyPI - TCGPR

This Python-based library is compatible with Windows, Linux, and macOS operating systems.


Patent

Patent Image

Algorithm Introduction

For detailed algorithm information, refer to the Introduction.


Installation

To install TCGPR, use pip:

pip install PyTcgpr

Checking Installation

You can check the installation details with:

pip show PyTcgpr

Updating the Package

Update TCGPR to the latest version using:

pip install --upgrade PyTcgpr

Running the Algorithm

1. Data Screening | Partition

from PyTcgpr import TCGPR
dataSet = "data.csv"
initial_set_cap = 3
sampling_cap = 2
up_search = 500
CV = 'LOOCV'
Task = 'Partition'

TCGPR.fit(
    filePath = dataSet, 
    initial_set_cap = initial_set_cap, 
    Task = Task, 
    sampling_cap = sampling_cap,
    up_search = up_search, 
    CV = CV
)
# Note: Mission is set to 'DATA' by default. No need to declare it explicitly.

2. Data Screening | Identification

from PyTcgpr import TCGPR
dataSet = "data.csv"
sampling_cap = 2
up_search = 500
Task = 'Identification'
CV = 'LOOCV'

TCGPR.fit(
    filePath = dataSet, 
    Task = Task, 
    sampling_cap = sampling_cap,
    up_search = up_search, 
    CV = CV
)
# Note: 'Mission' is 'DATA' by default; no need to declare it. 'initial_set_cap' is masked in this case.

3. Feature Selection Module

from PyTcgpr import TCGPR
dataSet = "data.csv"
sampling_cap = 2
Mission = 'FEATURE'
up_search = 500
CV = 'LOOCV'

TCGPR.fit(
    filePath = dataSet, 
    Mission = Mission, 
    sampling_cap = sampling_cap,
    up_search = up_search, 
    CV = CV
)
# Note: For feature selection, 'Mission' must be explicitly set to 'FEATURE'.

Parameters

:param Mission: str, default='DATA'
    The task to perform:
    - 'DATA' for data screening
    - 'FEATURE' for feature selection

:param filePath: str
    Path to the input dataset in CSV format.

:param initial_set_cap: int or list
    Initial set capacity. For 'Partition' under 'DATA', defaults to 3.
    Can also be a list specifying the indices of the initial set.

:param sampling_cap: int, default=1
    Number of data points or features added at each iteration.

:param measure: str, default='Pearson'
    Correlation criteria. Can be 'Pearson' (R values) or 'Determination' (R² values).

:param ratio: float
    Tolerance ratio for correlation. Varies based on the mission and task.

:param target: int, default=1
    Used in feature selection. Specifies the number of targets in regression tasks.

:param weight: float, default=0.2
    Weight factor for calculating the GGMF score.

:param up_search: int, default=500
    Upper boundary for brute-force search.

:param exploit_coef: float, default=2
    Constraint on the variance in the Cal_EI function.

:param exploit_model: bool, default=False
    If True, only R values will be used for the search (GGMF will not be considered).

:param CV: int or str, default=10
    Cross-validation setting. Can be an integer (e.g., 5, 10) or 'LOOCV' for leave-one-out cross-validation.

Output

The algorithm will output a CSV file containing the processed dataset: Dataset_remained_by_TCGPR.csv.


Maintainers

This project is maintained by Bin Cao. If you encounter any issues or have suggestions, feel free to open an issue on GitHub or contact:


Contributing

We welcome contributions and suggestions! You can submit issues for questions, bugs, and feature requests, or submit a pull request directly. We are also open to research collaborations—please get in touch if you're interested!

About

[Small 2024] Machine Learning Algorithm : outlier identifying, feature selection

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published