Improve Data Preprocessing Speed by Switching from Pandas to Polars #4

achrefbenammar404 · 2024-11-05T00:41:59Z

Description

The current ModelPreprocessor class relies on the Pandas library for data manipulation and preprocessing. While Pandas is effective, it can be slow with large datasets. Switching to Polars, a faster DataFrame library optimized for parallel processing, could significantly improve preprocessing speed, especially for computationally intensive tasks like feature transformation and data type conversion.

Proposed Solution

Replace Pandas with Polars in the ModelPreprocessor class.
Update the following methods to use Polars syntax for efficient parallel processing:
- feature_selection
- convert_data_types
- transform_categories
- create_log1p_features
- preprocess
Benchmark the Performance:
- Compare the preprocessing time between Pandas and Polars to confirm performance improvements.
- Document any notable speedups or changes in memory usage.
Test Compatibility:
- Ensure compatibility with other parts of the pipeline, especially CatBoost, which may require converting Polars DataFrames to formats compatible with CatBoostClassifier.

Updated Code Example

Replace Pandas functions with equivalent Polars functions in ModelPreprocessor. Below is a partial example:

import polars as pl

class ModelPreprocessor:
    def feature_selection(self, df: pl.DataFrame):
        df = df.select(self.selected_features) if hasattr(self, 'selected_features') else df
        return df

    def convert_data_types(self, df: pl.DataFrame):
        # Convert categorical columns
        for column in self.categorical_features:
            df = df.with_column(pl.col(column).cast(pl.Categorical))
        # Convert numerical columns
        for column in self.numerical_features:
            df = df.with_column(pl.col(column).cast(pl.Float32))
        return df
    # Continue refactoring other methods similarly...

Tasks

Refactor ModelPreprocessor class to use Polars instead of Pandas.
Update all methods to use Polars syntax for data manipulation.
Test compatibility with CatBoostClassifier and make adjustments as needed.
Run benchmarks to compare preprocessing speed with Pandas and document results.
Update the documentation to reflect the change from Pandas to Polars.

Expected Outcome

Faster data preprocessing for improved efficiency in prediction workflows.
Reduced memory usage, especially with large datasets.
Cleaner, more concise code for data manipulation tasks.

Additional Notes

Polars does not currently support all functionalities of Pandas, so some operations may need creative solutions or fallbacks.
Ensure that any Polars-specific dependencies are added to the requirements file and documented in the README.

The text was updated successfully, but these errors were encountered:

achrefbenammar404 assigned safina57 Nov 5, 2024

achrefbenammar404 added the enhancement New feature or request label Nov 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Data Preprocessing Speed by Switching from Pandas to Polars #4

Improve Data Preprocessing Speed by Switching from Pandas to Polars #4

achrefbenammar404 commented Nov 5, 2024

Improve Data Preprocessing Speed by Switching from Pandas to Polars #4

Improve Data Preprocessing Speed by Switching from Pandas to Polars #4

Comments

achrefbenammar404 commented Nov 5, 2024

Description

Proposed Solution

Updated Code Example

Tasks

Expected Outcome

Additional Notes