Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Data Preprocessing Speed by Switching from Pandas to Polars #4

Open
5 tasks
achrefbenammar404 opened this issue Nov 5, 2024 · 0 comments
Open
5 tasks
Assignees
Labels
enhancement New feature or request

Comments

@achrefbenammar404
Copy link
Contributor

Description

The current ModelPreprocessor class relies on the Pandas library for data manipulation and preprocessing. While Pandas is effective, it can be slow with large datasets. Switching to Polars, a faster DataFrame library optimized for parallel processing, could significantly improve preprocessing speed, especially for computationally intensive tasks like feature transformation and data type conversion.

Proposed Solution

  1. Replace Pandas with Polars in the ModelPreprocessor class.

  2. Update the following methods to use Polars syntax for efficient parallel processing:

    • feature_selection
    • convert_data_types
    • transform_categories
    • create_log1p_features
    • preprocess
  3. Benchmark the Performance:

    • Compare the preprocessing time between Pandas and Polars to confirm performance improvements.
    • Document any notable speedups or changes in memory usage.
  4. Test Compatibility:

    • Ensure compatibility with other parts of the pipeline, especially CatBoost, which may require converting Polars DataFrames to formats compatible with CatBoostClassifier.

Updated Code Example

Replace Pandas functions with equivalent Polars functions in ModelPreprocessor. Below is a partial example:

import polars as pl

class ModelPreprocessor:
    def feature_selection(self, df: pl.DataFrame):
        df = df.select(self.selected_features) if hasattr(self, 'selected_features') else df
        return df

    def convert_data_types(self, df: pl.DataFrame):
        # Convert categorical columns
        for column in self.categorical_features:
            df = df.with_column(pl.col(column).cast(pl.Categorical))
        # Convert numerical columns
        for column in self.numerical_features:
            df = df.with_column(pl.col(column).cast(pl.Float32))
        return df
    # Continue refactoring other methods similarly...

Tasks

  • Refactor ModelPreprocessor class to use Polars instead of Pandas.
  • Update all methods to use Polars syntax for data manipulation.
  • Test compatibility with CatBoostClassifier and make adjustments as needed.
  • Run benchmarks to compare preprocessing speed with Pandas and document results.
  • Update the documentation to reflect the change from Pandas to Polars.

Expected Outcome

  • Faster data preprocessing for improved efficiency in prediction workflows.
  • Reduced memory usage, especially with large datasets.
  • Cleaner, more concise code for data manipulation tasks.

Additional Notes

  • Polars does not currently support all functionalities of Pandas, so some operations may need creative solutions or fallbacks.
  • Ensure that any Polars-specific dependencies are added to the requirements file and documented in the README.
@achrefbenammar404 achrefbenammar404 added the enhancement New feature or request label Nov 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants