Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluate polars as a replacement of pyspark #25

Closed
kpto opened this issue Nov 4, 2024 · 1 comment
Closed

Evaluate polars as a replacement of pyspark #25

kpto opened this issue Nov 4, 2024 · 1 comment
Labels
enhancement New feature or request

Comments

@kpto
Copy link
Collaborator

kpto commented Nov 4, 2024

Following #18, Polars is preferred as it's syntax resembles Pandas and perhaps easier for a wider audience who may not have much understanding in SQL preferred by duckdb. A minimal working sample will soon be implemented and findings will be recorded here.

Initial findings regarding Polars:

  1. The support of streaming was added recently and is not mature https://docs.pola.rs/api/python/stable/reference/lazyframe/api/polars.LazyFrame.collect.html#polars.LazyFrame.collect
  2. Unlike duckdb which uses disk space to handle memory heavy operators, Polars seems to have no corresponding measure.
  3. The above requires developers to understand the memory usage of operators well to avoid potential memory issues. This is a good reference: https://www.rhosignal.com/posts/streaming-operations-in-polars/
@kpto kpto added this to OTAR3088 Nov 4, 2024
@kpto kpto added the enhancement New feature or request label Nov 4, 2024
@kpto kpto added this to the Version v0.4.0 milestone Nov 5, 2024
@kpto
Copy link
Collaborator Author

kpto commented Nov 7, 2024

Unfortunately the streaming feature seems to be very new and not suitable for production use. The doc is still under construction:
https://docs.pola.rs/user-guide/lazy/streaming/

Also it's unclear on how to iterate rows in Python runtime, the argument streaming=True of collect turns on streaming but it still returns a DataFrame which does not seem to be a generator.
https://docs.pola.rs/api/python/stable/reference/lazyframe/api/polars.LazyFrame.collect.html#polars.LazyFrame.collect

Here is also a relevant issue showing that a LazyFrame cannot/inefficient to be iterated in Python runtime:
pola-rs/polars#10683

Polars remains in our watchlist but not suitable to use for now. Closing the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Done
Development

No branches or pull requests

1 participant