Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace pyspark with duckdb #18

Open
kpto opened this issue Oct 30, 2024 · 22 comments
Open

Replace pyspark with duckdb #18

kpto opened this issue Oct 30, 2024 · 22 comments

Comments

@kpto
Copy link
Collaborator

kpto commented Oct 30, 2024

Due to the difficulties of running pyspark on Windows (#13 and #17 ) and it's sole purpose in our application is just to query a parquet file, an alternative that is less OS dependent is preferred for the project to serve a wider audience. duckdb, being a fast growing and loved data science package that provides a similar functions and interfaces as pyspark, seems to be a good candidate. I have conducted a preliminary test on the package and I think it is promising for our use. Here I provide a minimal working sample on branch duckdb-evaluation.

Advantages:

  1. Query in SQL syntax just like pyspark does
  2. Fast
  3. Easy to install

Disadvantages:

  1. Python methods of SQL clauses are not as completed as pyspark and therefore SQL language is preferred for consistency
  2. Returned row is a simple tuple rather than a dictionary with column names being the keys
@slobentanzer
Copy link
Collaborator

I agree with giving duckdb a try; I just would not replace the current adapter, but rather create a new one in the adapter directory. Makes it easier to compare and reference in the future.

@slobentanzer
Copy link
Collaborator

slobentanzer commented Oct 30, 2024

Can you also list advantages and disadvantages of Polars, as you have decided to use duckdb instead of Polars; what was the reason for that?

@slobentanzer slobentanzer moved this to Todo in OTAR3088 Oct 30, 2024
@kpto
Copy link
Collaborator Author

kpto commented Oct 30, 2024

I simply haven't tried Polars and if duckdb is already satisfactory, don't see the reason to spend time on Polars as we are not searching for the best one.

@slobentanzer
Copy link
Collaborator

Why are we not searching for the best one? I am not very familiar with SQL for example, so this would be a potential barrier for my understanding.

On a side note: I would like to provide solutions for as many scenarios as possible. If someone is familiar with Polars, an existing adapter that uses it "as per our recommendation" would be very useful.

@kpto
Copy link
Collaborator Author

kpto commented Oct 30, 2024

Our need is simple, it's just parquet file reading and parquet itself is a simple format as well. Any package that opens a parquet file should does the job so there is hardly the best one in terms of the functions we need unless speed is critical.

But perhaps Polars resembles the pattern you know more and is easier for our audience to learn then we may consider it as well. I can create a minimal working sample for it, do you want me to start?

@slobentanzer
Copy link
Collaborator

slobentanzer commented Oct 30, 2024

I think there are many facets to "the best" package: user friendliness, maintainability, dependency considerations, consistency with the rest of the ecosystem, flexibility in deployment, speed, validation capabilities, probably some others.

We don't only want to select a package that reads parquet; if that were our only goal, I'd agree. Our scope however is to develop a user-friendly community platform and ecosystem that provides flexibility for a broad range of users; I am thinking beyond a "working Open Targets adapter" here. It needs to fit into the overall mission of the project.

Hence, yes, I would definitely at least consider it and compare all potential solutions that seem worth trying. I would suggest to adjust your branch to introduce additional adapters instead of replacing the existing one in order to be able to compare syntax, performance, feature-richness, API, etc. Feel free to modify the adapter structure itself if you think it can be optimised.

@slobentanzer
Copy link
Collaborator

See also #22

@kpto
Copy link
Collaborator Author

kpto commented Oct 30, 2024

If that's the case then I suggest we refactor the code base first. It was not easy for me to dirty patch the code to use duckdb when I was creating the working example, creating other examples in this fashion is obviously inefficient. An improved structure is needed to support the works you mentioned.

@slobentanzer
Copy link
Collaborator

What do you mean with code base? What kind of refactor?

@kpto
Copy link
Collaborator Author

kpto commented Oct 30, 2024

What do you mean with code base? What kind of refactor?

Nothing complicated, I merely meant all python files in this repo. A simple start would be to put different definitions into different files rather than one monolithic file which helps reducing scrolling and make one easier to find the target code they want to focus.

The next thing would be to reduce repeated code by making the repeated code into a generic function so that adapting to a different package would become easier as only that reused function needs to be modified.

@kpto
Copy link
Collaborator Author

kpto commented Oct 30, 2024

One notable finding regarding Polars: it seems quite a lot of people think Polars is less stable than duckdb and I found this method argument on Polars documentation:

use_pyarrow

    Use PyArrow instead of the Rust-native Parquet reader. The PyArrow reader is more stable.

@slobentanzer
Copy link
Collaborator

Refactor plan sounds good, feel free to create an issue in the v1 milestone for that as well.

The stability observation on Polars seems relevant. I am still interested in pyarrow, generally.

@kpto
Copy link
Collaborator Author

kpto commented Oct 31, 2024

I think this is a good article to show the syntax difference between duckdb, Polars and pyarrow

https://www.confessionsofadataguy.com/pyarrow-vs-polars-vs-duckdb-for-data-pipelines

@slobentanzer
Copy link
Collaborator

slobentanzer commented Oct 31, 2024

I like that article, but it actually points towards Polars as the best candidate. Second best (most interesting) IMO is arrow native (at least in the bigger scope of BioCypher, as opposed to just one input adapter).

DuckDB seems very SQL-focussed and as such is not fully aligned with BioCypher philosophy. For me with little SQL background, the DuckDB snippet in this article is by far the hardest to comprehend. I don't get the author's comment that the DuckDB is most compact / least code. It has a huge plain text SQL statement, which cannot be reasonably debugged, and which encodes the same logic as the far more readable / debuggable Polars snippet, or even the raw arrow snippet.

FYI, there is this milestone in BioCypher that I created for outlining future options; this relates to this discussion as well as the more general one on modularisation and internal data structure: https://github.com/biocypher/biocypher/milestone/4

@slobentanzer
Copy link
Collaborator

I think the best pythonic solution would in any case be to refactor the main() function in these examples into readable functions such as connect(), load_df(), create_column(), and write(). Seems that with both Polars and with native arrow, this would be simple enough. Polars seems to be the fastest (not unexpected, since this is their declared focus) and offers similar syntax to Pandas, which many are familiar with.

@kpto
Copy link
Collaborator Author

kpto commented Oct 31, 2024

Not selling it but just to explain why SQL is preferred by some people.

SQL seems like a commonly used language for heterogenous data query in data science. Actually pyspark and polars support querying with SQL as well. To people who are familiar with SQL, the query in the duckdb demonstration is not hard to understand so I agree with the author's opinion and you can easily find similar comments in other sources saying SQL is more concise in terms of expressing a query which is natural because, after all, SQL stands for Structured Query Language.

These packages are all query engines and their main feature is to push some processing down to the data side as it is more efficient to do it there. To achieve this an intermediate language is needed to express your query and the engine accepts it and comes up an execution plan which is the actual code generated on-demand that will be executed on the data side to fulfill your query. Whether you use SQL or expression API, they are all translated in the end. The use of SQL draws a clear boundary between data side and client side. One clearly knows what is passed to and will be guaranteed to run on the engine while with expression API, you may not sure unless you check the API doc. It's also hard for the expression API to fully cover things up and sometimes creates awkward pattern like this:

pyarrow:

table.aggregate([("ride_id", "count")])

The count here is a command pushed down to the data side, hence the weird string literal here.

polars is better:

df.agg(pl.count("ride_id").alias("ride_id_count"))

It declares special functions, pl.count here for pushed down commands so you enjoy a bit of edit time checking.

Here is the SQL equivalent:

SELECT
    COUNT(`ride_id`) AS `ride_id_count`
FROM `table`
GROUP BY `ride_id`

For me who is familiar with both Python and SQL, SQL is a lot better in terms of clarity IMO. Of course I admit that SQL is an additional learning barrier so I don't plan to sell it.

The use of SQL is basically not covering up which I think results in a more consistent coding style due to not having the awkward pattern above. Another advantage of SQL is that it's platform agnostic, making the switch between packages easier. duckdb actually has expression API as well and I tried to use it at first but I think it is quite limited or I don't understand it so later I changed all my queries to SQL.

Debugging wise SQL and expression API are not so different. The SQL could be broken down into smaller statements for intermediate state inspection. Deeper debugging requires inspecting the execution plan generated so both are difficult.

@slobentanzer
Copy link
Collaborator

Thanks for the explanation, makes sense. From what I am understanding, Polars is still the best solution, isn't it? It is fastest, has an API for those who need it, and can do SQL for those who want it.

If there are no other large issues, I would make Polars the preferred candidate.

@laulopezreal
Copy link
Contributor

Hi both @slobentanzer @kpto

In my experience, Polars is a great solution that has proven quite resilient in other projects I've worked on.
This might be a subjective view, but I feel that Polars is becoming the standard as the alternative to pandas for bigger datasets within the python community. See my coment in #17 (comment)

All in all, I agree with @slobentanzer in making it the preferred candidate.

@kpto
Copy link
Collaborator Author

kpto commented Nov 4, 2024

@slobentanzer I did some more search about Polars, apart from it's uncertain stability on parquet file handling, Palars seems to have a slightly longer history and user base so I'd say it shouldn't be a bad choice anyway.

I will prioritise in creating a working sample with Polars then and therefore this one shall be closed.

@kpto
Copy link
Collaborator Author

kpto commented Nov 4, 2024

Closing this one as Polars is preferred. The follow up issue is #25

@kpto kpto closed this as completed Nov 4, 2024
@github-project-automation github-project-automation bot moved this from Todo to Done in OTAR3088 Nov 4, 2024
@kpto
Copy link
Collaborator Author

kpto commented Nov 7, 2024

Hi both @slobentanzer @kpto

In my experience, Polars is a great solution that has proven quite resilient in other projects I've worked on. This might be a subjective view, but I feel that Polars is becoming the standard as the alternative to pandas for bigger datasets within the python community. See my coment in #17 (comment)

All in all, I agree with @slobentanzer in making it the preferred candidate.

Hi @laulopezreal I have replied you there, again sorry for missing your comment.

@kpto
Copy link
Collaborator Author

kpto commented Nov 7, 2024

Reopening this issue as after evaluations on Polars, pyarrow and duckdb, duckdb seems to be the best choice at least for now. See #25 for the reasons of giving up Polars. pyarrow lacks query operators and it's memory control is not as good as duckdb. duckdb remains excellent memory usage stability while streaming rows from dataset to Python runtime and has the operators we are already using in the adapter.

@kpto kpto reopened this Nov 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

3 participants