-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace pyspark with duckdb #18
Comments
I agree with giving duckdb a try; I just would not replace the current adapter, but rather create a new one in the adapter directory. Makes it easier to compare and reference in the future. |
Can you also list advantages and disadvantages of Polars, as you have decided to use duckdb instead of Polars; what was the reason for that? |
I simply haven't tried Polars and if duckdb is already satisfactory, don't see the reason to spend time on Polars as we are not searching for the best one. |
Why are we not searching for the best one? I am not very familiar with SQL for example, so this would be a potential barrier for my understanding. On a side note: I would like to provide solutions for as many scenarios as possible. If someone is familiar with Polars, an existing adapter that uses it "as per our recommendation" would be very useful. |
Our need is simple, it's just parquet file reading and parquet itself is a simple format as well. Any package that opens a parquet file should does the job so there is hardly the best one in terms of the functions we need unless speed is critical. But perhaps Polars resembles the pattern you know more and is easier for our audience to learn then we may consider it as well. I can create a minimal working sample for it, do you want me to start? |
I think there are many facets to "the best" package: user friendliness, maintainability, dependency considerations, consistency with the rest of the ecosystem, flexibility in deployment, speed, validation capabilities, probably some others. We don't only want to select a package that reads parquet; if that were our only goal, I'd agree. Our scope however is to develop a user-friendly community platform and ecosystem that provides flexibility for a broad range of users; I am thinking beyond a "working Open Targets adapter" here. It needs to fit into the overall mission of the project. Hence, yes, I would definitely at least consider it and compare all potential solutions that seem worth trying. I would suggest to adjust your branch to introduce additional adapters instead of replacing the existing one in order to be able to compare syntax, performance, feature-richness, API, etc. Feel free to modify the adapter structure itself if you think it can be optimised. |
See also #22 |
If that's the case then I suggest we refactor the code base first. It was not easy for me to dirty patch the code to use duckdb when I was creating the working example, creating other examples in this fashion is obviously inefficient. An improved structure is needed to support the works you mentioned. |
What do you mean with code base? What kind of refactor? |
Nothing complicated, I merely meant all python files in this repo. A simple start would be to put different definitions into different files rather than one monolithic file which helps reducing scrolling and make one easier to find the target code they want to focus. The next thing would be to reduce repeated code by making the repeated code into a generic function so that adapting to a different package would become easier as only that reused function needs to be modified. |
One notable finding regarding Polars: it seems quite a lot of people think Polars is less stable than duckdb and I found this method argument on Polars documentation:
|
Refactor plan sounds good, feel free to create an issue in the v1 milestone for that as well. The stability observation on Polars seems relevant. I am still interested in pyarrow, generally. |
I think this is a good article to show the syntax difference between duckdb, Polars and pyarrow https://www.confessionsofadataguy.com/pyarrow-vs-polars-vs-duckdb-for-data-pipelines |
I like that article, but it actually points towards Polars as the best candidate. Second best (most interesting) IMO is arrow native (at least in the bigger scope of BioCypher, as opposed to just one input adapter). DuckDB seems very SQL-focussed and as such is not fully aligned with BioCypher philosophy. For me with little SQL background, the DuckDB snippet in this article is by far the hardest to comprehend. I don't get the author's comment that the DuckDB is most compact / least code. It has a huge plain text SQL statement, which cannot be reasonably debugged, and which encodes the same logic as the far more readable / debuggable Polars snippet, or even the raw arrow snippet. FYI, there is this milestone in BioCypher that I created for outlining future options; this relates to this discussion as well as the more general one on modularisation and internal data structure: https://github.com/biocypher/biocypher/milestone/4 |
I think the best pythonic solution would in any case be to refactor the |
Not selling it but just to explain why SQL is preferred by some people. SQL seems like a commonly used language for heterogenous data query in data science. Actually pyspark and polars support querying with SQL as well. To people who are familiar with SQL, the query in the duckdb demonstration is not hard to understand so I agree with the author's opinion and you can easily find similar comments in other sources saying SQL is more concise in terms of expressing a query which is natural because, after all, SQL stands for Structured Query Language. These packages are all query engines and their main feature is to push some processing down to the data side as it is more efficient to do it there. To achieve this an intermediate language is needed to express your query and the engine accepts it and comes up an execution plan which is the actual code generated on-demand that will be executed on the data side to fulfill your query. Whether you use SQL or expression API, they are all translated in the end. The use of SQL draws a clear boundary between data side and client side. One clearly knows what is passed to and will be guaranteed to run on the engine while with expression API, you may not sure unless you check the API doc. It's also hard for the expression API to fully cover things up and sometimes creates awkward pattern like this: pyarrow: table.aggregate([("ride_id", "count")]) The polars is better: df.agg(pl.count("ride_id").alias("ride_id_count")) It declares special functions, Here is the SQL equivalent: SELECT
COUNT(`ride_id`) AS `ride_id_count`
FROM `table`
GROUP BY `ride_id` For me who is familiar with both Python and SQL, SQL is a lot better in terms of clarity IMO. Of course I admit that SQL is an additional learning barrier so I don't plan to sell it. The use of SQL is basically not covering up which I think results in a more consistent coding style due to not having the awkward pattern above. Another advantage of SQL is that it's platform agnostic, making the switch between packages easier. duckdb actually has expression API as well and I tried to use it at first but I think it is quite limited or I don't understand it so later I changed all my queries to SQL. Debugging wise SQL and expression API are not so different. The SQL could be broken down into smaller statements for intermediate state inspection. Deeper debugging requires inspecting the execution plan generated so both are difficult. |
Thanks for the explanation, makes sense. From what I am understanding, Polars is still the best solution, isn't it? It is fastest, has an API for those who need it, and can do SQL for those who want it. If there are no other large issues, I would make Polars the preferred candidate. |
Hi both @slobentanzer @kpto In my experience, Polars is a great solution that has proven quite resilient in other projects I've worked on. All in all, I agree with @slobentanzer in making it the preferred candidate. |
@slobentanzer I did some more search about Polars, apart from it's uncertain stability on parquet file handling, Palars seems to have a slightly longer history and user base so I'd say it shouldn't be a bad choice anyway. I will prioritise in creating a working sample with Polars then and therefore this one shall be closed. |
Closing this one as Polars is preferred. The follow up issue is #25 |
Hi @laulopezreal I have replied you there, again sorry for missing your comment. |
Reopening this issue as after evaluations on Polars, pyarrow and duckdb, duckdb seems to be the best choice at least for now. See #25 for the reasons of giving up Polars. pyarrow lacks query operators and it's memory control is not as good as duckdb. duckdb remains excellent memory usage stability while streaming rows from dataset to Python runtime and has the operators we are already using in the adapter. |
Due to the difficulties of running pyspark on Windows (#13 and #17 ) and it's sole purpose in our application is just to query a parquet file, an alternative that is less OS dependent is preferred for the project to serve a wider audience. duckdb, being a fast growing and loved data science package that provides a similar functions and interfaces as pyspark, seems to be a good candidate. I have conducted a preliminary test on the package and I think it is promising for our use. Here I provide a minimal working sample on branch duckdb-evaluation.
Advantages:
Disadvantages:
The text was updated successfully, but these errors were encountered: