Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support inserting data into BigQuery directly from a Polars DataFrame #1979

Open
henryharbeck opened this issue Jul 18, 2024 · 10 comments
Open
Assignees
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.

Comments

@henryharbeck
Copy link

henryharbeck commented Jul 18, 2024

Is your feature request related to a problem? Please describe.
The Polars DataFrame library has been gaining a lot of traction and many are writing new pipelines in Polars and/or moving from pandas to Polars. It would be great to add native support between the BigQuery client library and Polars.

Describe the solution you'd like
This request is to allow inserting data directly from a Polars DataFrame into a BigQuery table.
An additional bonus would be not requiring PyArrow to be installed.
I would be open to expanding client.load_table_from_dataframe to also accept Polars DataFrames, or new dedicated method(s) being created.

Describe alternatives you've considered

  1. Convert to pandas at the end of the pipeline and use client.load_table_from_dataframe to insert the data. Not ideal to require an additional dependency just to insert data. Furthermore, I don't believe that pandas supports complex types available in both BigQuery and Polars, such as structs and arrays.

  2. Write the DataFrame to a bytes stream as a parquet file and insert the data with client.load_table_from_file. This intent of this code is a lot less obvious, and it would be much nicer to have more native support. Note that this is also the suggested approach in the Polars user guide (rightfully so IMO as it does not require any additional dependencies).

  3. Do not support Polars directly, but instead support inserting data from a PyArrow table. This is not currently feasible, but would be an alternative feature request. This is not preferable as the option above already allows inserting data without a PyArrow dependency. From looking at the docs (haven't checked the source), this potentially looks to have some overlap with what client.load_table_from_dataframe already does.

@product-auto-label product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery API. label Jul 18, 2024
@tswast tswast added the type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. label Aug 16, 2024
@tswast
Copy link
Contributor

tswast commented Aug 16, 2024

I have some thoughts on this: I think a separate package for BigQuery + Polars (similar to pandas-gbq) would be most appropriate here, especially given the desire to avoid pyarrow as a dependency.

@henryharbeck
Copy link
Author

henryharbeck commented Aug 31, 2024

Thanks for the response @tswast

It would be great if the BigQuery client accepted Polars DataFrames in addition to pandas DataFrames in client.load_table_from_dataframe. It would currently be possible without a pyarrow dependency as outlined in point 2 in the issue description. I would be happy to provide an implementation if accepted. I do acknowledge that this would be an additional optional dependency.

I can also appreciate a separate package would provide a consolidated reading and writing interface, which I am definitely in support of. Do you have in mind that this package would be developed / owned by Google? If yes, would be it be possible / in-scope to support reading from BigQuery into Polars without a pyarrow dependency? At the moment it seems that polars.from_arrow(client.query(...).to_arrow()) is (from my understanding) the most performant option.

As a heads up, I think I will also request that Polars provide some support for the BigQuery client, and implement the approaches mentioned in their user guide.

EDIT: here is that request - pola-rs/polars#18547

@tswast
Copy link
Contributor

tswast commented Sep 6, 2024

Do you have in mind that this package would be developed / owned by Google?

Yes, that's my thought, though if the community were to create one before I make it through all the red tape needed to make such a thing happen, I'd gladly contribute there, instead. ;-)

If yes, would be it be possible / in-scope to support reading from BigQuery into Polars without a pyarrow dependency?

That's my thought with regards to a separate package. No pyarrow necessary if the focus is just polars. The test suite for google-cloud-bigquery is complicated enough as it is without adding a test environment where polars is installed but pyarrow is not.

@tswast
Copy link
Contributor

tswast commented Sep 6, 2024

I know you asked for writes, but I figured I'd try the read path today, and I was able to get BigQuery table -> polars DataFrame without pyarrow. Checkout this gist: https://gist.github.com/tswast/99b017b20386e324f5c7d2bd49f21b5f#file-bigquery-to-polars-no-pyarrow-ipynb

Obviously, it's single threaded, missing a lot of boilerplate, and doesn't support query inputs, but as a proof of concept, I was happy to see it's possible.

@tswast
Copy link
Contributor

tswast commented Dec 13, 2024

I've confirmed it is possible to write to BigQuery from polars without pyarrow in this gist: https://gist.github.com/tswast/4e2fb2cca1c1fecf8fb697e94102358f

I've mailed pola-rs/polars#20292 to update the polars docs to set this option.

@henryharbeck
Copy link
Author

figured I'd try the read path today, and I was able to get BigQuery table -> polars DataFrame without pyarrow
This is great! I must say I need to familiarise myself with the BigQuery Storage Read API

Thank you for your Polars PR as well.

Is there any movement on creating a separate Polars / BQ package within Google?

@tswast
Copy link
Contributor

tswast commented Jan 3, 2025

Since I see there is some appetite to include such functionality in polars itself (pola-rs/polars#17326) or possibly in the a polars plugin package now that I/O plugins seem to be supported (https://github.com/pola-rs/pyo3-polars/tree/main/example/io_plugin, pola-rs/pyo3-polars#94), my approach has been to use some 20% time to try and contribute first a read plugin and then a write plugin. I don't have anything working yet, but https://github.com/tswast/polars/blob/issue17326-scan-bigquery/py-polars/polars/io/bigquery.py is the start of the branch I have for that.

The other approach we could take is to integrate this functionality into BigQuery DataFrames (googleapis/python-bigquery-dataframes#735), which I would definitely like to support and would be a bit simpler to implement, since bigframes already depends on pyarrow.

@henryharbeck
Copy link
Author

Thank you @tswast, that is awesome! I very much appreciate your work on this.

In the Polars repo or a Polars I/O plugin sounds great. A scan function supporting column/predicate pushdown would be amazing.

BigFrames support also seems in reasonable demand, but as you can probably tell I am personally trying to avoid the pyarrow dependency, haha

@CesarArroyo09
Copy link

Hey @tswast, I am very interested in participating in this development. I have around 6 years working with Python and very noob on rust but eager to learn.

If you have specific tasks that need to be executed and you can point to me, I can dedicate ~3 hours a week to this.

@tswast
Copy link
Contributor

tswast commented Jan 21, 2025

Hi @CesarArroyo09 , thanks for reaching out!

I think the most promising way to get this done would be to contribute to the polars project directly. I'm making some good progress on a scan_bigquery function (pola-rs/polars#17326 (comment)) a few hours a week as well.

I could definitely use some help. Besides a "scan" method that I'm working on, there's also read_database and write_database in the polars library that could use some BigQuery support.

I have around 6 years working with Python and very noob on rust but eager to learn.

I'm eager to learn Rust, too, but official Rust clients for Google Cloud are still very much in their infancy (https://github.com/googleapis/google-cloud-rust). I suspect it'll probably be another year or so before we can really take advantage of them. In the meantime, I think the Polars + BigQuery connectors can be written in pure Python, even if it won't be as efficient as a Rust integration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
Projects
None yet
Development

No branches or pull requests

4 participants