-
Notifications
You must be signed in to change notification settings - Fork 332
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Converting non-32-bit integers from python to R to use bit64 #323
Comments
Related to #265 ? Although a bit different because I am talking about native integer types, which differ in Python to R and thus do not have a one-to-one mapping (i.e. multiple Python integers map to the same R integer). |
library(reticulate)
py_run_string("x = 123")
main <- import_main(convert = FALSE)
class(main$x)
#> [1] "python.builtin.int" "python.builtin.object"
py_to_r.python.builtin.int <- function(x) {
message("hello!")
reticulate:::py_to_r.default(x)
}
py$x
#> hello!
#> [1] 123
Created on 2018-07-25 by the reprex package (v0.2.0). That said, I'm not sure what the best solution here is. I think that whatever is done it should be stable and consistent (ie whether you get an R integer vector or a Maybe # user-defined conversion functions
converters <- list(python.builtin.int = function(x) { <makes a bit64 vector> })
# would call user-defined converter in preference to built-in one
py_to_r(py$x, converters = converters) This way, at least the user could request a specific conversion type when the default doesn't suffice. Will have to think about this a bit more. |
It looks like there is actually some inconsistency in how we handle integer type conversion. For scalars, integers are always converted to integers: https://github.com/rstudio/reticulate/blob/master/src/python.cpp#L529-L553 So for PyLong scalars (which are probably 32-bits on 32-bit platforms and 64-bits on 64-bit platforms?) we will always coerce to R 32 bit integers, which has potential for lossage. However, for NumPy arrays we are careful to convert wider integer types to double: https://github.com/rstudio/reticulate/blob/master/src/python.cpp#L191-L238 So if we could know that data was a 64-bit integer we could explicitly force it to R double (as we already do for NumPy) |
Was just on my way here to submit a similar issue. Adding in another example: library(reticulate)
boto3 <- import("boto3")
session <- boto3$session$Session(profile_name = "x")
client <- session$client("athena")
res <- client$get_query_execution(QueryExecutionId = "x")
res$QueryExecution$Statistics$DataScannedInBytes
## 1814623667 vs import boto3
session = boto3.session.Session(profile_name = "x")
client = session.client("athena")
res = client.get_query_execution(QueryExecutionId = "x")
res["QueryExecution"]["Statistics"]["DataScannedInBytes"]
## 104893838771 Caught it when I was trying to write a cost estimator. It's an |
FWIW using a double is also dangerous since that only represents values up to 2^53 without loss of precision and that's definitely caused issues in the past (e.g. database keys overflowing storage). From what I understand, Python has two types for representing integers:
E.g. on my 64bit machine:
I'm not sure what the right solution here is :-/ In theory converting Python |
Aye. A "for instance" is that But, now I'm rly reticent abt doing a great deal with At the very least this current, "errant" behaviour (IMO) needs some top-level emphasis outside of a GH issue since I think quite a few folks assume it "just works" (like I had). Honestly, the |
So would the proposal be for us to depend on the Even though some packages might not like seeing |
I think that's reasonable. I also think that if we make this choice we might want to have a more global conversation about making |
It's popping up in a few places. My experience (from building |
There is one thing to be aware of for Also worth noting that there are some cases where > mean(as.integer64(1:2))
integer64
[1] 1 That said, I think it's overall worth taking this on assuming we don't step on any particularly egregious land mines in the switch over. |
Consider > mean(as.integer64(1:3))
integer64
[1] 2
> though -- it "simply" insists that its type sticks. Might be an issue here, but as I recall, "documented" over there. |
Right;
so while most things behave the same as one would expect for R integers, this isn't true for all operations -- so we definitely need to keep in mind that if we substitute |
The CRAN page (https://cran.r-project.org/web/packages/bit64/index.html) even states:
|
The lack of subset means that e.g. this doesn't work as expected: suppressPackageStartupMessages(library(bit64))
df <- data.frame(x = 1, y = 2, z = 3)
df[as.integer64(2)]
#> data frame with 0 columns and 1 row and I think this will be problematic :-/ |
I think we could get something a little bit better here with ALTREP. By default, we would allow the underlying representation to remain as 64-bit integers, but convert to R's own 32-bit integers as needed for e.g. indexing and so on. Other mathematical operations could use regular old S3 dispatch (as |
For what it's worth, my issue is also related to representing database keys, which are normally stored as long / bigint. One thing I have been using as a workaround is to represent the key as a string, then use reticulate to pull it in, and then finally cast that to an integer64 |
So I have been trying to figure out a workaround for this one, but it seems like the library(reticulate)
py_to_r.python.builtin.int <- function(x) {
message("hello!")
x <- as.character(x)
x <- bit64::as.integer64(x)
}
reticulate::py_run_string('obj = 2435823760;')
py$obj
#> hello!
#> integer64
#> [1] 2435823760
reticulate::py_run_string('obj2 = {"key": 2435823760};')
py$obj2
#> $key
#> [1] -1859143536
# not a different class?
main <- import_main(convert = FALSE)
class(main$obj2)
#> [1] "python.builtin.dict" "python.builtin.object"
class(main$obj2$key)
#> [1] "python.builtin.int" "python.builtin.object"
# but now it fires?
main$obj2
#> hello!
#> {'key': 2435823760}
main$obj2$key
#> 2435823760
# just not on the py$ object
py$obj2$key
#> [1] -1859143536 Created on 2019-01-27 by the reprex package (v0.2.1) |
This is because the default method We could implement a |
Interesting. Thanks Kevin. Is there any conceivable way for a user to work-around or modify this type of behavior? I.e. short of just rewriting all of these methods? I ask b/c I am working with an OOM in this case and have no qualms about working around the issue, but it feels like I'm sort of stuck atm. I tried poking at a dict method, but I felt very much out of my depth and it wasn't clear if that would even help 😄 Maybe it would be just importing the main manually with |
I don't think there's any way to override this right now (short of registering your own |
Any further thinking on this one and what the best long-term approach is? I would love to have a nice way to resolve this, if possible 😄 |
I fear that because of the lack of binary representation on the R side you always need a special case / trick. R after all only has
We have the same problem with |
We talked about an int64 backend for vctrs. It should work much better than bit64, for instance with |
How might one go about implementing the soln that @eddelbuettel uses in nanotime? Lack of 64bit conversion all but prevents me from using reticulate daily |
Took a swing at it @ #590. |
I hope it is helpful for people who need a short term workaround while waiting #590 merged. In general, we can convert a python object to JSON, then transfer it to R. It is not an idea solution for super long integer that require special handling to parse JSON string as discussed in https://stackoverflow.com/questions/65009389/parse-json-with-big-integer-in-r. Input
Output
|
I tried to trace
py_to_r
for integer conversion, but I have struggled to do so. I'm not sure whether this is something that could be implemented inreticulate
, or if I should just implement some type of S3 workaround in my code.In any case, I thought it would be desirable for reticulate to convert python integers that do not fit into 32bit integers into a 64-bit analog in R (i.e. the
bit64
package). Not sure if we are waiting for R to adopt a standard here - the Python conversion is super smooth for the objects I am working with minus these IDs that get destroyed by the 32bit issue.If I were to implement a workaround, overriding with a string conversion would work fine for IDs too. It looked like
int
dispatch may go throughpy_to_r.default
? Would a workaround implement a custompy_to_r.int
function? I may be missing something about how this should work.The text was updated successfully, but these errors were encountered: