Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSV reader issue on Windows with polars rs-0.44.2 #1288

Closed
eitsupi opened this issue Nov 17, 2024 · 8 comments · Fixed by #1292
Closed

CSV reader issue on Windows with polars rs-0.44.2 #1288

eitsupi opened this issue Nov 17, 2024 · 8 comments · Fixed by #1292
Labels
bug Something isn't working

Comments

@eitsupi
Copy link
Collaborator

eitsupi commented Nov 17, 2024

It seems that an error is occurring in the following location only on Windows.

url = "https://theunitedstates.io/congress-legislators/legislators-historical.csv"
dtypes = list(
"first_name" = pl$Categorical(),
"gender" = pl$Categorical(),
"type" = pl$Categorical(),
"state" = pl$Categorical(),
"party" = pl$Categorical()
)
# dtypes argument
dataset = pl$read_csv(url)$with_columns(pl$col("birthday")$str$strptime(pl$Date, "%Y-%m-%d"))

Execution halted with the following contexts
   0: In R: in pl$read_csv():
   0: During function call [tools::buildVignettes(dir = ".", tangle = TRUE)]
   1: Encountered the following error in Rust-Polars:
      	could not parse `"George Read (American politician, born 1733)"
` as dtype `str` at column 'wikipedia_id
' (column number 36)
      The current offset in the file is 4916 bytes.
      You might want to try:
      - increasing `infer_schema_length` (e.g. `infer_schema_length=10000`),
      - specifying correct dtype with the `schema_overrides` argument
      - setting `ignore_errors` to `True`,
      - adding `"George Read (American politician, born 1733)"
` to the `null_values` list.
      Original error: ```invalid csv file
      Field `"George Read (American politician, born 1733)"
` is not properly escaped.```

The cell in the CSV in issue is now this, and a comma can be found there.

wikipedia_id
"George Read (American politician, born 1733)"

@etiennebacher Could you please check if you can reproduce the problem in Windows (R and Python)?

Maybe related to this change: pola-rs/polars#19088

Originally posted by @eitsupi in #1271 (comment)

@eitsupi eitsupi added bug Something isn't working help wanted Extra attention is needed labels Nov 17, 2024
@eitsupi eitsupi changed the title CSV reader issue on Windows CSV reader issue on Windows with polars 0.44.2 Nov 17, 2024
@eitsupi eitsupi changed the title CSV reader issue on Windows with polars 0.44.2 CSV reader issue on Windows with polars rs-0.44.2 Nov 17, 2024
@etiennebacher
Copy link
Collaborator

Could you please check if you can reproduce the problem in Windows (R and Python)?

I don't have access to a Windows machine, I'll need a VM. I'll try that later today if possible

@eitsupi
Copy link
Collaborator Author

eitsupi commented Nov 18, 2024

Oh, sorry. I misunderstood that you were using Windows for development.

I guess one way is to do a release and wait for user feedback rather than spending time debugging on Windows, what do you think about doing a release in this state?

@etiennebacher
Copy link
Collaborator

etiennebacher commented Nov 18, 2024

Oh, sorry. I misunderstood that you were using Windows for development.

Actually I had a windows machine until recently but I don't have access to it anymore, you couldn't know.

I guess one way is to do a release and wait for user feedback rather than spending time debugging on Windows, what do you think about doing a release in this state?

I'd rather try to fix this before the release. I'll take a look tonight.

Also I'd like to check that it doesn't break anything in tidypolars before releasing this. Is that ok with you? Besides that, I'm not planning to add things here since I'm now focusing more on the rewrite.

@eitsupi
Copy link
Collaborator Author

eitsupi commented Nov 18, 2024

Ok, thanks.
Of course, if you could confirm this prior to release, I would appreciate it.

In fact I usually use a Windows machine but only use Docker for programming and in the past I had a bad experience with Rust and cmake on Windows so I am not confident to build polars on Windows.......

@etiennebacher
Copy link
Collaborator

etiennebacher commented Nov 18, 2024

I can reproduce this error with r-polars on Windows.

If I use py-polars, using the URL directly works fine. However, if I use R's download.file() first (what we do internally) and pass the downloaded CSV file to Python, then I can reproduce the error also in py-polars. Therefore, the error must come from the way we download the CSV file. I'll look more into it.

(It's very annoying to do this in a VM so I'm looking for an easy fix on the R side, I don't plan on doing more important changes in Rust.)

@etiennebacher
Copy link
Collaborator

@eitsupi I'm going to bed, if you're fine with the PR to close this issue then you can make a new release, I prepared tidypolars for it.

@eitsupi
Copy link
Collaborator Author

eitsupi commented Nov 18, 2024

Thanks!

Another thing I noticed is that mode="wb" is missing; the equivalent line in the arrow package appears to have it.

download.file(url = actual_url, destfile = cache_temp_file[[actual_url]])

https://github.com/apache/arrow/blob/ad752482f819df638d9438feff6cad3c49946fc7/r/R/io.R#L247

@eitsupi
Copy link
Collaborator Author

eitsupi commented Nov 19, 2024

I think I understand the problem: if mode is not specified (mode="w"), download.file() replaces the newline character LF with CRLF on Windows.
The manual says:

The choice of binary transfer (mode = "wb" or "ab") is important on Windows, since unlike Unix-alikes it does distinguish between text and binary files and for text transfers changes ‘⁠\n⁠’ line endings to ‘⁠\r\n⁠’ (aka ‘CRLF’).

In this case, the newline character in the original CSV file was CRLF, so all lines would end in CRCRLF.
We can see this by looking at the files downloaded in the two ways with the following commands.

Original one is ended with \r\n (CRLF), broken one is ended with \r\r\n (CRCRLF).

$ od -c orig.csv | tail
5621460       B   u   i   l   d   i   n   g       W   a   s   h   i   n
5621500   g   t   o   n       D   C       2   0   5   1   5   -   0   9
5621520   0   1   ,   2   0   2   -   2   2   5   -   4   1   3   6   ,
5621540   ,   ,   ,   ,   ,   ,   ,   ,   G   0   0   0   5   7   8   ,
5621560   ,   N   0   0   0   3   9   5   0   3   ,   ,   H   6   F   L
5621600   0   1   1   1   9   ,   1   0   4   5   2   8   ,   4   1   2
5621620   6   9   0   ,   1   1   7   1   0   1   ,   M   a   t   t    
5621640   G   a   e   t   z   ,   ,   2   1   7   1   9   ,   M   a   t
5621660   t       G   a   e   t   z  \r  \n
5621671
$ od -c broken.csv | tail
5651240   f   f   i   c   e       B   u   i   l   d   i   n   g       W
5651260   a   s   h   i   n   g   t   o   n       D   C       2   0   5
5651300   1   5   -   0   9   0   1   ,   2   0   2   -   2   2   5   -
5651320   4   1   3   6   ,   ,   ,   ,   ,   ,   ,   ,   ,   G   0   0
5651340   0   5   7   8   ,   ,   N   0   0   0   3   9   5   0   3   ,
5651360   ,   H   6   F   L   0   1   1   1   9   ,   1   0   4   5   2
5651400   8   ,   4   1   2   6   9   0   ,   1   1   7   1   0   1   ,
5651420   M   a   t   t       G   a   e   t   z   ,   ,   2   1   7   1
5651440   9   ,   M   a   t   t       G   a   e   t   z  \r  \r  \n
5651457

So I think mode="wb" will solve the problem.

@eitsupi eitsupi removed the help wanted Extra attention is needed label Nov 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
2 participants