Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read.xlsx() does not return the correct number of rows #307

Open
daattali opened this issue Dec 7, 2021 · 10 comments
Open

read.xlsx() does not return the correct number of rows #307

daattali opened this issue Dec 7, 2021 · 10 comments
Labels

Comments

@daattali
Copy link

daattali commented Dec 7, 2021

(Duplicate of #304 but it was closed prematurely and I don't have the option of commenting+reopening).

To continue from the last comment there: @JanMarvin I think you misunderstood the issue. You'll see the problem is you use cols = 2 in your example.

You can use this test file
test.xlsx

Calling openxlsx::read.xlsx("test.xlsx", rows = 2:4, cols = 2, colNames = FALSE) returns

  X1
1  a
2  b

But it SHOULD return

  X1
1  a
2  b
3

Another way to make the problem obvious is trying to read rows 4:6, which will return the following:

NULL
Warning message:
No data found on worksheet. 

But asking for 4:7 returns a 1-row dataframe. It should return 4 rows.

@JanMarvin
Copy link
Collaborator

Okay, now - that you provide an example - I see what you mean and I agree that this behavior is unpleasant. Though I don't see an easy way to fix this cleanly. The issue is within getCellInfo(). This function returns the data from the worksheet, but in this case, there is no data on the worksheet and nothing is returned. We could hackishly try to pad the dataframe we are going to return, with cells until the user selected number of rows/cols is matched, but I'd say the fix should be added prior. Anyways, I'm not going to fix the getCellInfo() function, it's old code and probably needs to be left alone.

@daattali
Copy link
Author

daattali commented Dec 7, 2021

I won't pretend to know anything about the internals/technical details so I can't suggest the right fix. But with the risk of being ignorant, I would suggest adding some sort of padding with NULL values that can happen in read.xlsx if you think it doesn't belong in getCellInfo. Or perhaps getCellInfo can gain a boolean parameter of whether or not to always return the expected dimensions or if to simplify.

@daattali
Copy link
Author

Another way to see the dangerous bugs this can cause:

Suppose you have this Excel data
test.xlsx

Row Value
1
2 a
3

When I try to grab different sets of rows, I always get the same output, which makes it impossible for me to know what the actual data in the sheet looks like! Rows 2:3, 2:4, 3:4, 2:5 - all of them return just a single value "a" and I don't know what row it's in.

@JanMarvin JanMarvin linked a pull request Dec 15, 2021 that will close this issue
@JanMarvin
Copy link
Collaborator

JanMarvin commented Dec 15, 2021

I have added a pull request #309 but merely to draft the issue. With this pull request the output looks as follows. Though this draft is not ment for merging.

There are two reasons for this:

  • I think we are way to late in for such a change. Even though it might be reasonable, but and we could break legacy code. Obviously I understand why one might have a different impression
  • The way I have implemented it - in base R - might work for small files, but most likely is rather slow with larger files. Like I have said, this should be fixed on the C++ level and I am not going to fix that for this project.
  • There are most likely a few corner cases to think about, even if CI is fine (edit: CI isn't fine)
> library(openxlsx)

> read.xlsx("~/gh_issue_307.xlsx")
  Row Value
1   1  <NA>
2   2     a
3   3  <NA>

> read.xlsx("~/gh_issue_307.xlsx", rows = 2:3, cols = 2:3, 
+           colNames = FALSE, skipEmptyRows = FALSE, skipEmptyCols = TRUE)
    X2
1    a
2 <NA>

> read.xlsx("~/gh_issue_307.xlsx", rows = 2:3, cols = 2:3, 
+           colNames = FALSE, skipEmptyRows = TRUE, skipEmptyCols = FALSE)
  X2 X3
1  a NA

> read.xlsx("~/gh_issue_307.xlsx", rows = 2:3, cols = 2:3, 
+           colNames = FALSE, skipEmptyRows = FALSE, skipEmptyCols = FALSE)
    X2 X3
1    a NA
2 <NA> NA


> read.xlsx("~/gh_issue_304.xlsx")
  Row Value
1   1     a
2   2     b
3   3  <NA>
4   4  <NA>
5   5  <NA>
6   6     c

> read.xlsx("~/gh_issue_304.xlsx", rows = 2:4, cols = 2,
+           colNames = FALSE, skipEmptyRows = F)
    X2
1    a
2    b
3 <NA>

> read.xlsx("~/gh_issue_304.xlsx", rows = 6:8, cols = 1:3,
+           colNames = FALSE, skipEmptyRows = F, skipEmptyCols = F)
  X1   X2 X3
1  5 <NA> NA
2  6    c NA
3 NA <NA> NA

@daattali
Copy link
Author

Regarding the first point: I'm also extremely conservative myself wrt to backwards compatibility/breaking changes, so I would think it's best to add a parameter and not change the default behaviour.

JanMarvin referenced this issue in JanMarvin/openxlsx2 Jan 23, 2022
* change a few checks for cc as uninitializedField to NULL
* update C++ loadvals() to use Environment (R6) rather than Reference
* add more missing `self$`
* some cleanup in other files from debugging
@github-actions
Copy link

This issue is stale because it has been open 365 days with no activity. Remove stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale label Dec 17, 2022
@daattali
Copy link
Author

This is still relevant, bot

@github-actions github-actions bot removed the Stale label Dec 24, 2022
Copy link

This issue is stale because it has been open 365 days with no activity. Remove stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale label Dec 30, 2023
@daattali
Copy link
Author

@ycphs are there plans to resolve this bug?

@github-actions github-actions bot removed the Stale label Jan 6, 2024
Copy link

This issue is stale because it has been open 365 days with no activity. Remove stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale label Jan 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants