read.xlsx() does not return the correct number of rows #307

daattali · 2021-12-07T01:58:09Z

(Duplicate of #304 but it was closed prematurely and I don't have the option of commenting+reopening).

To continue from the last comment there: @JanMarvin I think you misunderstood the issue. You'll see the problem is you use cols = 2 in your example.

You can use this test file
test.xlsx

Calling openxlsx::read.xlsx("test.xlsx", rows = 2:4, cols = 2, colNames = FALSE) returns

  X1
1  a
2  b

But it SHOULD return

Another way to make the problem obvious is trying to read rows 4:6, which will return the following:

NULL
Warning message:
No data found on worksheet.

But asking for 4:7 returns a 1-row dataframe. It should return 4 rows.

The text was updated successfully, but these errors were encountered:

JanMarvin · 2021-12-07T18:08:05Z

Okay, now - that you provide an example - I see what you mean and I agree that this behavior is unpleasant. Though I don't see an easy way to fix this cleanly. The issue is within getCellInfo(). This function returns the data from the worksheet, but in this case, there is no data on the worksheet and nothing is returned. We could hackishly try to pad the dataframe we are going to return, with cells until the user selected number of rows/cols is matched, but I'd say the fix should be added prior. Anyways, I'm not going to fix the getCellInfo() function, it's old code and probably needs to be left alone.

daattali · 2021-12-07T19:49:53Z

I won't pretend to know anything about the internals/technical details so I can't suggest the right fix. But with the risk of being ignorant, I would suggest adding some sort of padding with NULL values that can happen in read.xlsx if you think it doesn't belong in getCellInfo. Or perhaps getCellInfo can gain a boolean parameter of whether or not to always return the expected dimensions or if to simplify.

daattali · 2021-12-13T22:03:52Z

Another way to see the dangerous bugs this can cause:

Suppose you have this Excel data
test.xlsx

Row	Value
1
2	a
3

When I try to grab different sets of rows, I always get the same output, which makes it impossible for me to know what the actual data in the sheet looks like! Rows 2:3, 2:4, 3:4, 2:5 - all of them return just a single value "a" and I don't know what row it's in.

JanMarvin · 2021-12-15T20:34:26Z

I have added a pull request #309 but merely to draft the issue. With this pull request the output looks as follows. Though this draft is not ment for merging.

There are two reasons for this:

I think we are way to late in for such a change. Even though it might be reasonable, but ~~and~~ we could break legacy code. Obviously I understand why one might have a different impression
The way I have implemented it - in base R - might work for small files, but most likely is rather slow with larger files. Like I have said, this should be fixed on the C++ level and I am not going to fix that for this project.
There are most likely a few corner cases to think about, even if CI is fine (edit: CI isn't fine)

> library(openxlsx)

> read.xlsx("~/gh_issue_307.xlsx")
  Row Value
1   1  <NA>
2   2     a
3   3  <NA>

> read.xlsx("~/gh_issue_307.xlsx", rows = 2:3, cols = 2:3, 
+           colNames = FALSE, skipEmptyRows = FALSE, skipEmptyCols = TRUE)
    X2
1    a
2 <NA>

> read.xlsx("~/gh_issue_307.xlsx", rows = 2:3, cols = 2:3, 
+           colNames = FALSE, skipEmptyRows = TRUE, skipEmptyCols = FALSE)
  X2 X3
1  a NA

> read.xlsx("~/gh_issue_307.xlsx", rows = 2:3, cols = 2:3, 
+           colNames = FALSE, skipEmptyRows = FALSE, skipEmptyCols = FALSE)
    X2 X3
1    a NA
2 <NA> NA


> read.xlsx("~/gh_issue_304.xlsx")
  Row Value
1   1     a
2   2     b
3   3  <NA>
4   4  <NA>
5   5  <NA>
6   6     c

> read.xlsx("~/gh_issue_304.xlsx", rows = 2:4, cols = 2,
+           colNames = FALSE, skipEmptyRows = F)
    X2
1    a
2    b
3 <NA>

> read.xlsx("~/gh_issue_304.xlsx", rows = 6:8, cols = 1:3,
+           colNames = FALSE, skipEmptyRows = F, skipEmptyCols = F)
  X1   X2 X3
1  5 <NA> NA
2  6    c NA
3 NA <NA> NA

daattali · 2021-12-16T10:26:57Z

Regarding the first point: I'm also extremely conservative myself wrt to backwards compatibility/breaking changes, so I would think it's best to add a parameter and not change the default behaviour.

* change a few checks for cc as uninitializedField to NULL * update C++ loadvals() to use Environment (R6) rather than Reference * add more missing `self$` * some cleanup in other files from debugging

github-actions · 2022-12-17T02:13:23Z

This issue is stale because it has been open 365 days with no activity. Remove stale label or comment or this will be closed in 7 days.

daattali · 2022-12-17T02:38:29Z

This is still relevant, bot

github-actions · 2023-12-30T01:58:50Z

This issue is stale because it has been open 365 days with no activity. Remove stale label or comment or this will be closed in 7 days.

daattali · 2023-12-30T03:44:01Z

@ycphs are there plans to resolve this bug?

github-actions · 2025-01-11T02:28:42Z

This issue is stale because it has been open 365 days with no activity. Remove stale label or comment or this will be closed in 7 days.

JanMarvin linked a pull request Dec 15, 2021 that will close this issue

Gh issue 307 #309

Closed

github-actions bot added the Stale label Dec 17, 2022

github-actions bot removed the Stale label Dec 24, 2022

github-actions bot added the Stale label Dec 30, 2023

github-actions bot removed the Stale label Jan 6, 2024

github-actions bot added the Stale label Jan 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read.xlsx() does not return the correct number of rows #307

read.xlsx() does not return the correct number of rows #307

daattali commented Dec 7, 2021

JanMarvin commented Dec 7, 2021

daattali commented Dec 7, 2021 •

edited

Loading

daattali commented Dec 13, 2021

JanMarvin commented Dec 15, 2021 •

edited

Loading

daattali commented Dec 16, 2021

github-actions bot commented Dec 17, 2022

daattali commented Dec 17, 2022

github-actions bot commented Dec 30, 2023

daattali commented Dec 30, 2023

github-actions bot commented Jan 11, 2025

read.xlsx() does not return the correct number of rows #307

read.xlsx() does not return the correct number of rows #307

Comments

daattali commented Dec 7, 2021

JanMarvin commented Dec 7, 2021

daattali commented Dec 7, 2021 • edited Loading

daattali commented Dec 13, 2021

JanMarvin commented Dec 15, 2021 • edited Loading

daattali commented Dec 16, 2021

github-actions bot commented Dec 17, 2022

daattali commented Dec 17, 2022

github-actions bot commented Dec 30, 2023

daattali commented Dec 30, 2023

github-actions bot commented Jan 11, 2025

daattali commented Dec 7, 2021 •

edited

Loading

JanMarvin commented Dec 15, 2021 •

edited

Loading