Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Column type guessing #491

Open
raffaem opened this issue Aug 23, 2024 · 1 comment
Open

Column type guessing #491

raffaem opened this issue Aug 23, 2024 · 1 comment

Comments

@raffaem
Copy link

raffaem commented Aug 23, 2024

readxl::read_excel guesses column types based on the value of the first guess_max (an argument of the function) of its cells (reference)

This causes problems in importing when a numeric cell way down in my Excel file is silently converted into a boolean, without any warning of any sort.

This problem doesn't happen with openxlsx.

Can you specify in the documentation how openxlsx guesses column types?

My understanding is that Excel only provides cell types, not column types.

@mlell
Copy link

mlell commented Sep 25, 2024

openxlsx looks at the whole table and not only on the first guess_max lines. In readxl you can set guess_max to a very high value (e.g. Inf) to avoid the problematic behaviour.

In openxlsx, the process is aided by the fact that excel stores every string only once in a central storage and every cell with that string has the address of the correct string. Similarly for logical values. Dates have another treatment that I do not know right now. A column that has no cells with type "string references" is numeric

// If it contains any strings it will be a character column
IntegerVector char_cols_unique;
if(all(is_na(string_inds))){
char_cols_unique = -1;
}else{
IntegerVector columns_which_are_characters = cols[string_inds - 1];
char_cols_unique = unique(columns_which_are_characters);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants