Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ISSUE 393 : Decoding the content into utf-8 to avoid encoding issues #394

Merged
merged 2 commits into from
Dec 18, 2023

Conversation

BrayanMnz
Copy link
Contributor

The issue is that we are wrongly encoding a bytes object parsed to a string
what we need to do instead is, decode the bytes object directly.

This commit changes that, to decoding the content directly from bytes into utf-8

See, the results after these changes:

#393 (comment)

…tats_bref function.

The issue is because we are wrongly encoding a bytes object parsed to string
what we need to do instead is, decode the bytes object directly.
…stats_bref function.

    The issue is because we are wrongly encoding a bytes object parsed to string
    what we need to do instead is, decode the bytes object directly.
@bdilday
Copy link
Contributor

bdilday commented Dec 1, 2023

I think this change was on purpose in order to workaround a different issue - see
#223
#218

you may want to double check that this PR doesn't reintroduce the intermittent truncated dataframe issue mentioned in #218

@BrayanMnz
Copy link
Contributor Author

Could you elaborate on how to replicate the old issue? @bdilday

Just to validate this not introduce a regression

@BrayanMnz
Copy link
Contributor Author

Hi, @bdilday

Replicating the old issue, there is no data truncated with my fix.

I used the same example as the previous issue

pybaseball-2023-12-04_20.36.11.mp4

@bdilday
Copy link
Contributor

bdilday commented Dec 8, 2023

Could you elaborate on how to replicate the old issue? @bdilday

Just to validate this not introduce a regression

it's tricky to replicate because it depends not only on the data but on the (randomly generated) links in the page header. That is, if the page contained a link to the german language or spanish language fbref pages, AND a player in the table had a non-ascii character in their name, then the table got truncated.

but the problem seems to be resolved now. Here's an example that should reproduce the original issue but works fine
https://gist.github.com/bdilday/6e6bef77a5e42776a7e6da7bee12178d

@BrayanMnz
Copy link
Contributor Author

BrayanMnz commented Dec 8, 2023

So, my PR should also solve the issue reported on ISSUE 393.
Could you review it, again?

Also tested the snippet you've shared with my changes and is working as expected.

FYI @bdilday

@bdilday
Copy link
Contributor

bdilday commented Dec 15, 2023

So, my PR should also solve the issue reported on ISSUE 393. Could you review it, again?

Also tested the snippet you've shared with my changes and is working as expected.

FYI @bdilday

Yes, I think this PR looks good to go!

@BrayanMnz
Copy link
Contributor Author

Hi, @tjburch @schorrm

could someone merge this?

@schorrm schorrm merged commit 1c05162 into jldbc:master Dec 18, 2023
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants