Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve paragraph breaks #2

Open
mengwong opened this issue Mar 1, 2023 · 2 comments
Open

improve paragraph breaks #2

mengwong opened this issue Mar 1, 2023 · 2 comments

Comments

@mengwong
Copy link
Contributor

mengwong commented Mar 1, 2023

image

@mengwong
Copy link
Contributor Author

mengwong commented Apr 4, 2023

We discovered that a lot of the spurious paragraph breaks come from crossing footnote boundaries.

We discovered that the footnotes aren't actually footnotes, even in the original Word doc. They're just manually generated superscripts with manually font-size-10 at the bottom of each page; presumably someone there doesn't know how to use actual footnote functionality in Word, and has been editing files by hand.

We deleted all the font-size-10 using find-and-replace in Word.

Proposed algorithm for deleting spurious line breaks again: run diff; for each diff chunk, construct list of words; beore and after each line break, are the words the same? if so, delete the line break.

@mengwong
Copy link
Contributor Author

mengwong commented Apr 4, 2023

LOL, without improved breaks, we see that diff loses the plot and gets offset by one, like missing a button when buttoning up your shirt

aebf735

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant