Formatted versions of my transcription of Anne of Green Gables.
This repo was created mainly for producing items for Project Gutenberg. The PG ebook was deployed on 2021-01-22. Jacqueline Jeremy of Distributed Proofreaders assisted me with creating the PG ebook.
The PG (HTML) version of the transcription is here.
This repo has a Creative Commons Zero v1.0 Universal license. That's roughly equivalent to public domain.
Project Gutenberg has two main formats:
- a single plain text file (UTF-8, .txt extension), with standards on formatting
- an XHTML file (as .html), used in turn to generate .epub and .mobi formats. There are some general standards, but they aren't numerous. They focus on identifying chapters and ensuring basic XHTML/CSS validity.
Project Gutenberg (PG) works pretty closely with Distributed Proofreaders (DP). DP has a lot of good documentation about the required formats and standards (links below). The general idea is to follow their recommendations, and to use validators before submitting files to PG.
- DP's Post Processing FAQ - they refer to standardized formatting as post-processing (as in post proof-reading and post formatting).
- Short summary from PG
- PG Copyright Clearance - the start of their process. They will send you a 'clearance key' when approved; you need that to upload the book to PG.
- PG upload for your .zip
- PG Bookmaker tool
- W3C XHTML validator
- W3C CSS validator
- Emily of New Moon - an example to follow, for formatting. It likely used this copy-text.
After you submit files to PG, they will be examined closely by PG whitewashers, who prepare the text for final publication. File names are formatted
like_this_or-that-01.txt
all in lower case.
When you upload, you zip together all of the formats you have into a single zip file. There are validators that you can use to help you through the process.
Distributed Proofreaders acts as the main input into Project Gutenberg. They have tons of information on what is required:
- Post Processing FAQ - they refer to standardized formatting as post-processing (as in post proof-reading and post formatting).
- ppgen - their tool that generates both the plain text and HTML versions from a common source
- pptext - checks the output of ppgen
- DP Formatting Guidelines
- DP Errors and Misspellings
There are 3 main parts (wrapped later by PG boilerplate):
- front matter
- main body (38 chapters, in this case)
- plain text
- poetry
- letters
- illustrations
- transcriber's notes, at the end
Make a branch called 1908-LCP-4th-pg off the uncorrected branch, to hold the desired text. Apply a small number of corrections to that branch.
After that, the text needs to be formatted for PG. My formatting is only semi-automated:
- create a template file to hold the title page, transcriber notes and so on, entered manually
- run a script (Java, in Eclipse, against the correct branch) to inject the bulk of the text into the template file (with line wrap at 72)
- apply manual formatting for some items
- zip the result and send it to PG
- THE END text; classes: mt3, center
- the name of the novel, at the start of chapter 1; classes: center, p180
- poetry: copy-paste divs with classes; careful with quotes; careful with no-indent on the following para
- correspondence; careful with quotes; careful with no-indent on the following para; classes: mb0, center; mt0, center-right4, smcap
- placement of illustrations: put near the corresponding text
These are special places where I need to format the text manually.
P: poetry, L: correspondence (a letter).
- dedication: P
- Ch 02: P little birds sang
- Ch 07: L (odd, starts as normal text) Gracious Heavenly Father
- Ch 11: P Midian
- Ch 17: 2P, 2L; when twilight drops; shorn of Brutus
- Ch 18: P nothing but death
- Ch 19: P not a sister
- Ch 24: P heart farewell (not special, inline!)
- Ch 29: P stubborn spearsmen
- Ch 31: P hills peeped
- Ch 32: L (embedded in normal text, long)
- Ch 33: P one moonbeam
- UTF-8 encoding, .txt extension
- end of line is CR-LF (Windows style)
- byte-order mark (BOM) removed (they strip it out if present; don't worry about it)
- line width 60-70, max 75 (recommend 72)
- italic like _this_, bold like =this=
- using em-dash is OK
- using curly quotes is OK
- no tab characters allowed
- no spaces at end of lines
- no extra spaces between words
- use ligatures when the source copy-text does
- match the copy-text closely, including errors; they can be noted in the transcriber's notes
- it's ok to end a line in Mr. or Mrs.; no need for a non-breaking space - example
- 4 empty lines: at very top (gap after PG boilerplate)
- 4 empty lines: at the top of a chapter
- 4 empty lines: between frontispiece and title page
- 2 empty lines between chapter headings and chapter body
- hyphens at the end of the line: compare the spelling of the same word as found elsewhere in the text
- transcriber's note at the end; this helps to flag items to PG whitewashers
- poetry: indent by 1-4 spaces, so that line wrapping by tools is turned off
- blockquotes: treat as poetry
- letters/correspondence? those are trickiest items
- front matter: seems to be some freedom there, as long as it's reasonable
Line-width/line-wrapping is a bit tricky. Be careful with that. (Java has a BreakIterator class, but it's a bit quirky in its behaviour.)
In my case, most items are auto-generated. Poetry and letters are two items that are handled manually, by editing the generated output.
- UTF-8 encoding
- XHTML 1.0 Strict or 1.1 (epub uses XHTML)
- modern example
- CSS 2.1 or below; CSS 3 can also be used if needed; CSS appears to be embedded, not in a separate .css file
- handheld is used in the CSS media query; that setting is OK in CSS 2.1, deprecated in CSS 3.
- use W3C validators for markup and CSS; use HTML Tidy; remove unused styles
- use PG's ebookmaker converter to convert your book and review the result carefully (checks epub/mobi formats)
- image file formats: .jpg (.png for vector drawings)
- font: not specific, font-family only
- don't use: <br> (except in poems?), , or empty tags to control spacing; use CSS margin, padding
- title: H1
- chapter: H2
- images are placed in a conventional images directory beside the html
- cover image: no larger than 256K, width-height from 650x1000 to 5000x5000. Name is cover.jpg.
- inline image: no larger than 256K, up to 5000x5000; state the width-height explicitly for all images.
- example of title-tag: The Project Gutenberg eBook of Alice's Adventures in Wonderland, by Lewis Carroll
- metadata or comments about the text can be placed in the header, or in HTML comments
- page numbers optional, not mandatory; some people put them in comments
- no external links allowed
- css: div.chapter, p.poem, p.letter
- the text flow uses max width similar to plain text files
The PG (HTML) version of the transcription is deployed here.