Skip to content

Commit

Permalink
Add special handling for Soft Hyphen (SHY) unicode symbol to DOCX emi… (
Browse files Browse the repository at this point in the history
#1180)

Add special handling for Soft Hyphen (SHY) unicode symbol to DOCX emitter and PDF emitter as follows:

The SOFT HYPHEN Hyphen Unicode symbol (often abbreviated as "SHY", code point 173 = '\u00ad' ) is something like a shy dash: It is invisible except when line-breaking occurs at this place.
Its intention is to mark good locations for hyphenation. 
For US/English readers: Long words are quite common in some languages, so hyphenation is much more important there than in English texts.
For example, the German word "Bundestag" (the parliament) can be hyphenated as "Bun-des-tag". Let's assume that the text ist stored as "Bun\u00addes\u00adtag"(with a SHY symbol instead of the ASCII MINUS symbol between the syllables). If a line-break occurs, this may result in "Bundes-" at the end of line 1 and "tag" at the start of line 2.
TTF fonts usually calculate a width of 0 for this symbol.

BIRT behaves like the CSS property hyphens: auto is set (see https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Text/Wrapping_Text).
But until now, BIRT did not handle the SHY symbol correctly with the PDF emitter and (in some cases) with the DOCX emitter.

This PR adds special handling for the SHY symbol to the DOCX emitter and PDF emitters to handle this correctly.
Other emitters are not changed.
  • Loading branch information
hvbtup authored Jan 27, 2023
1 parent 85a0fb5 commit f5cb70e
Show file tree
Hide file tree
Showing 4 changed files with 239 additions and 16 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,10 @@
import org.eclipse.birt.report.engine.layout.pdf.util.PropertyUtil;
import org.w3c.dom.css.CSSValue;

/**
* This is used for writing WordML by the DocxEmitter and by the old Word 2003
* emitter.
*/
public abstract class AbstractWordXmlWriter {

protected XMLWriter writer;
Expand All @@ -50,6 +54,30 @@ public abstract class AbstractWordXmlWriter {

public static final int INDEX_NOTFOUND = -1;

/**
* <p>
* The soft hyphen Unicode symbol is intended to be visible only when a line
* break occurs there.
* </p>
* <p>
* This hiding logic of the SHY symbol needs special attention in many emitters.
* </p>
* <p>
* SOFT HYPHEN is often abbreviated as SHY, which also is very descriptive,
* because this symbol is hiding inside the surrounding words most of the time.
* </p>
* <p>
* In most fonts, its width is defined as zero, which of cause is correct only
* if it is hidden. If it is rendered, it looks similar to the minus sign.
* </p>
* <p>
* The Unicode standard also defines a HYPHEN symbol, which should look the same
* as the SHY symbol, but doesn't have the hiding logic. However, the HYPHEN
* symbol is rarely defined in TTF fonts.
* </p>
*/
public static final char SOFT_HYPHEN = '\u00ad';

protected int imageId = 75;

protected int bookmarkId = 0;
Expand Down Expand Up @@ -554,9 +582,17 @@ private void writeString(String txt, IStyle style) {
start++;
}
end = start + 1;
continue;
} else if (ch == SOFT_HYPHEN) {
// Output a special WordML tag for the SHY symbol.
writeText(txt.substring(start, end));
writer.closeTag("w:t"); //$NON-NLS-1$
writer.cdata("<w:softHyphen/>"); // $NON-LS-1$
writer.openTag("w:t"); //$NON-NLS-1$
start = end + 1;
end++;
} else {
end++;
}
end++;
}
writeText(txt.substring(start));

Expand Down Expand Up @@ -1002,11 +1038,8 @@ public void writeTextInRun(int type, String txt, IStyle style, String fontFamily
* @param cellWidth the width of the container in points
* @return String with truncated words that surpasses the cell width
*/
public String cropOverflowString(String text, IStyle style, String fontFamily, int cellWidth) {// TODO: retrieve
// font type and
// replace plain
// with
// corresponding
public String cropOverflowString(String text, IStyle style, String fontFamily, int cellWidth) {
// TODO: retrieve font type and replace plain with corresponding
Font font = new Font(fontFamily, Font.PLAIN, WordUtil
.parseFontSize(PropertyUtil.getDimensionValue(style.getProperty(StyleConstants.STYLE_FONT_SIZE))));
Canvas c = new Canvas();
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -10,14 +10,72 @@
*
* Contributors:
* Actuate Corporation - initial API and implementation
* Henning von Bargen - Added at least a bit of JavaDoc, added SOFT HYPHEN support.
***********************************************************************/

package org.eclipse.birt.report.engine.layout.pdf.hyphen;

/**
* <p>
* Despite its name, this describes a <em>fragment</em> of a word of text.
* </p>
* <p>
* If the word does not contain possible hyphenation / line-breaking points,
* then it is a whole word. But if the word contains Unicode MINUS or HYPHEN or
* SOFT HYPHEN symbols, then the {@link BreakIterator} splits this whole word
* into more than one Word instances.
* </p>
* <p>
* For example, "extra-ordinary" will be split into two Word instances "extra-"
* and "ordinary".
* </p>
*/
public class Word {
protected int start;
protected int end;
protected String text;

private boolean keepTrailingSoftHyphen = true;

/**
* Should a trailing Unicode SOFT HYPHEN (SHY) symbol be kept or omitted?
*
* @return true if a trailing soft hyphen should be kept, false if it should be
* omitted.
*
* @since 4.13
*/
public boolean isKeepTrailingSoftHyphen() {
return keepTrailingSoftHyphen;
}

/**
* Set whether a trailing Unicode SOFT HYPHEN (SHY) symbol should be kept or
* omitted. The default value is <tt>true</tt>, so this is usually only called
* to omit it.
*
* @apiNote This is not really used inside the Word class. But a Word object is
* used to transmit the information piggyback to the
* {@link org.eclipse.birt.report.engine.nLayout.area.impl.TextArea}
* object, where the information is needed.
*
* @param keepTrailingSoftHyphen whether to keep the last soft hyphen or not.
*
* @since 4.13
*/
public void setKeepTrailingSoftHyphen(boolean keepTrailingSoftHyphen) {
this.keepTrailingSoftHyphen = keepTrailingSoftHyphen;
}

/**
* Create a Word instance as a substring of a given text.
*
* @see String#substring(int,int)
*
* @param text Text
* @param start start index of the substring
* @param end end index of the substring (exclusive).
*/
public Word(String text, int start, int end) {
this.text = text;
this.start = start;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -21,13 +21,64 @@
import org.eclipse.birt.report.engine.nLayout.area.style.TextStyle;

import com.ibm.icu.text.Bidi;

import com.ibm.icu.text.BreakIterator;

/**
* <p>
* An abstract representation of a line of styled text (eg. with a font and font
* size specified etc.) or a fragment thereof.
* </p>
*/
public class TextArea extends AbstractArea implements ITextArea {

protected String text;

protected String cachedText = null;

/**
* <p>
* The soft hyphen Unicode symbol.
* </p>
* <p>
* It needs special handling, because it should only be visible when a
* line-break occurs there and hidden otherwise.
* </p>
* <p>
* See
* {@link org.eclipse.birt.report.engine.emitter.wpml.writer.AbstractWordXmlWriter#SOFT_HYPHEN}
* for more detail.
* </p>
*/
private static final char SOFT_HYPHEN = '\u00ad';

/**
* <p>
* This controls if Unicode SOFT HYPHEN symbols in a text should be removed from
* the output. The default value is <tt>true</tt> - remove soft hyphens.
* </p>
* <p>
* By setting the system property <tt>org.eclipse.birt.softhyphen.remove</tt> to
* <tt>false</tt>, the old, incorrect behavior of keeping them can be restored.
* </p>
*/
private boolean removeSoftHyphens = "true".equals(System.getProperty("org.eclipse.birt.softhyphen.remove", "true")); // $NON-NLS-1

/**
* <p>
* This controls if a Unicode SOFT HYPHEN at the end of the text area should be
* kept in the output or removed with the other SOFT HYPHENs when
* {@link #removeSoftHyphens} is set.
* </p>
* <p>
* Note that sometimes the same visible line of text can consist of more than
* one TextAreas. The text content of these text areas are the result of a
* {@link BreakIterator}. A pre-hyphenated word, e.g. "extra\u00adordinary" will
* be split by the {@link BreakIterator} into two
* {@link org.eclipse.birt.report.engine.layout.pdf.hyphen.Word "words"} can
* result in two TextAreas with the texts "
*/
private boolean keepTrailingSoftHyphen = true;

protected int runLevel;

protected TextStyle style;
Expand Down Expand Up @@ -115,12 +166,35 @@ public int getTextLength() {
return textLength;
}

/**
* <p>
* Get a string with the text this TextArea represents.
* </p>
* <p>
* SOFT HYPHEN Unicode symbols inside the text are usually removed (depending on
* {@link #removeSoftHyphens}), except a trailing one (depending on
* {@link #keepTrailingSoftHyphen}).
* </p>
*
* @return The unformatted text.
*/
private String calculateText() {
if (blankLine || text == null) {
return "";
} else {
return text.substring(offset, offset + textLength);
}
String textResult = text.substring(offset, offset + textLength);
if (removeSoftHyphens) {
// Remove all Unicode SOFT HYPHEN symbols except a trailing one.
// FIXME: This is possibly worth performance tuning!
int indxSoftHyphen = textResult.indexOf(SOFT_HYPHEN);
for (; indxSoftHyphen >= 0; indxSoftHyphen = textResult.indexOf(SOFT_HYPHEN)) {
String remaining = textResult.substring(indxSoftHyphen + 1);
if (keepTrailingSoftHyphen && remaining.strip().length() == 0)
break;
textResult = textResult.substring(0, indxSoftHyphen) + remaining;
}
}
return textResult;
}

public void addWord(int textLength, float wordWidth) {
Expand Down Expand Up @@ -237,4 +311,30 @@ public void setWhiteSpaceNumber(int whiteSpaceNumber) {
public boolean needClip() {
return needClip;
}

/**
* Whether a Unicode SOFT HYPHEN at the end of the text area should be kept in
* the output or removed.
*
* @see #keepTrailingSoftHyphen
*
* @return true if the soft hyphen shall be kept.
*/
public boolean isKeepTrailingSoftHyphen() {
return keepTrailingSoftHyphen;
}

/**
* Control whether a Unicode SOFT HYPHEN at the end of the text area should be
* kept in the output or removed.
*
* @see #keepTrailingSoftHyphen
*
* @param keepTrailingSoftHyphen true if the soft hyphen shall be kept.
*/
public void setKeepTrailingSoftHyphen(boolean keepTrailingSoftHyphen) {
this.keepTrailingSoftHyphen = keepTrailingSoftHyphen;
}


}
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,11 @@ public class TextCompositor {
private FontInfo fontInfo;
private int runLevel;

/**
* @see TextArea#isKeepTrailingSoftHyphen()
*/
private static final String SOFT_HYPHEN = "\u00ad";

/** offset relative to the text in the textContent. */
int offset = 0;

Expand All @@ -48,7 +53,7 @@ public class TextCompositor {
private IWordRecognizer remainWords;
/** the remain word */
private Word remainWord;
/** the remain characters in current word after hyphenation */
/** the remain characters in current word after word-breaking / hyphenation */
private Word wordVestige;

/**
Expand Down Expand Up @@ -157,8 +162,18 @@ private TextArea getNextTextArea(int maxLineWidth) {
textArea.setMaxWidth(maxLineWidth);
textArea.setWidth(0);
addWordIntoTextArea(textArea, remainWord);
textArea.setKeepTrailingSoftHyphen(remainWord.isKeepTrailingSoftHyphen());
remainWord = null;
return textArea;
// FIXME: Why do we return here already?
// This return here in a way contradicts the idea of the algorithm, which is to
// stuff as many words as possible into a TextArea,
// because it results in a (e.g. PDF) text line consisting of two (more than
// one) TextAreas A and B, where A is a TextArea with exactly one Word (= word
// fragment) that did not fit into the previous line, and B contains the next
// Words.
// This results in slightly larger PDF files than necessary and it and makes it
// slightly harder for accessibility software to understand the file.
}
// iterate the remainWords.
if (null == remainWords || !remainWords.hasWord()) {
Expand Down Expand Up @@ -250,13 +265,20 @@ private void addWordsIntoTextArea(TextArea textArea, IWordRecognizer words) {
*
*/
private void addWordIntoTextArea(TextArea textArea, Word word) {

// get the word's size
int textLength = word.getLength();
int wordWidth = getWordWidth(fontInfo, word);
// append the letter spacing
wordWidth += textStyle.getLetterSpacing() * textLength;
int adjustWordSize = fontInfo.getItalicAdjust() + wordWidth;
if (textArea.hasSpace(adjustWordSize)) {
int hyphenWidth = 0;
if (word.getValue().endsWith(SOFT_HYPHEN)) {
hyphenWidth = getTextWidth(fontInfo, "-");
// We are using the Unicode MINUS here for computing the hyphen dash size,
// because getTextWidth for the SOFT HYPHEN would return 0 width.
}
if (textArea.hasSpace(adjustWordSize + hyphenWidth)) {
addWord(textArea, textLength, wordWidth);
wordVestige = null;
if (remainWords.hasWord()) {
Expand Down Expand Up @@ -289,6 +311,18 @@ private void addWordIntoTextArea(TextArea textArea, Word word) {
} else {
wordVestige = null;
remainWord = word;
if (remainWords.hasWord()) {
// The soft hyphen symbol should be omitted except for the last word in the
// line.
// Please Note: This condition is not quite correct, but OK for real-world data.
// If the soft hyphen is inside a word, then the breakIterator has at least
// one more "word", which is actually the (part of) the rest of this word.
// But if someone comes up with a word that *ends* with a soft-hyphen,
// then there might be no more remaining "words", so this results in
// hiding the soft hyphen. However, a word ending with a soft-hyphen
// doesn't make sense at all, so we don't care about this.
remainWord.setKeepTrailingSoftHyphen(false);
}
}
textArea.setLineBreak(true);
hasLineBreak = true;
Expand All @@ -310,10 +344,8 @@ private void doWordBreak(String str, TextArea area) {
if (endHyphenIndex == 0 && area.getWidth() == 0) {
addWordVestige(area, 1, getTextWidth(fi, wb.getHyphenText(0, 1)), str.substring(1));
} else {
addWordVestige(area, endHyphenIndex,
getTextWidth(fi, wb.getHyphenText(0, endHyphenIndex))
+ textStyle.getLetterSpacing() * (endHyphenIndex - 1),
str.substring(endHyphenIndex));
addWordVestige(area, endHyphenIndex, getTextWidth(fi, wb.getHyphenText(0, endHyphenIndex))
+ textStyle.getLetterSpacing() * (endHyphenIndex - 1), str.substring(endHyphenIndex));
}
}

Expand Down

0 comments on commit f5cb70e

Please sign in to comment.