-
Notifications
You must be signed in to change notification settings - Fork 398
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PDF/UA support, part 2 #2018
base: master
Are you sure you want to change the base?
PDF/UA support, part 2 #2018
Conversation
…rfluous NonStruct tags for table caption
…hich is only allocated when a container is actually split.
…pends on context, eg TD or TDF
… remove Flash support
…nts for string literals
- Added some example PDF/UA reports. - Add support for declaring newer PDF/A versions. - Add support for declaring PDF/UA-2. - Add support for declaring PDF 2.0. - Use "auto" as default value for tagType property for some report element types to allow automatic context-depending setting as well as explicit setting. - Support generating "Caption" tag for tables. - Fix page-break handling of table rows and cells. - Create corresponding PDF tags for some HTML elements. - Add hyperlink support. - Allow creation of a PDF that conforms to PDF/A and PDF/UA at the same time. - Fixed specifying the document language based on report locale setting. - Added THead, TBody, TFoot tags for tables. - Fixed PDF syntax errors (contentByte.setColorStroke was called at wrong place in drawing sequence). - Removed support for Acrobat Flash. - Refactored attributes needed for split tracking into a dedicated class. - Refactoring of some PDF-tag related constants. - Added or improved some comments. - Removed some warnings. I originally developed this using my hobby account hvbargen over several weeks.
Reminder: Page numbering with the auto-text item causes and exception. I already added a commit and an example report with this commit,which I should add here when I'm at work again. |
Was the issue existing before or is it a reason of the changes? |
The issue was existing before when you tried to generate PDF/UA. |
I'll see what I can do this weekend. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The most changes are added to get comments and some coding styles.
Also the question is added if it make sense to add PDF 2.0 support when we throw an exception if it is used through the report developer.
PdfArray kids; | ||
PdfObject kido = currentPageDevice.structureCurrentLeaf.get(PdfName.K); | ||
if (kido == null) { | ||
kids = new PdfArray(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The standard convention is to use name like "child" & "children" instead of "kid" and "kids".
Therefore change it to child & children.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The naming is used this way inside OpenPDF and the rare examples I could find in the net. Thus I used the same naming convention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the BIRT developer used Schild & children, you are the first with Kids.
There is no Methode with "getKids" but you will finde "getChildren" and "getFirstChild()".
So it would be good to keep on child & children.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even for English it's poor English and I don't recall seeing such terminology used elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, lets use the children and child convention. I understand your idea to follow their pattern, but we should stick to our own.
break; | ||
} | ||
} catch (Exception e) { | ||
logger.log(Level.WARNING, e.getMessage(), e); | ||
} | ||
if (currentPageDevice.isTagged()) { | ||
PdfArray kids; | ||
PdfObject kido = currentPageDevice.structureCurrentLeaf.get(PdfName.K); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better name for "kido" = "kid-object", normally we name it child e.r. "childPdfObj" or "childPdf"
import com.lowagie.text.pdf.PdfName; | ||
|
||
/** | ||
* @since 4.19 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A description of the class is missing. You are the developer with the deepest knowledge and for other developers it would be helpfull to have a short description.
package org.eclipse.birt.report.engine.emitter.pdf; | ||
|
||
/** | ||
* @since 4.19 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A description of the class is missing. You are the developer with the deepest knowledge and for other developers it would be helpfull to have a short description.
* @since 4.19 | ||
* | ||
*/ | ||
@SuppressWarnings("javadoc") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The core pain of the old code is that there is no documentation is given on class, methode and property level.
So it would be much better to avoid the "SuppressWarnings" and add regularly comments to all elements.
super(); | ||
} | ||
/** | ||
* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Description is missing.
private static class PDFAExtensionSchema extends XmpSchema { | ||
|
||
/** | ||
* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Description is missing.
*/ | ||
private static final long serialVersionUID = 6654512771721220538L; | ||
|
||
public PDFAExtensionSchema() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment would be helpful which header details for which version woll be created.
case PdfWriter.PDFA1B: | ||
a1.addPart("1"); | ||
break; | ||
case PDFA2A: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add the class of the constant or remove it from the constants before.
} | ||
switch (PdfXConformance) { | ||
case PdfWriter.PDFA1A: | ||
case PDFA2A: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add the class of the constant or remove it from the constants before.
Great work, I would like to encourage you to complete this and merge it. Well done. |
Regarding adding a test with VeraPDF, I'll probably need help. I added this to the pom.xml of org.eclipse.birt.report.engine.emitter.pdf.tests/pom.xml:
But all the VeraPDF imports, e.g.
do not work with an error message that the import cannot be resolved. |
It needs to be in the target platform so added to the *.target and the dependency needs to be specified in the MAINFEST.MF. If it doesn’t work in the ide, it won’t work in the build. So I would need to add this to orbit and it would need ip review. I can look into that tomorrow. Hopefully it doesn’t drag in a whack of dependencies. |
Can't we take it as a fat jar? We only need it in a test bundle. |
The dependency fan-out is turning out not too bad. The most annoying questionable thing is this extremely old dependency:
But given we aren't shipping these dependencies in the product, that's of minimal concern. I did have to make one package import optional: According to the pom that is an optional dependency: https://repo1.maven.org/maven2/org/verapdf/validation-model/1.26.5/validation-model-1.26.5.pom I have no knowledge to know if this subset is fully functional for the intended purpose. |
BTW there's a javax and a jakarta version of verapdf, see https://docs.verapdf.org/develop/ |
I assume one would tend to prefer the jakarta version these days. |
BTW, I've hit a roadblock in that the source bundles produced retain the original jar signature and of course the jar has been modified to bundlize it such that the signature is invalid and the artifact will appear to be tampered which in fact is the case: I'm not sure if the build machine, which signs these jars, will remove the original signature. I will need to investigate, but it's most inconvenient not to be able to test this locally. This also seems a lot of overhead for a test... Note too that when I build a target platform where I let it pull in all recursive dependencies, it's an incredibly large set of 123 bundles: So I'm not sure how fat the fat jar would be or how you create (and plan to maintain) the fat jar. |
Right now, when I test manually, I use the command-line tool which is provided by the |
This looks a bit like a fat jar with everything in it similar in size to the jar in the download you mentioned: https://repo1.maven.org/maven2/org/verapdf/apps/pdfbox-apps/1.26.6/ It would be harder to get approval (and use) because sources are missing... |
Can't we download and extract it on the fly in a @beforeAll context? |
To which "it" do you refer? This appears to contain an installer as a jar, but I don't know if that jar can be used directly to run the application. https://software.verapdf.org/releases/verapdf-pdfbox-installer.zip I'm also not sure about this one: https://repo1.maven.org/maven2/org/verapdf/apps/pdfbox-apps/1.26.6/pdfbox-apps-1.26.6.jar It appears to want to run the GUI: Main-Class: org.verapdf.apps.PdfBoxGuiWrapper Maybe if a different main class is specified on the command line it can do other things. |
I was referring to Hennings #2018 (comment) where he now downloads a zip and extracts it to run the test. |
There seems to be a headless installer on this page https://docs.verapdf.org/install/. Look for section Automated Installation in the bottom. |
Yes that is the zip link I showed. But it’s an installer so one would need to run that first and after one could use the installation. Nicer would be if there exists a jar one can download and use, even if the jar was in a zip. Probably what’s needed is in the installer jar or the installer jar could be used that way. |
I just saw the automated installation documentation Wim mentioned a few minutes ago. So maybe it's possible to install VeraPDF on-the-fly before using it for tests. If accessing the Java API is not possible for various reasons, using the CLI is the second-best option IMO, despite the installer-, process- and Java initialization overhead. @merks Ed, if we could solve the technical issues with using the installer, would we get an approval to use it? Sigh... It seems rather complicated. |
I'm going to test the following idea (in a stand-alone program) as a POC: Initialization: Use the "automated installation" to install VeraPDF to a temp directory. Figure out what's needed to call the API methods (by looking at the batch scripts to start the CLI. Set up the class-path etc accordingly. Write a Java method validatePDF which uses the API to validate a given PDF. Clean up: Remove the temp directory. Once this works, rewrite the validatPDF method in a way which does not need the JARs at compile time. When this POC works, integrate this into the BIRT tests in such a ways that we can use the API directly from the tests, instead of creating and calling a child process for each PDF file. |
Sounds great, Henning. |
I created a POC in https://github.com/hvbargen/verapdf-install: This uses the VeraPDFInstaller ZIP archive as a basis, runs the installer in automatic mode, then creates a new class loader which contains the extracted JAR, sets some system properties which VeraPDF uses (maybe this can be customized in others ways) and then uses Javascript (Mozilla Rhino) with this class loader to validate the PDF. Well, that's how it should work. |
If you have a "javax" error, my first idea would be that the module-handling of the JVM create some troubles. Because my thinking was if you use a test-report with Rhino the JVM would need according to your javax-error the global parameters to load directly your missing module(s) at starting moment of JVM. But it is only an assumption because I don't know in which way you tried to use Rhino. |
Ah, I see: I was going to write: You can take a look at the POC code at https://github.com/hvbargen/verapdf-install. It's really just a single class. |
I originally developed this using my hobby account hvbargen over several weeks.