Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDF/UA support, part 2 #2018

Open
wants to merge 29 commits into
base: master
Choose a base branch
from

Conversation

hvbtup
Copy link
Contributor

@hvbtup hvbtup commented Jan 2, 2025

  • Added some example PDF/UA reports.
  • Add support for declaring newer PDF/A versions.
  • Add support for declaring PDF/UA-2.
  • Add support for declaring PDF 2.0.
  • Use "auto" as default value for tagType property for some report element types to allow automatic context-depending setting as well as explicit setting.
  • Support generating "Caption" tag for tables.
  • Fix page-break handling of table rows and cells.
  • Create corresponding PDF tags for some HTML elements.
  • Add hyperlink support.
  • Allow creation of a PDF that conforms to PDF/A and PDF/UA at the same time.
  • Fixed specifying the document language based on report locale setting.
  • Added THead, TBody, TFoot tags for tables.
  • Fixed PDF syntax errors (contentByte.setColorStroke was called at wrong place in drawing sequence).
  • Removed support for Acrobat Flash.
  • Refactored attributes needed for split tracking into a dedicated class.
  • Refactoring of some PDF-tag related constants.
  • Added or improved some comments.
  • Removed some warnings.

I originally developed this using my hobby account hvbargen over several weeks.

hvbargen and others added 21 commits December 1, 2024 20:46
…hich is only allocated when a container is actually split.
- Added some example PDF/UA reports.
- Add support for declaring newer PDF/A versions.
- Add support for declaring PDF/UA-2.
- Add support for declaring PDF 2.0.
- Use "auto" as default value for tagType property for some report element types to allow automatic context-depending setting as well as explicit setting.
- Support generating "Caption" tag for tables.
- Fix page-break handling of table rows and cells.
- Create corresponding PDF tags for some HTML elements.
- Add hyperlink support.
- Allow creation of a PDF that conforms to PDF/A and PDF/UA at the same time.
- Fixed specifying the document language based on report locale setting.
- Added THead, TBody, TFoot tags for tables.
- Fixed PDF syntax errors (contentByte.setColorStroke was called at wrong place in drawing sequence).
- Removed support for Acrobat Flash.
- Refactored attributes needed for split tracking into a dedicated class.
- Refactoring of some PDF-tag related constants.
- Added or improved some comments.
- Removed some warnings.

I originally developed this  using my hobby account hvbargen over several weeks.
@hvbargen
Copy link
Contributor

hvbargen commented Jan 4, 2025

Reminder: Page numbering with the auto-text item causes and exception. I already added a commit and an example report with this commit,which I should add here when I'm at work again.

@speckyspooky
Copy link
Contributor

Was the issue existing before or is it a reason of the changes?

@hvbargen
Copy link
Contributor

hvbargen commented Jan 5, 2025

The issue was existing before when you tried to generate PDF/UA.

@hvbtup
Copy link
Contributor Author

hvbtup commented Jan 6, 2025

I'll see what I can do this weekend.

Copy link
Contributor

@speckyspooky speckyspooky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The most changes are added to get comments and some coding styles.
Also the question is added if it make sense to add PDF 2.0 support when we throw an exception if it is used through the report developer.

PdfArray kids;
PdfObject kido = currentPageDevice.structureCurrentLeaf.get(PdfName.K);
if (kido == null) {
kids = new PdfArray();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The standard convention is to use name like "child" & "children" instead of "kid" and "kids".
Therefore change it to child & children.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The naming is used this way inside OpenPDF and the rare examples I could find in the net. Thus I used the same naming convention.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the BIRT developer used Schild & children, you are the first with Kids.
There is no Methode with "getKids" but you will finde "getChildren" and "getFirstChild()".

So it would be good to keep on child & children.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even for English it's poor English and I don't recall seeing such terminology used elsewhere.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, lets use the children and child convention. I understand your idea to follow their pattern, but we should stick to our own.

break;
}
} catch (Exception e) {
logger.log(Level.WARNING, e.getMessage(), e);
}
if (currentPageDevice.isTagged()) {
PdfArray kids;
PdfObject kido = currentPageDevice.structureCurrentLeaf.get(PdfName.K);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better name for "kido" = "kid-object", normally we name it child e.r. "childPdfObj" or "childPdf"

import com.lowagie.text.pdf.PdfName;

/**
* @since 4.19
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A description of the class is missing. You are the developer with the deepest knowledge and for other developers it would be helpfull to have a short description.

package org.eclipse.birt.report.engine.emitter.pdf;

/**
* @since 4.19
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A description of the class is missing. You are the developer with the deepest knowledge and for other developers it would be helpfull to have a short description.

* @since 4.19
*
*/
@SuppressWarnings("javadoc")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The core pain of the old code is that there is no documentation is given on class, methode and property level.
So it would be much better to avoid the "SuppressWarnings" and add regularly comments to all elements.

super();
}
/**
*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Description is missing.

private static class PDFAExtensionSchema extends XmpSchema {

/**
*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Description is missing.

*/
private static final long serialVersionUID = 6654512771721220538L;

public PDFAExtensionSchema() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment would be helpful which header details for which version woll be created.

case PdfWriter.PDFA1B:
a1.addPart("1");
break;
case PDFA2A:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add the class of the constant or remove it from the constants before.

}
switch (PdfXConformance) {
case PdfWriter.PDFA1A:
case PDFA2A:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add the class of the constant or remove it from the constants before.

@andreasrosdal
Copy link

Great work, I would like to encourage you to complete this and merge it. Well done.

@hvbargen
Copy link
Contributor

Regarding adding a test with VeraPDF, I'll probably need help.

I added this to the pom.xml of org.eclipse.birt.report.engine.emitter.pdf.tests/pom.xml:

<dependencies>
	<dependency>
		<groupId>org.verapdf</groupId>
		<artifactId>validation-model</artifactId>
		<version>1.26.1</version>
	</dependency>
</dependencies>

But all the VeraPDF imports, e.g.

import org.verapdf.pdfa.PDFAParser;

do not work with an error message that the import cannot be resolved.
It seems that I am missing some additional step required to add a new Maven dependency.

@merks
Copy link
Contributor

merks commented Jan 12, 2025

It needs to be in the target platform so added to the *.target and the dependency needs to be specified in the MAINFEST.MF. If it doesn’t work in the ide, it won’t work in the build. So I would need to add this to orbit and it would need ip review. I can look into that tomorrow. Hopefully it doesn’t drag in a whack of dependencies.

@merks
Copy link
Contributor

merks commented Jan 12, 2025

There are a lot of dependencies

image

and they're not OSGi bundles. This looks like a significant workload for my free spare time...

@wimjongman
Copy link
Contributor

Can't we take it as a fat jar? We only need it in a test bundle.

@merks
Copy link
Contributor

merks commented Jan 13, 2025

The dependency fan-out is turning out not too bad. The most annoying questionable thing is this extremely old dependency:

        <dependency>
          <groupId>net.java.dev.stax-utils</groupId>
          <artifactId>stax-utils</artifactId>
          <version>20070216</version>
          <type>jar</type>
        </dependency>

But given we aren't shipping these dependencies in the product, that's of minimal concern.

I did have to make one package import optional:

image

According to the pom that is an optional dependency:

https://repo1.maven.org/maven2/org/verapdf/validation-model/1.26.5/validation-model-1.26.5.pom

image

I have no knowledge to know if this subset is fully functional for the intended purpose.

@hvbtup
Copy link
Contributor Author

hvbtup commented Jan 13, 2025

BTW there's a javax and a jakarta version of verapdf, see https://docs.verapdf.org/develop/

@merks
Copy link
Contributor

merks commented Jan 13, 2025

I assume one would tend to prefer the jakarta version these days.

@merks
Copy link
Contributor

merks commented Jan 13, 2025

BTW, I've hit a roadblock in that the source bundles produced retain the original jar signature and of course the jar has been modified to bundlize it such that the signature is invalid and the artifact will appear to be tampered which in fact is the case:

image

I'm not sure if the build machine, which signs these jars, will remove the original signature. I will need to investigate, but it's most inconvenient not to be able to test this locally. This also seems a lot of overhead for a test...

Note too that when I build a target platform where I let it pull in all recursive dependencies, it's an incredibly large set of 123 bundles:

image

So I'm not sure how fat the fat jar would be or how you create (and plan to maintain) the fat jar.

@hvbtup
Copy link
Contributor Author

hvbtup commented Jan 13, 2025

Right now, when I test manually, I use the command-line tool which is provided by the verapdf-greenfield-1.27.33-installer.zip.
But while this distribution is nice for manual tests once it is installed, it needs a manual installation step: The installer is a ZIP file, which needs to be extracted into a (temp) directory, there you call the actual installer program, which in turn extracts the final contents into a user-defined directory (a directory structure with 750 files).
Probably, once installed, I could call the Java API directly instead of using the shell and the CLI.
But extracting the ZIP and running the installer before being able to use it for tests seems not very elegant.

@merks
Copy link
Contributor

merks commented Jan 13, 2025

This looks a bit like a fat jar with everything in it similar in size to the jar in the download you mentioned:

https://repo1.maven.org/maven2/org/verapdf/apps/pdfbox-apps/1.26.6/

It would be harder to get approval (and use) because sources are missing...

@wimjongman
Copy link
Contributor

Can't we download and extract it on the fly in a @beforeAll context?

@merks
Copy link
Contributor

merks commented Jan 13, 2025

@wimjongman

To which "it" do you refer?

This appears to contain an installer as a jar, but I don't know if that jar can be used directly to run the application.

https://software.verapdf.org/releases/verapdf-pdfbox-installer.zip

I'm also not sure about this one:

https://repo1.maven.org/maven2/org/verapdf/apps/pdfbox-apps/1.26.6/pdfbox-apps-1.26.6.jar

It appears to want to run the GUI:

Main-Class: org.verapdf.apps.PdfBoxGuiWrapper

Maybe if a different main class is specified on the command line it can do other things.

@wimjongman
Copy link
Contributor

which "it"

I was referring to Hennings #2018 (comment) where he now downloads a zip and extracts it to run the test.

@wimjongman
Copy link
Contributor

which "it"

I was referring to Hennings #2018 (comment) where he now downloads a zip and extracts it to run the test.

There seems to be a headless installer on this page https://docs.verapdf.org/install/. Look for section Automated Installation in the bottom.

@merks
Copy link
Contributor

merks commented Jan 13, 2025

Yes that is the zip link I showed. But it’s an installer so one would need to run that first and after one could use the installation. Nicer would be if there exists a jar one can download and use, even if the jar was in a zip. Probably what’s needed is in the installer jar or the installer jar could be used that way.

@hvbtup
Copy link
Contributor Author

hvbtup commented Jan 13, 2025

I just saw the automated installation documentation Wim mentioned a few minutes ago.
It seems a bit cumbersome to use (I would prefer if I could just specify the target directory directly on the command line instead of creating an XML file for this), but usable.

So maybe it's possible to install VeraPDF on-the-fly before using it for tests.
But then we could still only use the command-line version in a subprocess, which means a bit of overhead per PDF file tested.

If accessing the Java API is not possible for various reasons, using the CLI is the second-best option IMO, despite the installer-, process- and Java initialization overhead.

@merks Ed, if we could solve the technical issues with using the installer, would we get an approval to use it?

Sigh... It seems rather complicated.
Should we move the "adding automated tests" topic out of this PR?

@hvbargen
Copy link
Contributor

I'm going to test the following idea (in a stand-alone program) as a POC:

Initialization: Use the "automated installation" to install VeraPDF to a temp directory.

Figure out what's needed to call the API methods (by looking at the batch scripts to start the CLI.

Set up the class-path etc accordingly.

Write a Java method validatePDF which uses the API to validate a given PDF.
Call this method in a loop for several PDFs, to see if and what cleanup is needed inbetween.

Clean up: Remove the temp directory.

Once this works, rewrite the validatPDF method in a way which does not need the JARs at compile time.
For example, by using reflection or by using Rhino.

When this POC works, integrate this into the BIRT tests in such a ways that we can use the API directly from the tests, instead of creating and calling a child process for each PDF file.

@wimjongman
Copy link
Contributor

Sounds great, Henning.

@hvbargen
Copy link
Contributor

I created a POC in https://github.com/hvbargen/verapdf-install:

This uses the VeraPDFInstaller ZIP archive as a basis, runs the installer in automatic mode, then creates a new class loader which contains the extracted JAR, sets some system properties which VeraPDF uses (maybe this can be customized in others ways) and then uses Javascript (Mozilla Rhino) with this class loader to validate the PDF.

Well, that's how it should work.
But it doesn't, unfortunately.
Calling VeraPDF directly from Java worked fine (see https://github.com/hvbargen/verapdf-cal, yes, I know, there's a typo in the project name).
But calling it through Rhino, it complains that it needs javax, I don't know why. I give up for this weekend.

@speckyspooky
Copy link
Contributor

If you have a "javax" error, my first idea would be that the module-handling of the JVM create some troubles.
But I don't know how you have tried to use Rhino - is it a kind of test report?

Because my thinking was if you use a test-report with Rhino the JVM would need according to your javax-error the global parameters to load directly your missing module(s) at starting moment of JVM.

But it is only an assumption because I don't know in which way you tried to use Rhino.

@hvbtup
Copy link
Contributor Author

hvbtup commented Jan 27, 2025

Ah, I see: I was going to write: You can take a look at the POC code at https://github.com/hvbargen/verapdf-install. It's really just a single class.
But I created this as a private repository. Will turn it into public tonight.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants