Skip to content

Commit

Permalink
Refactor PDF import (#12310)
Browse files Browse the repository at this point in the history
* Mega changes

* Make PdfBibExtractor be Importer

* Add merge dialog to import

* Renaming

* Revert to old name

* Rename to old names

* Revert to old code

* Fix style

* Add documentation

* Fix ids

* Update checkstyle to 10.21.0

# Conflicts:
#	build.gradle

* Rename method (and add JavaDoc)

* Use Java23 _

* Add Java comment

* Fix name

* Refine adr-template.md

* Fix ADR0043

* Refine comment

* Point to ADR-0043

* Revert to stream-based forEach (enabled by adding "final")

* Fix ADR

* Update a bit ADR

* Fix tests

* Fix tests

* Fix checkers

---------

Co-authored-by: Oliver Kopp <kopp.dev@gmail.com>
  • Loading branch information
InAnYan and koppor authored Jan 25, 2025
1 parent 5046e06 commit df438b8
Show file tree
Hide file tree
Showing 49 changed files with 616 additions and 597 deletions.
2 changes: 1 addition & 1 deletion .vscode/settings.json
Original file line number Diff line number Diff line change
Expand Up @@ -2,5 +2,5 @@
"java.configuration.updateBuildConfiguration": "interactive",
"java.format.settings.url": "/config/VSCode Code Style.xml",
"java.checkstyle.configuration": "${workspaceFolder}/config/checkstyle/checkstyle_reviewdog.xml",
"java.checkstyle.version": "10.3.4"
"java.checkstyle.version": "10.21.0"
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
---
nav_order: 43
parent: Decision Records
---
# Show merge dialog when importing a single PDF

## Context and Problem Statement

PDF files are one of the main format for transferring various documents, especially scientific papers. However, by itself,
PDF is like a picture, it contains commands solely for displaying the human-readable text, but it might not contain
computer-readable metadata.

To overcome these problems various heuristics and AI models are used to "convert" a PDF into a BibTeX entry. However, it
also introduces a level of problems, as heuristics are not ideal: sometimes it works perfectly, but on others it generates
random output.

PDF importing in JabRef is done via `PdfImporter` abstract class and its descendants, and via `PdfMergeMetadataImporter`.
`PdfImporter` is typically a single heuristics or method of extracting a `BibEntry` from PDF. `PdfMergeMetadataImporter`
collects `BibEntry` candidates from all `PdfImporter`s and merges them automatically into a single `BibEntry`.

The specific problem JabRef has: should JabRef automate all heuristics (automatically merge all `BibEntry`ies from
several `PdfImporter`s) when importing PDF files or should every file be analysed thoroughly by users?

## Decision Drivers

* Option should provide a good-enough quality.
* It is desired to have a fine-grained controls of PDF importing for power-users.

## Considered Options

* Automatically merge all `BibEntry` candidates from `PdfImporters`.
* Open a merge dialog with all candidates.
* Open a merge dialog with all candidates if a single PDF is imported.

## Decision Outcome

Chosen option: "Open a merge dialog with all candidates if a single PDF is imported", because comes out best (see below).

## Pros and Cons of the Options

### Automatically merge all `BibEntry` candidates from `PdfImporters`

* Good, because minimal user interaction and disruption of flow. It also allows batch-processing.
* Bad, because heuristics are not ideal, and it is even harder to develop a "smarter" merging algorithm.

### Open a merge dialog with all candidates

* Good, because allows for fine-grained import. Some correct field may be overridden by a wrong field from other importer,
which is undesirable for power-users.
* Bad, because it is a dialog. If lots of PDFs are imported, then there will be lots of dialogs, which might be
too daunting to process manually.

### Open a merge dialog with all candidates if a single PDF is imported

Explanation:

- If a single PDF is imported, then open a merge dialog.
- If several PDFs are imported, merge candidates for each PDF automatically.

Outcomes:

* Good, because it combines the best of the other two options: Allow both for PDF batch-processing and for fine-grained control.

<!-- markdownlint-disable-file MD004 -->
3 changes: 2 additions & 1 deletion docs/decisions/adr-template.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,8 @@ parent: Decision Records
# consulted: {list everyone whose opinions are sought (typically subject-matter experts); and with whom there is a two-way communication}
# informed: {list everyone who is kept up-to-date on progress; and with whom there is a one-way communication}
---
<!-- we need to disable MD025, because we use the different heading "ADR Template" in the homepage (see above) than it is foreseen in the template -->
<!-- We need to disable MD025, because we use the different heading "ADR Template" in the homepage (see above) than it is foreseen in the template. -->
<!-- Remove these three lines when creating a real ADR - and keep the title consistent. -->
<!-- markdownlint-disable-next-line MD025 -->
# {short title, representative of solved problem and found solution}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -96,7 +96,7 @@ Activate the CheckStyle configuration file by ticking it in the list
![JabRef's checkstyle config is activated](guidelines-intellij-checkstyle-jabref-active.png)
{% endfigure %}

Ensure that the [latest CheckStyle version](https://checkstyle.org/releasenotes.html) is selected (10.3.4 or higher).
Ensure that the [latest CheckStyle version](https://checkstyle.org/releasenotes.html) is selected (10.21.0 or higher).
Also, set the "Scan Scope" to "Only Java sources (including tests)".

{% figure caption:"Checkstyle is the highest version - and tests are also scanned" %}
Expand Down
22 changes: 17 additions & 5 deletions src/main/java/org/jabref/gui/externalfiles/ImportHandler.java
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@
import org.jabref.gui.duplicationFinder.DuplicateResolverDialog;
import org.jabref.gui.fieldeditors.LinkedFileViewModel;
import org.jabref.gui.libraryproperties.constants.ConstantsItemModel;
import org.jabref.gui.mergeentries.MultiMergeEntriesView;
import org.jabref.gui.preferences.GuiPreferences;
import org.jabref.gui.undo.UndoableInsertEntries;
import org.jabref.gui.util.DragDrop;
Expand Down Expand Up @@ -100,6 +101,7 @@ public ExternalFilesEntryLinker getFileLinker() {
}

public BackgroundTask<List<ImportFilesResultItemViewModel>> importFilesInBackground(final List<Path> files, final BibDatabaseContext bibDatabaseContext, final FilePreferences filePreferences, TransferMode transferMode) {
// TODO: Make a utility class out of this. Package: org.jabref.logic.externalfiles.
return new BackgroundTask<>() {
private int counter;
private final List<ImportFilesResultItemViewModel> results = new ArrayList<>();
Expand Down Expand Up @@ -128,11 +130,21 @@ public List<ImportFilesResultItemViewModel> call() {

try {
if (FileUtil.isPDFFile(file)) {
ParserResult pdfImporterResult = contentImporter.importPDFContent(file, bibDatabaseContext, filePreferences);
List<BibEntry> pdfEntriesInFile = pdfImporterResult.getDatabase().getEntries();

if (pdfImporterResult.hasWarnings()) {
addResultToList(file, false, Localization.lang("Error reading PDF content: %0", pdfImporterResult.getErrorMessage()));
final List<BibEntry> pdfEntriesInFile;

// Details: See ADR-0043
if (files.size() == 1) {
pdfEntriesInFile = new ArrayList<>(1);
UiTaskExecutor.runAndWaitInJavaFXThread(() -> {
MultiMergeEntriesView dialog = PdfMergeDialog.createMergeDialog(new BibEntry(), file, preferences, taskExecutor);
dialogService.showCustomDialogAndWait(dialog).ifPresent(pdfEntriesInFile::add);
});
} else {
ParserResult pdfImporterResult = contentImporter.importPDFContent(file, bibDatabaseContext, filePreferences);
pdfEntriesInFile = pdfImporterResult.getDatabase().getEntries();
if (pdfImporterResult.hasWarnings()) {
addResultToList(file, false, Localization.lang("Error reading PDF content: %0", pdfImporterResult.getErrorMessage()));
}
}

if (pdfEntriesInFile.isEmpty()) {
Expand Down
68 changes: 68 additions & 0 deletions src/main/java/org/jabref/gui/externalfiles/PdfMergeDialog.java
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
package org.jabref.gui.externalfiles;

import java.io.IOException;
import java.nio.file.Path;
import java.util.function.Supplier;

import org.jabref.gui.mergeentries.MultiMergeEntriesView;
import org.jabref.gui.preferences.GuiPreferences;
import org.jabref.logic.importer.Importer;
import org.jabref.logic.importer.ParserResult;
import org.jabref.logic.importer.fileformat.pdf.PdfContentImporter;
import org.jabref.logic.importer.fileformat.pdf.PdfEmbeddedBibFileImporter;
import org.jabref.logic.importer.fileformat.pdf.PdfGrobidImporter;
import org.jabref.logic.importer.fileformat.pdf.PdfImporter;
import org.jabref.logic.importer.fileformat.pdf.PdfVerbatimBibtexImporter;
import org.jabref.logic.importer.fileformat.pdf.PdfXmpImporter;
import org.jabref.logic.l10n.Localization;
import org.jabref.logic.util.TaskExecutor;
import org.jabref.model.entry.BibEntry;

public class PdfMergeDialog {

/**
* Constructs a merge dialog for a PDF file. This dialog calls various {@link PdfImporter}s, collects the results, and lets the user choose between them.
* <p>
* {@link PdfImporter}s try to extract a {@link BibEntry} out of a PDF file,
* but it does not perform this 100% perfectly, it is only a set of heuristics that in some cases might work, in others not.
* Thus, JabRef provides this merge dialog that collects the results of all {@link PdfImporter}s
* and gives user a choice between field values.
*
* @param entry the entry to merge with
* @param filePath the path to the PDF file. This PDF is used as the source for the {@link PdfImporter}s.
* @param preferences the preferences to use. Full preference object is required, because of current implementation of {@link MultiMergeEntriesView}.
* @param taskExecutor the task executor to use when the multi merge dialog executes the importers.
*/
public static MultiMergeEntriesView createMergeDialog(BibEntry entry, Path filePath, GuiPreferences preferences, TaskExecutor taskExecutor) {
MultiMergeEntriesView dialog = new MultiMergeEntriesView(preferences, taskExecutor);

dialog.setTitle(Localization.lang("Merge PDF metadata"));

dialog.addSource(Localization.lang("Entry"), entry);
dialog.addSource(Localization.lang("Verbatim"), wrapImporterToSupplier(new PdfVerbatimBibtexImporter(preferences.getImportFormatPreferences()), filePath));
dialog.addSource(Localization.lang("Embedded"), wrapImporterToSupplier(new PdfEmbeddedBibFileImporter(preferences.getImportFormatPreferences()), filePath));

if (preferences.getGrobidPreferences().isGrobidEnabled()) {
dialog.addSource("Grobid", wrapImporterToSupplier(new PdfGrobidImporter(preferences.getImportFormatPreferences()), filePath));
}

dialog.addSource(Localization.lang("XMP metadata"), wrapImporterToSupplier(new PdfXmpImporter(preferences.getXmpPreferences()), filePath));
dialog.addSource(Localization.lang("Content"), wrapImporterToSupplier(new PdfContentImporter(), filePath));

return dialog;
}

private static Supplier<BibEntry> wrapImporterToSupplier(Importer importer, Path filePath) {
return () -> {
try {
ParserResult parserResult = importer.importDatabase(filePath);
if (parserResult.isInvalid() || parserResult.isEmpty() || !parserResult.getDatabase().hasEntries()) {
return null;
}
return parserResult.getDatabase().getEntries().getFirst();
} catch (IOException e) {
return null;
}
};
}
}
34 changes: 2 additions & 32 deletions src/main/java/org/jabref/gui/fieldeditors/LinkedFileViewModel.java
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,6 @@
import java.util.List;
import java.util.Optional;
import java.util.function.BiPredicate;
import java.util.function.Supplier;

import javafx.beans.Observable;
import javafx.beans.binding.Bindings;
Expand All @@ -23,6 +22,7 @@
import org.jabref.gui.AbstractViewModel;
import org.jabref.gui.DialogService;
import org.jabref.gui.desktop.os.NativeDesktop;
import org.jabref.gui.externalfiles.PdfMergeDialog;
import org.jabref.gui.externalfiletype.ExternalFileType;
import org.jabref.gui.externalfiletype.ExternalFileTypes;
import org.jabref.gui.icon.IconTheme;
Expand All @@ -35,13 +35,6 @@
import org.jabref.gui.util.ControlHelper;
import org.jabref.logic.FilePreferences;
import org.jabref.logic.externalfiles.LinkedFileHandler;
import org.jabref.logic.importer.Importer;
import org.jabref.logic.importer.ParserResult;
import org.jabref.logic.importer.fileformat.PdfContentImporter;
import org.jabref.logic.importer.fileformat.PdfEmbeddedBibFileImporter;
import org.jabref.logic.importer.fileformat.PdfGrobidImporter;
import org.jabref.logic.importer.fileformat.PdfVerbatimBibtexImporter;
import org.jabref.logic.importer.fileformat.PdfXmpImporter;
import org.jabref.logic.l10n.Localization;
import org.jabref.logic.util.TaskExecutor;
import org.jabref.logic.util.io.FileUtil;
Expand Down Expand Up @@ -457,34 +450,11 @@ public ValidationStatus fileExistsValidationStatus() {

public void parsePdfMetadataAndShowMergeDialog() {
linkedFile.findIn(databaseContext, preferences.getFilePreferences()).ifPresent(filePath -> {
MultiMergeEntriesView dialog = new MultiMergeEntriesView(preferences, taskExecutor);
dialog.setTitle(Localization.lang("Merge PDF metadata"));
dialog.addSource(Localization.lang("Entry"), entry);
dialog.addSource(Localization.lang("Verbatim"), wrapImporterToSupplier(new PdfVerbatimBibtexImporter(preferences.getImportFormatPreferences()), filePath));
dialog.addSource(Localization.lang("Embedded"), wrapImporterToSupplier(new PdfEmbeddedBibFileImporter(preferences.getImportFormatPreferences()), filePath));
if (preferences.getGrobidPreferences().isGrobidEnabled()) {
dialog.addSource("Grobid", wrapImporterToSupplier(new PdfGrobidImporter(preferences.getImportFormatPreferences()), filePath));
}
dialog.addSource(Localization.lang("XMP metadata"), wrapImporterToSupplier(new PdfXmpImporter(preferences.getXmpPreferences()), filePath));
dialog.addSource(Localization.lang("Content"), wrapImporterToSupplier(new PdfContentImporter(), filePath));
MultiMergeEntriesView dialog = PdfMergeDialog.createMergeDialog(entry, filePath, preferences, taskExecutor);
dialogService.showCustomDialogAndWait(dialog).ifPresent(newEntry -> {
databaseContext.getDatabase().removeEntry(entry);
databaseContext.getDatabase().insertEntry(newEntry);
});
});
}

private Supplier<BibEntry> wrapImporterToSupplier(Importer importer, Path filePath) {
return () -> {
try {
ParserResult parserResult = importer.importDatabase(filePath);
if (parserResult.isInvalid() || parserResult.isEmpty() || !parserResult.getDatabase().hasEntries()) {
return null;
}
return parserResult.getDatabase().getEntries().getFirst();
} catch (IOException e) {
return null;
}
};
}
}
5 changes: 2 additions & 3 deletions src/main/java/org/jabref/gui/importer/ImportCommand.java
Original file line number Diff line number Diff line change
Expand Up @@ -23,8 +23,8 @@
import org.jabref.logic.importer.ImportFormatReader;
import org.jabref.logic.importer.Importer;
import org.jabref.logic.importer.ParserResult;
import org.jabref.logic.importer.fileformat.PdfGrobidImporter;
import org.jabref.logic.importer.fileformat.PdfMergeMetadataImporter;
import org.jabref.logic.importer.fileformat.pdf.PdfGrobidImporter;
import org.jabref.logic.l10n.Localization;
import org.jabref.logic.preferences.CliPreferences;
import org.jabref.logic.util.BackgroundTask;
Expand Down Expand Up @@ -153,8 +153,7 @@ private ParserResult doImport(List<Path> files, Importer importFormat) throws IO
imports.add(importFormatReader.importUnknownFormat(filename, fileUpdateMonitor));
} else {
UiTaskExecutor.runAndWaitInJavaFXThread(() -> {
if (((importer.get() instanceof PdfGrobidImporter)
|| (importer.get() instanceof PdfMergeMetadataImporter))
if (((importer.get() instanceof PdfGrobidImporter) || (importer.get() instanceof PdfMergeMetadataImporter))
&& GrobidUseDialogHelper.showAndWaitIfUserIsUndecided(dialogService, preferences.getGrobidPreferences())) {
importFormatReader.reset();
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -164,11 +164,12 @@ private void addColumn(MultiMergeEntriesViewModel.EntrySource entrySourceColumn)
constraint.prefWidthProperty().bind(header.widthProperty());
optionsGrid.getColumnConstraints().add(constraint);

// Copy BibEntry to column - either immediately (if available) or after loading
if (!entrySourceColumn.isLoadingProperty().getValue()) {
writeBibEntryToColumn(entrySourceColumn, columnIndex);
} else {
header.setDisable(true);
entrySourceColumn.isLoadingProperty().addListener((observable, oldValue, newValue) -> {
entrySourceColumn.isLoadingProperty().addListener((_, _, newValue) -> {
if (!newValue && entrySourceColumn.entryProperty().get() != null) {
writeBibEntryToColumn(entrySourceColumn, columnIndex);
header.setDisable(false);
Expand Down Expand Up @@ -372,7 +373,7 @@ public FieldRow(Field field, int rowIndex) {

fieldEditorCell.addEventFilter(KeyEvent.KEY_PRESSED, event -> toggleGroup.selectToggle(null));

toggleGroup.selectedToggleProperty().addListener((obs, oldValue, newValue) -> {
toggleGroup.selectedToggleProperty().addListener((_, _, newValue) -> {
if (newValue == null) {
viewModel.mergedEntryProperty().get().setField(field, "");
} else {
Expand Down
10 changes: 5 additions & 5 deletions src/main/java/org/jabref/logic/importer/ImportFormatReader.java
Original file line number Diff line number Diff line change
Expand Up @@ -24,14 +24,14 @@
import org.jabref.logic.importer.fileformat.ModsImporter;
import org.jabref.logic.importer.fileformat.MsBibImporter;
import org.jabref.logic.importer.fileformat.OvidImporter;
import org.jabref.logic.importer.fileformat.PdfContentImporter;
import org.jabref.logic.importer.fileformat.PdfEmbeddedBibFileImporter;
import org.jabref.logic.importer.fileformat.PdfGrobidImporter;
import org.jabref.logic.importer.fileformat.PdfMergeMetadataImporter;
import org.jabref.logic.importer.fileformat.PdfVerbatimBibtexImporter;
import org.jabref.logic.importer.fileformat.PdfXmpImporter;
import org.jabref.logic.importer.fileformat.RepecNepImporter;
import org.jabref.logic.importer.fileformat.RisImporter;
import org.jabref.logic.importer.fileformat.pdf.PdfContentImporter;
import org.jabref.logic.importer.fileformat.pdf.PdfEmbeddedBibFileImporter;
import org.jabref.logic.importer.fileformat.pdf.PdfGrobidImporter;
import org.jabref.logic.importer.fileformat.pdf.PdfVerbatimBibtexImporter;
import org.jabref.logic.importer.fileformat.pdf.PdfXmpImporter;
import org.jabref.logic.l10n.Localization;
import org.jabref.model.database.BibDatabases;
import org.jabref.model.entry.BibEntry;
Expand Down
Loading

0 comments on commit df438b8

Please sign in to comment.