-
Notifications
You must be signed in to change notification settings - Fork 561
Using setMetadata() and setToC()
These methods allow changing meta information of a PDF document (only). Like the earlier introduced method select()
, they are methods in the Document
class. Both as well support the incremental save technique.
For every MuPDF-supported document type, doc.metadata
is a Python dictionary with keys format
, author
, creator
, producer
, creationDate
, modDate
, subject
, title
, encryption
and keywords
. This is true whether or not this information (completely) exists for any given document.
Except format
and encryption
, all of these data can be changed if the document is a PDF.
All you have to do is preparing a Python dictionary m
with some or all or the above key-value pairs and invoke doc.setMetadata(m)
.
Any above key not contained in this dictionary, will receive a value of none
.
If you provide an empty dictionray m = {}
, all information will be cleared in this way.
If you want to clear meta data for data protection / data security reasons, please make sure you save your PDF to a new file using save option garbage
. This makes sure the old information is physically removed from the file (incremental save does not do that).
If you want to change selected values only (and keep others), take a modified doc.metadata
and directly use it as a parameter. PDF format and encryption keys present in m
will be silently ignored.
Except for the dates keys (must be strings), any unicode value is acceptable. See section PDF String Handling.
The examples directory contains a pair of utilities, csv2meta.py and meta2csv.py, which export / import metadata to / from a csv file.
Apart from standard metadata, XML-based metadata are supported since PDF version 1.4. PDF maintenance software often uses this feature to store more complex information than is possible with standard metadata.
PyMuPDF contains no XML processing logic and therefore does not directly support maintaining such data. However, you can delete, extract and replace XML metadata (currently, no support inserting new XML metadata).
-
Document._delXmlMetadata()
delete XML metadata (if any, no exception raised). Can be used to enhance data privacy or reduce file size. - Use
xref = Document._getXmlMetadataXref()
to get the xref number (int) of XML metadata. If zero, none exist. - Use
data = Document._getXrefStream(xref)
to retrieve the data (abytes
object). Then interpret or change these data with a package likelxml
. - Use
Document._updateStream(xref, data)
to update the metadata.
Bookmarks or outlines form a quite complex forward-backward chained set of objects in PDFs. Together they are known as table of contents (TOC).
A TOC structure as found in books is much simpler: it just contains a list of lines with titles, page references and hierarchy levels. Relationship between such lines is only implicitly established by their sequence of occurrence.
Maintaining a book-like TOC (instead of single, separate bookmark items) is therefore exactly what we have decided to implement in PyMuPDF. Changing anything in a TOC means changing the complete TOC. A TOC will be inserted, changed or deleted as one single item with this function. We believe that this approach meets both, practical requirements and intuitive handling:
- everyone knows what TOCs in books are and how to use them
- hierarchy relations between lines in a TOC can simply be expressed by the entry's hierachy level
- forward / backward relationships between entries are established implicitely by the sequence in which they occur
In addition, previously existing method doc.getToC()
already provides an intuitive picture of all document bookmark items of a document in exactly the way described above. So, maintaining a TOC of a PDF could occur in the following simple steps:
toc = doc.getToC(simple = True or False)
- Modify
toc
as required ... doc.setToC(toc)
In step 3, behind the scenes, a new outline chain will be created using toc
to completely replace the old one. If you wish to delete an existing TOC, you can also set toc = []
.
If you wish to give a PDF a completely new TOC, provide a list of lists like toc = [[lvl1, title1, page1], [lvl2, title2, page2], ...]
.
As with meta data above, title entries may be provided using the full unicode character set (see following section).
Example program PDFoutline.py implements all of the above using the wxPython GUI.
A pair of utilities, toc2csv.py and csv2toc.py can be used to export / import a TOC to / from a csv file.
Outside document content text, PDF support two sets of character encoding, namely PDFDocEncoding and Unicode (see appendix D of the Adobe manual). Both are now fully implemented in PyMuPDF for use in methods setMetadata()
and setToC()
in the following way (applies to the above mentioned metadata fields and the TOC title entries):
- if an entry contains only ASCII characters (
ord(c) <= 127
), it will be used unchanged / as is; - else, any character
127 < ord(c) <= 255
will be replaced by the string\nnn
, wherennn
is the octal representation oford(c)
; the resulting string will be used; - else, if a string contains any character with
ord(c) > 255
, the complete string is encoded using UTF-16BE, prefixed with0xfeff
and this result, converted to its hexadecimal representation, will be used.
Differences and similarities of string handling between Python 2 and Python 3 are covered in the following way:
- The argument will be decoded with UTF-8.
- If it was
bytes
orbytearray
, it will be converted tounicode
(Python 2 and Python 3) - A
str
in Python 2 will becomeunicode
, aunicode
(Python 2) and astr
(Python 3) will remain unaffected (i.e. stayunicode
). - The resulting
str
/unicode
will then be treated as mentioned above.
All of the above results in a considerable flexibility: metadata and title fields can be provided as strings, unicode, bytes or bytearray objects!
HOWTO Button annots with JavaScript
HOWTO work with PDF embedded files
HOWTO extract text from inside rectangles
HOWTO extract text in natural reading order
HOWTO create or extract graphics
HOWTO create your own PDF Drawing
Rectangle inclusion & intersection
Metadata & bookmark maintenance