veraPDF CLI Configuration
Introduction
There is a sub-directory called config
below the veraPDF installation directory or in folder verapdf
in the user directory. It contains the XML configuration files for the veraPDF software
components. To see the contents of this directory from a terminal session
in the installation root directory type ls config/ on Mac or Linux
machines or dir config on Windows machines. On my Windows test VM
this outputs the following:
C:\Users\cfw\verapdf>dir config
Volume in drive C has no label.
Volume Serial Number is 1C45-2074
Directory of C:\Users\cfw\verapdf\config
22/01/2023 12:44 <DIR> .
22/01/2023 12:44 <DIR> ..
22/01/2023 12:44 411 app.xml
22/01/2023 12:44 186 features.xml
22/01/2023 12:44 109 fixer.xml
22/01/2023 12:44 131 validator.xml
4 File(s) 837 bytes
2 Dir(s) 3,695,038,464 bytes free
If you can’t see any files then it’s likely you’ve not run the application after installation. The software generates default configuration files on start-up if none exist. Try running verapdf --version which should generate the missing files.
If you are running a version of the application you have built yourself and not installed, config folder would be located in folder verapdf
in the user directory.
veraPDF config files
There are four config files available:
app.xml
configures the veraPDF CLI and GUI applications;validator.xml
sets defaults for PDF/A or PDF/UA validation;fixer.xml
provides configuration of the metadata fixer; andfeatures.xml
configures feature extraction.
The sections below give a brief overview of these files and their options.
Configuring the veraPDF application
A default application config file looks like:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<appConfig type="VALIDATE" format="XML" isVerbose="false">
<fixerFolder></fixerFolder>
<wikiPath>https://github.com/veraPDF/veraPDF-validation-profiles/wiki/</wikiPath>
<policyFile></policyFile>
</appConfig>
appConfig
The appConfig
element has a set of attributes can be used as follows:
type
controls the default processing model for the GUI, legal values are:VALIDATE
: PDF/A or PDF/UA validation.VALIDATE_FIX
: PDF/A or PDF/UA validation and metadata fixing.EXTRACT
: Feature extraction.VALIDATE_EXTRACT
: PDF/A or PDF/UA validation and feature extraction.EXTRACT_FIX
: PDF/A or PDF/UA validation, feature extraction and metadata fixing.POLICY
: Policy checking, this also enables PDF/A or PDF/UA validation and feature extraction as the policy checker depends upon them.POLICY_FIX
: Policy checking and metadata fixing, again PDF/A or PDF/UA validation and feature extraction are also enabled.
format
chooses the default reporting format, valid values are:XML
(MRR
) : machine readable report, an XML file that has been formatted for machine parsing and reporting.RAW
: the RAW xml report format contains all application configuration properties and the ungrouped and unsorted list of all failed checks (assertions). The RAW xml data used by the veraPDF APIs, it’s not quite as readable as the XML format but can be de-serialised by the veraPDF API for further processing.HTML
: a formatted HTML report intended for human consumption.TEXT
: very brief single line text output.
isVerbose
can be set tofalse
for brief output which is the default, ortrue
for verbose output in text report.
fixerFolder
The fixerFolder
element sets a default folder where the repaired files generated by the metadata fixer are written.
wikiPath
The wikiPath
element defines the base URL used to create reference links in the HTML report. You’re unlikely to want to change this unless you intend to host your own local version
of the veraPDF validation rule wiki.
policyFile
The policyFile
element defines default policy file to be applied by the veraPDF policy checker.
Configuring PDF/A or PDF/UA validation
The default validation config file contains:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<validatorConfig flavour="NO_FLAVOUR" defaultFlavour="PDFA_1_B" recordPasses="false" maxFails="-1" debug="false" showErrorMessages="true" isLogsEnabled="true" loggingLevel="WARNING" maxNumberOfDisplayedFailedChecks="100" showProgress="false"/>
The validatorConfig element
The validatorConfig
element defines the following attributes:
flavour
the default flavour to use when none is specified by the user, can be PDF_A_1A, PDF_A_1B, PDF_A_2A, PDF_A_2B, PDF_A_2U, PDF_A_3A, PDF_A_3B, PDF_A_3U, PDF_A_4, PDF_A_4E, PDF_A_4F, PDF_UA1, PDF_UA2 or NO_FLAVOUR (for automatic detection).recordPasses
settrue
to report passed validation checks,false
to report failures only.maxFails
specifies the maximum number of failed checks before validation is terminated, the default value of -1 means report all failures.maxNumberOfDisplayedFailedChecks
specifies how many failed tests are reported per validation rule.debug
settrue
to output all processed file namesshowErrorMessages
settrue
to add detailed error message for each check (xml, json, raw or html)isLogsEnabled
settrue
to add logs to report (xml, json or html)loggingLevel
determine the log level, can be “OFF”, “SEVERE”, “WARNING”, “CONFIG”, “ALL”showProgress
settrue
to show the current status of the validation job (only in cli)defaultFlavour
the default flavour to use when automatic detection did not work
Configuring feature extraction
The config/features.xml
file configures the types of PDF features extracted by the
veraPDF software. The default file contains a single entry:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<featuresConfig>
<enabledFeatures>
<feature>INFORMATION_DICTIONARY</feature>
</enabledFeatures>
</featuresConfig>
This enables the extraction of the PDF document metadata held in the information
dictionary. You can enable the extraction of other features by adding new
<feature>
sub-elements to the <enabledFeatures>
element.
For reference here’s a version of features.xml
with every type of feature
enabled:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<featuresConfig>
<enabledFeatures>
<feature>ACTION</feature>
<feature>ANNOTATION</feature>
<feature>COLORSPACE</feature>
<feature>DOCUMENT_SECURITY</feature>
<feature>EMBEDDED_FILE</feature>
<feature>EXT_G_STATE</feature>
<feature>FONT</feature>
<feature>FORM_XOBJECT</feature>
<feature>ICCPROFILE</feature>
<feature>IMAGE_XOBJECT</feature>
<feature>INFORMATION_DICTIONARY</feature>
<feature>INTERACTIVE_FORM_FIELDS</feature>
<feature>LOW_LEVEL_INFO</feature>
<feature>METADATA</feature>
<feature>OUTLINES</feature>
<feature>OUTPUTINTENT</feature>
<feature>PAGE</feature>
<feature>PATTERN</feature>
<feature>POSTSCRIPT_XOBJECT</feature>
<feature>PROPERTIES</feature>
<feature>SHADING</feature>
<feature>SIGNATURE</feature>
</enabledFeatures>
</featuresConfig>
ACTION
Lists all acton elements associated with various document, page, interactive form events. The extracted action element contains information about the action type and location (document, page, annotation, outline) to which this action was associated.
ANNOTATION
Lists all of the annotations found within the document. The extracted annotation elements contain detailed information about annotation e.g. type, location, references to the annotation resources and other annotations used by an annotation.
COLORSPACE
Lists all colour spaces contained in the document. The description of each color space contains details relevant for given color space family. The family is specified in family attribute. Possible color space families are:
- DeviceGray
- DeviceRGB
- DeviceCMYK
- CalGray
- CalRGB
- Lab
- ICCBased
- Indexed
- Pattern
- Separation
- DeviceN
DOCUMENT_SECURITY
Requests information about document security including encryption, password protection and permissions.
EMBEDDED_FILE
Extracts information about any embedded files contained within a PDF document.
EXT_G_STATE
Lists the graphic states used in the document and their properties, e.g. transparency.
FONT
Lists any fonts used in the document. The description of each font contains the details relevant for given font type. The children elements of the font element:
- subtype
- name
- baseName
- firstChar
- lastChar
- widths
- encoding
- embedded
- subset
- fontDescriptor (the font descriptor describing the font’s metrics other than its glyph widths)
FORM_XOBJECT
Extracts information about any forms contained in the document.
ICCPROFILE
Configures the extraction of ICC profiles contained in the PDF document.
IMAGE_XOBJECT
Extracts information about the images contained in the document like height, width and compression used.
INFORMATION_DICTIONARY
This enables the extraction of key-value pairs from the PDF Document information dictionary. The dictionary key name is saved as the value of the key argument; the dictionary value is saved as the value of the entry element
INTERACTIVE_FORM_FIELDS
Extracts information about all interactive form fields found in the document. The extracted information includes the name of the form field and its value.
LOW_LEVEL_INFO
Extract information about indirect objects, the document ID as well as compression / decoding filters used in the document.
METADATA
Requests reporting of the document-level XMP metadata package exactly as it is in the original PDF Document or, if automatic XMP metadata fixing is enabled, in the resulting PDF Document. Since XMP serialization is based on XML there is no need to change in the serialized XMP packet, except for encoding. If the encoding used by XMP differs from encoding used for Report generation, the XMP will be re-encoded to make it consistent with the rest of the Report.
OUTLINES
Extracts information pertaining to any bookmarks in the document.
OUTPUTINTENT
Requests the extraction of information about the document’s output intents.
PAGE
Lists the page elements, each representing a page in the PDF document. This includes information about:
- media boxes;
- crop boxes;
- trim boxes;
- bleed boxes;
- art boxes;
- rotation;
- scaling;
- thumbnails;
- resources, including reference to fonts and images used on a page; and
- annotations.
PATTERN
Gathers information about the patterns contained in the PDF.
POSTSCRIPT_XOBJECT
Extracts information about any PostScript fragments used when printing to a PostScript device.
PROPERTIES
Lists the properties dictionaries.
SHADING
Lists the shadings used in the document.
SIGNATURE
Extracts information about any digital signatures contained in the document.
Configuring plugins
The config/plugins.xml
file configures plug in components for veraPDF. The default file contains an empty entry:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<pluginsConfig/>
To add a plug in execution the plugin
element shall be specified.
For reference here’s an example of plugins.xml
with single plugin enabled:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<pluginsConfig>
<plugin enabled="true">
<name>Plugin Name</name>
<version>1.0.1</version>
<description>Some plugin description</description>
<pluginJar>pluginPath/plugin.jar</pluginJar>
<attributes>
<attribute key="attrKey" value="attrValue"/>
<attribute key="attr2Key" value="attr2Value"/>
</attributes>
</plugin>
</pluginsConfig>
enabled
The enabled
attribute specified if the plugin shall be executed during features extracting or not. This attribute
can be used for temporary disabling the plugin without removing the configuration data for the plugin.
name
This is a plug in name which will be added into features report.
version
This is a plug in version which will be added into features report.
description
This is a plug in description which will be added into features report.
plugin jar
This is a path to plug in jar file. Shall be either absolute or relative to veraPDF installation folder.
attributes
This is a list of attribute
nodes. Each of them contains two xml attributes key
and value
.
The resulted map will be used as attributes map for the plug in.
Configuring the metadata fixer
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<fixerConfig fixId="true" fixesPrefix="veraFixMd_"/>