veraPDF CLI Configuration

Introduction

There is a sub-directory called config below the veraPDF installation directory or in folder verapdf in the user directory. It contains the XML configuration files for the veraPDF software components. To see the contents of this directory from a terminal session in the installation root directory type ls config/ on Mac or Linux machines or dir config on Windows machines. On my Windows test VM this outputs the following:

C:\Users\cfw\verapdf>dir config
 Volume in drive C has no label.
 Volume Serial Number is 1C45-2074

 Directory of C:\Users\cfw\verapdf\config

22/01/2023  12:44    <DIR>          .
22/01/2023  12:44    <DIR>          ..
22/01/2023  12:44               411 app.xml
22/01/2023  12:44               186 features.xml
22/01/2023  12:44               109 fixer.xml
22/01/2023  12:44               131 validator.xml
               4 File(s)            837 bytes
               2 Dir(s)   3,695,038,464 bytes free

If you can’t see any files then it’s likely you’ve not run the application after installation. The software generates default configuration files on start-up if none exist. Try running verapdf --version which should generate the missing files.

If you are running a version of the application you have built yourself and not installed, config folder would be located in folder verapdf in the user directory.

veraPDF config files

There are four config files available:

app.xml configures the veraPDF CLI and GUI applications;
validator.xml sets defaults for PDF/A or PDF/UA validation;
fixer.xml provides configuration of the metadata fixer; and
features.xml configures feature extraction.

The sections below give a brief overview of these files and their options.

Configuring the veraPDF application

A default application config file looks like:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<appConfig type="VALIDATE" format="XML" isVerbose="false">
    <fixerFolder></fixerFolder>
    <wikiPath>https://github.com/veraPDF/veraPDF-validation-profiles/wiki/</wikiPath>
    <policyFile></policyFile>
</appConfig>

appConfig

The appConfig element has a set of attributes can be used as follows:

type controls the default processing model for the GUI, legal values are:
- VALIDATE : PDF/A or PDF/UA validation.
- VALIDATE_FIX : PDF/A or PDF/UA validation and metadata fixing.
- EXTRACT : Feature extraction.
- VALIDATE_EXTRACT : PDF/A or PDF/UA validation and feature extraction.
- EXTRACT_FIX : PDF/A or PDF/UA validation, feature extraction and metadata fixing.
- POLICY : Policy checking, this also enables PDF/A or PDF/UA validation and feature extraction as the policy checker depends upon them.
- POLICY_FIX : Policy checking and metadata fixing, again PDF/A or PDF/UA validation and feature extraction are also enabled.
format chooses the default reporting format, valid values are:
- XML(MRR) : machine readable report, an XML file that has been formatted for machine parsing and reporting.
- RAW : the RAW xml report format contains all application configuration properties and the ungrouped and unsorted list of all failed checks (assertions). The RAW xml data used by the veraPDF APIs, it’s not quite as readable as the XML format but can be de-serialised by the veraPDF API for further processing.
- HTML : a formatted HTML report intended for human consumption.
- TEXT : very brief single line text output.
isVerbose can be set to false for brief output which is the default, or true for verbose output in text report.

fixerFolder

The fixerFolder element sets a default folder where the repaired files generated by the metadata fixer are written.

wikiPath

The wikiPath element defines the base URL used to create reference links in the HTML report. You’re unlikely to want to change this unless you intend to host your own local version of the veraPDF validation rule wiki.

policyFile

The policyFile element defines default policy file to be applied by the veraPDF policy checker.

Configuring PDF/A or PDF/UA validation

The default validation config file contains:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<validatorConfig flavour="NO_FLAVOUR" defaultFlavour="PDFA_1_B" recordPasses="false" maxFails="-1" debug="false" showErrorMessages="true" isLogsEnabled="true" loggingLevel="WARNING" maxNumberOfDisplayedFailedChecks="100" showProgress="false"/>

The validatorConfig element

The validatorConfig element defines the following attributes:

flavour the default flavour to use when none is specified by the user, can be PDF_A_1A, PDF_A_1B, PDF_A_2A, PDF_A_2B, PDF_A_2U, PDF_A_3A, PDF_A_3B, PDF_A_3U, PDF_A_4, PDF_A_4E, PDF_A_4F, PDF_UA1, or NO_FLAVOUR (for automatic detection).
recordPasses set true to report passed validation checks, false to report failures only.
maxFails specifies the maximum number of failed checks before validation is terminated, the default value of -1 means report all failures.
maxNumberOfDisplayedFailedChecks specifies how many failed tests are reported per validation rule.
debug set true to output all processed file names
showErrorMessages set true to add detailed error message for each check (xml, json, raw or html)
isLogsEnabled set true to add logs to report (xml, json or html)
loggingLevel determine the log level, can be “OFF”, “SEVERE”, “WARNING”, “CONFIG”, “ALL”
showProgress set true to show the current status of the validation job (only in cli)
defaultFlavour the default flavour to use when automatic detection did not work

Configuring feature extraction

The config/features.xml file configures the types of PDF features extracted by the veraPDF software. The default file contains a single entry:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<featuresConfig>
    <enabledFeatures>
        <feature>INFORMATION_DICTIONARY</feature>
    </enabledFeatures>
</featuresConfig>

This enables the extraction of the PDF document metadata held in the information dictionary. You can enable the extraction of other features by adding new <feature> sub-elements to the <enabledFeatures> element.

For reference here’s a version of features.xml with every type of feature enabled:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<featuresConfig>
    <enabledFeatures>
        <feature>ACTION</feature>
        <feature>ANNOTATION</feature>
        <feature>COLORSPACE</feature>
        <feature>DOCUMENT_SECURITY</feature>
        <feature>EMBEDDED_FILE</feature>
        <feature>EXT_G_STATE</feature>
        <feature>FONT</feature>
        <feature>FORM_XOBJECT</feature>
        <feature>ICCPROFILE</feature>
        <feature>IMAGE_XOBJECT</feature>
        <feature>INFORMATION_DICTIONARY</feature>
        <feature>INTERACTIVE_FORM_FIELDS</feature>
        <feature>LOW_LEVEL_INFO</feature>
        <feature>METADATA</feature>
        <feature>OUTLINES</feature>
        <feature>OUTPUTINTENT</feature>
        <feature>PAGE</feature>
        <feature>PATTERN</feature>
        <feature>POSTSCRIPT_XOBJECT</feature>
        <feature>PROPERTIES</feature>
        <feature>SHADING</feature>
        <feature>SIGNATURE</feature>
    </enabledFeatures>
</featuresConfig>

ACTION

Lists all acton elements associated with various document, page, interactive form events. The extracted action element contains information about the action type and location (document, page, annotation, outline) to which this action was associated.

ANNOTATION

Lists all of the annotations found within the document. The extracted annotation elements contain detailed information about annotation e.g. type, location, references to the annotation resources and other annotations used by an annotation.

COLORSPACE

Lists all colour spaces contained in the document. The description of each color space contains details relevant for given color space family. The family is specified in family attribute. Possible color space families are:

DeviceGray
DeviceRGB
DeviceCMYK
CalGray
CalRGB
Lab
ICCBased
Indexed
Pattern
Separation
DeviceN

DOCUMENT_SECURITY

Requests information about document security including encryption, password protection and permissions.

EMBEDDED_FILE

Extracts information about any embedded files contained within a PDF document.

EXT_G_STATE

Lists the graphic states used in the document and their properties, e.g. transparency.

FONT

Lists any fonts used in the document. The description of each font contains the details relevant for given font type. The children elements of the font element:

subtype
name
baseName
firstChar
lastChar
widths
encoding
embedded
subset
fontDescriptor (the font descriptor describing the font’s metrics other than its glyph widths)

FORM_XOBJECT

Extracts information about any forms contained in the document.

ICCPROFILE

Configures the extraction of ICC profiles contained in the PDF document.

IMAGE_XOBJECT

Extracts information about the images contained in the document like height, width and compression used.

INFORMATION_DICTIONARY

This enables the extraction of key-value pairs from the PDF Document information dictionary. The dictionary key name is saved as the value of the key argument; the dictionary value is saved as the value of the entry element

INTERACTIVE_FORM_FIELDS

Extracts information about all interactive form fields found in the document. The extracted information includes the name of the form field and its value.

LOW_LEVEL_INFO

Extract information about indirect objects, the document ID as well as compression / decoding filters used in the document.

METADATA

Requests reporting of the document-level XMP metadata package exactly as it is in the original PDF Document or, if automatic XMP metadata fixing is enabled, in the resulting PDF Document. Since XMP serialization is based on XML there is no need to change in the serialized XMP packet, except for encoding. If the encoding used by XMP differs from encoding used for Report generation, the XMP will be re-encoded to make it consistent with the rest of the Report.

OUTLINES

Extracts information pertaining to any bookmarks in the document.

OUTPUTINTENT

Requests the extraction of information about the document’s output intents.

PATTERN

Gathers information about the patterns contained in the PDF.

POSTSCRIPT_XOBJECT

Extracts information about any PostScript fragments used when printing to a PostScript device.

PROPERTIES

Lists the properties dictionaries.

SHADING

Lists the shadings used in the document.

SIGNATURE

Extracts information about any digital signatures contained in the document.

Configuring plugins

The config/plugins.xml file configures plug in components for veraPDF. The default file contains an empty entry:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<pluginsConfig/>

To add a plug in execution the plugin element shall be specified.

For reference here’s an example of plugins.xml with single plugin enabled:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<pluginsConfig>
    <plugin enabled="true">
        <name>Plugin Name</name>
        <version>1.0.1</version>
        <description>Some plugin description</description>
        <pluginJar>pluginPath/plugin.jar</pluginJar>
        <attributes>
            <attribute key="attrKey" value="attrValue"/>
            <attribute key="attr2Key" value="attr2Value"/>
        </attributes>
    </plugin>
</pluginsConfig>

enabled

The enabled attribute specified if the plugin shall be executed during features extracting or not. This attribute can be used for temporary disabling the plugin without removing the configuration data for the plugin.

name

This is a plug in name which will be added into features report.

version

This is a plug in version which will be added into features report.

description

This is a plug in description which will be added into features report.

plugin jar

This is a path to plug in jar file. Shall be either absolute or relative to veraPDF installation folder.

attributes

This is a list of attribute nodes. Each of them contains two xml attributes key and value. The resulted map will be used as attributes map for the plug in.

Configuring the metadata fixer

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<fixerConfig fixId="true" fixesPrefix="veraFixMd_"/>