The plugin is usually executed fully automatically within the workflow. It starts the configured checking process and then outputs whether the required checking level has been reached. If one of the checked documents does not reach the required level, the plugin fails.
This plugin is integrated into the workflow in such a way that it is executed automatically. Manual interaction with the plugin is not necessary. For use within a workflow step, it should be configured as shown in the screenshot below.
Configuration
The configuration of the plugin is done via the configuration file plugin_intranda_step_file_validation.xml and can be adjusted during operation. The following is an example configuration file:
<config_plugin><!-- order of configuration is: 1.) project name and step name matches 2.) step name matches and project is * 3.) project name matches and step name is * 4.) project name and step name are * --> <config><!-- which projects to use for (can be more then one, otherwise use *) --> <project>*</project><!-- which stepss to use for (can be more then one, otherwise use *) --> <step>*</step><!-- input folder where the documents are located, is only used in the STEP-Plugin --> <inputFolder>{processpath}/pdf</inputFolder><!-- outputfolder where the folder with the tool reports will be created, is only used in the STEP-Plugin --> <outputFolder>{processpath}/validation</outputFolder><!-- fileFilter: regex-Pattern that allows to filter by filename and fileextension --> <fileFilter>(?i).*\.pdf|.*\.epub</fileFilter><!-- name of the profile that shall be used by this config blog --> <profileName>epubPdf</profileName><!--targetLevel that must be reached for a successful plugin run --> <targetLevel>2</targetLevel> <config> </config><!-- which institution to use for (can be more then one, otherwise use *) --> <institution>*</institution> <profileName>epubPdf</profileName> <targetLevel>0</targetLevel> </config><!-- global has the child elements profile, namespaces and tools profile contains the definition of the ingest levels. It also has the attribute name, so you can refer to the profile in the config-blog element profileName the order of the levels defines their numbering, the first level element defines level zero and so on. a level element can contain check- and setValue- elements. a check element has following attributes name: name of the check dependsOn: name of the check that must have been successful, if this check shall be executed a check can only depend on a check that was defined before it. the parameter is optional tool: name of the tool that must be executed to create the report group: checks can be grouped. grouped checks are OR-operated which means, that the level won't fail if one check of the group is successful. ( i.e. check for isPDF-A and isPDFx in one level) code: Errorcode or Errormessage that shall be displayed when the check fails /regex doesn't match node does not exist
xpathSelector: xpathSelector to selct the node or attribute value regex: regular expression that will be matched with the read value, if no regular expression is provided the check will only test if the node exists. namespace: (only needed if the specified check uses another namespace than the tool and if namespaces are used) a setValue element has following attributes name: name of the check dependsOn: name of the check that must have been successful if this setValue-Element shall be executed a setValue-Element can only depend on a check not on other setValue Elements. set value Elements will always be executed after the checks. the parameter is mandatory tool: name of the tool that must have ben executed to create the report code: Errorcode or Errormessage that shall be displayed when value retrival fails. xpathSelector: xpathSelector to selct an attribute value namespace: (only needed if the specified setValue-Element uses another namespace than the tool and if namespaces are used) a tools element contains multiple tool elements a tool element hast the attributes name: name of the tool cmd: the command that must be run to create the xml report. you can use the {pv.outputFile} variable to refer to stdout: if stdout is true, the reportfile will be generated from the commandline output of the file. if it is set to false
the plugins assumes the tool is able to create the file by itself xmlNamespace: the name of the xml-namespace the generated report uses "jhove" a namespaces element can contain multiple namespace elements a namespace element has the attributes: name: the name of the xml namespace used in the xml and to address it in xmlNamespace attributes of tool-, check- and setValue-Elements
uri: the uri of the xml namespace --> <global> <profilename="epubPdf" > <level><!-- 0 DI check Integrity of Document --><!--checksum test should be done here --> </level> <level><!-- 1 ID Document with File --> <checkname="isPDF"tool="file"group="fileformat"code="This is not a PDF-File"xpathSelector="//format"regEx="(?i)^pdf document.*" /> <checkname="isEPUB"tool="file"group="fileformat"code="This is not an EPUB-File"xpathSelector="//format"regEx="(?i)^epub document.*" /> </level> <level><!-- 2 BF check for encryption or access restrictions --> <checkname="checkEncryption"dependsOn="isPDF"tool="pdfinfo"code="The file is encrypted or has access restrictions"xpathSelector="//pdfinfo/Encrypted"regEx="^no$" /> </level> <level><!-- 3 MD Extraction of Metadata --> <setValuename="PdfVersion"dependsOn="isPDF"tool="jhove_pdf"code="Could not read Version Information!"xpathSelector="//jhove:repInfo/jhove:version"processProperty="PDFVersion" /> <setValuename="FilesizePDF"dependsOn="isPDF"tool="jhove_pdf"code="Couldn't obtain Filesize"xpathSelector="//jhove:repInfo/jhove:size"processProperty="Filesize" /> <setValuename="EPUBVersion"dependsOn="isEPUB"tool="jhove_epub"code="Could not read Version Information!"xpathSelector="//jhove:repInfo/jhove:version"processProperty="EPUBVersion" /> <setValuename="FilesizeEPUB"dependsOn="isEPUB"tool="jhove_epub"code="Couldn't obtain Filesize"xpathSelector="//jhove:repInfo/jhove:size"processProperty="Filesize" /> </level> <level><!-- 4 V Validity --> <checkname="checkPDFVersion"dependsOn="isPDF"tool="jhove_pdf"code="The Version of the PDF-File is not supported by this Version of JHOVE"xpathSelector="//jhove:repInfo/jhove:version"regEx="^1\.[012456]$|^2\.0$" /> <checkname="isValidPDF"dependsOn="checkPDFVersion"tool="jhove_pdf"code="PDF Validation failed"xpathSelector="//jhove:repInfo/jhove:status"regex="Well-Formed and valid" /> <checkname="isValidEPUB"dependsOn="isEPUB"tool="jhove_epub"code="EPUB Validation failed"xpathSelector="//jhove:repInfo/jhove:status"regex="Well-Formed and valid" /><!-- <check name="pdf-a validation" dependsOn="isPDF" tool="verapdf" code="pdfa_validation_failed" xpathSelector="xpathSelector" regEx="regEx" /> --> </level> </profile> <namespaces> <namespacename="jhove"uri="http://hul.harvard.edu/ois/xml/ns/jhove" /> </namespaces> <tools> <toolname="jhove_pdf"cmd="/usr/bin/jhove -m PDF-hul -h XML -o {pv.outputFile} {pv.inputFile}"stdout="false"xmlNamespace ="jhove" /> <toolname="jhove_epub"cmd="/usr/bin/jhove -m EPUB-ptc -h XML -o {pv.outputFile} {pv.inputFile}"stdout="false"xmlNamespace ="jhove" /> <toolname="verapdf"cmd="/opt/digiverso/verapdf/verapdf --format mrr {pv.inputFile}"stdout="true" /> <toolname="pdfinfo"cmd="/opt/digiverso/goobi/config/pdfinfogawk.sh {pv.inputFile}"stdout="true" /> <toolname="file"cmd="/opt/digiverso/goobi/config/filegawk.sh {pv.inputFile}"stdout="true" /> </tools> </global></config_plugin>
The config_plugin element can have two child element types: config and global. First, the functionality of the config element is described here.
Structure of the config element
Structure of the global element
The global element can have 3 child element types: profile, namespaces and tools.
Structure of the namespace element
The namespace element can have several children of the type namespace. A namespace here describes an XML namespace and has the following attributes:
Structure of the tools element
The tools element can have several children of the type tool. The tool element can be used to describe the parameters needed to execute a tool/script from the plugin.
Structure of the profile element
The profile element can have several children of the type tool. It also has the attribute name with whose value it can be referenced in the 'config' element profileName. A profile has several elements of type level. Each level can contain several check and setValue elements. The levels are numbered internally according to their order. The first level element is level 0, the second level 1 and so on.
Structure of check and setValue elements
A check makes it possible to check a value in one of the generated xml-reports. A regular expression is used to check the value. If no regular expression is specified, only the existence of the specified xml element is checked. If a check fails, the level is considered to have failed. Unless the failed check is in a group, then all other checks in the group must also fail for the level to be considered failed.
The attributes of the check element look like this:
Ein setValue-Element ermöglicht es einen Wert aus einem der erzeugten Reports auszulesen und in den Prozesseigenschaften oder den Metadaten des obersten Strukturelements zu speichern. Die Attribute des setValue-Elements sehen wie folgt aus:
Solution for programmes that do not generate XML output
A basic requirement of this plugin is that the tools used generate XML output. However, it often happens that the desired tool does not generate XML output. In this case we advise you to transform the output to XML with a GAWK script. The output of the file command serves as an example here:
LoremIpsum-a3b.pdf:PDFdocument,version1.6
Instead of calling the tool directly, one would now create a shell script with the following content and store it in the cmd attribute of the tool:
If we only need the second parameter from the output of the file command, the (g)awk script could look like this:
BEGIN{ FS="|";printf("<?xml version=\"1.0\" ?>\n<file>\n");}{split($1,a,":");# remove whitespacegsub(/^[ \t]+/,"",a[2]);# remove unwanted Version Informationsub(/,.*/,"",a[2])# ignore key value and only print second valueprintf("<format>%s</format>\n",a[2]); }END{printf("</file>\n");}
The result would then be the following XML output:
Complete example of plugin configuration within the file plugin_intranda_step_file_validation.xml:
<config_plugin><!-- order of configuration is: 1.) project name and step name matches 2.) step name matches and project is * 3.) project name matches and step name is * 4.) project name and step name are * --> <config><!-- which projects to use for (can be more then one, otherwise use *) --> <project>*</project><!-- which stepss to use for (can be more then one, otherwise use *) --> <step>*</step><!-- input folder where the documents are located, is only used in the STEP-Plugin --> <inputFolder>/opt/digiverso/pdf</inputFolder><!-- outputfolder where the folder with the tool reports will be created, is only used in the STEP-Plugin --> <outputFolder>{processpath}/validation</outputFolder><!-- fileFilter: regex-Pattern that allows to filter by filename and fileextension --> <fileFilter>(?i).*\.pdf|.*\.epub</fileFilter><!-- name of the profile that shall be used by this config blog --> <profileName>epubPdf</profileName><!--targetLevel that must be reached for a successful plugin run --> <targetLevel>4</targetLevel> <config> </config><!-- which institution to use for (can be more then one, otherwise use *) --> <institution>*</institution> <profileName>epubPdf</profileName> <targetLevel>0</targetLevel> </config><!-- global has the child elements profile, namespaces and tools profile contains the definition of the ingest levels. It also has the attribute name, so you can refer to the profile in the config-blog element profileName the order of the levels defines their numbering, the first level element defines level zero and so on. a level element can contain check- and setValue- elements. a check element has following attributes name: name of the check dependsOn: name of the check that must have been successful, if this check shall be executed a check can only depend on a check that was defined before it. the parameter is optional tool: name of the tool that must be executed to create the report group: checks can be grouped. grouped checks are OR-operated which means, that the level won't fail if one check of the group is successful. ( i.e. check for isPDF-A and isPDFx in one level) code: Errorcode or Errormessage that shall be displayed when the check fails /regex doesn't match node does not exist
xpathSelector: xpathSelector to selct the node or attribute value regex: regular expression that will be matched with the read value, if no regular expression is provided the check will only test if the node exists. namespace: (only needed if the specified check uses another namespace than the tool and if namespaces are used) a setValue element has following attributes name: name of the check dependsOn: name of the check that must have been successful if this setValue-Element shall be executed a setValue-Element can only depend on a check not on other setValue Elements. set value Elements will always be executed after the checks. the parameter is mandatory tool: name of the tool that must have ben executed to create the report code: Errorcode or Errormessage that shall be displayed when value retrival fails. xpathSelector: xpathSelector to selct an attribute value namespace: (only needed if the specified setValue-Element uses another namespace than the tool and if namespaces are used) a tools element contains multiple tool elements a tool element hast the attributes name: name of the tool cmd: the command that must be run to create the xml report. you can use the {pv.outputFile} variable to refer to stdout: if stdout is true, the reportfile will be generated from the commandline output of the file. if it is set to false
the plugins assumes the tool is able to create the file by itself xmlNamespace: the name of the xml-namespace the generated report uses "jhove" a namespaces element can contain multiple namespace elements a namespace element has the attributes: name: the name of the xml namespace used in the xml and to address it in xmlNamespace attributes of tool-, check- and setValue-Elements
uri: the uri of the xml namespace --> <global> <profilename="epubPdf" > <level><!-- 0 DI check Integrity of Document --><!--checksum test should be done here --> </level> <level><!-- 1 ID Document with JHOVE --> <checkname="isPDF"tool="jhove"group="fileformat"code="This is not a PDF-File"xpathSelector="//jhove:repInfo/jhove:format"regEx="(?i)pdf$" /> <checkname="isEPUB"tool="jhove"group="fileformat"code="This is not an EPUB-File"xpathSelector="//jhove:repInfo/jhove:format"regEx="(?i)epub$" /> </level> <level> <!-- 2 BF check for encryption or access restrictions --> <checkname="checkEncryption"dependsOn="isPDF"tool="pdfinfo"code="The file is encrypted or has access restrictions"xpathSelector="//pdfinfo/Encrypted"regEx="^no$" /> </level> <level><!-- 3 MD Extraction of Metadata --> <setValuename="PdfVersion"dependsOn="isPDF"tool="jhove"code="Could not read Version Information!"xpathSelector="//jhove:repInfo/jhove:version"processProperty="PDFVersion" /> <setValuename="FilesizePDF"dependsOn="isPDF"tool="jhove"code="Couldn't obtain Filesize"xpathSelector="//jhove:repInfo/jhove:size"processProperty="Filesize" /> <setValuename="FilesizeEPUB"dependsOn="isEPUB"tool="jhove"code="Couldn't obtain Filesize"xpathSelector="//jhove:repInfo/jhove:version"processProperty="EPUBVersion" /> </level> <level><!-- 4 V Validity --> <checkname="checkPDFVersion"dependsOn="isPDF"tool="jhove"code="The Version of the PDF-File is not supported by this Version of JHOVE"xpathSelector="//jhove:repInfo/jhove:version"regEx="^1\.[012456]$|^2\.0$" /> <checkname="isValidPDF"dependsOn="checkPDFVersion"tool="jhove"code="PDF Validation failed"xpathSelector="//jhove:repInfo/jhove:status"regex="Well-Formed and valid" /> <checkname="isValidEPUB"dependsOn="isEPUB"tool="jhove"code="EPUB Validation failed"xpathSelector="//jhove:repInfo/jhove:status"regex="Well-Formed and valid" /><!-- <check name="pdf-a validation" tool="verapdf" code="pdfa_validation_failed" xpathSelector="xpathSelector" regEx="regEx" /> --> </level> </profile> <namespaces> <namespacename="jhove"uri="http://schema.openpreservation.org/ois/xml/ns/jhove" /> </namespaces> <tools> <toolname="jhove"cmd="/opt/digiverso/tools/jhove/jhove -h XML -m PDF-hul -o {pv.outputFile} {pv.inputFile}"stdout="false"xmlNamespace ="jhove" /> <toolname="verapdf"cmd="/home/michael/verapdf/verapdf --format mrr {pv.inputFile}"stdout="true" /> <toolname="pdfinfo"cmd="/opt/digiverso/tools/pdfinfogawk.sh {pv.inputFile}"stdout="true" /> </tools> </global></config_plugin>
Example for PDF validation
Example for PDF validation call using pdfinfogawk.sh: