Goobi workflow Plugins
Documentation homeGoobi workflow ManualGoobi workflow Digests
English
English
  • Overview
  • Administration
    • Archive Management
      • Using the plugin
      • Configuration of the plugin
    • Configuration editor
    • Copy Master-Anchor
    • Data Poller
    • Goobi-to-Goobi
      • Installation and configuration
      • Creation of the export directories
      • Transfer of the export directories
      • Importing the export directories
    • Reset pagination
    • Restoring archived image folders
    • Ruleset Compatibility
    • Ruleset editor
  • Dashboards
    • Barcode scanner Dashboard
    • Extended Dashboard
  • Exports
    • Customised export for the DMS Imagen Media Archive Management
    • Single Page Export
    • Configurable export
    • Fedora Export
    • Fedora Export PROV
    • Heris Export
    • Export for newspapers to the portal of the German Digital Library
    • PDF export to the NLI directory structure
    • Export of selected images
    • Stanford Export
    • VLM Export
    • HAAB Export
    • ZOP Export
  • Generic
    • Barcode Scanner
  • Imports
    • Legacy data import for the Austrian Federal Monuments Authority
    • Archive data import
    • Data import without catalogue query for ETH Zurich
    • Importing records from an Excel file
    • Import of card catalogues from KatZoom
    • Importing MAB Files
    • Import of Sisis SunRise Files
    • Import for journal articles from an Endnote Export
    • Data import with ALMA catalogue query for Zurich Central Library
    • Data import with CMI catalogue query for Zurich Central Library
    • Data import without catalogue query for the Zentralbibliothek Zurich
  • Metadata
    • Change Publication Type
    • Metadata extension for the creation of structural elements per image
  • OPAC
    • Ariadne Import
    • EAD data transfer
    • Generic XML Import
    • Generic JSON Import
    • Kalliope Import
    • MARC Import
    • PICA Import
    • Soutron Import
  • Repeated Jobs
    • Data import for the Austrian Housing Promotion Fund
    • HERIS Vocabulary Update
  • Statistics
    • Sudan Memory Translations
    • Visualisation of the throughput per user
  • Steps
    • ALMA API Plugin
    • Automatic pagination based on file names
    • Archiving image folders
    • Generating Archival Resource Keys (ARK)
    • Libsafe Integration
    • Assign batch
    • Batch Progress Plugin
    • Catalogue Request
    • Changing the workflow based on process properties
    • Generation of PDF files
    • Plugin for registering DOI via the DataCite API
    • Delay Workflow
    • Conditional workflow status delay
    • Delete Content
    • Display of metadata in a task
    • Plugin for DOI registration
    • Downloading and verifying files
    • Duplication of work steps
    • ePIC PID Registration (Handle & DOI)
    • EWIG Long term archiving
    • Metadata enrichment via Excel file
    • Package Export
    • Copying files from metadata fields
    • Upload files
    • File validation
    • Flex Editor
    • Generate ALTO IDs
    • Generate Identifier
    • Geonames Annotation
    • GeoNames Correction
    • Automatic Handle Assignment
    • Heris data import
    • Extraction of image metadata
    • Image scaling and watermarking
    • Selection of images
    • Quality control of images
    • Metadata transfer from a directory
    • Metadata Cleaning
    • Layout Wizzard
      • Using the plugin
        • Preview
          • Image area
          • Display and navigation options
        • Single page view
          • Folder and file options
          • Current image
          • General settings
          • File list
          • Save view
          • Working steps
          • Selected analysis step
          • Global cutting options
      • Technical details
        • Installation
        • Configuration of the LayoutWizzard
        • Configuration of the user interface
        • Workflow
    • Metadata edition
    • Capture metadata per image
    • Automatic enrichment of metadata from own vocabularies
    • Structure data import from an Excel file
    • Update Metadata Fields
    • Automatic METS enrichment with image files and pagination
    • Enrich METS file
    • Data migration from Visual Library
    • MIX Metadata Enrichment Plugin
    • OCR execution with mixed fonts
    • OCR page selection
    • Transfer OCR result to metadata field
    • Object Identifier Generation
    • Correction of tables of contents after an OLR
    • Data import for Book Interchange files
    • Split PDFs, extract full text and read table of contents
    • Electronic Publications
    • Generation of placeholder images
    • Process folder migration
    • Renaming files
    • Renaming files before the Rosetta ingest
    • Renaming Processes
    • Reorder Images
    • Replace images
    • Automatic setting of the representative
    • Reverse Image Order
    • Generation of docket files
    • Sending emails
    • Import of ECHO files as TEI
    • Tif-Validation
    • Transcription of image content
    • OCR using Transkribus
    • Import and download from Transkribus Collections
    • Creation of Uniform Resource Names (URN)
    • User Assignment
    • Vocabulary enrichment
    • Writing XMP metadata to image files
    • Metadata validation within a task
    • Invoices and delivery notes for user orders
  • Workflow
    • AEON data transfer
    • Barcode Generator
    • Close steps
    • Entity Editor - Artist Dictionary
    • Generic import plugin for excel files including validation
    • Process creation through file upload
    • Mass import from Excel data with EAD enrichment
    • Generic import plugin for JSON files
    • LayoutWizzard workflow plugin
    • Create process relationships
    • Mass upload
    • Import of newspaper issues as single pages
    • Project export as folder with images and Excel file
    • Mass import for brand studies and advertising material
    • Data transfer from AIM25
Powered by GitBook
On this page
  • Overview
  • Introduction
  • Installation
  • Configuration
  • Overview and functionality
  • Configuration
  • General configuration of the plugin
  • Configuration of the individual metadata fields
Export as PDF
  1. Workflow

Generic import plugin for excel files including validation

This workflow plugin allows the mass import of data from Excel files including the display of validation messages.

Overview

Name
Wert

Identifier

intranda_workflow_excelimport

Repository

Licence

GPL 2.0 or newer

Last change

24.07.2024 20:14:01

Introduction

This workflow plugin was implemented in order to provide an individual user interface in addition to the pure import routine, which provides extensive feedback on the validation of the content before it is actually imported.

Installation

To configure how this import plugin imports an Excel file different values in its configuration file can be adapted. The configuration file is located here:

/opt/digiverso/goobi/config/plugin_intranda_workflow_excelimport.xml

Configuration

The content of the configuration file looks like this:

<config_plugin>

    <!-- which file types shall be allowed for uploading these -->
    <allowed-file-extensions>/(\\.|\\/)(xls|xlsx)$/</allowed-file-extensions>

    <qaStepName>Quality assurance</qaStepName>

    <config>
        <!-- publication type to create -->
        <publicationType>Collection</publicationType>
        <!-- which digital collection to use -->

        <collection>General</collection>
        <rowIdentifier>2</rowIdentifier>
        <!-- define in which row the header is written, usually 1 -->
        <rowHeader>3</rowHeader>
        <!-- define in which row the data starts, usually 2 -->
        <rowDataStart>4</rowDataStart>
        <!-- define in which row the data ends, usually 20000 -->
        <rowDataEnd>100</rowDataEnd>

        <identifierHeaderName>1</identifierHeaderName>
        <!-- This block defines which column is mapped to which metadatum and what kind of validation it will be checked against
        Format: <metadata {list of attributes}/>

        Possible attributes include:
            - ugh: the name of the metadatum as defined in the ruleset
            - identifier: this metadatum will be associated with the column containing this in the row configured as <rowIdentifier> above

        Further there are a variety of possible validation criteria:
            - required: if this is set to true every cell of this metadatum will be expected to have some content
            - pattern: the contents of this attribute are interpreted as a regular expression, every cell of this metadatum is checked against it, note that in these regular expressions "," and "\" have to be escaped by "\", e.g. the expression for a single digit would be "\\d"
            - list: content of this attribute is expected to be the path to a plain text file with one expression per line, cells of this metadatum are checked to match one line of this file, cells may contain several elements of the list separated by "; "
            - either: checks that either this metadatum or the one provided by its identifier in this attribute have content
            - requiredFields: allows to configure fields that must have content before this metadatum is allowed to have content, takes a list of identifiers separated by "; "
            - wordcount: checks that a certain number of words are in the field

        Additionally it is possible to configure specific error messages for every validation criterion which are then displayed as a Mouse over Text in the validation area for any fields that failed a validation:
            - requiredErrorMessage: displayed if cell has no content despite being "required
            - patternErrorMessage: dispayed if cell contents do not match provided regex "pattern"
            - listErrorMessage: displayed if cell contents are not in provided "list"
            - eitherErrorMessage: displayed if neither this cell or the one configured as "either" have content
            - requiredFieldsErrorMessage: displayed if a cell had contents despite at least one of the ones listed as "requiredFields" being empty
            - wordcountErrorMessage: displayed if a cell has less Words than required
-->
        <metadata ugh="CatalogIDDigital" identifier="1" required="true" requiredErrorMessage="Field is required" pattern="^[A-Z]*-\\d*$" patternErrorMessage="Needs to be of the Format AAA-000" />
        <metadata ugh="Country" identifier="2" required="true" requiredErrorMessage="Field is required" list="/opt/digiverso/goobi/config/plugin_intranda_workflow_excelimport_ISO3166-1.txt" listErrorMessage="Content not in list of valid options"/>
        <metadata ugh="Institution" identifier="3" required="true" requiredErrorMessage="Field is required" />
        <metadata ugh="InstitutionOfficial" identifier="4" required="true" requiredErrorMessage="Field is required" />
        <metadata ugh="LanguageInstitution" identifier="4a" required="true" requiredErrorMessage="Field is required" list="/opt/digiverso/goobi/config/plugin_intranda_workflow_excelimport_ISO639-2.txt" listErrorMessage="Content not in list of valid options"/>
        <metadata ugh="ContactPostal" identifier="5a"  required="true" requiredErrorMessage="Field is required" />
        <metadata ugh="ContactPhone" identifier="5b" required="true" requiredErrorMessage="Field is required" pattern="^[\\d-+() ]*$" patternErrorMessage="May only contain numbers, +, -, ( and )"/>
        <metadata ugh="ContactWeb" identifier="5c" required="true" requiredErrorMessage="Field is required" pattern="^(https?|ftp|file)://[-a-zA-Z0-9+&amp;@#/%?=~_|!:\,.;]*[-a-zA-Z0-9+&amp;@#/%=~_|]" patternErrorMessage="Must be valid web address"/>
        <metadata ugh="ContactEmail" identifier="5d" required="true" requiredErrorMessage="Field is required" pattern="(?:[a-z0-9!#$%&amp;'*+/=?^_`{|}~-]+(?:\\.[a-z0-9!#$%&amp;'*+/=?^_`{|}~-]+)*|\&quot;(?:[\\x01-\\x08\\x0b\\x0c\\x0e-\\x1f\\x21\\x23-\\x5b\\x5d-\\x7f]|\\\\[\\x01-\\x09\\x0b\\x0c\\x0e-\\x7f])*\&quot;)@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\\x01-\\x08\\x0b\\x0c\\x0e-\\x1f\\x21-\\x5a\\x53-\\x7f]|\\\\[\\x01-\\x09\\x0b\\x0c\\x0e-\\x7f])+)\\])$" patternErrorMessage="Must be valid Email address"/>
        <metadata ugh="ReferenceNumber" identifier="6a" required="true" requiredErrorMessage="Field is required" />
        <metadata ugh="ReferenceNumberType" identifier="6b" required="true" requiredErrorMessage="Field is required" />
        <metadata ugh="TitleDocMain" identifier="7" required="true" requiredErrorMessage="Field is required" />
        <metadata ugh="TitleDocMainOfficial" identifier="8" required="true" requiredErrorMessage="Field is required"/>
        <metadata ugh="TitleDocMainLanguage" identifier="8a" required="true" requiredErrorMessage="Field is required" list="/opt/digiverso/goobi/config/plugin_intranda_workflow_excelimport_ISO639-2.txt" listErrorMessage="Content not in list of valid options"/>
        <metadata ugh="Creator" identifier="9" required="true" requiredErrorMessage="Field is required" />
        <metadata ugh="DateOfOrigin" identifier="10" either="11" eitherErrorMessage="Either this or 11 must have content" pattern="(^\\d{4}$|^\\d{4}/\\d{4}$|^\\d{4}-\\d{2}$|^\\d{4}-\\d{2}/\\d{4}-\\d{2}$)" patternErrorMessage="Invalid date Format"/>
        <metadata ugh="DateNote" identifier="11" either="10" eitherErrorMessage="Either this or 10 must have content"/>
        <metadata ugh="DocLanguage" identifier="12" required="true" requiredErrorMessage="Field is required" list="/opt/digiverso/goobi/config/plugin_intranda_workflow_excelimport_ISO639-2.txt" listErrorMessage="Content not in list of valid options"/>
        <metadata ugh="Extent" identifier="13" required="true" requiredErrorMessage="Field is required"/>
        <metadata ugh="MaterialDescription" identifier="14a" required="true" requiredErrorMessage="Field is required" pattern="(^[Aa]udio$|^[Cc]artographic [Mm]aterial$|^[Gg]raphic [Mm]aterial$|^[Pp]hotographic [Ii]mages$|^[Tt]extual [Mm]aterial$|^[Mm]oving [Ii]mages$)" patternErrorMessage="Must be one of Audio\, Cartographic Material\, Graphic Material\, Photographic Images\, Textual Material\, Moving Images"/>
        <metadata ugh="MaterialDescription" identifier="14b" requiredFields="14a" requiredFieldsErrormessage="May only have content if 14a is filled" pattern="(^[Aa]udio$|^[Cc]artographic [Mm]aterial$|^[Gg]raphic [Mm]aterial$|^[Pp]hotographic [Ii]mages$|^[Tt]extual [Mm]aterial$|^[Mm]oving [Ii]mages$)" patternErrorMessage="Must be one of Audio\, Cartographic Material\, Graphic Material\, Photographic Images\, Textual Material\, Moving Images"/>
        <metadata ugh="MaterialDescription" identifier="14c" requiredFields="14a; 14b" requiredFieldsErrormessage="May only have content if 14a and 14b are filled" pattern="(^[Aa]udio$|^[Cc]artographic [Mm]aterial$|^[Gg]raphic [Mm]aterial$|^[Pp]hotographic [Ii]mages$|^[Tt]extual [Mm]aterial$|^[Mm]oving [Ii]mages$)" patternErrorMessage="Must be one of Audio\, Cartographic Material\, Graphic Material\, Photographic Images\, Textual Material\, Moving Images"/>
        <metadata ugh="Condition" identifier="15" required="true" requiredErrorMessage="Field is required" pattern="(^[Gg]ood$|^[Pp]oor$)" patternErrorMessage="Needs to be either good or poor"/>
        <metadata ugh="ScopeContent" identifier="16" required="true" requiredErrorMessage="Field is required" wordcount="400" wordcountErrorMessage="Must have at least 400 words" />
        <metadata ugh="ArchivalHistory" identifier="17" required="true" requiredErrorMessage="Field is required" wordcount="50" wordcountErrorMessage="Must have at least 50 words"/>
        <metadata ugh="AdministrativeHistory" identifier="18" required="true" requiredErrorMessage="Field is required" wordcount="100" wordcountErrorMessage="Must have at least 100 words"/>
        <metadata ugh="AccessLocations" identifier="19a" />
        <metadata ugh="AccessPersons" identifier="19b" />
        <metadata ugh="AccessCorporate" identifier="19c" />
        <metadata ugh="AccessSubject" identifier="19d" />
        <metadata ugh="Arrangement" identifier="20" wordcount="50" wordcountErrorMessage="Must have at least 50 words"/>
        <metadata ugh="AccessRestrictions" identifier="21" wordcount="40" wordcountErrorMessage="Must have at least 40 words"/>
        <metadata ugh="FindingAids" identifier="22" wordcount="40" wordcountErrorMessage="Must have at least 40 words"/>
        <metadata ugh="FindingAidsLink" identifier="22a" pattern="^(https?|ftp|file)://[-a-zA-Z0-9+&amp;@#/%?=~_|!:\,.;]*[-a-zA-Z0-9+&amp;@#/%=~_|]" patternErrorMessage="Must be valid web address"/>
        <metadata ugh="Copies" identifier="23" />
        <metadata ugh="Originals" identifier="24" />
        <metadata ugh="AuthorOfDescription" identifier="25" />
    </config>
</config_plugin>

Overview and functionality

After installing and starting the plugin, it is available within the Workflow menu. After the call, an Excel file can be uploaded here, so that all validation messages are immediately visible. You can then decide whether the processes should still be generated in Goobi despite any validation messages or whether the import should be cancelled.

Configuration

General configuration of the plugin

Some configurations of the plugin generally apply to all records to be imported. These are as follows:

Value
Description

allowed-file-extensions

The entry allowed-file-extensions is used to allow only files of certain formats to be uploaded. It contains a regular expression which is compared with the name of the file to be uploaded, if they do not match, the file cannot be uploaded.

qaStepName

The entry qaStepName is the Name of the quality assurance step in goobi.

publicationType

The entry publicationType is the type of structure Element the Processes generated by this Plugin will contain as the top level.

collection

The entry collection is the name of the Collection the Processes generated by this Plugin will be added to.

rowIdentifier

The entry rowIdentifier determines in which row the plugin will look for text identifying the columns.

rowHeader

The entry rowHeader determines in which row the plugin will look for the name to display with this column.

rowDataStart

The entry rowDataStart tells the plugin which row is the first containing the content to parse.

rowDataEnd

The entry rowDataEnd tells the plugin which row is the last containing content. Rows before this may be empty.

Configuration of the individual metadata fields

For each individual metadata, you can specify how it is to be imported. Each field corresponds to the contents of a cell in an Excel column. The following values are possible.

Value
Description

identifier

this tells the plugin which column this entry refers to (e.g. 1, 2, 4a or 14b)

ugh

this tells the plugin which metadatum the column should be written to (e.g. CatalogIDDigital, Country or Institution). In addition to this one may want the plugin to check the fields in a column for certain characteristics, for this there are more optional values:

required

if this has the value true, the plugin knows an entry in this column is required, and will mark rows where there are none (only the value true has an effect, others are ignored)

wordcount

the plugin checks how many words are in this column and marks rows, where there are fewer than required here (e.g. 50 or 100)

either

if this has the value of another columns identifier (e.g. 1, 2, 4a or 14b) the plugin will check that either of them have content or mark both

requiredFields

similar to either, this is used to tell the plugin another column must have content before there is any allowed in this one, value must be the identifier of another column (e.g. 1, 2, 4a or 14b)

list

this entry is used to ensure that this metadatum is from a list of allowed words, the value needs to be the location of a textfile containing the allowed words, rows with content not on the list will be marked(e.g. 1, 2, 4a or 14b)

pattern

this entry allows to use regular expressions to compare a metadatum with, the value is the regular expression the contents of this column is compared with, rows not matching will be marked (e.g. ^\\d{4}$ to check that the metadatum consists of exactly 4 digits)

Attention: Please note that columns which do not exist in the excel file (e.g. 14d) are ignored by the plugin but have to be configured here.

PreviousEntity Editor - Artist DictionaryNextProcess creation through file upload

Last updated 9 months ago

https://github.com/intranda/goobi-plugin-workflow-excel-import