Legacy data import for the Austrian Federal Monuments Authority
Import plugin for importing legacy data for the Federal Monuments Office in Austria
Overview
Introduction
This documentation describes the installation, configuration and use of the plugin for the mass import of existing old data of the Federal Monuments Office in Austria. The starting point for the import are existing Excel files as well as provided directories with image files. The special structure of the Excel file made a significant revision of the standard Excel import plugin necessary, so that this plugin differs significantly from it.
Installation
To be able to use the plugin, the following files must be installed:
To use the import, the mass import area must be opened in the production templates and the plugin intranda_import_bka_bda must be selected in the File Upload Import tab. An Excel file can then be uploaded and imported.
The import is then carried out line by line. A new process is created for each object and the configured rules are applied. If a valid data record has been created and the generated process title has not yet been assigned, the process is actually created and saved. Within the Excel file, subsequent lines belonging to the Goobi process to be generated are created with the desired structure type depending on the configuration. Associated images are also automatically transferred and assigned to the generated structural elements and processes.
Configuration
The configuration is done via the file plugin_intranda_import_bka_bda.xml. This file can be adapted during operation.
<config_plugin> <config><!-- which workflow template shall be used --> <template>BDA_Bildarchiv</template><!-- publication type to create --> <publicationType>Kleinbilddia</publicationType> <imageType>Dia</imageType><!-- which digital collection to use --> <collection>Dias</collection> <processTitleColumn>Buchtitel</processTitleColumn><!-- define in which row the header is written, usually 1 --> <rowHeader>1</rowHeader><!-- define in which row the data starts, usually 2 --> <rowDataStart>2</rowDataStart><!-- define in which row the data ends, usually 20000 --> <rowDataEnd>10000</rowDataEnd><!-- define if import shall use GoobiScript to run in the background --> <runAsGoobiScript>true</runAsGoobiScript><!-- column for the path to the images --> <imageFolderHeaderName>filename</imageFolderHeaderName><!-- s3 data --> <s3use="true"> <endpoint>http://127.0.0.1:9000</endpoint> <bucketName>workflow-upload-testing</bucketName> <accessKey>minioadmin</accessKey> <accessSecret>minioadmin</accessSecret> <prefix>prefix/</prefix> </s3><!-- individual metadata fields read from the excel columns --> <mainMetadatarulesetName="CatalogIDDigital"columnName="scanid" /> <mainMetadatarulesetName="InternalNumber"columnName="interne_nummerierung" /> <mainMetadatarulesetName="BookTitle"columnName="Buchtitel" /> <mainMetadatarulesetName="BookID"columnName="BuchId" /> <mainMetadatarulesetName="InventoryNumber"columnName="inventarnummer" /> <mainMetadatarulesetName="Country"columnName="staat" /> <mainMetadatarulesetName="ZIPCode"columnName="plz" /> <mainMetadatarulesetName="FederalState"columnName="bundesland" /> <mainMetadatarulesetName="PoliticalCommunity"columnName="gemeindename" /> <mainMetadatarulesetName="KGNr"columnName="gemeindekz" /> <mainMetadatarulesetName="Community"columnName="ortschaftsname" /> <mainMetadatarulesetName="PGNr"columnName="ortskz" /> <mainMetadatarulesetName="ObjectDescription"columnName="objekt" /> <mainMetadatarulesetName="TitleDocMain"columnName="titel" /> <mainMetadatarulesetName="Photographer"columnName="fotograf" /> <mainMetadatarulesetName="Photographer"columnName="zweiter_fotograf" /> <mainMetadatarulesetName="PublicationYear"columnName="Aufnahmejahr" /> <mainMetadatarulesetName="Series"columnName="verweise_entstehung" /> <mainMetadatarulesetName="SubSeries"columnName="verweise_serien" /> <mainMetadatarulesetName="Remarks"columnName="anmerkungen" /> <mainMetadatarulesetName="BookPage"columnName="buchseite" /> <mainMetadatarulesetName="Label"columnName="objekttraegerbeschriftung" /> <mainMetadatarulesetName="BoxNumber"columnName="Ablageort_Box" /> <mainMetadatarulesetName="RegisterNumber"columnName="Ablageort_Register" /> <mainMetadatarulesetName="Ressort"columnName="ressort" /> <mainMetadatarulesetName="area"columnName="bereich" /> <mainMetadatarulesetName="PhysicalLocation"columnName="standort" /> <mainMetadatarulesetName="BDAArea"columnName="projekt" /> <mainMetadatarulesetName="DateRecorded"columnName="created" /> <mainMetadatarulesetName="DateModified"columnName="lastModified" /> <imageMetadatarulesetName="ObjectType"columnName="type" /> <imageMetadatarulesetName="ImportPath"columnName="filename" /> <imageMetadatarulesetName="filesize"columnName="filesize" /> <imageMetadatarulesetName="width"columnName="width" /> <imageMetadatarulesetName="height"columnName="height" /> <imageMetadatarulesetName="verticalresolution"columnName="verticalresolution" /> <imageMetadatarulesetName="horizontalresolution"columnName="horizontalresolution" /> </config> <config><!-- which workflow template shall be used --> <template>*</template><!-- publication type to create --> <publicationType>Document</publicationType> <imageType>Photograph</imageType><!-- which digital collection to use <collection>General</collection> --> <processTitleColumn>Vorgang</processTitleColumn><!-- define in which row the header is written, usually 1 --> <rowHeader>1</rowHeader><!-- define in which row the data starts, usually 2 --> <rowDataStart>2</rowDataStart><!-- define in which row the data ends, usually 20000 --> <rowDataEnd>10000</rowDataEnd><!-- define if import shall use GoobiScript to run in the background --> <runAsGoobiScript>true</runAsGoobiScript><!-- column for the path to the images --> <imageFolderHeaderName>Importpfad Bildobjekt</imageFolderHeaderName><!-- individual metadata fields read from the excel columns --> <mainMetadatarulesetName="CatalogIDDigital"columnName="DMDB-ID" /> <mainMetadatarulesetName="HerisID"columnName="HERIS-ID" /> <mainMetadatarulesetName="FederalState"columnName="Bundesland" /> <mainMetadatarulesetName="Location"columnName="Ort" /> <mainMetadatarulesetName="Address"columnName="Adresse" /> <mainMetadatarulesetName="TitleDocMain"columnName="Objekttitel" /> <mainMetadatarulesetName="Photographer"columnName="FotografIn" /> <mainMetadatarulesetName="DateRecorded"columnName="Aufnahmejahr" /> <mainMetadatarulesetName="DateRecordedInformation"columnName="Aufnahmedatum Zusatzinfo" /> <mainMetadatarulesetName="KeyNumber"columnName="GZ" /> <mainMetadatarulesetName="Condition"columnName="Erhaltungszustand des Objekts" /> <mainMetadatarulesetName="Remarks"columnName="Sonstiges/Anmerkungen" /> <mainMetadatarulesetName="ObjectType"columnName="Objekttyp" /> <mainMetadatarulesetName="Series"columnName="Serie" /> <mainMetadatarulesetName="SubSeries"columnName="Unterserie" /> <imageMetadatarulesetName="View"columnName="Ansicht" /> <imageMetadatarulesetName="shelfmarksource"columnName="Signatur" /> <imageMetadatarulesetName="FormerFilePath"columnName="Ursprünglicher Dateipfad" /> <imageMetadatarulesetName="Copyright"columnName="Rechteinhaber" /> <imageMetadatarulesetName="ImportPath"columnName="Importpfad Bildobjekt" /> </config></config_plugin>
Individual configurability
It is possible to create a global configuration for all production templates as well as individual settings for single production templates. To do this, the config element can be repeated in the XML file. If mass import is selected in Goobi, the system searches for the configuration block that contains the name of the selected production template in the template element. If such an entry does not exist, the default configuration is used. This is indicated by *.
<!-- which workflow template shall be used --><template>*</template>
Publication type
The following parameter can be used to globally define the publication type to be used:
<!-- publication type to create --><publicationType>Monograph</publicationType>
Every process that is created in Goobi with this plugin receives the application type defined here.
Structure types
The special feature of this plugin is that structural elements are to be generated from the partially repeating Excel table rows, which are to be created as sub-elements for the previously created publication type. The type to be used for this is specified with this parameter:
<imageType>Photograph</imageType>
Collection
With the optional element collection it is possible to define a collection to be inserted in all records. In addition, collections can also be selected from the interface, or the collection can be imported as part of the Excel file or from the catalogue.
<!-- which digital collection to use --><collection>Example collection</collection>
Row range
The following elements describe the structure of the Excel file to be imported.
In rowHeader it is defined in which row the column headers are entered that are later relevant for the mapping. Usually this is the first line. However, this can also deviate in the case of multi-line entries.
<!-- define in which row the header is written, usually 1 --><rowHeader>1</rowHeader>
The elements rowDataStart and rowDataEnd describe the area that contains the data. Usually these are the lines that directly follow the rowHeader, but in the case of special formatting there may also be empty lines that can be removed via this.
<!-- define in which row the data starts, usually 2 --><rowDataStart>2</rowDataStart><!-- define in which row the data ends, usually 20000 --><rowDataEnd>20000</rowDataEnd>
Identifier
The entry identifierHeaderName contains the heading of the column that contains an identifier. This field is used internally to identify the rows. In an OPAC query, this value is used. In addition, this value is also used for generating the case title if no other generation for case titles has been specified.
<!-- define which column is the one to use for catalogue requests and to identify the row during the import --><identifierHeaderName>Identifier</identifierHeaderName>
Process title
The processTitleRule element is used to generate the process title. The same options are available here that can be used in the Goobi configuration file goobi_projects.xml.
<!-- Rules to generate the process title, the same syntax as in goobi_projects.xml can be used. Use the column names to get the right metadata values. If the field is missing or empty, the value of the identifier column is used.--><processTitleRule>'StaticPrefix_'+Identifier</processTitleRule>
Importing images
With the help of the elements imageFolderHeaderName, imageFolderPath and moveImages, images can be imported in addition to the metadata. In imageFolderHeaderName the column name is entered for this purpose, in which the folder names containing the images can be found in the Excel file. Either an absolute path or a relative path can be entered. If a relative path is specified, the element imageFolderPath must contain the root path to the images.
The element moveImages can be used to control whether the images are to be copied or moved.
<!-- define which column contains the image folder name. Can be combined with <imageFolderPath> prefix or an absolute path.
If the field is missing, empty or does not contain an existing directory, no images will be imported --><imageFolderHeaderName>image folder</imageFolderHeaderName><!-- prefix path to the image folder. Can be empty or missing if the import doesn't contain images or if the excel field contains absolute path -->
<imageFolderPath>/mnt/images/</imageFolderPath><!-- defines, if images are moved from the source folder to the destination (true) or copied (false) --><moveImages>true</moveImages>
Importing images from an S3 storage
To import images from an S3 storage, the <imageFolderHeaderName> parameter described above must also be set. The other two elements when importing images relate to file system operations and are therefore not necessary. The following area is used instead:
The element runAsGoobiScript controls whether an import should be processed asynchronously in the background via the GoobiScript queue or whether the import should be processed directly within the user session. Here you have to decide which setting makes sense. If an import is to include images or if the Excel file contains a large number of data records, it is probably more sensible to perform this import as a GoobiScript.
<!-- Run the import as GoobiScript --><runAsGoobiScript>true</runAsGoobiScript>
Attention: If the column identifierHeaderName does not contain a unique identifier or has not been configured, the option runAsGoobiScript cannot be used.
Configuration of the individual Excel columns
The fields metadata, person and group can be used to import individual columns as metadata or as transaction properties. For this purpose, each field contains a number of attributes and sub-elements.
Importing metadata
The element metadata is used to create descriptive metadata.
The attribute headerName contains the column title. The rule only applies if the Excel file contains a column with this title and the cell is not empty. At least one of the two attributes ugh and name must exist. The field ugh can contain the name of a metadatum. If this is the case (and the metadatum is allowed for the configured publication type), a new metadatum is created. A property with this name is created using name.
The attribute docType becomes relevant if a multi-volume work or a journal has been imported from the catalogue. It can be used to control whether the field should belong to the complete record or to the volume.
If, in addition to the content, another column with standard data identifiers or URIs exists, this column can be added in the attribute normdataHeaderName.