1 von 12

File system

As a workflow management application for the library environment, Goobi has to be able to deal with a wide range of specific configurations and project-specific requirements. To this end, it has been designed in line with established conventions. These cover individual directory structures and the way Goobi uses these structures in different areas of the application. This section outlines the directory structures that have proven most effective and explains how external storage is integrated into the system.

Global directory structure

As a web-based application, Goobi has its own structure and is located on a defined path in the file system independently of the servlet container being used. This section explains how to organise the directory structures within which Goobi saves its data and the different configuration files.

The base path for all digitisation software in the Goobi environment is:

/opt/digiverso/

The following directories are usually located on this base path:

/opt/digiverso/goobi/
/opt/digiverso/logs/
/opt/digiverso/itm/
/opt/digiverso/viewer/

The logs directory is the main directory for log files. Goobi log files are also stored here (assuming the system is properly configured). The other directories listed above relate to frequently used applications (e.g. viewer for the Goobi viewer, itm for the intranda Task Manager and goobi for Goobi.

The base path for Goobi is:

/opt/digiverso/goobi/

In most cases, this base path will accommodate the following folder structure (see below for details of each sub-directory):

/opt/digiverso/goobi/config/
/opt/digiverso/goobi/import/
/opt/digiverso/goobi/metadata/
/opt/digiverso/goobi/plugins/
/opt/digiverso/goobi/rulesets/
/opt/digiverso/goobi/scripts/
/opt/digiverso/goobi/xslt/

‘config’ sub-directory

The config directory contains all the Goobi configuration files that do not have to be located within the application itself. These are listed below:

config_contentServer.xml
goobi_activemq.xml
goobi_config.properties
goobi_digitalCollections.xml
goobi_exportXml.xml
goobi_mail.xml
goobi_metadataDisplayRules.xml
goobi_normdata.xml
goobi_opac.xml
goobi_opacUmlaut.txt
goobi_processProperties.xml
goobi_projects.xml
goobi_rest.xml
goobi_webapi.xml
messages_de.properties
messages_en.properties

Depending on the specific installation, the config directory may also contain other configuration files in addition to those related to the application’s core components. Accordingly, we recommend that you also use this central configuration directory to store configurations for individual plug-ins that provide additional functionality.

plugin_abc.xml
plugin_xyz.xml

For subsequent ease of maintenance, the paths and file names relating to the configuration of any new Goobi plug-ins that may be developed should also adhere to this convention.

‘import’ sub-directory

Depending on the way Goobi has been installed, the import directory will contain a range of data, mostly on a temporary basis. By way of example, import plug-ins use this directory to enter metadata and associated digital content in order to create processes. The respective import plug-ins are also responsible for deleting files that are no longer needed.

‘metadata’ sub-directory

The metadata sub-directory is the central directory for storing metadata and digital content generated by Goobi. For each Goobi process, it contains a directory with the name of the process ID. Directories for individual Goobi processes are structured as follows:

1234/
1234/meta.xml
1234/images/
1234/ocr/
1234/import/
1234/validation/
1234/taskmanager/
1234/thumbs/

Depending on the configuration, the central metadata file meta.xml may be accompanied by other back-up files, e.g. meta.xml.1, meta.xml.2, meta.xml.3

Images

Within the workflow, the images directory is accessible for a limited period to various users. Its structure is shown below:

1234/images/abc_media/
1234/images/abc_jpg/
1234/images/abc_source/
1234/images/master_abc_media/

When you are working with digital content, the most important directory is the one ending in _media. The directory beginning with master_ is normally used to store all master images in unmanipulated status. The other directories are intended to be freely accessible within the workflow and can be added to whenever necessary. Both the directory ending in source and the directory ending in _media are copied when exporting to the presentation system (e.g. intranda viewer).

OCR

The images directory may be accompanied by an ocr directory. This contains all the OCR results that are generated within the workflow and added to the process. There is a separate directory for each format of the OCR results.

1234/ocr/abc_alto/
1234/ocr/abc_doc/
1234/ocr/abc_pdf/
1234/ocr/abc_wc/
1234/ocr/abc_xml/

Thumbnails

Smaller versions of the images in images can be saved in the thumbs folder, which Goobi uses to display the images in low resolution. This considerably increases the speed of image display for larger images. For each subfolder of images, one or more subfolders can be created in thumbs with the same name as the images subfolder, extended by an additional underscore _ and a size specification in pixels. This size specification must correspond to the maximum height and width of the images in the respective subfolder. The file names of the images in the thumbs subfolder must correspond to those of the images in the corresponding images subfolder, but with the file extension .jpg.

1234/thumbs/abc_media_800/
1234/thumbs/abc_media_3000/
1234/thumbs/master_abc_media_800/
1234/thumbs/master_abc_media_3000/

If there are matching images in thumbs for an image file in images, these are automatically used in Goobi to display thumbnails and zoomable images when zoomed out.

Validation

The validation directory is used in cases where automatic validation (e.g. of the images) is performed on the Goobi server and in the workflows.

A sub-folder is created within this directory each time validation is performed. This makes it possible to retain older validation results. As illustrated below, the name generated for each sub-folder contains the date, time and type of validation.

1234/validation/2012-11-20_11-20-01_jpylyzer/
1234/validation/2012-11-20_12-02-13_jhove/
1234/validation/2012-11-23_08-12-56_jpylyzer/

If Goobi is being used with the intranda TaskManager, a taskmanager directory will also be found within the folder. This is where TaskManager stores temporary data to perform tasks that require lengthy processing. Depending on the configuration, it is also used to permanently store and maintain all the ticket and template files created each time the TaskManager is called. The directory is made up as follows:

1234/taskmanager/2012-11-23_08-12-56_jp2validate/
1234/taskmanager/2012-11-25_14-38-15_iii-create_jpeg/

Import

Again depending on the installation, for each Goobi process there is an import directory, which is used by import plug-ins to store original source files for the process in question. Catalogue dataset files and other source files that have been manually read in and imported can be stored here and used in scripts as part of workflow processing. The folder structure could be as follows:\

1234/import/abc.mrc
1234/import/abc.original.pdf
1234/import/eod/

‘plugins’ sub-directory

Depending on the way Goobi has been installed, the plugins directory may contain a number of plug-ins that perform imports or call Web API commands. Depending on the task, the compiled plug-ins are located in either of the directories shown below:

/opt/digiverso/goobi/plugins/administration/
/opt/digiverso/goobi/plugins/command/
/opt/digiverso/goobi/plugins/dashboard/
/opt/digiverso/goobi/plugins/export/
/opt/digiverso/goobi/plugins/GUI/
/opt/digiverso/goobi/plugins/import/
/opt/digiverso/goobi/plugins/opac/
/opt/digiverso/goobi/plugins/statistics/
/opt/digiverso/goobi/plugins/step/
/opt/digiverso/goobi/plugins/validation/

‘rulesets’ sub-directory

Within Goobi, the UGH class library is used to process metadata, map PICA imports and generate METS files. In order to manage the huge variety of configuration options, UGH uses a mechanism known as rulesets. The rulesets directory is the central storage location for these rulesets. It allows you to make individual configurations available for different projects and types of publication.

‘scripts’ sub-directory

A range of scripts can be made available centrally in the scripts directory. These scripts can be used within the workflow to automate certain tasks.

‘xslt’ sub-directory

Goobi uses a mechanism called XSLT transformation to generate dockets as PDF files. This involves generating PDF documents from existing xml files. This is done on the basis of xslt files located centrally in the xslt directory.

Directory structure of the application

The Goobi installation path may vary depending on your installation. Typically, the base path for web applications on an Ubuntu Linux system within an Apache Tomcat servlet container is shown below:

/var/lib/tomcat7/webapps

Accordingly, the Goobi application is located on the following path within the file system:

/var/lib/tomcat7/webapps/goobi/

Integrating external storage

General

Most digitisation projects involve handling very large volumes of data. In most cases, this makes it necessary to link external storage capacity to the server. This can be done in a number of ways. We recommend that the external storage is linked to the following folder in the directory tree:

/opt/digiverso/

This means that all Goobi data can be found in a central location.

Two solutions for integrating external storage are explained in schematic form below. We do not recommend linking via CIFS as this can affect performance and functionality. Furthermore, CIFS does not allow you to produce symbolic links or read-only rights.

The following information is required if you wish to integrate external storage via an NFS Share

• exporting server • exporting directory

You can then add the storage to the directory tree via NFS. It is a good idea to add an entry into the file /etc/fstab that automatically sets up the link when the system starts up. This entry could be as follows:

example.net:/path/to/share /opt/digiverso nfs vers=3,rsize=8192,wsize=8192,soft,intr,rw,auto 0 0

Logical volume in the virtual machine

Another way of integrating external storage is to attach it to the virtual machine as an independent device. This can be different iSCSIs or SAN LUNs. They are subsequently combined into a logical volume in the virtual machine using LVM. The result is an aggregated storage unit based on a number of devices.

Integration of S3 as storage

Goobi workflow allows operation with S3-compatible storage. It should be noted that a local file system is still required to store the metadata. This means that the files meta.xml, meta_anchor.xml and their backups, which exist for each process, will continue to be stored in the file system. Only all other data, such as images and OCR results, are stored on the S3 storage area.

To run Goobi with S3 as storage, the following two settings must be set within the configuration file goobi_config.properties:

# global config if s3 should be used
useS3=true

# the bucket that is used for the content that would normally live in /opt/digiverso/goobi/metadata/
S3bucket=workflow-data

Goobi workflow uses the AWS Java SDK internally. This means that the credentials for accessing the storage system are read either from $HOME/.aws or from environment variables. If another S3 provider is to be used instead of AWS, the connection can be configured relatively granularly. This requires a few more settings within the same configuration files:

S3AccessKeyID=keyID
S3SecretAccessKey=secretkey
S3Endpoint=http://s3.mygoobi.tld

Using S3 as a storage system should basically work with all S3-compatible APIs. During the development of the S3 functionality, both Amazon S3 and MinIO were used for the implementation.