2.6 Metadata

Each metadata field to be used in the Goobi viewer must be configured in the Goobi viewer Indexer in the fields configuration element. The following schema is used for this:

<fields>
    <PI>
        <list>
            <item>
                <xpath>
                    <list>
                        <item prefix="bibprefix_" suffix="_bibsuffix">mets:xmlData/mods:mods/mods:identifier [@type="ppn" or @type="PPN"]</item>
                        <item>lido:administrativeMetadata/lido:recordWrap/lido:recordID</item>
                    </list>
                </xpath>
                <getnode>first</getnode>
                <addToDefault>true</addToDefault>
                <addUntokenizedVersion>false</addUntokenizedVersion>
            </item>
        </list>
    </PI>
</fields>

The uppermost element has the name of the Solr field into which the configured metadata is to be written (here: PI).

Field configurations whose name does not begin with MD_ im are mostly mandatory fields for the Goobi viewer and must not be removed! Basic knowledge of XPath is required for configuring Solr metadata fields. Further details can be found at the following address: http://de.wikipedia.org/wiki/XPATH

Changes in the <fields> section are applied directly without a restart.

item

These elements contain XPath expressions under which the respective metadata can be found. All specified expressions are checked so that a metadata field (if allowed for multiple values) can contain multiple values from different XPath expressions. In the example above, one expression is defined for METS/MODS and one for LIDO, which means that this field configuration is valid for both formats.

The prefix and suffix attributes can optionally each contain a static prefix or suffix that is appended to each value found using the respective XPath expression.

For METS documents, the expression can either be formulated relative to the element, for example:

mets.xml
<mets:mets>
    <mets:mdWrap MDTYPE=”MODS”>
    [...]
    </mets:mdWrap>
</mets:mets>
<item>mets:xmlData/mods:mods/mods:identifier[@type="ppn" or @type="PPN"]</item>

or relative to the top element:

mets.xml
<mets:mets>
    [...]
</mets:mets>
<xpath>mets:amdSec[@ID="AMD"]/mets:rightsMD[@ID="RIGHTS"]/mets:mdWrap[@MDTYPE='OTHER' and @MIMETYPE='text/xml' and @OTHERMDTYPE='DVRIGHTS']/mets:xmlData/dv:rights/dv:ownerContact</xpath>

The variant does not have to be specified explicitly - the Goobi viewer indexer checks all configured expressions relative to both nodes.

LIDO formulates XPath expressions for general metadata relative to the top element, for example:

lido.xml
<lido:lido>
    [...]
</lido:lido>
<item>lido:administrativeMetadata/lido:recordWrap/lido:recordID</item>

Metadata of LIDO events use XPath expressions relative to the respective event element, for example:

lido.xml
<lido:lido>
     <lido:eventWrap>
          <lido:eventSet>
               <lido:event>
                      ...
               </lido:event>
          </lido:eventSet>
     </lido:eventWrap>
</lido:lido>
<item>lido:eventDate/lido:date/lido:earliestDate</item>

addSortField

If true, a second field is written in addition to the configured metadata field, which can be used to sort search hits. The second field has the prefix SORT_ in its name. This is particularly useful for fields whose names begin with the prefix MD_ (MD_ is replaced by SORT_). These are always configured in the Solr schema so that they can contain several values as a list - such fields cannot be used for sorting in Solr. Note that a sort field does not contain all values found, but only the first one (for example, the first author of a record). The default value is false.

addSortFieldToTopstruct

Special setting for metadata fields from LIDO events. If search hits are to be sorted by a metadata field from a LIDO event, the setting true writes a sort field from this metadata field into the document of the main element. The default value is false.

addToDefault

If true, the value of the configured metadata field is appended to the DEFAULT field. The latter contains all tokens that can be searched for in the Goobi viewer. If a specific metadata is to be searchable, this option must be activated in the corresponding field configuration. The default value is false.

addUntokenizedVersion

Metadata fields with the prefix MD_ are always split into individual tokens (roughly speaking, individual words in a character string). To be able to browse a field (see Browsing), the entire character string must be present as a token (otherwise, for example, you cannot browse main titles, but only individual words from main titles). For this purpose, a copy can be created in addition to the metadata field, whose value exists as a token (this receives the suffix _UNTOKENIZED). Such fields can be configured for the Browse function. The default value is true.

groupEntity

Individuals and entities may contain additional metadata, such as life data or external links. This metadata may be grouped to make it clear that they belong together.

For this purpose, there is a possibility to write related metadata, e.g. of persons and corporations, in a separate Solr Index document. These can then be parsed by the Goobi viewer and graphically processed accordingly.

If this configuration element is set to true, the following MODS structure is assumed for the relevant metadata field:

mods.xml
<mods:name type="personal" xlink:href=http://de.wikipedia.org/wiki/Eike_von_Repgow  valueURI="http://d-nb.info/gnd/118529501">

      <mods:namePart type=”family”>&lt;von Repgow&gt;</mods:namePart>
      <mods:namePart type=”given”>Eike</mods:namePart>
      <mods:namePart type="date">1180 - 1233</mods:namePart>

      <mods:role>
            <mods:roleTerm authority="marcrelator" type="code">aut</mods:roleTerm>
      </mods:role>

      <mods:displayForm>&lt;von Repgow&gt;, Eike</mods:displayForm>
</mods:name>

The configuration is done in the configuration element . This can optionally contain the type attribute (permissible values: PERSON, CORPORATION, LOCATION, SUBJECT, ORIGININFO, CITATION, OTHER), which under certain circumstances can trigger a special treatment of the metadata in the Goobi viewer (for example, displaying a person icon for search hits based on a person metadata).

<MD_AUTHOR>
    <list>
        <item>
            <xpath>
                <list>
                    <item>mets:xmlData/mods:mods/mods:name[@type="personal"][mods:role/mods:roleTerm="aut"[@authority='marcrelator'][@type='code']]</item>
                </list>
            </xpath>
            <groupEntity type="PERSON" addAuthorityDataToDocstruct="false" addCoordsToDocstruct="false">
                <field name="MD_VALUE" addSortField="true">mods:displayForm</field>
                <field name="MD_LINK">@xlink:href</field>
                <field name="MD_CORPORATION">mods:namePart[not(@type)]</field>
                <field name="MD_LASTNAME" defaultValue="Doe">mods:namePart[@type="family"]</field>
                <field name="MD_FIRSTNAME" defaultValue="John" multivalued="false">mods:namePart[@type="given"]</field>
                <field name="MD_LIFEPERIOD">mods:namePart[@type="date"]</field>
                <field name="NORM_URI">@valueURI</field>
            </groupEntity>
        </item>
    </list>
</MD_AUTHOR>

The optional attribute addAuthorityDataToDocstruct causes with the setting true that certain authority data values (from external databases) are not stored in the corresponding page-doc but in the structural element doc to which the respective page belongs. In this way it can be avoided, if desired, that authority data values which contain entered search terms are displayed as sub-hits (optical preference). Default value is false.

The addCoordsToDocstruct attribute works similarly. This writes optional searchable geo-coordinates such as the WKT_* or NORM_COORDS_* fields into the associated structure element doc instead of the page doc. Default value is false.

The individual metadata fields within the index document are configured in <field> elements. The attribute name describes the desired name of the index field within the document, the text value of the <field> element contains the XPath expression of the respective metadata.

The optional addSortField attribute causes an additional SORT_* field to be generated for the field. This can be used to sort multiple metadata values within a grouped metadata. Default is false.

With the optional attribute multivalued=false only the first found value of the respective field can be transferred to the index. The default value is true.

The optional attribute defaultValue can be used to write a default value if the XPath expression to which the attribute belongs does not return a value. To avoid that the default value is erroneously written in addition to the value from the XPath expression (e.g. if a LIDO-XPath expression does not return a value when indexing METS), it is best to configure separatem <item> elements for such constellations.

The XPath expressions are defined relative to the element found by the XPath expression in an <item> element. In the example above, the XPath expressions in elements are relative to <mods:name>.

Any links are possible. The display mode with the Wikipedia symbol in the above example was configured in messages.properties (see Metadata Configuration). For technical reasons, however, only one display type can be configured per Solr field.

The MD_VALUE subfield contains the "main value" of the metadata (such as the names of persons, entities, and keywords). It is mandatory for the grouped metadata to be included in the index.

If the configuration element is missing, a simple string with the name of the person or body is expected in mods:displayForm.

If type="CITATION" is configured, a URL attribute can be configured to retrieve field contents from an external source. In the example below, the value from MD_VALUE is used instead of the placeholder in the URL to retrieve the record. The XPath expressions of the other subfields are used to retrieve values from the Primo document.

<groupEntity type="CITATION" url="https://example.com/primo_library/libweb/webservices/rest/primo-explore/v1/pnxs/xml/L/ALEPH_EXMPL01${MD_VALUE}?vid=BIBNET\&amp;inst=BIBNET\&amp;showPnx=true">
    <field name="MD_VALUE">lido:objectID[@lido:source="my authoritah"]</field>
    <field name="MD_CITATION_RECORDID">pnx:control/pnx:recordid</field>
    <field name="MD_CITATION_TITLE">pnx:display/pnx:title</field>
    <field name="MD_CITATION_CONTRIBUTOR">pnx:display/pnx:contributor</field>
    <field name="MD_CITATION_PUBLISHER">pnx:display/pnx:publisher</field>
</groupEntity>

getnode

If the value is set to first, the system aborts after the first value found for the current XPath expression. Note that several XPath expressions can be defined for a Solr field, for which values can also exist.

getchildren

If the value is set to all, values for this metadata are copied from all directly subordinate structure items.

getparents

If the value is set to first, values for this metadata are also copied from the directly superior force element. In all cases, values for this metadata are copied from all higher-level force elements up to the highest level.

lowercase

If this option is set to true, all uppercase letters in the string are replaced by lowercase letters. The default value is false. Example:

"Book Print" ==> "book print"

onefield

If true, all values found will be written to the Solr field as one string. The values are separated by the string "; ". If false, a new string is written for each value. In the latter case, the relevant field in the Solr schema must be configured so that lists of several values are permitted. The default value is false.

With the optional attribute separator, the default separator ; can be replaced by a self-defined one:

<onefield separator="#SPACE#">true</onefield>

onetoken / splittingCharacter

Values (onetoken): true|false values
(splittingCharacter): any character string 

If the configuration element onetoken is set to true, the values are processed in such a way that Solr cannot split them into several tokens. In concrete terms, this means that non-alphanumeric characters are removed. The default value is false. Example

"Book Print, Hello World" ==> "BookPrintHelloWorld". 

If any character string in the element splittingCharacter is additionally configured for this field, all occurrences of this character string are replaced by a dot (.). Example:

<splittingCharacter>#<splittingCharacter>
"Book#Print, Hello World"==> "Book.PrintHelloWorld"

These configuration elements exist specifically for handling collection names in the Goobi viewer (field DC) and do not add value to other metadata fields.

normalizeYear

This configuration element is intended for metadata fields that contain dates or years. If this element is set to 'true', the Goobi viewer Indexer attempts to extract a normalized year (yyyy) from the value of this field. Currently, pure years and dates are recognized according to the patterns dd.MM.yyyy and yyyy-MM-dd. All recognized years are written into the field YEAR , YEARMONTH and YEARMONTHDAY. In addition, the respective century is calculated from these years and written into the CENTURY field.

Both fields can be used, for example, for faceting search hits. However, since these are fields with several possible values per field, they cannot be used for sorting search hits. The default value is false.

Optionally the attribute <normalizeYear minYearDigits="n"> can be configured. The value n indicates the number of digits above which free-standing numbers in a character string are to be interpreted as years. Minimum value is 1. Default value is 3.

replace

This configuration element replaces individual ASCII characters or character strings with another character string. A character to be replaced is specified in the char attribute as an ASCII numeric code. Alternatively, a character string can be specified in the attribute string or a regular expression in the attribute regex. The character string to be written instead is specified within the element.

Example 1:

<replace char="10">&lt;br /&gt;</replace>

In the above example, an ASCII character for the line break (10) is replaced by the HTML character string for the line break.

Eine Liste der ASCII-Zeichen finden Sie z.B. unter https://ascii.cl/

Example 2:

<replace string="¬"></replace>

In this example, the non-sorting character '¬' is replaced by nothing (it is removed).

Example 3:

<replace regex="(.*)\[(.*) (.*)\]">$1[$3 $2]</replace>

In this example, three groups are formed and the second and third are swapped within the square brackets, for example, "The Holy Seat [Max Mustermann]" would become "The Holy Seat [Mustermann Max]".

The placeholder #SPACE# is supported for spaces.

If a value is to be split into several values, the {SPLIT} placeholder can be used for this purpose:

config_indexer.xml
<replace regex="^international$">LUX_YES{SPLIT}LUX_NO</replace>

nonSortCharacters

As an alternative to replacing non-sorting characters (see chapter 3.8.12), this configuration element can be used to configure the detection of non-sorting parts for metadata and sort fields. For displayed metadata fields the non-sort characters are filtered out, for sort fields the entire part of the string discriminated against by the non-sort characters is removed.

If the part that is not relevant for sorting is surrounded on both sides by non-sorting characters, for example « and », « must be configured in the prefix attribute, » in the suffix attribute.

Example 1:

mods.xml
<mods:name>
    <mods:displayForm>«von» Goethe, Johann Wolfgang</mods:displayForm>
<mods:name>
<nonSortCharacters prefix="«" suffix="»" />

Result:

<MD_TITLE>von Goethe, Johann Wolfgang</MD_TITLE>
<SORT_TITLE>Goethe, Johann Wolfgang<SORT_TITLE>

If the part not relevant for sorting is only separated from the rest of the string on one page, this character must only be configured as prefix (non-relevant part at the beginning of the string) or suffix (non-relevant part at the end of the string).

Example 2:

mods.xml
<mods:name>
    <mods:displayForm>von¬ Goethe, Johann Wolfgang</mods:displayForm>
<mods:name>
<nonSortCharacters suffix="\u00AC" />

Result:

<MD_TITLE>von Goethe, Johann Wolfgang</MD_TITLE>
<SORT_TITLE>Goethe, Johann Wolfgang<SORT_TITLE>

addToChildren

If this parameter is set to true, the values of this metadata are passed on to subordinate structure type documents. This is important, for example, for license types that use a Solr condition query. The field names used in the query must usually have this parameter activated. The DC field must always have this parameter activated. The default value is false.

addToPages

If this parameter is set to true, the values of this metadata are passed on to the page documents assigned to the respective structure hierarchy. This is important, for example, for license types that use a Solr condition query. The field names used in the query must usually have this parameter activated. And also metadata fields that are to be combined with full texts in the extended search. in the page documentsThe DC field must always have this parameter activated. The default value is false.

normalizeValue

<normalizeValue convertRoman="false" length="5" filler="0" position="FRONT" regex="foo([0-9]+).*$" />

Especially for sorting fields, it is sometimes necessary to normalize any numerical values contained in them in order to obtain correct sorting.

Example: "foo1138bar" and "foo123bar". "123" is smaller as a number than "1138", but by string sorting the value is sorted after "1138", because "11" comes before "12".

By using the configuration described above, the numerical value is filled to five digits: "foo00123bar" is sorted before "foo01138bar" after normalization.

Multiple <normalizeValue> elements can be configured per field, which are successively applied to the metadata value.

convertRoman

If true, Roman numerals can be searched for in the regex (e.g. via ([C\|I\|M\|V\|X]+)), which will then be converted to Arabic. In this mode no padding of the value is possible (but can be done afterwards in another <normalizeValue> element.

length

The desired number of characters in the normalized part of the string.

filler

A fill character that is used to fill up to the desired length.

position

Position at which the fill characters are appended. Possible values are FRONT and REAR. Default value is frFRONTont.

regex

Optional regular expression to define the part relevant for normalization. Can contain up to one capture group (otherwise the entire expression will be matched).

addToParents

If this parameter is set to true; all metadata values found in subelements are moved to the main element (both for simple and grouped metadata). The default value is false.

The attribute keepOnChildren causes highly inherited metadata not only to end up in the top element, but to remain in the entire hierarchy. Default is false.

geoJSONSource

This allows geocoordinates of points and polygons from GML or mods:coordinates to be written as geoJSON instead of in the original format.

The following values are currently supported:

  • gml:Point (points in "lon lat" format)

  • gml:Point:4326 (points in "lat lon" format)

  • gml:Polygon (points in "lon lat" format)

  • gml:Polygon:4326 (points in "lat lon" format)

  • mods:coordindates/point

Optionally the attribute addSearchField="true" can be defined. In this case, the WKT_COORDS field is also written, which contains the coordinates of the point or polygon in WKT format.

<MD_GEOJSON_POLYGON>
    <list>
        <item>
            <xpath>denkxweb:geoReference/denkxweb:surface/gml:MultiSurface[@srsName="http://www.opengis.net/def/crs/epsg/0/4326"]/gml:surfaceMember/gml:Polygon/gml:exterior/gml:LinearRing/gml:posList</xpath>
            <geoJSONSource addSearchField="true">gml:Polygon:4326</geoJSONSource>
        </item>
    </list>
</MD_GEOJSON_POLYGON>

interpolateYears

The switch <interpolateYears /> can be used to specify whether years in the field YEAR should be interpolated. This is particularly useful if, when faceting by year, data records are also to be found for which no concrete year but a period was entered. For example, a period can be entered as 1400-1499 or 1400/1499. The default value is false.

The switch should be used with caution. For example, if an object originates from the Cretaceous period and this is stored, then a lot of values can be written and the usefulness of the additional information is no longer given.

allowDuplicateValues

As a rule, duplicate metadata (field + value identic) is written only once. The switch <allowDuplicateValues> (if true) allows this rule to be suspended for certain fields and allows multiple identical metadata fields to be written. Default value is false.

addExistenceBoolean

This switch can be used to automatically generate a boolean field that describes the existence of the respective metadata field. If this switch is set to true during the field configuration of MD_TITLE, a field BOOL_TITLE=true is written into the index if a value is found for MD TITLE. If no value is found for MD_TITLE, BOOL_TITLE=false will be in the index. This boolean field can be used to filter objects according to the presence of a metadata field.

Last updated