2.6 Metadata
Each metadata field to be used in the Goobi viewer must be configured in the Goobi viewer Indexer in the fields
configuration element. The following schema is used for this:
The uppermost element has the name of the Solr field into which the configured metadata is to be written (here: PI
).
Field configurations whose name does not begin with MD_ im are mostly mandatory fields for the Goobi viewer and must not be removed! Basic knowledge of XPath is required for configuring Solr metadata fields. Further details can be found at the following address: http://de.wikipedia.org/wiki/XPATH
Changes in the <fields>
section are applied directly without a restart.
item
These elements contain XPath expressions under which the respective metadata can be found. All specified expressions are checked so that a metadata field (if allowed for multiple values) can contain multiple values from different XPath expressions. In the example above, one expression is defined for METS/MODS and one for LIDO, which means that this field configuration is valid for both formats.
The prefix and suffix attributes can optionally each contain a static prefix or suffix that is appended to each value found using the respective XPath expression.
For METS documents, the expression can either be formulated relative to the element, for example:
or relative to the top element:
The variant does not have to be specified explicitly - the Goobi viewer indexer checks all configured expressions relative to both nodes.
LIDO formulates XPath expressions for general metadata relative to the top element, for example:
Metadata of LIDO events use XPath expressions relative to the respective event element, for example:
addSortField
If true
, a second field is written in addition to the configured metadata field, which can be used to sort search hits. The second field has the prefix SORT_
in its name. This is particularly useful for fields whose names begin with the prefix MD_
(MD_
is replaced by SORT_). These are always configured in the Solr schema so that they can contain several values as a list - such fields cannot be used for sorting in Solr. Note that a sort field does not contain all values found, but only the first one (for example, the first author of a record). The default value is false
.
addSortFieldToTopstruct
Special setting for metadata fields from LIDO events. If search hits are to be sorted by a metadata field from a LIDO event, the setting true
writes a sort field from this metadata field into the document of the main element. The default value is false
.
addToDefault
If true
, the value of the configured metadata field is appended to the DEFAULT
field. The latter contains all tokens that can be searched for in the Goobi viewer. If a specific metadata is to be searchable, this option must be activated in the corresponding field configuration. The default value is false
.
addUntokenizedVersion
Metadata fields with the prefix MD_
are always split into individual tokens (roughly speaking, individual words in a character string). To be able to browse a field (see Browsing), the entire character string must be present as a token (otherwise, for example, you cannot browse main titles, but only individual words from main titles). For this purpose, a copy can be created in addition to the metadata field, whose value exists as a token (this receives the suffix _UNTOKENIZED
). Such fields can be configured for the Browse function. The default value is true
.
groupEntity
Individuals and entities may contain additional metadata, such as life data or external links. This metadata may be grouped to make it clear that they belong together.
For this purpose, there is a possibility to write related metadata, e.g. of persons and corporations, in a separate Solr Index document. These can then be parsed by the Goobi viewer and graphically processed accordingly.
If this configuration element is set to true
, the following MODS structure is assumed for the relevant metadata field:
The configuration is done in the configuration element . This can optionally contain the type attribute (permissible values: PERSON
, CORPORATION
, LOCATION
, SUBJECT
, ORIGININFO
, CITATION
, OTHER
), which under certain circumstances can trigger a special treatment of the metadata in the Goobi viewer (for example, displaying a person icon for search hits based on a person metadata).
The optional attribute addAuthorityDataToDocstruct
causes with the setting true
that certain authority data values (from external databases) are not stored in the corresponding page-doc but in the structural element doc to which the respective page belongs. In this way it can be avoided, if desired, that authority data values which contain entered search terms are displayed as sub-hits (optical preference). Default value is false
.
The addCoordsToDocstruct
attribute works similarly. This writes optional searchable geo-coordinates such as the WKT_*
or NORM_COORDS_*
fields into the associated structure element doc instead of the page doc. Default value is false
.
The individual metadata fields within the index document are configured in <field>
elements. The attribute name describes the desired name of the index field within the document, the text value of the <field>
element contains the XPath expression of the respective metadata.
The optional addSortField
attribute causes an additional SORT_*
field to be generated for the field. This can be used to sort multiple metadata values within a grouped metadata. Default is false
.
With the optional attribute multivalued=false
only the first found value of the respective field can be transferred to the index. The default value is true
.
The optional attribute defaultValue
can be used to write a default value if the XPath expression to which the attribute belongs does not return a value. To avoid that the default value is erroneously written in addition to the value from the XPath expression (e.g. if a LIDO-XPath expression does not return a value when indexing METS), it is best to configure separatem <item>
elements for such constellations.
The XPath expressions are defined relative to the element found by the XPath expression in an <item>
element. In the example above, the XPath expressions in elements are relative to <mods:name>
.
Any links are possible. The display mode with the Wikipedia symbol in the above example was configured in messages.properties (see Metadata Configuration). For technical reasons, however, only one display type can be configured per Solr field.
The MD_VALUE subfield contains the "main value" of the metadata (such as the names of persons, entities, and keywords). It is mandatory for the grouped metadata to be included in the index.
If the configuration element is missing, a simple string with the name of the person or body is expected in mods:displayForm
.
If type="CITATION"
is configured, a URL attribute can be configured to retrieve field contents from an external source. In the example below, the value from MD_VALUE is used instead of the placeholder in the URL to retrieve the record. The XPath expressions of the other subfields are used to retrieve values from the Primo document.
getnode
If the value is set to first
, the system aborts after the first value found for the current XPath expression. Note that several XPath expressions can be defined for a Solr field, for which values can also exist.
getchildren
If the value is set to all
, values for this metadata are copied from all directly subordinate structure items.
getparents
If the value is set to first
, values for this metadata are also copied from the directly superior force element. In all
cases, values for this metadata are copied from all higher-level force elements up to the highest level.
lowercase
If this option is set to true, all uppercase letters in the string are replaced by lowercase letters. The default value is false. Example:
onefield
If true
, all values found will be written to the Solr field as one string. The values are separated by the string ";
". If false
, a new string is written for each value. In the latter case, the relevant field in the Solr schema must be configured so that lists of several values are permitted. The default value is false
.
With the optional attribute separator
, the default separator ;
can be replaced by a self-defined one:
onetoken / splittingCharacter
If the configuration element onetoken
is set to true
, the values are processed in such a way that Solr cannot split them into several tokens. In concrete terms, this means that non-alphanumeric characters are removed. The default value is false. Example
If any character string in the element splittingCharacter is additionally configured for this field, all occurrences of this character string are replaced by a dot (.). Example:
These configuration elements exist specifically for handling collection names in the Goobi viewer (field DC) and do not add value to other metadata fields.
normalizeYear
This configuration element is intended for metadata fields that contain dates or years. If this element is set to 'true
', the Goobi viewer Indexer attempts to extract a normalized year (yyyy
) from the value of this field. Currently, pure years and dates are recognized according to the patterns dd.MM.yyyy
and yyyy-MM-dd
. All recognized years are written into the field YEAR
, YEARMONTH
and YEARMONTHDAY
. In addition, the respective century is calculated from these years and written into the CENTURY
field.
Both fields can be used, for example, for faceting search hits. However, since these are fields with several possible values per field, they cannot be used for sorting search hits. The default value is false
.
Optionally the attribute <normalizeYear minYearDigits="n">
can be configured. The value n
indicates the number of digits above which free-standing numbers in a character string are to be interpreted as years. Minimum value is 1. Default value is 3
.
replace
This configuration element replaces individual ASCII characters or character strings with another character string. A character to be replaced is specified in the char
attribute as an ASCII numeric code. Alternatively, a character string can be specified in the attribute string
or a regular expression in the attribute regex
. The character string to be written instead is specified within the element.
Example 1:
In the above example, an ASCII character for the line break (10) is replaced by the HTML character string for the line break.
Eine Liste der ASCII-Zeichen finden Sie z.B. unter https://ascii.cl/
Example 2:
In this example, the non-sorting character '¬' is replaced by nothing (it is removed).
Example 3:
In this example, three groups are formed and the second and third are swapped within the square brackets, for example, "The Holy Seat [Max Mustermann]" would become "The Holy Seat [Mustermann Max]".
The placeholder #SPACE#
is supported for spaces.
If a value is to be split into several values, the {SPLIT}
placeholder can be used for this purpose:
nonSortCharacters
As an alternative to replacing non-sorting characters (see chapter 3.8.12), this configuration element can be used to configure the detection of non-sorting parts for metadata and sort fields. For displayed metadata fields the non-sort characters are filtered out, for sort fields the entire part of the string discriminated against by the non-sort characters is removed.
If the part that is not relevant for sorting is surrounded on both sides by non-sorting characters, for example «
and »
, «
must be configured in the prefix attribute, »
in the suffix attribute.
Example 1:
Result:
If the part not relevant for sorting is only separated from the rest of the string on one page, this character must only be configured as prefix
(non-relevant part at the beginning of the string) or suffix
(non-relevant part at the end of the string).
Example 2:
Result:
addToChildren
If this parameter is set to true
, the values of this metadata are passed on to subordinate structure type documents. This is important, for example, for license types that use a Solr condition query. The field names used in the query must usually have this parameter activated. The DC field must always have this parameter activated. The default value is false
.
addToPages
If this parameter is set to true
, the values of this metadata are passed on to the page documents assigned to the respective structure hierarchy. This is important, for example, for license types that use a Solr condition query. The field names used in the query must usually have this parameter activated. And also metadata fields that are to be combined with full texts in the extended search. in the page documentsThe DC field must always have this parameter activated. The default value is false
.
normalizeValue
Especially for sorting fields, it is sometimes necessary to normalize any numerical values contained in them in order to obtain correct sorting.
Example: "foo1138bar" and "foo123bar". "123" is smaller as a number than "1138", but by string sorting the value is sorted after "1138", because "11" comes before "12".
By using the configuration described above, the numerical value is filled to five digits: "foo00123bar" is sorted before "foo01138bar" after normalization.
Multiple <normalizeValue>
elements can be configured per field, which are successively applied to the metadata value.
convertRoman
If true
, Roman numerals can be searched for in the regex (e.g. via ([C\|I\|M\|V\|X]+)
), which will then be converted to Arabic. In this mode no padding of the value is possible (but can be done afterwards in another <normalizeValue>
element.
length
The desired number of characters in the normalized part of the string.
filler
A fill character that is used to fill up to the desired length.
position
Position at which the fill characters are appended. Possible values are FRONT
and REAR
. Default value is frFRONTont
.
regex
Optional regular expression to define the part relevant for normalization. Can contain up to one capture group (otherwise the entire expression will be matched).
addToParents
If this parameter is set to tru
e; all metadata values found in subelements are moved to the main element (both for simple and grouped metadata). The default value is false
.
The attribute keepOnChildren
causes highly inherited metadata not only to end up in the top element, but to remain in the entire hierarchy. Default is false
.
geoJSONSource
This allows geocoordinates of points and polygons from GML or mods:coordinates
to be written as geoJSON instead of in the original format.
The following values are currently supported:
gml:Point (points in "lon lat" format)
gml:Point:4326 (points in "lat lon" format)
gml:Polygon (points in "lon lat" format)
gml:Polygon:4326 (points in "lat lon" format)
mods:coordindates/point
Optionally the attribute addSearchField="true"
can be defined. In this case, the WKT_COORDS
field is also written, which contains the coordinates of the point or polygon in WKT format.
interpolateYears
The switch <interpolateYears />
can be used to specify whether years in the field YEAR
should be interpolated. This is particularly useful if, when faceting by year, data records are also to be found for which no concrete year but a period was entered. For example, a period can be entered as 1400-1499
or 1400/1499
. The default value is false
.
The switch should be used with caution. For example, if an object originates from the Cretaceous period and this is stored, then a lot of values can be written and the usefulness of the additional information is no longer given.
allowDuplicateValues
As a rule, duplicate metadata (field + value identic) is written only once. The switch <allowDuplicateValues>
(if true
) allows this rule to be suspended for certain fields and allows multiple identical metadata fields to be written. Default value is false
.
addExistenceBoolean
This switch can be used to automatically generate a boolean field that describes the existence of the respective metadata field. If this switch is set to true
during the field configuration of MD_TITLE
, a field BOOL_TITLE=true
is written into the index if a value is found for MD TITLE
. If no value is found for MD_TITLE
, BOOL_TITLE=false
will be in the index. This boolean field can be used to filter objects according to the presence of a metadata field.
Last updated