4.5. Full-text recognition - OCR

To automatically execute a text recognition the TaskManager can be used by connecting it to Goobi with this plugin. Dependent on the configuration you can define which data formats shall be transferred back to Goobi from the OCR engine.

Starting the plugin

The TaskClient call for an OCR job is similar to the calls for other job types::

/usr/bin/java -jar /opt/digiverso/itm/bin/TaskClient.jar 
    -itm http://localhost:8080/itm/service 
    -s {tifpath} 
    -d {processpath} 
    -e -gid {processid} 
    -i {stepid} 
    -T {processtitle} 
    -f {process.Schrifttyp} 
    -n template.xml 
    -l ${metas.DocLanguage} 
    -st intranda-abbyy

Parameters

The command parameters are explained in the following table:

Parameter

Possible Goobi variable

Meaning

-itm

http://localhost/itm/service

URL to intranda TaskManager interface

-e, --returnError

-

If entered, the TaskClient will end with an error code to prevent the workflow from continuing automatically

-p

0 – 10

Priority to execute this job

-gid

{processid}

Goobi process ID

-i

{stepid}

ID for the workflow step that launches the call

-T, --title

{processtitle}

Goobi process title for which the call is launched

-t, --jobtype

OCR

Job type

-n, --templatename

template.xml

Name of previously generated configuration file

-s, --source

{tifpath}

Path to media directory for the process

-d, --destination

{processpath}

Path to main process directory

-f, --fontType

e.g. {product.fonttype} or {process. fonttype}

Typeface Fraktur and Antiqua are supported

-l, --language

${metas.DocLanguage}

Language of text being recognised

-st

{intranda-abbyy}

OCR server type

Operation of the plugin

When a new OCR job is received by the intranda TaskManager, different tickets are generated depending on the number of images to be recognised. Typically, a ticket will cover up to 500 images requiring up to 10 GB of storage. Each ticket comprises a list of image files and an instruction specifying the OCR output format.

Tickets are processed individually. The intranda TaskManager loads the ticket and the corresponding images into the OCR input folder. This can be a local folder, a mounted folder or a WebDav folder. The OCR application monitors this folder and begins the text recognition process once all the data has been transferred.

The intranda TaskManager now monitors both the error folder for possible errors and the OCR application’s control folder for results messages. If a suitable control file is received, all the data belonging to this ticket will be gathered together in the output folder and downloaded. The data is then saved into individual sub-folders in the ocr sub-folder (within the corresponding process folder) based on their file suffix..

Last updated