4.7. Internet Archive download

Working with the intranda TaskManager also allows Goobi to efficiently harvest the Internet Archive. For this to happen, two criteria have to be met. The WebDavCommunicator plugin must be present in TaskManager’s plugin folder. The default path is shown below:

/opt/digiverso/itm/plugins/WebDavCommunicator-<version>.jar

The plugin itself is also copied to the plugin folder. The default path is:

/opt/digiverso/itm/plugins/IADownloadPlugin-<version>.jar

Starting the plugin

Der Aufruf des Internet-Archive-Harvestings wird innerhalb von Goobi in einem Workflowschritt folgendermaßen konfiguriert:

/usr/bin/java -jar /opt/digiverso/itm/bin/TaskClient.jar
-itm http:~/~/localhost/itm/service
-s http:~/~/archive.org/download/${meta.CatalogIDDigital}
-d {imagepath}/source/
-n template
-e -i {stepid}
–T {processtitle}
-gid {processid}
-t IADOWNLOAD

Parameters

The command parameters are explained in the following table:

Parameter

Possible Goobi variable

Meaning

-itm

http://localhost/itm/service

URL to intranda TaskManager interface

-e, --returnError

-

If this parameter is met, the TaskClient will end with an error code to prevent the workflow from continuing automatically

-p

0 – 10

Priority to execute this job

-gid

{processid}

Goobi process ID

-i

{stepid}

The workflow step ID that launches the call

-T, --title

{processtitle}

The Goobi process title for which the call is launched

-t, --jobtype

IADOWNLOAD

The job type

-n, --templatename

template

Name of the previously generated configuration file

-s, --source

http://archive.org/download/${meta.CatalogIDDigital}

Path to the URL for the record in the Internet Archive

-d, --destination

{processpath}

Path to the process main directory

Operation of the plugin

To begin with, the plugin receives a URL for a volume in the Internet Archive together with a target directory. It then downloads the files shown below from the URL into the target directory:

scandata.xml
marc.xml
abbyy.gz
jp2.zip

If a download fails, that job’s priority will be reduced, and TaskManager will attempt to download the next job. Failed attempts are reported to Goobi, and a warning message is then recorded in the process log.

Initially, every job in TaskManager is assigned priority level 10. The higher the priority, the sooner a job will be processed. If a job’s priority is reduced to zero, four more download attempts will be made. If all of these fail, the job will be regarded as unsuccessful and will be sent back to Goobi with an error message.