Asynchronous jobs and worker management¶

AtoM relies on a job scheduler in order to execute certain long-running tasks asynchronously in the background (instead of synchronously via your web browser, making you wait until the task is done and the page loaded before continuing), to guarantee that web requests are handled promptly and work loads can be distributed across multiple machines. Example tasks in AtoM that use the job scheduler include:

Generating finding aids
Importing records via the user interface
Exporting records via the Clipboard
Using the Move module to reorganize archival description hierarchies
Ingesting DIPs from Archivematica

We use Gearman as our job scheduler in AtoM. Gearman “provides a generic application framework to farm out work to other machines or processes that are better suited to do the work. It allows you to do work in parallel, to load balance processing, and to call functions between languages.”

You’ll need a good understanding of all the components involved to find out what the best configuration and arrangement of services will be for you. You could configure it so every service runs in the same machine, or scale up to many machines across the network. You can always start small in a single-node environment and add more hardware later.

Jump to:

Gearman Job Server configuration
Gearman Job Worker management
Job and job scheduler management

Gearman Job Server configuration¶

You should have completed the initial installation and configuration of Gearman as part of your AtoM installation process. For more information, see:

Deployment of workers

Other considerations¶

An AtoM worker needs to know where the job server is running, which is defined in an application setting under config/gearman.yml and defaults to 127.0.0.1:4730.

By default, Gearman’s job queues are stored in memory. However you are free to deploy the queue storage in a different machine or replace it with a durable solution like MySQL or SQLite3. For more configuration options available, visit: http://gearman.org/manual/job_server/.

Note that the job server will perfectly handle multiple workers running simultaneously and the work load will be distributed across all available workers. If there are no workers available because they are busy completing other tasks, the job server will store the job in the queues and deliver them once a worker becomes available.

If you’re planing to connect multiple AtoM instances to the same Gearman server, make sure to set a different string value for the workers_key setting located in config/app.yml. This will avoid collisions between those instances and the workers will only take the jobs that belong to their related AtoM install.

Whenever you change any of these settings, make sure that the Symfony cache is cleared and the workers are restarted.

Gearman Job Worker management¶

In AtoM, a worker is just a CLI task that you can run in a terminal or supervise with specific tools like upstart, supervisord or circus. The worker will wait for jobs that are assigned by the job server.

The simplest way to run a worker is from your terminal:

php symfony jobs:worker

A better way to run a worker is to use a process supervisor like systemd. This is the recommended method documented in the Ubuntu AtoM installation guide. See:

Deployment of workers

With this method, you can control the service execution status with the following commands:

sudo systemctl enable atom-worker   # Enables the worker (on boot)
sudo systemctl start atom-worker    # Starts the worker
sudo systemctl stop atom-worker     # Stops the worker
sudo systemctl restart atom-worker  # Restarts the workers
sudo systemctl status atom-worker   # Obtains current status

Restarting the atom-worker¶

Restarting the atom-worker is often a useful troubleshooting step. You should consider restarting the atom-worker if:

You have encountered a 500 error related to a task in AtoM supported by the job scheduler
You have made changes to the worker configuration
You have a job that never seems to complete in the queue

Tip

When encountering a 500 error, we recommend first consulting the webserver error log for more information. See:

Web server logs

If you see the following message in the related error log output, it’s likely that the 500 error can be resolved by restarting the atom-worker:

"No Gearman worker available that can handle the job."

Important

The systemd service configuration also includes a start rate limit of 3 restarts in 24 hours of the atom-worker, after which the the fail counter will need to be manually reset to be able to restart the worker. See below for more information:

Atom-worker start rate limit

Atom-worker start rate limit¶

The systemd service configuration file created during the Ubuntu installation process includes several parameters that govern how the service will attempt to automatically restart itself following a failure. The following lines in the [Service] section of the configuration block tell the service to attempt a restart in 30-second intervals following a failure:

Restart=on-failure
RestartSec=30

However, to prevent the service from being caught in a loop when restarting does not resolve the source cause of the failure, the configuration block also includes a restart attempt limit of 3 attempts in 24 hours, in the [Unit] section:

# High interval and low restart limit to increase the possibility
# of hitting the rate limits in long running recurrent jobs.
StartLimitIntervalSec=24h
StartLimitBurst=3

If the worker hits the start rate limit (3 starts in 24h), the failed status will need to be manually cleared before the worker can be restarted. You can reset the fail counter with the following:

sudo systemctl reset-failed atom-worker
sudo systemctl start atom-worker

If you see the following message when attempting to restart the atom-worker, it is likely you have hit the start rate limit and will need to clear it before successfullly restarting:

Job for atom-worker.service failed.
See "systemctl status atom-worker.service" and "journalctl -xe" for details.

Atom-worker journal¶

When configured via systemd, you can have access to the journal of the atom-worker unit as follows:

sudo journalctl -f -u atom-worker

This is useful in case you need to troubleshoot the worker.

Job and job scheduler management¶

The status of individual jobs executed via the user interface can be viewed via the Manage > Jobs page. For more information, see:

Manage jobs

All authenticated (e.g. logged in) users can access the Jobs page. However, most users will only see jobs and be able to clear jobs which they have started. Only an administrator can see the status of all jobs - however, like all users, administrators can only clear jobs that belong to them via the user interface.

For system administrators with command-line access, AtoM has a task that can be used to clear all jobs from the queue as well as clear the job history of previous jobs from the database.

Important

The following task clears ALL jobs, including those currently running and queued. This means you will need to manually restart any jobs you would like to continue after running this command!

Note that the history of previous jobs is also cleared from the database. If you would like to preserve job history information before running this task, the Manage jobs page in the AtoM user interface includes an option to download a CSV of the job history. For more information, see:

Jobs page button block options

To clear all jobs and job history, run the following from AtoM’s root installation directory:

php symfony jobs:clear

Troubleshooting jobs that will not complete¶

Occasionally, a long-running or resource-intensive job can exhaust all available system resources - generally this will result in a job shown on the Manage jobs page of the AtoM user interface that remains indefinitely stuck at the “Running” status.

You can use the jobs:clear command-line task to clear all jobs if desired. This will terminate the stalled job, as well as any other jobs in the queue, and delete the prior job history from AtoM’s database. Depending on the cause of the stall, you may need to restart the atom-worker after - see above:

Restarting the atom-worker

However, this approach means you will need to manually restart any previously queued jobs you would like to retry.

Alternatively, if you have access to the MySQL command-prompt, you can also use SQL to kill just a specific stalled job if you don’t want to lose other jobs in the queue, and/or the job history. For more information, see:

Kill a specific atom-worker job

Asynchronous jobs and worker management¶

Gearman Job Server configuration¶

Other considerations¶

Gearman Job Worker management¶

Restarting the atom-worker¶

Atom-worker start rate limit¶

Atom-worker journal¶

Job and job scheduler management¶

Troubleshooting jobs that will not complete¶

Version 2.7

Contents

Available documents

Asynchronous jobs and worker management¶

Gearman Job Server configuration¶

Other considerations¶

Gearman Job Worker management¶

Restarting the atom-worker¶

Atom-worker start rate limit¶

Atom-worker journal¶

Job and job scheduler management¶

Troubleshooting jobs that will not complete¶

Version 2.7

Contents

Search

Available documents