Job Retry Module
Jobs can fail for different reasons. The Job Retry Module greatly simplifies operations by taking actions based on error codes/messages.
Actions
Here is a description of the currently available actions.
NAME |
DESCRIPTION |
---|---|
no_retry |
Do not retry the job again for certain hopeless errors. |
limit_retry |
Limit the number of retries to a certain maximum. |
increase memory |
Submit next job retries with a higher memory requirement. |
increase CPU time |
Submit next job retries with a longer walltime requirement. |
Retry actions are recorded in the database table RETRYACTIONS
. New actions need to be
implemented and then registered in the table.
Rules
- Rules are recorded in the database table
RETRYERRORS
. For new rules you have to specify: ID: unique ID
Retry action: which action from the previous section you want to invoke
Error source: the source of the error (payload, pilot, job dispatcher, task buffer)
Error code
Error message: a regular expression in python syntax (https://docs.python.org/3/library/re.html) to match the error message. You can check your regular expressions in online services like https://pythex.org/ if you don’t want to write the pythong snippet.
Active: you can choose to run the rule in passive mode. In this case there will only be a log message indicating that the rule would have been invoked, but it has no effect. This option is useful when you are not sure of the scope of your new rule.
Parameters: valid only for certain actions, such as
limit_retry
, where you want to specify the limit of retries.Scope: there are a couple of columns (architecture, release, workqueue), where you can limit the scope of the new rule. For example if you want to apply the rule only for a certain software release.