Running hyperparameter optimization

Introduction 

PanDA and iDDS integrate geographically distributed GPU/CPU resources, so that users can run special analysis to automatically optimize hyperparameters of machine learning models.

Each HPO workflow is typically composed of iterations of three steps:

Sampling step
To choose hyperparameter points in a search space.
Training step
To evaluate each hyperparameter point with an objective function and a training dataset.
Optimization step
To redefine the search space based on loss values of evaluated hyperparameter points.

In the system, the sampling and optimization steps are executed on central resources while the training step is executed on distributed resources. The former is called steering and the latter is called evaluation, i.e., steering = sampling + optimization, and evaluation = training. Users can submit HPO tasks to PanDA using phpo, which is available in panda-client-1.4.24 or higher.

Once tasks are injected into the system, iDDS orchestrates JEDI, PanDA, Harvester, and the pilot to timely execute HPO steps on relevant resources, as shown in the figure above. Users can see what’s going on in the system using PanDA monitoring. The iDDS document explains how the system works, but end-users don’t have to know all the details. However, one important thing is that a single PanDA job evaluates one or more hyperparameter points and thus it is good to have a look at log files in PanDA monitoring if there is something wrong.

Preparation 

The main trick to run hyperparameter optimization (HPO) on distributed resources is the separation of steering and evaluation and their asynchronous execution, as shown in the figure below.

Users require two types of containers, one for steering and the other for evaluation.

For steering, users can use predefined containers or their own containers. Note that users need to use ML packages such as skopt and nevergrad which support the ask-and-tell pattern, when making own steering containers. This page explains how steering containers communicate with the system. Users provide execution strings to specify how to generate new hyperparameter points in their steering containers. Each execution string would contain several placeholders which are dynamically replaced with actual values when the containers are executed. Input and output are done through json files in the initial working directly $PWD so that the directory needs to be initially mounted.

For evaluation, users also provide execution strings to specify what is executed in their containers to train the ML model. There are two files for input (one for a hyperparameter point to be evaluated and the other for training data) and three files for output (the first one to report the loss, the second one to report job metadata, and the last one to preserve training metrics). The input file for a hyperparameter point and the output file to report the loss are mandatory, while other files are optional. See the this page for the details of their format. Note that evaluation containers are executed in the read-only mode, so that file-writing operations have to be done in the initial working directory /srv/workDir which is bound to the host working directory where containers and the system communicate using the input and output files. It is better to dynamically get the path of the initial working directory using os.getcwd(), echo $PWD, and so on, when applications are executed in evaluation containers, rather than hard-coding /srv/workDir in the applications, since the convention might be changed in the future.

phpo 

phpo is a command-line tool specialized for HPO task submission. The following options are available in addition to usual options such as --site and --verbose. All options can be loaded from a json using --loadJson if preferable.

--nParallelEvaluation NPARALLELEVALUATION
                      The number of hyperparameter points being evaluated
                      concurrently. 1 by default
--maxPoints MAXPOINTS
                      The max number of hyperparameter points to be
                      evaluated in the entire search. 10 by default.
--maxEvaluationJobs MAXEVALUATIONJOBS
                      The max number of evaluation jobs in the entire
                      search. 2*maxPoints by default. The task is terminated
                      when all hyperparameter points are evaluated or the
                      number of evaluation jobs reaches MAXEVALUATIONJOBS

For steering,

--nPointsPerIteration NPOINTSPERITERATION
                      The max number of hyperparameter points generated in
                      each iteration. 2 by default Simply speaking, the
                      steering container is executed
                      maxPoints/nPointsPerIteration times when
                      minUnevaluatedPoints is 0. The number of new points is
                      nPointsPerIteration-minUnevaluatedPoints
--minUnevaluatedPoints MINUNEVALUATEDPOINTS
                      The next iteration is triggered to generate new
                      hyperparameter points when the number of unevaluated
                      hyperparameter points goes below minUnevaluatedPoints.
                      0 by default
--steeringContainer STEERINGCONTAINER
                      The container image for steering run by docker
--steeringExec STEERINGEXEC
                      Execution string for steering. If --steeringContainer
                      is specified, the string is executed inside of the
                      container. Otherwise, the string is used as command-
                      line arguments for the docker command
  --searchSpaceFile SEARCHSPACEFILE
                      External json filename to define the search space.
                      None by default. If this option is used together with
                      --segmentSpecFile the json file contains a list of
                      search space dictionaries. It is possible to contain
                      only one search space dictionary if all models use the
                      search space. In this case the search space dictionary
                      is cloned for every segment

For evaluation,

--evaluationContainer EVALUATIONCONTAINER
                      The container image for evaluation
--evaluationExec EVALUATIONEXEC
                      Execution string to run evaluation in singularity.
--evaluationInput EVALUATIONINPUT
                      Input filename for evaluation where a json-formatted
                      hyperparameter point is placed. input.json by default
--evaluationTrainingData EVALUATIONTRAININGDATA
                      Input filename for evaluation where a json-formatted
                      list of training data filenames is placed.
                      input_ds.json by default. Can be omitted if the
                      payload directly fetches the training data using wget
                      or something
--evaluationOutput EVALUATIONOUTPUT
                      Output filename of evaluation. output.json by default
--evaluationMeta EVALUATIONMETA
                      The name of metadata file produced by evaluation
--evaluationMetrics EVALUATIONMETRICS
                      The name of metrics file produced by evaluation
--trainingDS TRAININGDS
                      Name of training dataset

--checkPointToSave CHECKPOINTTOSAVE
                      A comma-separated list of files and/or directories to
                      be periodically saved to a tarball for checkpointing.
                      Note that those files and directories must be placed
                      in the working directory. None by default
--checkPointToLoad CHECKPOINTTOLOAD
                      The name of the saved tarball for checkpointing. The
                      tarball is given to the evaluation container when the
                      training is resumed, if this option is specified.
                      Otherwise, the tarball is automatically extracted in
                      the working directories
--checkPointInterval CHECKPOINTINTERVAL
                      Frequency to check files for checkpointing in minute.
                      5 by default

To see latest or full list of options,

phpo --helpGroup ALL

How to submit HPO tasks 

There are a couple of concrete examples in this HPO page.

The most important options of phpo are --steeringContainer, --steeringExec, --evaluationContainer, and --evaluationExec, i.e., container names for steering and evaluation, and what is executed in each container. Here is an example to show how those options look like.

$ cat config_dev.json

{
   "evaluationContainer": "docker://gitlab-registry.cern.ch/zhangruihpc/evaluationcontainer:mlflow",
   "evaluationExec": "bash ./exec_in_container.sh",
   "evaluationMetrics": "metrics.tgz",
   "searchSpaceFile": "search_space_example2.json",
   "steeringExec": "/bin/bash -c \"hpogrid generate --n_point=%NUM_POINTS --max_point=%MAX_POINTS --infile=$PWD/%IN  --outfile=$PWD/%OUT -l=nevergrad\"",
   "steeringContainer": "gitlab-registry.cern.ch/zhangruihpc/steeringcontainer:latest",
   "trainingDS": "user.hoge.my_training_dataset",
}

Note that the execution string for the evaluation container is written in a local file exec_in_container.sh. All files with *.json, *.sh, *.py, *.yaml in the local current directory are automatically sent to the remote working directory. So users don’t have to specify a complicated execution string in --evaluationExec. E.g.

$ cat exec_in_container.sh

export CURRENT_DIR=$PWD
export CALO_DNN_DIR=/ATLASMLHPO/payload/CaloImageDNN
export PYTHONPATH=$PYTHONPATH:$CALO_DNN_DIR/deepcalo
curl -sSL https://cernbox.cern.ch/index.php/s/HfHYEsmJNWiefu3/download | tar -xzvf -;
python $CALO_DNN_DIR/scripts/make_input.py input.json input_new.json
cp -r $CALO_DNN_DIR/exp_scalars $CURRENT_DIR/
python /ATLASMLHPO/payload/CaloImageDNN/run_model.py -i input_new.json --exp_dir $CURRENT_DIR/exp_scalars/ --data_path $CURRENT_DIR/dataset/event100.h5 --rm_bad_reco True --zee_only True -g 0
rm -fr $CURRENT_DIR/exp_scalars/
tar cvfz $CURRENT_DIR/metrics.tgz mlruns/*
rm -fr mlruns dataset
ls $CURRENT_DIR/

The initial search space can be described in a json file.

$ cat search_space_example2.json

{
  "auto_lr": {
    "method": "categorical",
    "dimension": {
      "categories": [
          true,
          false
      ],
      "grid_search": 0
    }
  },
  "batch_size": {
    "method": "uniformint",
    "dimension": {
      "low": 10,
      "high": 30
    }
  },
  "epoch": {
    "method": "uniformint",
    "dimension": {
      "low": 5,
      "high": 10
    }
  },
  "cnn_block_depths_1": {
    "method": "categorical",
    "dimension": {
        "categories": [1, 1, 2],
        "grid_search": 0
    }
  },
  "cnn_block_depths_2": {
    "method": "uniformint",
    "dimension": {
      "low": 1,
      "high": 3
    }
  }
}

Then

phpo --loadJson config_dev.json --site XYZ --outDS user.blah.`uuidgen`

Once tasks are submitted, users can see what’s going on in the system by using PanDA monitor.

If --trainingDS is specified each PanDA job gets all files in the dataset unless the task is segmented. Segmented HPO is explained later.

FAQ 

Protection against bad hyperparameter points 

Each hyperparameter point is evaluated 3 times at most. If all attempts are timed-out, the system considers that the hyperparameter point is hopeless and a very large loss is registered, so that the task continues.

Visualization of the search results 

It is possible to upload a tarball of metrics files to a grid storage when evaluating each hyperparameter point. For example, the above example uses MLflow for logging parameters and metrics, collects all files under ./mlflow into tarballs, and uploads them to grid storages. The filename of the tarball needs to be specified using the --evaluationMetrics option. Tarballs are registered in the output dataset so that they can be download using rucio client. It is easy to combine MLflow metrics files. The procedure is as follows:

rucio download --no-subdir <output dataset>
tar xvfz *
tar xvfz metrics*
mlflow ui

Then access to http://127.0.0.1:5000 using your own browser will show something like the figure below.

There is an on-going development activity to dynamically spin-up MLFlow services on !PanDA monitoring or something which would do the above procedure on behalf of users and centrally provide MLFlow UI to users.

Relationship between nPointsPerIteration and minUnevaluatedPoints 

The relationship between nPointsPerIteration and minUnevaluatedPoints is illustrated in the above figure. The steering is executed to generate new hyperparameter points every time the number of unevaluated points goes below minUnevaluatedPoints. The number of new points is nPointsPerIteration-minUnevaluatedPoints. The main idea to set a non-zero value to minUnevaluatedPoints is to keep the task running even if some hyperparameter points take very long to be evaluated.

What “Logged status: skipped since no HP point to evaluate or enough concurrent HPO jobs” means in PanDA monitor 

PanDA jobs are generated every 10 min, when the number of active PanDA jobs is less than nParallelEvaluation and there is at least one unevaluated hyperparameter point. The logging message means that there are enough PanDA jobs running/queued in the system, or the system has evaluated or is evaluating all hyperparameter points which have been generated so far. Note that there is a delay for iDDS to trigger the next iteration after enough hyperparameter points were evaluated in the previous iteration.

Checkpointing 

If evaluation containers support checkpointing it is possible to terminate evaluation in the middle and resume it afterward, which is typically useful to run long training on short-lived and/or preemptive resources. Evaluation containers need to

periodically produce checkpoint file(s) in the initial working directory or in sub-directories under the initial working directory by using relevant functions of ML packages like keras example, and
resume the training if checkpoint file(s) are available in the initial working directory, otherwise, start a fresh training.

Users can specify the names of the checkpoint files and/or the sub-directories using the --checkPointToSave option. The system periodically checks the files and/or sub-directories, and saves them in a persistent location if some of them were updated after the previous check cycle. The check interval is defined by using --checkPointInterval which is 5 minutes by default. Note that the total size of checkpoint files must be less than 100 MB. When PaDA jobs are terminated while evaluating hyperparameter points, they are automatically retried. The latest saved checkpoint files are provided to the retried PanDA jobs. If the --checkPointToLoad option is specified the checkpoint files/directories are archived to a tarball which is placed in the initial working directory, otherwise, they are copied to the initial working directory with the original file/directory names.

Segmented HPO 

It is possible to define multiple ML models in a single HPO task and optimize hyperparameters for each model independently. This is typically useful when you have a target object, which can be logically or practically partitioned to sub-objects, and want to optimize their ML models in one-go. For example, it would be reasonable to logically break down a puppet to several parts, such as arms, body, and legs, in some use-cases, but it would be nightmare to submit a HPO task for each part if there are so many. Instead, the user would submit a single task for the puppet and let the system split workload based on the user-defined breakdown instruction, which would significantly simplify bookkeeping from user’s point of view. It is enough to prepare a single training dataset which contains files for all models, but the user needs to specify how the training dataset is partitioned by using the --segmentSpecFile option.

--segmentSpecFile SEGMENTSPECFILE
                      External json filename to define segments for
                      segmented HPO which has one model for each segment to
                      be optimized independently. The file contains a list of
                      dictionaries {'name': arbitrary_unique_segment_name,
                      'files': [filename_used_for_the_segment_in_the_training_dataset,
                      ... ]}. It is also possible to specify 'datasets' instead of 'files'
                      in those dictionaries if the training dataset has constituent
                      datasets and is partitioned with constituent dataset boundaries.
                      None by default

For example, when a dataset contains file_1, file_2, file_3, …, and file_N, the json would be something like

$ cat seg.json
[
    {
        "files": [
            "file_1",
            "file_3"
        ],
        "name": "name_A"
    },
    {
        "files": [
            "file_2"
        ],
        "name": "name_B"
    }
]

so that there are two segments in the task. Note that it is also possible to specify ‘datasets’ instead of ‘files’ in dictionaries in the json if the training dataset has constituent datasets and is partitioned with constituent dataset boundaries.

Then

phpo --segmentSpecFile seg.json --trainingDS blah ...

The first segment is called name_A and PanDA jobs for the segment takes only file_1 and file_3 from the training dataset, while the second segment is called name_B and PanDA jobs for the segment takes only file_2. It is possible to use %SEGMENT_NAME in --evaluationExec which is replaced with the actual segment name, such as name_A and name_B, so that the evaluation container can dynamically choose the model relevant to the segment name as shown in the figure below.

The segment name is prepended to metrics files to show for which segment the metrics file contains information. For example,

"evaluationExec": "python toy.py %SEGMENT_NAME",
"evaluationMetrics": "metrics.tgz",

with those options, PanDA jobs for the first segment would execute the evaluation container with “python toy.py name_A” so that toy.py would change configuration based on sys.argv[1], and the system would rename metrics.tgz to name_A.XYZ.metrics.tgz.