Slurm Script Arguments

These scripts are used to launch training runs on a Slurm-managed cluster. It takes several arguments to configure the training process.

NOTE: there are several placholders throughout the run scripts that should be replaced appropriately.

Arguments:

G_INTRA_R=$1: The G_x part of the config.
G_INTRA_C=$2: The G_y part of the config.
G_INTRA_D=$3: The G_z part of the config.
TRIAL_NUM=$4: Trial number used in the output filename.
TRAIN_FILE=$5: Path to train.py file (e.g., from Plexus repository's examples directory).
DATA_DIR=$6: Path to directory containing the preprocessed data (unpartitioned or partitioned).
RESULT_DIR=$7: Directory to save output file to.

For running ogbn-papers100M on 64 and 128 GPUs on Perlmutter, request the 80GB GPU nodes:

#SBATCH -C gpu&hbm80g

For GPU OOM issues (fragmentation warnings from PyTorch), try:

export PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True"

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
get_rank.sh		get_rank.sh
run_frontier.sh		run_frontier.sh
run_perlmutter.sh		run_perlmutter.sh