Skip to content

Frontier scripts for DeepCAM #3

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 17 commits into
base: develop
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
62 changes: 59 additions & 3 deletions AMG2023/README.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
# AMG2023 README
For more detailed installation parameters, please refer to the [installation document](https://github.com/pssg-int/AMG2023/blob/main/amg-doc.pdf).

## Perlmutter Compilation
Repository: [AMG2023](https://github.com/hpcgroup/AMG2023/)

Repository: [AMG2023](https://github.com/pssg-int/AMG2023)
## Perlmutter Compilation

### Steps to Compile

Expand Down Expand Up @@ -50,5 +50,61 @@ Repository: [AMG2023](https://github.com/pssg-int/AMG2023)
cmake -DHYPRE_PREFIX=/pscratch/sd/c/cunyang/AMG2023 ..
```

## Frontier Installation
## Frontier Compilation

### Steps to Compile

1. Load modules
```sh
module reset

module load cray-mpich/8.1.30
module load craype-accel-amd-gfx90a
module load rocm/6.1.3
export MPICH_GPU_SUPPORT_ENABLED=1

# load compatible cmake version
module load Core/24.07
module load cmake/3.27.9
```
2. Configure hypre (v2.32.0)
- Clone hypre v2.32.0 and navigate to src:
```sh
git clone -b v2.32.0 https://github.com/hypre-space/hypre.git
cd into ~/hypre/src
```
- Configure hypre (in hypre/src)
```sh
./configure --with-hip --enable-device-memory-pool --enable-mixedint --with-gpu-arch=gfx90a \
--with-MPI-lib-dirs="${MPICH_DIR}/lib" --with-MPI-libs="mpi" \
--with-MPI-include="${MPICH_DIR}/include" \
CFLAGS="-I${ROCM_PATH}/include/ -I${ROCM_PATH}/llvm/include/ -I${ROCM_PATH}/include/rocsparse/" \
LDFLAGS="-L${ROCM_PATH}/lib/ -L${ROCM_PATH}/llvm/lib/ -lrocsparse"
```
- Compile hypre (in hypre/src)
```sh
# build with make
make
```
3. Configure AMG2023
- Clone repo:
```sh
git clone https://github.com/pssg-int/AMG2023`
cd AMG2023
```
- Add mpiP to LD_LIBRARY_PATH
```sh
export LD_LIBRARY_PATH=/ccs/home/keshprad/mpiP:$LD_LIBRARY_PATH
```
- Configure cmake
```sh
mkdir build && cd build

cmake .. -DHYPRE_PREFIX=/ccs/home/keshprad/hypre/src/hypre/ \
-DCMAKE_C_FLAGS="-I${ROCM_PATH}/include/ -I${ROCM_PATH}/llvm/include/ -I${ROCM_PATH}/include/rocsparse/" \
-DCMAKE_EXE_LINKER_FLAGS="-L${ROCM_PATH}/lib/ -L${ROCM_PATH}/llvm/lib/ -lrocsparse -lrocrand"
```
- Compile AMG2023 (in AMG2023/build)
```sh
make install
```
57 changes: 57 additions & 0 deletions AMG2023/run_frontier_16.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
#!/bin/bash
#SBATCH -N 16
#SBATCH -n 128
#SBATCH -q normal
#SBATCH -J amg
#SBATCH --gpu-bind none
#SBATCH -t 00:30:00
#SBATCH -A csc569
#SBATCH --output /lustre/orion/csc569/scratch/keshprad/perfvar/AMG2023_logs/16nodes/%x-%j/output-AMG2023.log
#SBATCH --error /lustre/orion/csc569/scratch/keshprad/perfvar/AMG2023_logs/16nodes/%x-%j/error-AMG2023.log
#SBATCH --exclusive
# Run like: sbatch run_frontier_16.sh

OUTPUT_DIR=/lustre/orion/csc569/scratch/keshprad/perfvar/AMG2023_logs/16nodes/$SLURM_JOB_NAME-$SLURM_JOB_ID
OUTPUT_FILE=$OUTPUT_DIR/output-AMG2023.log
ERROR_FILE=$OUTPUT_DIR/error-AMG2023.log

# Run gpu benchmarks
COMM_TYPE=mpi
ROCM_VERSION=6.1.3
PERF_VARIABILITY_ROOT=/ccs/home/keshprad/perf-variability
echo running allreduce benchmark
bash $PERF_VARIABILITY_ROOT/gpu-benchmarks/allreduce/run_frontier.sh $COMM_TYPE $ROCM_VERSION $SLURM_JOB_NUM_NODES $OUTPUT_DIR
# echo running allgather benchmark
# bash $PERF_VARIABILITY_ROOT/gpu-benchmarks/allgather/run_frontier.sh $COMM_TYPE $ROCM_VERSION $SLURM_JOB_NUM_NODES $OUTPUT_DIR
echo running gemm benchmark
bash $PERF_VARIABILITY_ROOT/gpu-benchmarks/gemm/run_frontier.sh $ROCM_VERSION $SLURM_JOB_NUM_NODES $OUTPUT_DIR

APP_ROOT=/ccs/home/keshprad/AMG2023
cd $APP_ROOT

# reset modules
echo resetting modules:
module reset
# load modules
echo loading modules:
module load cray-mpich/8.1.30
module load craype-accel-amd-gfx90a
module load rocm/6.1.3

export MPICH_GPU_SUPPORT_ENABLED=1
export CRAY_ACCEL_TARGET=gfx90a
export HYPRE_INSTALL_DIR=/ccs/home/keshprad/hypre/src/hypre/
# mpiP
export LD_LIBRARY_PATH=/ccs/home/keshprad/mpiP:$LD_LIBRARY_PATH
export MPIP="-o -f $OUTPUT_DIR"

# log start date
echo start AMG2023: $(date)
# define command
cmd="srun --output $OUTPUT_FILE --error $ERROR_FILE \
./build/amg -P 4 4 8 -n 128 64 64 -problem 1 -iter 500"
echo solving:
echo $cmd
$cmd
# log end date
echo end AMG2023: $(date)
57 changes: 57 additions & 0 deletions AMG2023/run_frontier_64.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
#!/bin/bash
#SBATCH -N 64
#SBATCH -n 512
#SBATCH -q normal
#SBATCH -J amg
#SBATCH --gpu-bind none
#SBATCH -t 00:30:00
#SBATCH -A csc569
#SBATCH --output /lustre/orion/csc569/scratch/keshprad/perfvar/AMG2023_logs/64nodes/%x-%j/output-AMG2023.log
#SBATCH --error /lustre/orion/csc569/scratch/keshprad/perfvar/AMG2023_logs/64nodes/%x-%j/error-AMG2023.log
#SBATCH --exclusive
# Run like: sbatch run_frontier_64.sh

OUTPUT_DIR=/lustre/orion/csc569/scratch/keshprad/perfvar/AMG2023_logs/64nodes/$SLURM_JOB_NAME-$SLURM_JOB_ID
OUTPUT_FILE=$OUTPUT_DIR/output-AMG2023.log
ERROR_FILE=$OUTPUT_DIR/error-AMG2023.log

# Run gpu benchmarks
COMM_TYPE=mpi
ROCM_VERSION=6.1.3
PERF_VARIABILITY_ROOT=/ccs/home/keshprad/perf-variability
echo running allreduce benchmark
bash $PERF_VARIABILITY_ROOT/gpu-benchmarks/allreduce/run_frontier.sh $COMM_TYPE $ROCM_VERSION $SLURM_JOB_NUM_NODES $OUTPUT_DIR
# echo running allgather benchmark
# bash $PERF_VARIABILITY_ROOT/gpu-benchmarks/allgather/run_frontier.sh $COMM_TYPE $ROCM_VERSION $SLURM_JOB_NUM_NODES $OUTPUT_DIR
echo running gemm benchmark
bash $PERF_VARIABILITY_ROOT/gpu-benchmarks/gemm/run_frontier.sh $ROCM_VERSION $SLURM_JOB_NUM_NODES $OUTPUT_DIR

APP_ROOT=/ccs/home/keshprad/AMG2023
cd $APP_ROOT

# reset modules
echo resetting modules:
module reset
# load modules
echo loading modules:
module load cray-mpich/8.1.30
module load craype-accel-amd-gfx90a
module load rocm/6.1.3

export MPICH_GPU_SUPPORT_ENABLED=1
export CRAY_ACCEL_TARGET=gfx90a
export HYPRE_INSTALL_DIR=/ccs/home/keshprad/hypre/src/hypre/
# mpiP
export LD_LIBRARY_PATH=/ccs/home/keshprad/mpiP:$LD_LIBRARY_PATH
export MPIP="-o -f $OUTPUT_DIR"

# log start date
echo start AMG2023: $(date)
# define command
cmd="srun --output $OUTPUT_FILE --error $ERROR_FILE \
./build/amg -P 8 8 8 -n 128 64 64 -problem 1 -iter 500"
echo solving:
echo $cmd
$cmd
# log end date
echo end AMG2023: $(date)
19 changes: 19 additions & 0 deletions AMG2023/run_frontier_crontab.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
#!/bin/bash
if [ "$#" -ne 1 ]; then
echo "Usage: $0 <number_of_nodes>"
exit 1
fi
# `16` or `64`
NUM_NODES=$1

PERF_VARIABILITY_ROOT=/ccs/home/keshprad/perf-variability

# load lmod
source /usr/share/lmod/lmod/init/bash
# load default LMOD_SYSTEM_DEFAULT_MODULES and MODULEPATH
export LMOD_SYSTEM_DEFAULT_MODULES=craype-x86-trento:craype-network-ofi:perftools-base:xpmem:cray-pmi:PrgEnv-cray:DefApps
export MODULEPATH=/sw/frontier/spack-envs/modules/cce/17.0.0/cray-mpich-8.1.28/cce-17.0.0:/sw/frontier/spack-envs/modules/cce/17.0.0/cce-17.0.0:/sw/frontier/spack-envs/modules/Core/24.07:/opt/cray/pe/lmod/modulefiles/mpi/crayclang/17.0/ofi/1.0/cray-mpich/8.0:/opt/cray/pe/lmod/modulefiles/comnet/crayclang/17.0/ofi/1.0:/opt/cray/pe/lmod/modulefiles/compiler/crayclang/17.0:/opt/cray/pe/lmod/modulefiles/mix_compilers:/opt/cray/pe/lmod/modulefiles/perftools/23.12.0:/opt/cray/pe/lmod/modulefiles/net/ofi/1.0:/opt/cray/pe/lmod/modulefiles/cpu/x86-trento/1.0:/opt/cray/pe/modulefiles/Linux:/opt/cray/pe/modulefiles/Core:/opt/cray/pe/lmod/lmod/modulefiles/Core:/opt/cray/pe/lmod/modulefiles/core:/opt/cray/pe/lmod/modulefiles/craype-targets/default:/sw/frontier/modulefiles:/opt/cray/modulefiles

# run sbatch script
script=$PERF_VARIABILITY_ROOT/AMG2023/run_frontier_$NUM_NODES\.sh
sbatch $script
131 changes: 131 additions & 0 deletions DeepCAM/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
# DeepCAM README
For more detailed installation parameters, please refer to DeepCAM install guide

Perlmutter Repository: [hpc_results_v3.0](https://github.com/hpcgroup/hpc_results_v3.0)
Frontier Repository: [hpc](https://github.com/hpcgroup/hpc)


## Perlmutter Setup

### Setup steps

## Frontier Setup

### Setup steps

#### 1. Pytorch Install
- Load modules
```bash
module reset
module load PrgEnv-gnu/8.5.0
module load rocm/6.1.3
module load craype-accel-amd-gfx90a
module load cray-python/3.9.13.1
- Create env variables
```bash
DEEPCAM_ROOT=/lustre/orion/csc569/scratch/keshprad/deepcam/
PYVENV_ROOT=${DEEPCAM_ROOT}/.venv
PYVENV_SITEPKGS=${PYVENV_ROOT}/lib/python3.9/site-packages

cd ${DEEPCAM_ROOT}
```
- Create python virtual env
```bash
python -m venv ${PYVENV_ROOT}
source ${PYVENV_ROOT}/bin/activate
```
- Install torch and mpi4py
```bash
# torch==2.5.0
pip install torch==2.5.0 torchvision==0.20.0 torchaudio==2.5.0 --index-url https://download.pytorch.org/whl/rocm6.1

MPICC="cc -shared" pip install --no-cache-dir --no-binary=mpi4py mpi4py
```
- Install AWS-OCI-RCCL plugin
```bash
mkdir -p ${DEEPCAM_ROOT}/repos
cd ${DEEPCAM_ROOT}/repos

rocm_version=6.1.3
# Load modules
module load PrgEnv-gnu/8.5.0
module load rocm/$rocm_version
module load craype-accel-amd-gfx90a
module load gcc-native/12.3
module load cray-mpich/8.1.30
#module load libtool
libfabric_path=/opt/cray/libfabric/1.15.2.0

# Download the plugin repo
git clone --recursive https://github.com/ROCmSoftwarePlatform/aws-ofi-rccl
cd aws-ofi-rccl

# Build the plugin
./autogen.sh
export LD_LIBRARY_PATH=/opt/rocm-$rocm_version/hip/lib:$LD_LIBRARY_PATH
PLUG_PREFIX=$PWD

CC=hipcc CFLAGS=-I/opt/rocm-$rocm_version/rccl/include ./configure \
--with-libfabric=$libfabric_path --with-rccl=/opt/rocm-$rocm_version --enable-trace \
--prefix=$PLUG_PREFIX --with-hip=/opt/rocm-$rocm_version/hip --with-mpi=$MPICH_DIR

make
make install

# Reminder to export the plugin to your path
echo $PLUG_PREFIX
echo "Add the following line in the environment to use the AWS OFI RCCL plugin"
echo "export LD_LIBRARY_PATH="$PLUG_PREFIX"/lib:$""LD_LIBRARY_PATH"
```
- Install supporting dependencies
```bash
cd ${DEEPCAM_ROOT}

pip install wandb
pip install gym
pip install pyspark
pip install scikit-learn
pip install scikit-image
pip install opencv-python
pip install wheel
pip install tomli
pip install h5py

# tensorboard
pip install tensorboard
pip install tensorboard_plugin_profile
pip install tensorboard-plugin-wit
pip install tensorboard-pytorch

pip install git+https://github.com/ildoonet/pytorch-gradual-warmup-lr.git
```
- Install mlperf-logging
```bash
mkdir -p ${DEEPCAM_ROOT}/repos
cd ${DEEPCAM_ROOT}/repos

git clone -b hpc-1.0-branch https://github.com/mlcommons/logging mlperf-logging
# may need to manually change mlperf-logging/VERSION to a valid version number (e.g. 1.0.0.rc2)
pip install -e mlperf-logging

rm ${PYVENV_SITEPKGS}/mlperf-logging.egg-link
cp -r ./mlperf-logging/mlperf_logging ${PYVENV_SITEPKGS}/mlperf_logging
cp -r ./mlperf-logging/mlperf_logging.egg-info ${PYVENV_SITEPKGS}/mlperf_logging.egg-info
```

#### 2. Download src code
- Download from PSSG Frontier repo for DeepCAM (linked at top of README)
```bash
# REPLACE WITH YOUR PATH
PRFX=/lustre/orion/csc569/scratch/keshprad
DEEPCAM_ROOT=${PRFX}/deepcam

mkdir -p ${DEEPCAM_ROOT}
cd ${DEEPCAM_ROOT}

git clone https://github.com/hpcgroup/hpc.git hpc
```

#### 3. Download dataset with globus
- [Globus Link](https://app.globus.org/file-manager?origin_id=0b226e2c-4de0-11ea-971a-021304b0cca7&origin_path=%2F)
- Download to `$DEEPCAM_ROOT/data`
Loading