Perlmutter (NERSC)

Build and run guidance for ERF on NERSC Perlmutter. For shared build concepts (machine profiles, Cray detection, script reference), see Machine Profiles, Cray Detection, Build Scripts, and Workstation Builds.

Simple build using GNU compiler and CUDA:

# Load environment
module load PrgEnv-gnu cudatoolkit cray-mpich cray-netcdf-hdf5parallel

# Navigate to Exec
cd ${ERF_HOME}/Exec

# Build
make -j4 COMP=gnu USE_MPI=TRUE USE_CUDA=TRUE

This produces an executable like ERF3d.gnu.MPI.CUDA.ex.

Using the provided build script:

# Load environment
source $ERF_HOME/Build/machines/perlmutter_erf.profile

# Configure and build (out-of-source)
mkdir build && cd build
../Build/cmake_with_kokkos_many_cuda.sh

Executable location: build/Exec/erf_exec (or install/bin/erf_exec if installed)

Or manual configuration:

cmake -DCMAKE_BUILD_TYPE=Release \
      -DERF_ENABLE_MPI=ON \
      -DERF_ENABLE_CUDA=ON \
      -DERF_ENABLE_NETCDF=ON \
      -DERF_ENABLE_RRTMGP=ON \
      ..
make -j4

This example runs ERF on 4 nodes with GPU-aware MPI enabled.

Before submitting, load the environment:

source $ERF_HOME/Build/machines/perlmutter_erf.profile

# Run from scratch filesystem with executable and inputs in same directory
mkdir -p $PSCRATCH/ERF/rundir
cd $PSCRATCH/ERF/rundir

# Verify paths before launching
ls -lh ./ERF3d.*.ex inputs

Job submission script:

#!/bin/bash
#SBATCH --account=m4106_g
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=4
#SBATCH --gpus-per-node=4
#SBATCH --gpu-bind=none
#SBATCH --time=00:30:00
#SBATCH --constraint=gpu&hbm40g
#SBATCH --job-name=ERF
#SBATCH --output=erf_%j.out

# GPU-aware MPI optimizations
export MPICH_OFI_NIC_POLICY=GPU
export MPICH_GPU_SUPPORT_ENABLED=1
export SLURM_CPU_BIND="cores"

# Launch with CUDA device ordering
srun -n 16 --cpus-per-task=4 --cpu-bind=cores bash -c "
  export CUDA_VISIBLE_DEVICES=\$((3-SLURM_LOCALID));
  ./ERF3d.gnu.MPI.CUDA.ex inputs amrex.use_gpu_aware_mpi=1"

Submit with: sbatch job_script.sh

For the 256 nodes with 80GB HBM per GPU, replace:

#SBATCH --constraint=gpu&hbm40g

with:

#SBATCH --constraint=gpu&hbm80g
Advanced: AMReX Scaling Tests

Production script demonstrating NIC pinning optimization from the AMReX FFT scaling repository:

#!/bin/bash
#SBATCH --account=mp111_g
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=4
#SBATCH -c 32
#SBATCH --gpus-per-task=1
#SBATCH --gpu-bind=none
#SBATCH --time=00:10:00
#SBATCH --constraint=gpu&hbm40g
#SBATCH --qos=debug

export MPICH_GPU_SUPPORT_ENABLED=1 
export SLURM_CPU_BIND="cores"

srun -n 16 ../../old-order.ex amrex.use_gpu_aware_mpi=1 \
                    n_cell_x=1024 n_cell_y=512 n_cell_z=512 >& run-4-oldorder.ou
srun -n 16 ../../new-order.ex amrex.use_gpu_aware_mpi=1 \
                    n_cell_x=1024 n_cell_y=512 n_cell_z=512 >& run-4-neworder.ou

export MPICH_OFI_NIC_POLICY=GPU

srun -n 16 ../../old-order.ex amrex.use_gpu_aware_mpi=1 \
                    n_cell_x=1024 n_cell_y=512 n_cell_z=512 >& run-4-oldorder-nci.ou
srun -n 16 ../../new-order.ex amrex.use_gpu_aware_mpi=1 \
                    n_cell_x=1024 n_cell_y=512 n_cell_z=512 >& run-4-neworder-nci.ou

Key features:

  • Uses --gpus-per-task=1 for fine-grained GPU binding

  • Compares default NIC policy vs MPICH_OFI_NIC_POLICY=GPU

  • Demonstrates multiple runs with different configurations

WarpX Production Script

Reference implementation from the WarpX project:

#!/bin/bash -l

# Copyright 2021-2023 Axel Huebl, Kevin Gott
#
# This file is part of WarpX.
#
# License: BSD-3-Clause-LBNL

#SBATCH -t 00:10:00
#SBATCH -N 2
#SBATCH -J WarpX
#    note: <proj> must end on _g
#SBATCH -A <proj>
#SBATCH -q regular
# A100 40GB (most nodes)
#SBATCH -C gpu
# A100 80GB (256 nodes)
#S BATCH -C gpu&hbm80g
#SBATCH --exclusive
#SBATCH --cpus-per-task=32
# ideally single:1, but NERSC cgroups issue
#SBATCH --gpu-bind=none
#SBATCH --ntasks-per-node=4
#SBATCH --gpus-per-node=4
#SBATCH -o WarpX.o%j
#SBATCH -e WarpX.e%j

# python interpreter & script here
EXE=python3
INPUTS=run_script.py
# or executable & inputs file
#EXE=./warpx
#INPUTS=inputs

# environment setup
if [[ -z "${MY_PROFILE}" ]]; then
    echo "WARNING: FORGOT TO"
    echo "   source $HOME/perlmutter_gpu_warpx.profile"
    echo "before submission. Doing that now."

    source $HOME/perlmutter_gpu_warpx.profile
fi

# pin to closest NIC to GPU
export MPICH_OFI_NIC_POLICY=GPU

# threads for OpenMP and threaded compressors per MPI rank
#   note: 16 avoids hyperthreading (32 virtual cores, 16 physical)
export OMP_NUM_THREADS=16

# GPU-aware MPI optimizations
export AMREX_DEFAULT_INIT="amrex.use_gpu_aware_mpi=1"

# CUDA visible devices are ordered inverse to local task IDs
#   Reference: nvidia-smi topo -m
srun --cpu-bind=cores bash -c "
    export CUDA_VISIBLE_DEVICES=\$((3-SLURM_LOCALID));
    ${EXE} ${INPUTS}" \
  > output.txt