Perlmutter (NERSC)¶
Build and run guidance for ERF on NERSC Perlmutter. For shared build concepts (machine profiles, Cray detection, script reference), see Machine Profiles, Cray Detection, Build Scripts, and Workstation Builds.
Simple build using GNU compiler and CUDA:
# Load environment
module load PrgEnv-gnu cudatoolkit cray-mpich cray-netcdf-hdf5parallel
# Navigate to Exec
cd ${ERF_HOME}/Exec
# Build
make -j4 COMP=gnu USE_MPI=TRUE USE_CUDA=TRUE
This produces an executable like ERF3d.gnu.MPI.CUDA.ex.
Using the provided build script:
# Load environment
source $ERF_HOME/Build/machines/perlmutter_erf.profile
# Configure and build (out-of-source)
mkdir build && cd build
../Build/cmake_with_kokkos_many_cuda.sh
Executable location: build/Exec/erf_exec (or install/bin/erf_exec if installed)
Or manual configuration:
cmake -DCMAKE_BUILD_TYPE=Release \
-DERF_ENABLE_MPI=ON \
-DERF_ENABLE_CUDA=ON \
-DERF_ENABLE_NETCDF=ON \
-DERF_ENABLE_RRTMGP=ON \
..
make -j4
This example runs ERF on 4 nodes with GPU-aware MPI enabled.
Before submitting, load the environment:
source $ERF_HOME/Build/machines/perlmutter_erf.profile
# Run from scratch filesystem with executable and inputs in same directory
mkdir -p $PSCRATCH/ERF/rundir
cd $PSCRATCH/ERF/rundir
# Verify paths before launching
ls -lh ./ERF3d.*.ex inputs
Job submission script:
#!/bin/bash
#SBATCH --account=m4106_g
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=4
#SBATCH --gpus-per-node=4
#SBATCH --gpu-bind=none
#SBATCH --time=00:30:00
#SBATCH --constraint=gpu&hbm40g
#SBATCH --job-name=ERF
#SBATCH --output=erf_%j.out
# GPU-aware MPI optimizations
export MPICH_OFI_NIC_POLICY=GPU
export MPICH_GPU_SUPPORT_ENABLED=1
export SLURM_CPU_BIND="cores"
# Launch with CUDA device ordering
srun -n 16 --cpus-per-task=4 --cpu-bind=cores bash -c "
export CUDA_VISIBLE_DEVICES=\$((3-SLURM_LOCALID));
./ERF3d.gnu.MPI.CUDA.ex inputs amrex.use_gpu_aware_mpi=1"
Submit with: sbatch job_script.sh
For the 256 nodes with 80GB HBM per GPU, replace:
#SBATCH --constraint=gpu&hbm40g
with:
#SBATCH --constraint=gpu&hbm80g
Advanced: AMReX Scaling Tests
Production script demonstrating NIC pinning optimization from the AMReX FFT scaling repository:
#!/bin/bash
#SBATCH --account=mp111_g
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=4
#SBATCH -c 32
#SBATCH --gpus-per-task=1
#SBATCH --gpu-bind=none
#SBATCH --time=00:10:00
#SBATCH --constraint=gpu&hbm40g
#SBATCH --qos=debug
export MPICH_GPU_SUPPORT_ENABLED=1
export SLURM_CPU_BIND="cores"
srun -n 16 ../../old-order.ex amrex.use_gpu_aware_mpi=1 \
n_cell_x=1024 n_cell_y=512 n_cell_z=512 >& run-4-oldorder.ou
srun -n 16 ../../new-order.ex amrex.use_gpu_aware_mpi=1 \
n_cell_x=1024 n_cell_y=512 n_cell_z=512 >& run-4-neworder.ou
export MPICH_OFI_NIC_POLICY=GPU
srun -n 16 ../../old-order.ex amrex.use_gpu_aware_mpi=1 \
n_cell_x=1024 n_cell_y=512 n_cell_z=512 >& run-4-oldorder-nci.ou
srun -n 16 ../../new-order.ex amrex.use_gpu_aware_mpi=1 \
n_cell_x=1024 n_cell_y=512 n_cell_z=512 >& run-4-neworder-nci.ou
Key features:
Uses
--gpus-per-task=1for fine-grained GPU bindingCompares default NIC policy vs
MPICH_OFI_NIC_POLICY=GPUDemonstrates multiple runs with different configurations
WarpX Production Script
Reference implementation from the WarpX project:
#!/bin/bash -l
# Copyright 2021-2023 Axel Huebl, Kevin Gott
#
# This file is part of WarpX.
#
# License: BSD-3-Clause-LBNL
#SBATCH -t 00:10:00
#SBATCH -N 2
#SBATCH -J WarpX
# note: <proj> must end on _g
#SBATCH -A <proj>
#SBATCH -q regular
# A100 40GB (most nodes)
#SBATCH -C gpu
# A100 80GB (256 nodes)
#S BATCH -C gpu&hbm80g
#SBATCH --exclusive
#SBATCH --cpus-per-task=32
# ideally single:1, but NERSC cgroups issue
#SBATCH --gpu-bind=none
#SBATCH --ntasks-per-node=4
#SBATCH --gpus-per-node=4
#SBATCH -o WarpX.o%j
#SBATCH -e WarpX.e%j
# python interpreter & script here
EXE=python3
INPUTS=run_script.py
# or executable & inputs file
#EXE=./warpx
#INPUTS=inputs
# environment setup
if [[ -z "${MY_PROFILE}" ]]; then
echo "WARNING: FORGOT TO"
echo " source $HOME/perlmutter_gpu_warpx.profile"
echo "before submission. Doing that now."
source $HOME/perlmutter_gpu_warpx.profile
fi
# pin to closest NIC to GPU
export MPICH_OFI_NIC_POLICY=GPU
# threads for OpenMP and threaded compressors per MPI rank
# note: 16 avoids hyperthreading (32 virtual cores, 16 physical)
export OMP_NUM_THREADS=16
# GPU-aware MPI optimizations
export AMREX_DEFAULT_INIT="amrex.use_gpu_aware_mpi=1"
# CUDA visible devices are ordered inverse to local task IDs
# Reference: nvidia-smi topo -m
srun --cpu-bind=cores bash -c "
export CUDA_VISIBLE_DEVICES=\$((3-SLURM_LOCALID));
${EXE} ${INPUTS}" \
> output.txt