Kestrel (NREL)

The Kestrel cluster is an HPE Cray system with Intel Xeon Sapphire Rapids CPU nodes (104 cores) and NVIDIA H100 GPUs (4 per node).

Note

Kestrel has separate login nodes for GPU work. Access GPU login nodes via kestrel-gpu.hpc.nrel.gov. GPU jobs should only be submitted from GPU login nodes.

Building

# Reset to default environment
module restore

# Build with Cray compilers
cd ${ERF_HOME}/Exec
make realclean
make -j COMP=cray
# Load GPU modules
module purge
module load PrgEnv-gnu/8.5.0
module load cuda/12.3
module load craype-x86-milan

# Build
cd ${ERF_HOME}/Exec
make realclean
make -j COMP=gnu USE_CUDA=TRUE
# Load GPU modules
module purge
module load PrgEnv-gnu/8.5.0
module load cuda/12.3
module load craype-x86-milan

# Configure and build
mkdir build && cd build
cmake -DCMAKE_BUILD_TYPE=Release \
      -DERF_ENABLE_MPI=ON \
      -DERF_ENABLE_CUDA=ON \
      ..
make -j

Warning

System updates on Kestrel periodically change required modules. Verify current module names with module avail before building.

Memory Allocation

Kestrel allows partial node allocations. For memory-intensive operations (e.g., CUDA compilation):

# Option 1: Request exclusive node access
#SBATCH --exclusive

# Option 2: Request specific memory
#SBATCH --mem=240G
# or
#SBATCH --mem-per-cpu=2G

Without these flags, CUDA compilation may fail due to insufficient memory.

Performance and Cost Considerations

GPU node hours on Kestrel are charged at 10× the rate of CPU node hours. Understanding performance trade-offs is essential for efficient use of allocations.

Typical Performance Characteristics:

  • GPU nodes (4× H100): 10-20× faster than CPU nodes (96-104 cores)

  • Best efficiency with >1M cells per GPU

  • Smaller problems may not fully utilize GPU capability

When to Use GPU vs CPU:

Problem Size

GPU Nodes (10× cost)

CPU Nodes (1× cost)

< 500K cells/GPU

May not justify 10× cost

Recommended

500K - 1M cells/GPU

Marginal benefit

Consider for development

> 1M cells/GPU

Recommended - cost effective

Slower time to solution

> 5M cells/GPU

Excellent utilization

May exceed wall-time limits

Recommendations:

  1. Profile your specific case - Performance varies with physics packages and I/O frequency

  2. Development on CPU - Use CPU nodes for code development and small test cases

  3. Production on GPU - Use GPU nodes for production runs with well-sized domains

  4. Monitor utilization - Check GPU usage with nvidia-smi to verify saturation

Note

The 10-20× performance gain typically justifies the 10× cost increase for production runs, providing faster time-to-solution and 1-2× better overall cost efficiency measured in allocation units per simulation.