Kestrel (NREL)¶
The Kestrel cluster is an HPE Cray system with Intel Xeon Sapphire Rapids CPU nodes (104 cores) and NVIDIA H100 GPUs (4 per node).
Note
Kestrel has separate login nodes for GPU work. Access GPU login nodes via kestrel-gpu.hpc.nrel.gov. GPU jobs should only be submitted from GPU login nodes.
Building
# Reset to default environment
module restore
# Build with Cray compilers
cd ${ERF_HOME}/Exec
make realclean
make -j COMP=cray
# Load GPU modules
module purge
module load PrgEnv-gnu/8.5.0
module load cuda/12.3
module load craype-x86-milan
# Build
cd ${ERF_HOME}/Exec
make realclean
make -j COMP=gnu USE_CUDA=TRUE
# Load GPU modules
module purge
module load PrgEnv-gnu/8.5.0
module load cuda/12.3
module load craype-x86-milan
# Configure and build
mkdir build && cd build
cmake -DCMAKE_BUILD_TYPE=Release \
-DERF_ENABLE_MPI=ON \
-DERF_ENABLE_CUDA=ON \
..
make -j
Warning
System updates on Kestrel periodically change required modules. Verify current module names with module avail before building.
Memory Allocation
Kestrel allows partial node allocations. For memory-intensive operations (e.g., CUDA compilation):
# Option 1: Request exclusive node access
#SBATCH --exclusive
# Option 2: Request specific memory
#SBATCH --mem=240G
# or
#SBATCH --mem-per-cpu=2G
Without these flags, CUDA compilation may fail due to insufficient memory.
Performance and Cost Considerations
GPU node hours on Kestrel are charged at 10× the rate of CPU node hours. Understanding performance trade-offs is essential for efficient use of allocations.
Typical Performance Characteristics:
GPU nodes (4× H100): 10-20× faster than CPU nodes (96-104 cores)
Best efficiency with >1M cells per GPU
Smaller problems may not fully utilize GPU capability
When to Use GPU vs CPU:
Problem Size |
GPU Nodes (10× cost) |
CPU Nodes (1× cost) |
|---|---|---|
< 500K cells/GPU |
May not justify 10× cost |
Recommended |
500K - 1M cells/GPU |
Marginal benefit |
Consider for development |
> 1M cells/GPU |
Recommended - cost effective |
Slower time to solution |
> 5M cells/GPU |
Excellent utilization |
May exceed wall-time limits |
Recommendations:
Profile your specific case - Performance varies with physics packages and I/O frequency
Development on CPU - Use CPU nodes for code development and small test cases
Production on GPU - Use GPU nodes for production runs with well-sized domains
Monitor utilization - Check GPU usage with
nvidia-smito verify saturation
Note
The 10-20× performance gain typically justifies the 10× cost increase for production runs, providing faster time-to-solution and 1-2× better overall cost efficiency measured in allocation units per simulation.