Build Troubleshooting¶
This guide helps diagnose and resolve ERF build issues. For library configuration problems, see Library Configuration. For HPC-specific issues, see Machine Profiles, Cray Detection, Build Scripts, and Workstation Builds.
Quick Diagnostic¶
Where’s the problem?
CMake configuration fails
Common causes:
Missing
craype-accel-*module on Cray GPU builds → See Missing craype-accel ModuleNetCDF/HDF5 not found → See Library Configuration
Wrong compiler detected → Check
module list
Quick checks:
module list
echo $CRAY_ACCEL_TARGET
echo $NETCDF_DIR
Cray GPU link error: cannot find -lcudart
If configure/link fails with cannot find -lcudart, check what the Cray wrapper is injecting:
CC --cray-print-opts=libs | grep -E 'cuda|mpi_gtl|mpich|libsci'
If CUDA -L paths are missing (or stale), reload your machine profile/module stack and reconfigure from a clean build directory.
Compilation fails
Common causes:
Out of memory during CUDA/HIP compilation → See Out of Memory During Compilation
Missing source files → Check
git submodule update --init --recursiveStale CMake cache → See Stale CMake Cache
Quick fixes:
# Check memory
free -h
# Update submodules
git submodule update --init --recursive
# Reduce parallel jobs
make -j4
Linking fails
Common causes:
Parallel/serial library mismatch → See Library Configuration for MPI linker errors
Missing libraries → Check
ldd ./ERF3d.*.exGPU-aware MPI issues (now auto-fixed on Cray)
Executable fails to run
Common causes:
Wrong GPU architecture
Missing runtime libraries
MPI misconfiguration
Verification:
# Check dependencies
ldd ./ERF3d.*.ex
# Try short run
mpiexec -n 4 ./ERF3d.*.ex inputs max_step=10
→ See Verifying Successful Builds for full verification steps
Build Process Issues¶
Missing craype-accel Module¶
Symptom: CMake error during GPU build on Cray systems.
Error:
CMake Error: CRAY_ACCEL_TARGET not set for GPU build
Cause: GPU builds on Cray require craype-accel-* module to set $CRAY_ACCEL_TARGET.
Solution:
Load the module for your hardware:
# NVIDIA A100 (Perlmutter, Polaris)
module load craype-accel-nvidia80
# AMD MI250X (Frontier)
module load craype-accel-amd-gfx90a
# Intel GPUs (Aurora)
module load craype-accel-intel-gpu
Best practice: Use machine profiles:
source Build/machines/perlmutter_erf.profile
cmake -DERF_ENABLE_CUDA=ON ..
Out of Memory During Compilation¶
Symptom: Compilation killed with memory errors.
Error:
nvcc fatal: Memory allocation failure
c++: fatal error: Killed signal terminated program
Cause: GPU compilation requires more memory than default allocation on partial-node systems.
Solution:
# SLURM script
#SBATCH --exclusive
# Interactive
salloc --exclusive -N 1
#SBATCH --mem=240G
# or
#SBATCH --mem-per-cpu=4G
make -j4 # Instead of make -j
Note
Common on Kestrel where partial node allocations are default. Always use --exclusive or explicit memory requests.
Stale CMake Cache¶
Symptom: Unexpected failures after changing modules or compilers.
Cause: CMake caches library locations that become invalid when environment changes.
Solution:
make distclean
cmake ..
make
Or manually:
rm -rf CMakeCache.txt CMakeFiles/
cmake ..
Debugging Tools¶
CMake Debugging¶
# Verbose output
cmake --log-level=VERBOSE ..
# With context (shows hierarchy)
cmake --log-context --log-level=VERBOSE ..
Example output:
[ERF.Cray] Detected Cray Programming Environment
[ERF.Cray] Setting Cray compiler wrappers...
[ERF.NetCDF] Found NetCDF: /opt/cray/pe/netcdf/4.9.0.9
Inspect cache:
cmake -LAH | less
grep NETCDF CMakeCache.txt
GNU Make Debugging¶
# Print variable values
make print-CXXFLAGS
make print-LIBRARIES
# Verbose build
make VERBOSE=1
Library Dependencies¶
# Check linked libraries
ldd ./ERF3d.*.ex | grep netcdf
# Check for symbols
nm ERF3d.*.ex | grep nc_
nm ERF3d.*.ex | grep MPI_
Verifying Successful Builds¶
Quick Test¶
# Run short simulation
cd build/install/bin
mpiexec -n 4 ./ERF3d.*.ex inputs max_step=10
Regression Tests¶
# Configure with tests
cmake -DERF_ENABLE_TESTS=ON ..
make
# Run tests
ctest -L regression -VV
Check Build Info¶
./ERF3d.*.ex --describe
Shows compiler versions, enabled features, and GPU architecture.
Resolved Issues (Automated)
These issues are now handled automatically by the build system.
Cray GPU-Aware MPI Linking
Historical problem: Linking failed with GPU-aware MPI due to Cray’s --as-needed flag removing GTL libraries.
Automated solution:
Detects GPU-aware MPI (
MPICH_GPU_SUPPORT_ENABLED=1)Identifies MPI base library (e.g.,
mpi_gnu_123)Identifies required GTL library: -
mpi_gtl_cudafor NVIDIA -mpi_gtl_hsafor AMDAdds to
CMAKE_CXX_STANDARD_LIBRARIESandCMAKE_CUDA_STANDARD_LIBRARIES
If automation fails: Check MPICH_GPU_SUPPORT_ENABLED=1 is set and craype-accel-* module loaded.
NetCDF/HDF5 Detection on Cray
Historical problem: find_package failed because parallel libraries in module-managed paths.
Automated solution:
Queries Cray compiler wrapper:
CC --cray-print-opts=PKG_CONFIG_PATH
Prepends path to PKG_CONFIG_PATH, enabling pkg-config to find parallel libraries.
If automation fails: Load cray-netcdf-hdf5parallel manually.
GPU Architecture Auto-Detection
Historical problem: Users manually specified architecture for all dependencies.
Automated solution:
Reads
$CRAY_ACCEL_TARGET(e.g.,nvidia80,amd_gfx90a)Maps to architecture flags: - AMReX:
AMReX_CUDA_ARCH=8.0- Kokkos:Kokkos_ARCH_AMPERE80=ON
If automation fails: Check craype-accel-* module loaded:
echo $CRAY_ACCEL_TARGET
Getting Help¶
Before submitting an issue:
Search existing issues
Check this guide and Library Configuration
Run diagnostic commands above
Creating a bug report:
Include this information in your GitHub issue:
**System:**
- OS: [e.g., Perlmutter/CrayOS, Ubuntu 22.04]
- Compiler: [gcc --version or CC --version]
- MPI: [mpirun --version]
- Modules: [module list]
**Build command:**
[Complete cmake command or script]
**Error:**
[Complete, unedited terminal output]
Attach files:
CMakeCache.txtBuild log:
make 2>&1 | tee build.log
Diagnostic output:
cmake --log-level=VERBOSE --log-context .. 2>&1 | tee cmake_verbose.log
echo $CRAY_ACCEL_TARGET
echo $NETCDF_DIR
module list
Contributing Fixes¶
If you solve a build problem, contribute your solution!
Contributions welcome:
Machine profiles (
Build/machines/*.profile)Build system improvements
Documentation enhancements
Troubleshooting examples
How to contribute:
Fork ERF repository
Create feature branch
Make changes
Submit Pull Request
See contribution guidelines in the repository.
Note
Community contributions are essential. Your solutions help other users and improve ERF for everyone.