Command:        srun -n 502 ./ccsm.exe
Resources:      4 nodes (128 physical, 256 logical cores per node)
Memory:         251 GiB per node
Tasks:          502 processes, OMP_NUM_THREADS was 1
Machine:        b1362.betzy.sigma2.no
Start time:     Tue Apr 20 18:16:15 2021
Total time:     285 seconds (about 5 minutes)
Full path:      /cluster/work/users/ingo/noresm/norcpm1_piControl_00010101/norcpm1_piControl_00010101_mem01/run

Summary: ccsm.exe is MPI-bound in this configuration
Compute:                                     31.3% |==|
MPI:                                         68.5% |======|
I/O:                                          0.2% ||
This application run was MPI-bound. A breakdown of this time and advice for investigating further is in the MPI section below. 

CPU:
A breakdown of the 31.3% CPU time:
Single-core code:                            99.8% |=========|
OpenMP regions:                               0.2% ||
Scalar numeric ops:                          27.7% |==|
Vector numeric ops:                           7.4% ||
Memory accesses:                             64.9% |=====|
The per-core performance is memory-bound. Use a profiler to identify time-consuming loops and check their cache performance.
Little time is spent in vectorized instructions. Check the compiler's vectorization advice to see why key loops could not be vectorized.

MPI:
A breakdown of the 68.5% MPI time:
Time in collective calls:                    44.8% |===|
Time in point-to-point calls:                55.2% |=====|
Effective process collective rate:            4.26 MB/s
Effective process point-to-point rate:        72.5 MB/s
Most of the time is spent in point-to-point calls with a low transfer rate. This can be caused by inefficient message sizes, such as many small messages, or by imbalanced workloads causing processes to wait.
The collective transfer rate is very low. This suggests load imbalance is causing synchronization overhead; use an MPI profiler to investigate.

I/O:
A breakdown of the 0.2% I/O time:
Time in reads:                               87.9% |========|
Time in writes:                              12.1% ||
Effective process read rate:                   209 MB/s
Effective process write rate:                  242 MB/s
Most of the time is spent in read operations with an average effective transfer rate. It may be possible to achieve faster effective transfer rates using asynchronous file operations.

OpenMP:
A breakdown of the 0.2% time in OpenMP regions:
Computation:                                  0.0% |
Synchronization:                              0.0% |
Physical core utilization:                   98.0% |=========|
System load:                                 98.2% |=========|
No measurable time is spent in OpenMP regions.

Memory:
Per-process memory usage may also affect scaling:
Mean process memory usage:                     354 MiB
Peak process memory usage:                     739 MiB
Peak node memory usage:                      27.0% |==|
There is significant variation between peak and mean memory usage. This may be a sign of workload imbalance or a memory leak.
The peak node memory usage is very low. Running with fewer MPI processes and more data on each process may be more efficient.

Energy:
A breakdown of how energy was used:
CPU:                                      not supported
System:                                   not supported
Mean node power:                          not supported
Peak node power:                              0.00 W
Energy metrics are not available on this system.
CPU metrics are not supported (no intel_rapl module)