fea-nux

Distributed ANSYS – NUMA and MPI

With NUMA, processors can access its local memory faster than non-local memory.Â Use of NUMA is usually activated in the BIOS when the workstation boots.

For SMP version of ANSYS, using NUMA does not have any noticeable impact.Â On the other hand, using NUMA correctly can help speed up Distributed ANSYS (DANSYS) since there are multiple processes (same as number of cores used) running.

On Linux, using numactl --hardware will provide information on the processors and memory.Â An example listing is shown below:

available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3
node 0 size: 24576 MB
node 0 free: 23788 MB
node 1 cpus: 4 5 6 7
node 1 size: 24573 MB
node 1 free: 23466 MB
node distances:
nodeÂ Â  0Â Â  1
Â  0:Â  10Â  20
Â  1:Â  20Â  10

The above system has 2 quad-core CPUs.Â We can see the two nodes (each CPU) with the amount of local memory (24 GB), meaning that the processes on a node should not utilize more than the free RAM (< 24 GB) available.

Enabling NUMA in the system BIOS is usually not sufficient since the operating system (both Linux and Windows) often moves the process around to different cores â€” even sometimes to a different CPU, which may defeat any gains seen by using NUMA.

MPI software, such as Platform MPI (Platform Computing was acquired by IBM, so the software product is now called “IBM Platform MPI”), supports environment variables which can fix the process affinity for the DANSYS processes.Â Specifically, MPIRUN_OPTIONS, MPI_CPU_AFFINITY, and MPI_BIND_MAP can be useful to control how the processes are associated with cores.

For example, on Windows, one can create a DOS batch file to run a DANSYS job and set associated environment variables beforehand:

@echo off
set PATH=%ANSYS145_DIR%\bin\%ANSYS_SYSDIR%;%PATH%
set ANSYS145_PRODUCT=ansys
set MPIRUN_OPTIONS=-v
set MPI_CPU_AFFINITY=MAP_CPU
set MPI_BIND_MAP=0,4,2,6,1,5,3,7
ansys145 -b nolist -i input.inp -o output.out -np 4 -dis

In the above, example, the PATH environment variable is defined to include the ANSYS 14.5 directory, so Windows will recognize the “ansys145” command without having to type its full pathname.Â The MPIRUN_OPTIONS environment variable specifies arguments to use when mpirun is launched (this is how the ANSYS processes are executed); “-v” gives verbose output (that is echoed in the DOS Command Prompt window), which can be helpful to get detailed information.Â MPI_CPU_AFFINITY=MAP_CPU indicates that the MPI processes will be mapped to physical cores.Â The MPI_BIND_MAP environment variable defines the order of the assigned cores (for the associated rank IDs). In this case, core 0 (on node 0) will be used for the first process, then core 4 (on node 1) will be used, etc.

While the above solution is running, one can open Windows Task Manager and go to the “Processes” tab.Â There, right-click on “ANSYS.exe” process and select “Set Affinity…” to see that the process is mapped to a single core.Â (Compare this with other processes which will indicate that the process may run on any core.)

For Linux, the same environment variables can be used.Â In the above batch script, “@echo off” would be replaced by “#!/bin/bash”, and use “export PATH=/ansys_inc/v145/ansys/bin:$PATH” instead (in bash, “export” is used instead of “set” in DOS, and existing variables are referenced as “$variable” instead of “%variable%”.Â To verify whether or not a process is pinned to a given core, use “ps -ef | grep”Â to find the ANSYS process ID, then use “taskset” to view the CPU affinity for the process ID.Â One should see that the process is tied to a single core when using the above script.

One can easily perform a series of runs in a DOS or bash script to compare the performance gains when using NUMA and DANSYS with and without the above environment variables.Â Speedups anywhere between 5-15% may be experienced.Â The performance gains may not be worth the extra effort for some users, so using these environment variables is not necessarily for everyone.Â However, for users wishing to optimize their hardware configuration â€” especially if they are running multiple runs (e.g., optimization or probabilistic study done through Design Exploration) â€” being able to cut the solution time down by 10% may be worthwhile.

Note that the total memory used by the cores on a node must not exceed the free memory local to that node.Â The user can run the solution once, then view the amount of memory used by all processes in the ANSYS output files to determine the memory needed.

Distributed ANSYS – NUMA and MPI

Entry Details

External Sites

ANSYS Links

Latest Posts