-
Notifications
You must be signed in to change notification settings - Fork 67
/
Copy pathaffinity.sh
executable file
·196 lines (148 loc) · 6.5 KB
/
affinity.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
#!/bin/bash
#SBATCH --job-name="sample"
#SBATCH --nodes=1
#SBATCH --exclusive
#SBATCH --export=ALL
#SBATCH --time=00:10:00
#SBATCH --out=%J.out
#SBATCH --error=%J.err
#SBATCH --cpus-per-task=18
:<<++++
Author: Tim Kaiser
Script that runs a hybrid MPI/OpenMP program "phostone" on the
specified number of --nodes=1 with a given number of OpenMP
threads.
Here we run with 2 tasks and 18 threads. We "default" to 18
threads because we have --cpus-per-task=18. This can also be
set with the environmental variable OMP_NUM_THREADS.
If you set OMP_NUM_THREADS within a script you should add the
--cpus-per-task=$OMP_NUM_THREADS to your srun command line.
In this version we show the effects of setting the variable
KMP_AFFINITY. KMP_AFFINITY is used to control mappings of
threads to cores when the Intel compilers are used.
The issue is that we can, if not set, see multiple threads or
tasks end up on the same core. We will look at three settings
If
KMP_AFFINITY=verbose
a report will be sent to stderr, %J.err in our case where %J
is the job number. The mapping of threads to cores is "default"
which is somewhat arbitrary.
KMP_AFFINITY=verbose,scatter
and
KMP_AFFINITY=verbose,compact
We still get the report but the system tries to not map multiple
threads to the same core.
For each run we put the output in a seperate file. If we run
the program phostone with the -F option we will get a report of
mappings of MPI tasks and threads to cores.
task thread node name first task # on node core
0000 0011 r4i7n35 0000 0000 0011
0000 0015 r4i7n35 0000 0000 0015
...
...
0001 0017 r4i7n35 0000 0001 0035
We have the MPI task and OMP_THREAD_NUMBER, the node and the core number.
Ideally, for each node each core would only have a single thread.
While we can look at the individual output files to see if there is
duplication we actually we use the code in the nested for loops to
find and report where we have cores over/under loaded. We note that
this only happend when we have KMP_AFFINITY=verbose
The variable KMP_AFFINITY is unique to Intel compilers. There are
similar "OMP" variables that work for GGC compilers and Intel compilers.
For example the following settings give similar results to KMP_AFFINITY=scatter
OMP_PLACES=cores
OMP_PROC_BIND=spread
USAGE:
sbatch -A hpcapps --partition=debug affinity.sh
++++
# needed for threading if you use Intel compilers
module load comp-intel
# load our version of MPI
module load mpt
# Go to the directory from which our job was launched
cd $SLURM_SUBMIT_DIR
echo running glorified hello world
1>&2 echo "***** running verbose *****"
export KMP_AFFINITY=verbose
srun --ntasks=2 ./phostone -t 10 -F > $SLURM_JOBID.noaffinity
1>&2 echo "***** running verbose,scatter *****"
export KMP_AFFINITY=verbose,scatter
srun --ntasks=2 ./phostone -t 10 -F > $SLURM_JOBID.scatter
1>&2 echo "***** running verbose,compact *****"
export KMP_AFFINITY=verbose,compact
srun --ntasks=2 ./phostone -t 10 -F > $SLURM_JOBID.compact
for f in noaffinity scatter compact ; do
echo "Core report - cores over/under loaded KMP_AFFINITY=" $f
for c in `seq -w 0 35` ; do
echo -n "$c "
grep 00$c\$ $SLURM_JOBID.$f | wc -l
done | grep -v 1$
echo " "
done
:<<++++
Example Output:
el2:collect> sbatch -A hpcapps --partition=short affinity.sh
Submitted batch job 5368560
el2:collect>
el2:collect> ls -l 5368560*
-rw-rw----. 1 tkaiser2 tkaiser2 2600 Dec 18 07:32 5368560.compact
-rw-rw----. 1 tkaiser2 tkaiser2 22175 Dec 18 07:32 5368560.err
-rw-rw----. 1 tkaiser2 tkaiser2 2600 Dec 18 07:32 5368560.noaffinity
-rw-rw----. 1 tkaiser2 tkaiser2 136 Dec 18 07:32 5368560.out
-rw-rw----. 1 tkaiser2 tkaiser2 2600 Dec 18 07:32 5368560.scatter
el2:collect> cat 5368560.out
running glorified hello world
Core report - cores over/under loaded
noaffinity
24 2
25 3
27 2
31 0
32 0
33 0
34 0
scatter
compact
el2:collect> head 5368560.err
running verbose
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: 0-17
OMP: Info #214: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #156: KMP_AFFINITY: 18 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #285: KMP_AFFINITY: topology layer "LL cache" is equivalent to "socket".
OMP: Info #285: KMP_AFFINITY: topology layer "L3 cache" is equivalent to "socket".
OMP: Info #285: KMP_AFFINITY: topology layer "L2 cache" is equivalent to "core".
OMP: Info #285: KMP_AFFINITY: topology layer "L1 cache" is equivalent to "core".
OMP: Info #285: KMP_AFFINITY: topology layer "thread" is equivalent to "core".
el2:collect> head 5368560.noaffinity
MPI VERSION Intel(R) MPI Library 2019 Update 7 for Linux* OS
task thread node name first task # on node core
0000 0001 r1i1n15 0000 0000 0005
0000 0015 r1i1n15 0000 0000 0004
0000 0003 r1i1n15 0000 0000 0007
0000 0007 r1i1n15 0000 0000 0009
0000 0014 r1i1n15 0000 0000 0006
0000 0016 r1i1n15 0000 0000 0003
0000 0006 r1i1n15 0000 0000 0010
el2:collect> head 5368560.scatter
MPI VERSION Intel(R) MPI Library 2019 Update 7 for Linux* OS
task thread node name first task # on node core
0000 0013 r1i1n15 0000 0000 0013
0000 0012 r1i1n15 0000 0000 0012
0000 0003 r1i1n15 0000 0000 0003
0000 0007 r1i1n15 0000 0000 0007
0000 0008 r1i1n15 0000 0000 0008
0000 0016 r1i1n15 0000 0000 0016
0000 0006 r1i1n15 0000 0000 0006
el2:collect> head 5368560.compact
MPI VERSION Intel(R) MPI Library 2019 Update 7 for Linux* OS
task thread node name first task # on node core
0000 0003 r1i1n15 0000 0000 0003
0000 0012 r1i1n15 0000 0000 0012
0000 0002 r1i1n15 0000 0000 0002
0000 0014 r1i1n15 0000 0000 0014
0000 0001 r1i1n15 0000 0000 0001
0000 0004 r1i1n15 0000 0000 0004
0000 0015 r1i1n15 0000 0000 0015
el2:collect>
++++