-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathChangeLog
1480 lines (1322 loc) · 72.5 KB
/
ChangeLog
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
------------------- Upcoming version 9.0 -----------------------------
Major features:
- For LLVM runtimes based on LLVM 13.0 and newer, Score-P now offers
function instrumentation similar to that available for GCC.
This includes compile-time filtering via
`scorep --instrument-filter=`. Note that not all compilers provide
the necessary libraries or headers.
Features and improvements:
- The CUDA adapter now uses Score-P timestamps for CUPTI events directly
instead of converting CUPTI timestamps with CUDA 11.6 and newer. Older
CUDA versions still use the previous implementation. This may prevent
timestamp issues seen with previous Score-P versions.
- The OMPT adapter has these new features: improved support for OpenMP
tasks, including detach, yield, taskwait depend, taskloops and
cancelled tasks. Support for OpenMP reduction clauses.
User tools and API improvements and changes:
- The 'xnonblock' option of SCOREP_MPI_ENABLE_GROUPS is deprecated.
Measurements always record extended non-blocking events.
- The `scorep` tool now also provides the Git revision via the
`--revision` flag, similar to `scorep-config`.
- Add a filter generation option to `scorep-score`, that generates a
maximal filter including all filterable regions. This serves as
starting point for a fully manual approach without the need to
copy and paste from the default output.
- The help command of the Score-P instrumenter `scorep --help` will now
only print the help for available Score-P adapters for its installation.
- For OpenMP instrumentation, the OMPT adapter is now used by default,
if available. To use OPARI2 as the default instrumenter, please use
`--enable-default=opari2` during configure. The option
`--enable-default=ompt` will be removed in a future Score-P release.
- The HIP adapter is now enabled by default when configure detects
the required libraries and a suitable Clang-based compiler.
- Intel compilers supporting `-tcollect` for instrumentation (i.e., the
classic icc, icpc, ifort) will switch back to this option, instead of
using `-finstrument-functions`. The former allows for compile-time
filtering. The Clang-based icx, icpx, and ifx will continue to use
`-finstrument-functions`.
Compatibility:
- The usage of POMP user instrumentation, aka `#pragma pomp ...`, is
deprecated. Use the `SCOREP_USER_*` macros instead.
- Non-LLVM based CCE compilers are no longer supported. Please upgrade.
- Added the environment variable `NVCC` to change the compiler
command used for checking the NVIDIA CUDA compiler during configure.
- Remove OpenACC configure options and environment variables to
specify include paths to `openacc.h` and `acc_prof.h`. Compilers
implementing OpenACC know where to find the header files.
- Added flag `--disable-libwrap-generator` to disable the build
of the library wrapper generator. While `--without-llvm` still
disables this build as well, `--with-llvm=<path>` will not
cause configure to fail if the requirements for the generator
build cannot be satisfied anymore.
- External dependencies which can be downloaded via `--with-lib<foo>=download`
at configure time, can now be provided via `build-config/packages` or
`--with-package-cache=<path>`. Run `build-config/packages.sh` to list
all packages and their respective download URL.
Bugfixes:
- Prevent the OMPT adapter from aborting when nested undeferred
OpenMP tasks report task underflows.
- Enabling the CUDA / HIP adapter together with the OMPT adapter will no
longer cause a segmentation fault with LLVM 16.0 and newer when OpenMP
offloading flags are used.
- Ensure that usage of the HIP API from within Score-P does not leave
lingering last error values.
- Fix memory corruption in cases where the HIP communicator was not
initialized before unification.
------------------- Released version 8.4 -----------------------------
User tools and API improvements and changes:
- Fixed assignment of functions to the `COM` category in Score-P's scoring tool.
User wrapped libraries now do induce `COM` along call paths that reach them,
and Score-P internal regions such as `TRACE BUFFER FLUSH` no longer induce
`COM`.
Compatibility:
- The classic Intel compilers will now use the flag `-diag-disable=10441`
to suppress the deprecation warning for each compilation unit.
Bugfixes:
- Fix crash when issuing multiple consecutive split collective IO operations
on the same file handle.
- Fix linking issues with LLVM based compilers when OpenMP target flags
are used and the compiler instrumentation is active.
- HIP instrumentation is now consistent with CUDA instrumentation in its
handling of small rounding errors in device timestamp interpolation and
will simply adjust the beginning of the next event per stream according
to the events already seen.
- Fix incorrect communicator attribute in MPI_COLLECTIVE_END events for
`MPI_Comm_create_group`, `MPI_Comm_create_from_group`, `MPI_Comm_join`,
`MPI_Intercomm_create`, and `MPI_Intercomm_create_from_groups`.
- Fix generation of 'MpiRequestTested' events from `MPI_Test[all|any|some]`
wrappers for unsuccessful tests when enabled via
`SCOREP_MPI_ENABLE_GROUPS=xreqtest`.
- The configure check for `pthread_spin_init` now uses the correct
data types for the arguments.
- Library archives (`.a`) on a combined compile-and-link command with the
`nvcc` compiler will not fail anymore.
- Fix pointers getting cut off for OpenMP regions when the source code lookup
fails.
- Fix handling of multiple or unknown values for `--enable-default` during
configure.
- Fix classification of various MPI functions (mostly introduced with
MPI 4.0) by the report post-processing.
- Fix build with ROCm 6.0 and device ID in stream names `HIP[D:S]`.
- Provide the missing wrapper for `MPI_Request_get_status` and properly
handle the case where the operation hasn't been completed yet.
- Fix abort of OMPT adapter with nested parallel regions inside of tasks.
- Fix duplicate MPI inter-communicator definitions for cases with
distinct peer communicators.
- Fix inconsistent measurement with memory copies being done with changed
contexts. The exit event is now properly ignored as well.
- Measurements with HIP and `memcpy` enabled in `SCOREP_HIP_ENABLE` do
not abort anymore if no data transfers were performed.
- By default, use Cray compiler wrappers cc, CC, ftn instead of MPI
and SHMEM compiler wrappers on Cray EX platforms.
- Make OpenACC profiling tool registration compliant to specification.
- Fix profile metric names in the case of hierarchies. Only names of siblings
have to be unique.
------------------- Released version 8.3 -----------------------------
Bugfixes:
- Fix abort in OMPT adapter for OpenMP loops if runtime supports
reporting loop schedules.
- Fix 'inconsistent profile' abort in OMPT adapter seen with NVHPC and
nested OpenMP parallel region where the outer parallel region uses a
single thread only.
------------------- Released version 8.2 -----------------------------
Features and improvements:
- The OMPT adapter is now able to report the schedule type selected
in OpenMP loops. However, this needs to be supported by the runtime.
- Communicators created with one of the procedures added in MPI 4.0 are now
properly tracked by Score-P.
Compatibility:
- The HIP adapter requires the ROCm SMI library.
- Score-P now requires CubeW and CubeLib in version 4.8.2.
Bugfixes:
- The OpenMP detection now additionally checks for the `-fiopenmp` flag
used by Intel oneAPI.
- The OMPT adapter now aborts when it detects non-conforming OpenMP
behavior observed with runtimes that don't support OMPT target
callbacks (here, helper threads are created that lack thread-begin
and implicit-task-begin but dispatch parallel-begin).
- Support for OpenMP's tool interface OMPT will be partially disabled
if configure detects shortcomings of the OpenMP runtime that can be
worked around by disabling parts of the interface. The reason will
be reported in the configure summary as `OMPT remediable checks`,
whereas shortcomings that can't be worked around are now reported as
`OMPT critical checks`.
- Additional compiler-generated functions are excluded from automatic
compiler instrumentation, as they are known to cause measurement
failures. The names of these functions are unique and won't collide
with user-code functions, with one exception, though. Functions
containing the substring `_tree_reduce_` are neglected when using
Intel's oneAPI compilers. In this case, renaming the functions would
prevent them from being ignored.
- Fix the errors in topology coordinate mapping, introduced by
non-process location groups.
- Fix abort on nested synchronization regions in OpenMP programs
when using the OMPT adapter.
- Acquiring a lock through `omp_test_[nest_]lock` no longer aborts
the measurement when using OMPT. Note that test_[nest_]lock events
are recorded only if lock acquisition was successful.
- Ensure that each OpenMP compilation unit receives `-mp=ompt` as a compile
flag under NVHPC, as this is now mandatory as of version 23.7.
- Fix race condition in CUDA adapter leading to a failed assertion.
- Fix erroneous attribution of visits from GPU and asynchronous
thread activities to the program root node in profiling.
- Fix missing or inconsistent contributions of GPU activities in the
`scorep-score` calculations.
- Compiler-generated functions are now checked by their mangled and
demangled function name, as only checking demangled function
names could lead them to pass the checks incorrectly.
- Removed flags for pre-LLVM CCE memory instrumentation, which were
inappropriately applied to other compilers on Cray/HPE systems.
- Remove a misleading warning message from `MPI_Comm_join`.
- Restore HIP memory transfer recording.
- Limit metrics to CPU thread locations, to avoid aborts when used in
combination with accelerator recording.
- Fix erroneous creation of system tree branches without locations as
leafs as they don't contribute and conflict the Cube system tree model.
- Allow per-component SCOREP_METRIC configurations in MPMD scenarios.
- Fix calculation of put and get bytes in MPI accumulate functions.
------------------- Released version 8.1 -----------------------------
Bugfixes:
- Score-P now gracefully handles undefined behavior exploited by the
application when using the Pthread API instead of aborting the
application.
- Allow to build against LLVM's libunwind.
- Fix segmentation fault when trying to parse certain function names.
- Fix segmentation fault when executing OpenMP target regions on the
host.
- Score-P records program arguments again. The feature was disabled due
to issues on some HPC systems. Please report if there are any issues
again.
- Support for OpenMP's tool interface OMPT will be disabled if
configure detects shortcomings of the OpenMP runtime that cannot be
worked around. The reason will be reported in the configure summary.
- Fix the primary output of scorep-score by keeping it strictly human
readable. Additional escaping of special characters for filterable
region types is enabled for the `-m` option for regions without a
mangled name.
- Prevent deadlock in the Kokkos adapter when using in an inhomogeneous
multi-process GPU setup.
- Downloaded external libraries (--with-libbfd|libunwind=download) are
now being installed into lib, even if the system preference is to
install under lib64 (e.g., for SUSE systems). This guarantees
picking up the donwloaded installation of these libraries.
- Add C wrappers for `MPI_Info_get_nkeys` and `MPI_Info_get_nthkey` to the
MPI adapter.
------------------- Released version 8.0 -----------------------------
Major features:
- Add support to record CUDA NVTX instrumentation.
- Support compiler instrumentation even if compilers use different
flags and different instrumentation interface per language (C, C++,
Fortran). This applies to combinations where the C/C++ compilers are
Clang-based but Fortran still support the traditions vendor's
instrumentation only (e.g., Cray and Fujitsu).
Intel's `-tcollect` instrumentation and compile-time filtering
fades out. `-tcollect` is only used if no alternative is
available. Usually `-finstrument-functions(-after-inlining)`
serves as the replacement with classic/oneAPI compilers.
Support for IBM XL older than version 11 was removed.
Support for PGI compilers that don't support the
`__cyg_profile_func` instrumentation API was removed.
- Support for recording AMD HIP activities. It requires an LLVM/Clang
based compiler.
- Add support for MPI intercommunicators.
- Support OpenMP's tool interface OMPT for host events as a
replacement for OPARI2 instrumentation. The OMPT adapter is
considered experimental. To use OMPT as default OpenMP
instrumentation, add `--enable-default=ompt` to your configure line.
You may switch between OPARI2 and OMPT instrumentation using the
`--thread=omp:opari2` and `--thread=omp:ompt` instrumentation
options. Currently, recent Intel, AMD, and Clang compilers support
the interface.
In contrast to OPARI2, the `test_lock` and `test_nest_lock` routines
cannot be handled by OMPT. The `atomic` construct isn't implemented.
The new adapter handles the `taskgroup` construct, though.
Source code locations usually point to the OpenMP construct. As an
exception, implicit barriers for parallel regions point to the
corresponding `parallel` construct.
Features and improvements:
- Add support for recording the offset parameter to MPI I/O
functions with explicit offsets.
- Record events for the ISO C I/O 'remove' function.
- Score-P limitedly supports `MPI_THREAD_SERIALIZED`, and, if there is
thread-local storage detected, `MPI_THREAD_MULTIPLE`.
- Record detailed information about MPI non-blocking collective
operations.
- Score-P now supports only version 7 and up of the CUDA Toolkit.
- Score-P now requires OTF2 3.0 and CubeLib/W 4.8.
- The `SCOREP_ENABLE_SYSTEM_TREE_SEQUENCE_DEFINITIONS` feature is now
also disabled when recording accelerator applications.
- Add proper support for NVIDIA HPC SDK compilers to build system
via `--with-nocross-compiler-suite=nvhpc`.
- Add proper support for Intel oneAPI compilers to build system
via `--with-nocross-compiler-suite=oneapi`.
- Add proper support for AMD ROCm compilers to build system via
`--with-nocross-compiler-suite=amdclang`.
- Add events for communicator creation and destruction to Score-P
corresponding to the new records in OTF2 3.0.
- Add Fortran TYPE(C_PTR) overload wrappers to the MPI adapter for
- `MPI_Alloc_mem`
- `MPI_Win_allocate`
- `MPI_Win_allocate_shared`
- `MPI_Win_shared_query`
- With NVHPC compilers from version 21.1 on, Score-P now automatically
ignores OpenMP outlined functions that caused measurement aborts
when not manually filtered.
- Score-P now matches kernel launch sites to their execution instances
during measurement by providing a numeric parameter, useful for
distinguishing kernel instances when the same kernel is launched from
different callpaths.
This feature currently supports CUDA and HIP based kernels and is
controlled via the `kernel_callsite` option to `SCOREP_CUDA_ENABLE` or
'SCOREP_HIP_ENABLE' respectivly.
- Detect BeeGFS and WEKA as distributed filesystems.
- Score-P now generates ENTER and EXIT events for MPI procedures added in version
4.0 of the MPI Standard.
In particular, the following wrappers for procedures listed in section
B.1.2 were added:
- Item 6: `MPI_Isendrecv`, `MPI_Isendrecv_replace`
- Item 7: persistent collectives and persistent neighborhood collectives
- Item 9: partitioned communication
- Item 13: `MPI_Comm_idup_with_info`
- Item 22: `MPI_Info_get_string`
- Item 24: Sessions model
- Item 25: `MPI_Info_create_env`
- Kernel parameters for CUDA kernels are now recorded as parameters in Cube and
OTF2, instead of metrics.
- CUDA instrumentation is now heuristically enabled by the Score-P instrumenter
when not linking with the `nvcc` compiler but the object files reference the
CUDA runtime or the library is specified on the command line.
User tools and API improvements and changes:
- Support for the online access interface was removed.
- Support for NEC and Sun compilers was removed.
- Improvement of `SCOREP_CUDA_ENABLE` options to include implicit
dependencies and define a default in line with the overall
measurement strategy of Score-P. Check `scorep-info` for more
information.
- Lustre stripe I/O handle attributes now also support Lustre
Progressive File Layouts (PFL). The `Number of Extents` attribute
holds the number of components. `Extent Begin` holds the extent
begin offsets. This and the existing `Stripe Count` and `Stripe
Size` attributes are now a comma separated list of values. Some
values can also be constants like `DEFAULT` or `WIDE`.
- Deprecate measurements without extended non-blocking communication events.
- If OMPT (see above) is supported by the compiler, `--thread=omp` is
supplemented by the two variants `--thread=omp:ompt` and
`--thread=omp:opari2`; `--thread=omp` and `--thread=omp:ompt` are
used interchangeably is this case. Same is true for `--thread=omp`
and `--thread=omp:opari2` if OMPT is not supported.
- Deprecate instrumentation using the Program Database Toolkit, i.e.,
using the `--pdt` option. Please use compiler or user
instrumentation instead.
- The `SCOREP_ENABLE_SYSTEM_TREE_SEQUENCE_DEFINITIONS` feature,
introduced in Score-P 4.0, is deprecated.
- Add a `DEMANGLED` keyword to the filter parser as a counterpart to
the existing `MANGLED` keyword to switch back and forth between
matching against mangled or demangled region names.
Compatibility:
- Score-P now requires a shared or PIC libbfd. Therefore, the
configure option --with-libbfd now also accepts 'download' to
download, build and install a libbfd at make time.
- Address-to-line lookup now requires the availability of
dl_iterate_phdr from link.h where previously /proc/self/maps was
parsed. In addition, symbols from dlopened shared libraries are
considered if linker auditing is available and the LD_AUDIT
environment variable is set to
<prefix>/lib[/backend]/libscorep_rtld_audit.so when executing an
instrumented binary.
- The C++ compiler requirements to build Score-P were raised from
C++98 to C++11.
- For building scorep-score, use CC and CXX provided by cubelib-config
but allow for individual flags via (C|CXX|LD)FLAGS_FOR_BUILD_SCORE.
- For building the library-wrapper generator used by `scorep-libwrap-init`,
use CC and CXX provided by `llvm-config` or via `(CC|CXX)_FOR_BUILD_LIBWRAP`.
Additionally, allow for individual flags via
`((CPP|C|CXX|LD)FLAGS|LIBS)_FOR_BUILD_LIBWRAP`. The latter supersedes the
previous `LIBCLANG_((CPP|CXX|LD)FLAGS|LIBS)`.
- Support for Intel MIC platforms is deprecated.
- Support for IBM Blue Gene/Q platforms is deprecated.
Bugfixes:
- MPI request management now uses internal Score-P memory management.
However, a measurement will now have increased memory requirements.
Please be aware of this.
- Add missing Fortran wrappers to the MPI adapter for
- `MPI_Alloc_mem`
- `MPI_Free_mem`
- `MPI_Win_shared_query`
- Non-blocking MPI I/O events now produce the correct `IoOperationTest`
event instead of the erroneous `MpiRequestTested` event for an
unsuccessful MPI_Test on the request.
- Events for unsuccessful `MPI_Test` on a request are now consistently triggered
by the `xreqtest` group.
- Fix a crash in the MPI adapter when no active requests are given to `MPI_Waitany`.
- Fix the calculation of sent bytes in `MPI_Reduce_scatter` with `MPI_IN_PLACE`.
- Fix a bug in the conversion of request handles in the Fortran wrapper of
`MPI_Wait`.
------------------- Released version 7.1 -----------------------------
Bug fixes:
- Properly handle `nvcc` compiler flags beginning with -o that do
not set output files.
- Ensure that Score-P's compiler wrappers are not called recursively. This was
possible as a result of `scorep-nvcc` using a `scorep-*` host
compiler wrapper.
- scorep-score: fix event size estimation of I/O sync events.
- Allow for OpenMP tasks outside of an OpenMP parallel region.
- Fix Fortran wrappers for MPI_Alltoallw and MPI_Ialltoallw when using
MPI_IN_PLACE. This resolves a rare crash when passing legal NULL
array arguments to these functions.
- Communication completed in MPI_Request_get_status is now correctly
recorded.
------------------- Released version 7.0 -----------------------------
Features and improvements:
- Add support for recording calls to OpenCL 2.1/2.2 functions.
- Add support for recording events from the Kokkos tools interface.
The Kokkos CUDA and HIP back ends are stable on a single device
(see OPEN_ISSUES). The OpenMP and Pthread back ends should be
treated as experimental.
- Issue individual I/O events in POSIX vectorized I/O operations.
- Add recording of transfer offsets of POSIX I/O operations.
- Add wrapping of more vectorized I/O operations:
- `preadv2`, `preadv64`, `preadv64v2`
- `pwritev2`, `pwritev64`, `pwritev64v2`
- Add stripe count/size for recorded files on the Lustre file system.
- Add process ID (PID) and thread ID (TID) as attributes on program
begin or thread creation events respectivly.
- Record node-level unique identifiers for NVIDIA and AMD GPUs as
CUDA and OpenCL location properties to separate devices in a
multi-GPU environment.
- A new mutex implementation based on atomic intrinsics replaces all
existing mutex implementations.
- Change default of CUDA instrumentation to force a flush of CUDA
activity buffers at program exit. This should resolve issues with
measurements failing to include CUDA activity.
`SCOREP_CUDA_ENABLE=flushatexit` is deprecated and replaced with the
new `SCOREP_CUDA_ENABLE=dontflushatexit` option for programs that already
perform a device synchronize or reset before exit and don't need an
additional flush.
User tools and API improvements and changes:
- Remove the configure option `--with-extra-instrumentation-flags`.
It was introduced to work around GCC compiler instrumentation issues
that vanished with the advent of the recommended GCC compiler
instrumentation plug-in.
- Remove the instrumenter option `--config=<file>` as it was
considered of little use.
- Add ability to generate an initial filter file with optional
control parameters using buffer values, visits and region types.
This includes the ability to iteratively refine the generated filter
file using existing filters.
- Compile-time filtering via `scorep --instrument-filter` is now
also available for builds using Intel compilers.
- Add additional `scorep-score` sorting modes `name`, `totaltime`,
`timepervisit`, and `visits`, besides the default `maxbuffer`.
Select a sorting mode via `-s <mode>`.
- Remove the `scorep` and `scorep-config` option `--mutex` due to
changes in the mutex implementation, see above.
- Allow to build against the `libcuda.so` stubs library from the
CUDA SDK. Specify `--with-libcuda-lib=<cuda-sdk>/lib64/stubs` when
configuring. At runtime the `libcuda.so` library must be found by
the system-library path though.
Bugfixes:
- Support changed BFD API. Changes introduced by binutils-2.34.
- Fix aborts when user library wrapper were first called in a thread
parallel context.
- Unify and fix representation of artificial root nodes for threads,
GPU kernels, and OpenMP tasks in profiling.
- Allocation metrics were lost on MPI RMA window allocation functions.
- Honor `CUDA_VISIBLE_DEVICES` when creating CUDA location names.
- Improve error handling of calls to `realpath` on kernel files in
`/proc` or `/sys` when recording I/O activities.
- Allow to select 'runtime' wrapping of OpenCL in the instrumenter again.
- Fix event sequence and attributes when recording non-blocking
`lio_listio` operations.
- Improve thread-safety of CUDA adapter.
- Improve mount point extraction for some corner cases.
Compatibility:
- Score-P now requires an MPI implementation which is compliant with
at least the MPI 2.2 standard and provides the `USE mpi` Fortran
bindings, instead of the discouraged `INCLUDE 'mpif.h'`.
Note that `USE mpi_f08` is not yet supported and Score-P will
abort during MPI initialization if this is detected.
------------------- Released version 6.0 -----------------------------
Major features:
- Support for recording I/O activities: Calls to POSIX I/O and MPI-I/O
are wrapped and meta data about individual I/O operations is
recorded. Whereas MPI-I/O events are recorded by default, POSIX I/O
recording needs to be activated using the instrumenter option
--io=posix.
Features and improvements:
- Created separate enable group for request handling functions in MPI.
MPI functions dealing with the completion of non-blocking requests
(i.e., the Test/Wait family of calls) are no longer part of the P2P
enable group and moved to a separate enable group, which is enabled
or disabled automatically by the Score-P runtime system.
- Adapted remapper specification to reflect that Test/Wait functionality
is no longer specific to point-to-point communication.
- Added support for the Clang compiler suite. Select via
`--with-nocross-compiler-suite=clang`. Additionally experimental
support for macOS based systems was added, but needs to be enabled
with `--enable-experimental-platform` explicitly.
- Bulding with the PGI compiler suite now selects the 'pgfortran'
compiler for F77 and FC. Added support for the PGI/LLVM variant.
- Added support for tracking MPI-3 one-sided communication.
- The previously unused environment variable
`SCOREP_MPI_MAX_ACCESS_EPOCHS` was renamed to `SCOREP_MPI_MAX_EPOCHS`
and is now used in tracking MPI one-sided communication.
- Changed the presentation of parameter-based profiling. Instead of
nested call tree nodes under the source code region, create multiple
nodes for the region on the same level and attach Cube-Parameters to
them. In this context, the API of libscorep-estimator (used for
scoring profiles, e.g., in scorep-score) changed. Consider this API
'experimental'.
Bugfixes:
- For OPARI2-instrumented codes that use OpenMP criticals the mapping
to Score-P critical objects was erroneous. As a consequence,
lock-contention analysis for these criticals unfortunatly was
erroneous too.
------------------- Released version 5.0 -----------------------------
Major features:
- Orphan thread support: Score-P now records events from POSIX threads
that were not instrumented, e.g., threads created from `std::thread`,
Intel TBB, Intel Cilk Plus, or any other runtime which is based on
POSIX threads. Previously, events from such threads caused a
'TPD == 0' measurement abort. Note that if your link-line does not
need a POSIX thread option like -pthread, you need to use the
Score-P option `--thread=pthread` to activate this feature.
This feature also includes support for POSIX threads that are
running longer than main. For these threads, Score-P will exit all
active regions and end the thread (from the measurement point of
view).
- Added support for cartesian topologies.
Supported topology types:
1) MPI cartesian topologies via MPI_Cart_create.
2) Platform/Hardware specific topologies:
- IBM Blue Gene/Q
- K Computer
3) Process x Threads topology: Generic 2D topology,
currently only for CPU threads.
4) User topologies via user instrumentation API.
By default all available topology types will be recorded. They can
selectively be disabled based on type through environment variables,
see `scorep-info config-vars`. Viable topology results require a
distinct thread binding.
Features and improvements:
- Score-P now generates a dynamic `MANIFEST.md` file for each
experiment and copies files, like the filter or selective
configuration files, to the experiment directory.
- In profiling mode, add the file `<DATADIR>/scorep/scorep.spec` to
the `profile.cubex` container, thus making the profile output more
self-contained.
- On thread creation, request internal memory on the fly instead of in
advance. Depending on the measurement configuration this will save
some memory.
- As Open MPI provides since version 3.0 a C++ compiler wrapper for
SHMEM, Score-P will also provide a instrumentation wrapper
`scorep-oshcxx` in this case.
- Values in config variables of type Set can now be negated by
preceeding it with '~', e.g., 'SCOREP_MPI_ENABLE_GROUPS=default,~cg'.
- Functions excluded from instrumentation by the GCC plug-in, because
they were declared as inline, can now be instrumented by providing
an instrumentation filter to 'scorep' where the function is matched
by an explicit 'INCLUDE' rule, which is not the match-all '*' one.
Functions excluded from instrumentation can be listed by adding
`--verbose=2` to the `scorep` command-line.
- Changes to the experimental `scorep-preload-init` script:
- Also preloads the Score-P constructor to be able to early
initialize the measurement.
- Issues a warning for options which are not suitable for
uninstrumented applications.
- 'MPI_Comm_idup' is now supported and does not abort the measurement
anymore.
- Added support for the high bandwidth memory interface (hbw_malloc)
of the memkind library, allowing memory tracking for the Intel KNL
MCDRAM with Score-P.
- All Fortran wrappers support now 64-bit character length arguments
with GCC 8.
- Multiple improvements in the `scorep` instrumenter command to better
interact with build systems:
- All warnings and errors are prefixed with '[Score-P] ', for better
identification.
- All output goes to stderr, to not interfere when catching output
from the compiler/linker in process substitutions.
- When no source files could be identified, the command is executed
as is.
- Since Score-P version 2.x, measurement initialization is done before
entering 'main' using compiler-provided constructor functions, if
available. As a consequence, MPI- or SHMEM-only instrumented
programs lacked the artificial 'PARALLEL' region that was used to
enclose all following regions. Instead of the 'PARALLEL' region
Score-P now generates program-begin and program-end events that
enclose the entire application. If program arguments are given,
these are recorded as well. In tracing mode program-begin/end are
mapped to ProgramBegin/End event records; in profiling mode this
feature is modeled as enter/exit of an additional region with the
name of the executable, if available.
Bugfixes:
- Instrumentation of Fortran OpenMP programs that use untied tasks
failed with undefined references. Fixed.
- So far, programs that `pthread_exit()` the main thread crashed based
on the requirement that the program's main thread is responsible for
the measurement finalization. This requirement was removed and was
accompanied by multiple improvements of threads lasting longer than
main.
- Restored the ability to run with `SCOREP_TOTAL_MEMORY=4G`.
- Instrumentation failed for codes that include system headers via
local headers of the same name. This is fixed for compilers that
support the '-iquote' option (most of the compilers do, PGI
doesn't). Note that this bugfix is overruled if scorep's '--pdt'
option is used.
- Fix memory recording of C++14 applications, because Score-P did not
wrapped the `delete`/`delete[]` operators with size argument.
- Fix possible overflow of send/recv bytes in MPI_Bcast, MPI_Sendrecv,
and MPI_Sendrecv_replace.
- In selecting MPI groups to be recorded (SCOREP_MPI_ENABLE_GROUPS),
fix handling of MPI subgroups.
------------------- Released version 4.1 -----------------------------
Bugfixes:
- scorep-score: fixed potentially wrong output of SCOREP_TOTAL_MEMORY
which was caused by an uninitialized variable.
- Improve robustness of wrapping memory-related function calls
during link-time.
- Fixed PGI compiler adapter to prevent the corruption of register
values in some cases.
- Fixed calculation of memory statistics in out-of-memory condition.
- Honor --libdir and --dis|enable-shared|static when building and
installing libscorep_estimator.
------------------- Released version 4.0 -----------------------------
Major features:
- User Library Wrapping: Using scorep-libwrap-init, you can now
automatically generate library wrappers supplying only the
headers and library files of the target library.
You then install the wrapper into SCOREP_LIBWRAP_PATH and use it
with the new instrumenter flag --libwrap=<wrapper>.
For this only linking with Score-P is necessary, except when
the library is called from threads, then the threading paradigm
has to be instrumented as well.
Features and improvements:
- The utility "scorep-score" is provided now as a library application
to allow using its functionality in third-party software. Obtain
compile flags via
"scorep-config --target score --cflags|--ldflags|--libs".
- Improve detection and compiler selection for SGI MPT
implementations.
- Provide the Substrate Plugin interface, which enables plugins to
consume Score-P runtime events for recording, analysis, and
optimization purposes.
- Added the option SCOREP_FORCE_CFG_FILES, which enables users to
force the creation of the experiment directory even if there are no
active substrates that write any output. Defaults to true.
- Provided the option to use sequence definitions for the system tree.
They provide a constant size system tree description. The trade-off
is the loss of individual names and properties for locations,
location groups and system tree nodes. Currently supported only for
MPI.
- Added possibilities to aggregate the locations within a thread to
reduce the report size. The aggregation can be enabled via the
SCORE_PROFILING_FORMAT environment variable. The new formats
THREAD_SUM, THREAD_TUPLE, KEY_THREADS, and CLUSTER_THREADS are
available.
- Replace the two threading variants --thread=omp:pomp_tpd and
--thread=omp:ancestry by only one: --thread=omp. The possible
options are detected at configure time. If both are available,
the ancestry variant will be used by default.
- As compressing OTF2 traces was not supported by any OTF2 release in
the past and probably wont be in the foreseeable future either, the
support for this feature in Score-P was removed.
- Score-P no longer ships with the Cube GUI. Cube was componentized
and Score-P just includes Cube's library components that are
necessary for measurements and scoring. The configure option
--with-cube was replace by --with-cubew and --with-cubelib. They
need to be provided a PATH to cubew-config and cubelib-config,
respectively, if not already in PATH. The Cube GUI is separately
available from http://www.scalasca.org.
- An experimental script named `scorep-preload-init` is provided
which helps to setting up a measurement done through the `LD_PRELOAD`
mechanism. Score-P needs to be built with shared libraries to
enable this feature and not all instrumentations are supported
though.
Bugfixes:
- Improve the extraction of topology information from the Slurm
topology/tree plugin to create the system tree. There were cases
where the Slurm topology information wasn't correctly distributed to
the individual compute nodes. This resulted in a system tree with a
single node parenting all processes instead of several nodes
parenting subsets of processes.
- Recording of synchrounous metrics (SCOREP_METRIC_SYNC), i.e.,
per-process metrics or metrics provided by a 'sync' plugin, resulted
in wrong values in profiling mode. Fixed.
- Added a time-based string to temporary results files of the
preprocessing step during instrumentation. This should avoid name
clashes if the same source file is concurrently processed twice during
the build process.
- The support for a modularized OPARI2, introduced in Score-P 2.0,
attributed wrong names for the inner regions of the OpenMP
constructs critical, ordered, section, single, and task. This is
fixed now.
------------------- Released version 3.1 -----------------------------
Features and improvements:
- The induced penalty to access thread-local storage variables was
considerably reduced for some compilers, notable for the Intel
compilers.
- If both OpenMP instrumentation options, omp:tpd and omp:ancestry,
are supported, use omp:ancestry as default. This works around a
problem found with recent Intel compilers (e.g., 17.0.0) and the
omp:tpd option.
- The GCC compiler instrumentation plug-in now instruments functions
that will not return in the usual way, like, e.g., a Pthread
start_routine that calls pthread_exit.
Bugfixes:
- Fix compilation error during instrumentation, if the command line
contains a header file.
- Fix loosing parameter call-paths by avoiding multiple definitions of
the same parameters.
- Fix that memory allocation measurements are disabled if the
user explicitly specifies --memory.
- Fix conflict of function wrapping with IPA on BlueGene systems.
- Do not preprocess assembler files anymore.
- Fix race condition in parallel make (make -j). Note that parallel
'make check' still exhibits race conditions due to Fortran
dependency issues.
- Fix segmentation fault in the profile when memory operations
and metric counters are recorded at the same time.
- Improve detection of ARM and Cray platforms.
- Allow for shell variables in configure options. Options like
'--includedir=\${prefix}/include' caused configure to fail.
------------------- Released version 3.0 -----------------------------
Note: In this version, we switch from a 'major.minor.bugfix'
versioning scheme to a 'major.bugfix' scheme. New user-relevant
features will be introduced by increasing the major number. Bugfix
releases will not add new user-relevant features but might contain,
in addition to bugfixes, Score-P-internal improvements.
Major features:
- Support for instrumentation of OpenACC codes based on the profiling
interface specified in OpenACC 2.5.
Features and improvements:
- Extract topology information from the Slurm topology/tree plugin to
create the system tree. This feature is available in Slurm since
version 2.1 (around 09/2009) and documented since 01/2014. Please
refer to the Slurm documentation how to enable this feature:
http://slurm.schedmd.com/topology.html
- Change PGI C++ compiler settings (selected via
--with-nocross-compiler-suite=pgi) from pgCC to pgc++. PGI removed
pgCC in version 16.1. If your installation still provides pgCC and
you want to use it, please add CXX=pgCC to your configure line.
Bugfixes:
- Prevent sampling/unwinding when Intel MPI is used. This combination,
even when sampling is not active, may mysteriously alter the
application output just by linking libunwind.
- Fixed possible underestimation of the trace size and memory footprint
in scorep-score due to counting timestamps only for enter/leave
records.
- Fixed function signatures of SHMEM API functions that changed in
Open MPI 2.0.
------------------- Released version 2.0.2 ---------------------------
Bugfixes:
- The preprocessing of source files before they will be instrumented
with OPARI2 was broken. This is fixed.
- Prevent potential division by zero error during calculation of tsc
timer frequency.
- Compiler-specific CXXFLAGS might break the 'build-score' configure
as CXX use to build 'scorep-score' might differ from CXX used to
build the Score-P libraries. CXXFLAGS in build-score are now
ignored. To set build-score related CXXFLAGS, use
CXXFLAGS_FOR_BUILD_SCORE.
- Fix bug in configuration of SHMEM support triggered by change in
shmem.fh header of Open MPI 1.10.2.
- Fix PAPI configure check when additional libraries are needed to
successfully link to PAPI. This was a regression introduced with
version 2.0.
- Fix typos in remapping specification file which caused the
point-to-point and collective bytes transferred metrics to always
be zero.
- Build-system hardening.
- The configure check for libunwind now also works if libunwind
depends on liblzma.
- Documentation improvements.
- Fixed memory leaks in sampling and CUDA mode.
------------------- Released version 2.0.1 ---------------------------
Bugfixes:
- Prevent the memory adapter from initializing the measurement system
as this leads to program crashes if it happens too early, e.g., on
Blue Gene systems. If memory instrumentation is the only means of
instrumentation, the measurement system is initialized via the
feature 'compiler constructor'. If this feature isn't available
(search for 'compiler constructor: yes' in 'scorep-info
config-summary'), you need to add e.g., user instrumentation to
initialize the measurement system.
------------------- Released version 2.0 -----------------------------
Major features:
- Score-P supports a new data collecting mode based on sampling.
Sampling can be used in conjunction with the usual instrumentation
of parallel paradigms. Therefore it combines the lower overhead of
statistical sampling and the accuracy of instrumentation. Both
call-path profiling and event tracing are supported. As this is
rather a major change in the Score-P internals and also for the user
experience we appreciate any feedback but need to declare the
sampling support as experimental in this first release.
- Support for OPARI2 2.0 was integrated. OPARI2 is now more flexible
to enable support for other pragma/directive based paradigms.
- Support for MPI-3.1 functions (except 'MPI_Comm_idup'). Most new
functions currently provide plain enter/exit wrappers.
- Support for tracking memory allocations was added to Score-P. This
includes C/C++, MPI, and SHMEM API calls. The instrumentation is
done by default, though must be enabled at measurement time
explicitly.
Features and improvements:
- When using compiler instrumentation with GNU (not the gcc-plugin but
the '-finstrument-functions' variant), Cray, or Fujitsu compilers,
one can provide a file containing symbols that will trigger
measurement events when the corresponding function is called. These
symbols are subject to filtering. Providing symbols this way is
useful when obtaining symbols during measurement via 'nm' or
'libbfd' is not an option, e.g, on Blue Gene systems. The symbol
file needs to be specified in the environment variable
'SCOREP_NM_SYMBOLS'. The accepted format is as in
'nm -l <executable>'.
- Transparent changes to the event-dispatching. Currently events are
consumed by either the profiling or tracing substrate (or both).
- The timer selection was moved from configure time to measurement
time. During configure we detect all available timers and provide
the environment variable 'SCOREP_TIMER' to select one. The timer
defaults to a low-overhead time stamp counter, if available. Note
that we assume all processes to use the same timer and time stamp
counter timers to run at the same frequency.
- Building the entire Score-P package on Blue Gene/Q systems using GNU
compilers is now supported. The installation currently needs some
extra steps, please see 'share/bg-gnu/README' for details. The
installation on older Blue Gene systems, though not tested, might
work as well.
- Source-to-source instrumentation via PDT on Blue Gene systems was
re-enabled for PDT versions newer than 3.18.
- Score-P takes advantage from compilers to initialize the measurement
system automatically before triggering any event. This also ensures
that the interrupt sources for sampling are registered as early as
possible and in the case when no compiler instrumentation is
available.
- Score-P uses now the '-Minstrument=functions' flag for PGI compiler
instrumentation (64-bit targets only). The '-Mprof=func' flag is no
longer supported by PGI compiler version 16. To our knowledge,
'-Minstrument=functions' is available at least since PGI compiler
version 11. However, older PGI compiler versions may not support
'-Minstrument=functions' and are not supported by Score-P anymore.
- A synchronization callback was added to the metric plugin API. A
metric plugin can register a synchronization callback which is
called every time Score-P starts clock synchronization. The
synchronization callback contains one argument specifying the point
in time in more detail. At the moment we distinguish
synchronization at initialization, during measurement run, and at
finalization. As a result, the synchronization callback allows
metric plugins to detect start and end points of measurement
intervals.
- The manual user instrumentation for Fortran 90 now performs region
initialization checks based on handle values instead of comparing
names. This reduces overhead. It does not apply when using PGI
compilers though.
- Support tracing of applications with more than 500000 tasks.
User tools and API improvements and changes:
- A Score-P installation provides new instrumentation wrappers which
simplify the application instrumentation of autotools and CMake
based projects. Please consult the usage instruction of the
'scorep-wrapper' command.
- The option '--pomp' does not take any options any more.
- Specific options for OPARI2 are passed via the
'--opari=<parameter-list>' option.
- To control instrumentation of OpenMP the options '--openmp' and
'--noopenmp' have been added. Note that for compilations using the
OpenMP compiler-flag, instrumentation is enabled by default.
However, when manually disabling instrumentation via
'--noopenmp', some instrumentation must still be carried out to
ensure a thread-safe execution of the measurement system.
- POMP user instrumentation is no longer automatically activated
together with OpenMP instrumentation. The '--pomp' flag has to be
explicitly specified with the 'scorep' command.
- On Cray systems, compiler instrumentation does not add '-G2' option
anymore because '-G2' disables some optimizations.
- The instrumenter now warns the user if the provided instrumentation
filter wont be used by the active instrumentations.
- The option '--disable-preprocessing' was added to the instrumenter.
It tells the instrumenter to skip all preprocessing related
activities. Useful e.g, if the input files are already
preprocessed.
Bugfixes:
- Fixed possible mistreatment of a profile node as being in an untied
task.
- Fixed bug in obtaining executable names longer than 512 characters
when using the GNU compiler adapter (applies also to Cray and
Fujitsu compilers).
- The GCC compiler instrumentation plug-in was non-functional for
GCC 5 because of an unnoticed API change. Additionally, the custom
demangling of Fortran module functions is working again.
- The GCC instrumentation plug-in does not instrument the `main`
function in Fortran programs anymore as the main entry point for the
user is `MAIN__`.
- Names assigned to MPI communicators by calls to 'MPI_Comm_set_name'
are now also tracked, even if the corresponding API calls wont be
recorded.
- Fixed MPI library interposition if the link command lists explicitly
'libmpifort' or 'libmpigi'.
------------------- Released version 1.4.2 ---------------------------
Features and improvements:
- The GCC plug-in can also be built on cross build machines and with
the GCC 5 release series.
Bugfixes:
- The OpenMP flag for PGI compilers (-mp) may have a value appended.
In this case, the instrumenter did not detect the OpenMP paradigm
properly. Fixed.
- On Cray systems, a conflict between the -eZ and and the -eP flag
occurred if the instrumenter performed preprocessing before OPARI2
instrumentation and the command line contains -eZ. Fixed.
- If the user explicitly requires static Score-P libraries by
specifying --static on the command line, scorep-config provides
also full paths to the dependencies of its libraries, which might
cause problems if the libraries are linked with dynamic
libraries. Fixed.
- The preprocessing step of CUDA source files for the OPARI2
instrumentation did not add preprocessing flags to the preprocessor
invocation. Thus, it becomes a full compilation step. Fixed.
- Fix exponent in the CUDA metric definitions.
- Fix scorep-config bug on MIC, which always showed an 'Unsupported
target mic. Abort'
- Configure checks for PAPI on MIC failed with unresolved symbols to
libpfm. Fixed.
- Help text for --target attribute of scorep-config added
------------------- Released version 1.4.1 ---------------------------
Bugfixes:
- BG/Q: use optimized MPI rank to SION file mapping (one file per I/O node)
- Fixes in the OpenCL adapter:
- The Score-P instrumenter did misinterpret the OpenCL library as an
input file, if it was given as '-l opencl' on the command line. Fixed.
- Fixed segmentation fault of clReleaseEvent during Score-P OpenCL flush.
- Fixed wrappers of OpenCL 2.0 functions.
- Revised mutex locking.