diff --git a/Checks/PWR004/README.md b/Checks/PWR004/README.md index e226665..2f86758 100644 --- a/Checks/PWR004/README.md +++ b/Checks/PWR004/README.md @@ -21,8 +21,10 @@ improves code readability. ### Code example -In the following code, a variable `factor` is used in each iteration of the loop -to initialize the array `result`. +#### C + +In the following code, a variable `factor` is used in each iteration of the +loop to initialize the array `result`: ```c void example() { @@ -36,22 +38,55 @@ void example() { } ``` -Having the scope declared explicitly for each variable improves readability +Having the scope declared explicitly for each variable improves readability, since it makes explicit the scope of all the variables within the parallel -region. +region: ```c void example() { int factor = 42; int result[10]; - #pragma omp parallel for shared(result, factor) + #pragma omp parallel for default(none) shared(result, factor) private(i) for (int i = 0; i < 10; i++) { result[i] = factor * i; } } ``` +#### Fortran + +In the following code, a variable `factor` is used in each iteration of the +loop to initialize the array `result`: + +```f90 +subroutine example() + integer :: factor = 42 + integer :: result(10) + + !$omp parallel do + do i = 1, 10 + result(i) = factor * i + end do +end subroutine example +``` + +Having the scope declared explicitly for each variable improves readability, +since it makes explicit the scope of all the variables within the parallel +region: + +```f90 +subroutine example() + integer :: factor = 42 + integer :: result(10) + + !$omp parallel do default(none) shared(factor, result) private(i) + do i = 1, 10 + result(i) = factor * i + end do +end subroutine example +``` + ### Related resources * [PWR004 examples](../PWR004) diff --git a/Checks/PWR005/README.md b/Checks/PWR005/README.md index d6a0459..787d1f8 100644 --- a/Checks/PWR005/README.md +++ b/Checks/PWR005/README.md @@ -21,10 +21,12 @@ variable. ### Code example +#### C + In the following code, a variable `t` is used in each iteration of the loop to hold a value that is then assigned to the array `result`. Since no data scoping -is declared for those variables the default will be used. This makes the -variable `t` shared which is incorrect since it introduces a race condition. +is declared for those variables, the default will be used. This makes the +variable `t` shared, which is incorrect since it introduces a race condition: ```c void example() { @@ -39,8 +41,8 @@ void example() { } ``` -The following code disables the default scoping which will make the compiler -raise an error due to unspecified scopes. +The following code disables the default scoping, which will make the compiler +raise an error due to unspecified scopes: ```c void example() { @@ -55,15 +57,15 @@ void example() { } ``` -To fix the code the scope of each variable must be specified. The variable `t` -must be made private to prevent the race condition. +To fix the code, the scope of each variable must be specified. The variable `t` +must be made private to prevent the race condition: ```c void example() { int t; int result[10]; - #pragma omp parallel for default(none) shared(result) private(t) + #pragma omp parallel for default(none) shared(result) private(i, t) for (int i = 0; i < 10; i++) { t = i + 1; result[i] = t; @@ -71,6 +73,58 @@ void example() { } ``` +#### Fortran + +In the following code, a variable `t` is used in each iteration of the loop to +hold a value that is then assigned to the array `result`. Since no data scoping +is declared for those variables, the default will be used. This makes the +variable `t` shared, which is incorrect since it introduces a race condition: + +```f90 +subroutine example() + integer :: i, t + integer :: result(10) + + !$omp parallel do + do i = 1, 10 + t = i + 1; + result(i) = t; + end do +end subroutine example +``` + +The following code disables the default scoping, which will make the compiler +raise an error due to unspecified scopes: + +```f90 +subroutine example() + integer :: i, t + integer :: result(10) + + !$omp parallel do default(none) + do i = 1, 10 + t = i + 1; + result(i) = t; + end do +end subroutine example +``` + +To fix the code, the scope of each variable must be specified. The variable `t` +must be made private to prevent the race condition: + +```f90 +subroutine example() + integer :: i, t + integer :: result(10) + + !$omp parallel do default(none) shared(result) private(i, t) + do i = 1, 10 + t = i + 1; + result(i) = t; + end do +end subroutine example +``` + ### Related resources * [PWR005 examples](../PWR005) diff --git a/Checks/PWR006/README.md b/Checks/PWR006/README.md index 65fd471..fba9de8 100644 --- a/Checks/PWR006/README.md +++ b/Checks/PWR006/README.md @@ -18,9 +18,11 @@ possible. ### Code example +#### C + In the following code, arrays `A` and `B` are never written to. However, they are privatized and thus each thread will hold a copy of each array, effectively -using more memory and taking more time to create private copies. +using more memory and taking more time to create private copies: ```c #define SIZE 5 @@ -30,7 +32,7 @@ void example() { int B[SIZE] = {5, 4, 3, 2, 1}; int sum[SIZE]; - #pragma omp parallel for shared(sum) firstprivate(A, B) + #pragma omp parallel for shared(sum) firstprivate(A, B) private(i) for (int i = 0; i < SIZE; i++) { sum[i] = A[i] + B[i]; } @@ -39,7 +41,7 @@ void example() { To save memory, change their scope to shared. This may also prevent memory issues when using arrays, as codes may easily run out of memory for a high -number of threads. +number of threads: ```c #define SIZE 5 @@ -49,13 +51,55 @@ void example() { int B[SIZE] = {5, 4, 3, 2, 1}; int sum[SIZE]; - #pragma omp parallel for shared(sum, A, B) + #pragma omp parallel for shared(sum, A, B) private(i) for (int i = 0; i < SIZE; i++) { sum[i] = A[i] + B[i]; } } ``` +#### Fortran + +In the following code, arrays `A` and `B` are never written to. However, they +are privatized and thus each thread will hold a copy of each array, effectively +using more memory and taking more time to create private copies: + +```f90 +subroutine example() + implicit none + integer :: i + integer :: a(5) = [1, 2, 3, 4, 5] + integer :: b(5) = [6, 7, 8, 9, 10] + integer :: sum(5) + + !$omp parallel do default(none) firstprivate(a, b) shared(sum) private(i) + do i = 1, 5 + sum(i) = a(i) + b(i) + end do + !$omp end parallel do +end subroutine example +``` + +To save memory, change their scope to shared. This may also prevent memory +issues when using arrays, as codes may easily run out of memory for a high +number of threads: + +```f90 +subroutine example() + implicit none + integer :: i + integer :: a(5) = [1, 2, 3, 4, 5] + integer :: b(5) = [6, 7, 8, 9, 10] + integer :: sum(5) + + !$omp parallel do default(none) shared(a, b, sum) private(i) + do i = 1, 5 + sum(i) = a(i) + b(i) + end do + !$omp end parallel do +end subroutine example +``` + ### Related resources * [PWR006 examples](../PWR006) diff --git a/Checks/PWR006/example.c b/Checks/PWR006/example.c index 33d7ac3..10e2dbf 100644 --- a/Checks/PWR006/example.c +++ b/Checks/PWR006/example.c @@ -7,7 +7,7 @@ void example() { int B[SIZE] = {5, 4, 3, 2, 1}; int sum[SIZE]; - #pragma omp parallel for shared(sum) firstprivate(A, B) + #pragma omp parallel for shared(sum) firstprivate(A, B) private(i) for (int i = 0; i < SIZE; i++) { sum[i] = A[i] + B[i]; } diff --git a/Checks/PWR006/example.f90 b/Checks/PWR006/example.f90 index 07f1c11..cd79c21 100644 --- a/Checks/PWR006/example.f90 +++ b/Checks/PWR006/example.f90 @@ -1,15 +1,15 @@ ! PWR006: Avoid privatization of read-only variables -program example +subroutine example() implicit none integer :: i integer :: a(5) = [1, 2, 3, 4, 5] integer :: b(5) = [6, 7, 8, 9, 10] integer :: sum(5) - !$omp parallel do default(none) firstprivate(a, b) shared(sum) + !$omp parallel do default(none) firstprivate(a, b) shared(sum) private(i) do i = 1, 5 - sum(i) = a(i) + b(i); + sum(i) = a(i) + b(i) end do !$omp end parallel do -end program example +end subroutine example diff --git a/Checks/PWR009/README.md b/Checks/PWR009/README.md index 1a5be47..61e7bc9 100644 --- a/Checks/PWR009/README.md +++ b/Checks/PWR009/README.md @@ -38,6 +38,8 @@ is used. ### Code example +#### C + The following code offloads a matrix multiplication computation through the `target` construct and then creates a parallel region and distributes the work through `for` construct (note that the matrices are statically sized arrays): @@ -60,9 +62,9 @@ through `for` construct (note that the matrices are statically sized arrays): } // end target ``` -When offloading to the GPU it is recommended to use an additional level of +When offloading to the GPU, it is recommended to use an additional level of parallelism. This can be achieved by using the `teams` and `distribute` -constructs, in this case in combination with `parallel for`: +constructs; in this case, in combination with `parallel for`: ```c #pragma omp target teams distribute parallel for \ @@ -77,6 +79,48 @@ for (size_t i = 0; i < m; i++) { } ``` +#### Fortran + +The following code offloads a matrix multiplication computation through the +`target` construct and then creates a parallel region and distributes the work +through the `do` construct: + +```f90 +!$omp target map(to: A, B) map(tofrom: C) +!$omp parallel default(none) private(i, j, k) shared(A, B, C) +!$omp do +do j = 1, size(C, 2) + do k = 1, size(C, 2) + do i = 1, size(C, 1) + C(i, j) = C(i, j) + A(i, k) * B(k, j) + end do + end do +end do +!$omp end do +!$omp end parallel +!$omp end target +``` + +When offloading to the GPU, it is recommended to use an additional level of +parallelism. This can be achieved by using the `teams` and `distribute` +constructs; in this case, in combination with `parallel do`: + +```f90 +!$omp target teams distribute map(to: A, B) map(tofrom: C) +!$omp parallel default(none) private(i, j, k) shared(A, B, C) +!$omp do +do j = 1, size(C, 2) + do k = 1, size(C, 2) + do i = 1, size(C, 1) + C(i, j) = C(i, j) + A(i, k) * B(k, j) + end do + end do +end do +!$omp end do +!$omp end parallel +!$omp end target +``` + ### Related resources * [PWR009 examples](../PWR009) diff --git a/Checks/PWR012/README.md b/Checks/PWR012/README.md index f06ef1f..b303005 100644 --- a/Checks/PWR012/README.md +++ b/Checks/PWR012/README.md @@ -31,6 +31,8 @@ movements impacting correctness and even crashes impacting code quality. ### Code example +#### C + In the following example, a struct containing two arrays is passed to the `foo` function, which only uses one of the arrays: @@ -89,6 +91,90 @@ void example() { } ``` +#### Fortran + +In the following example, a derived type containing two arrays is passed to the +`foo` function, which only uses one of the arrays: + +```f90 +program example + + implicit none + + type data + integer :: a(10) + integer :: b(10) + end type data + +contains + + subroutine foo(d) + implicit none + type(data), intent(in) :: d + integer :: i, sum + + do i = 1, 10 + sum = sum + d%a(i) + end do + end subroutine foo + + subroutine bar() + implicit none + type(data) :: d + integer :: i + + do i = 1, 10 + d%a(i) = 1 + d%b(i) = 1 + end do + + call foo(d) + end subroutine bar + +end program example +``` + +This can be easily addressed by only passing the required array and rewriting +the procedure body accordingly: + +```f90 +program example + + implicit none + + type data + integer :: a(10) + integer :: b(10) + end type data + +contains + + subroutine foo(a) + implicit none + integer, intent(in) :: a(:) + integer :: i, sum + + do i = 1, size(a, 1) + sum = sum + a(i) + end do + end subroutine foo + + subroutine bar() + implicit none + type(data) :: d + integer :: i + + do i = 1, 10 + d%a(i) = 1 + d%b(i) = 1 + end do + + call foo(d%a) + end subroutine bar + +end program example +``` + ### Related resources * [PWR012 examples](../PWR012) diff --git a/Checks/PWR012/example.f90 b/Checks/PWR012/example.f90 index 12aba17..75b1a01 100644 --- a/Checks/PWR012/example.f90 +++ b/Checks/PWR012/example.f90 @@ -1,23 +1,37 @@ ! PWR012: Pass only required fields from derived type as parameters program example + implicit none + type data integer :: a(10) integer :: b(10) end type data + contains + subroutine foo(d) implicit none type(data), intent(in) :: d integer :: i, sum + do i = 1, 10 sum = sum + d%a(i) end do end subroutine foo + subroutine bar() implicit none type(data) :: d + integer :: i + + do i = 1, 10 + d%a(i) = 1 + d%b(i) = 1 + end do + call foo(d) end subroutine bar + end program example diff --git a/Checks/PWR013/README.md b/Checks/PWR013/README.md index 4eae066..c054536 100644 --- a/Checks/PWR013/README.md +++ b/Checks/PWR013/README.md @@ -19,13 +19,15 @@ should be copied to or from the GPU memory. ### Code example +#### C + In the following example, matrix `B` is copied to the GPU even when it is not used: ```c void example(double *A, double *B, double *C) { #pragma omp target teams distribute parallel for schedule(auto) shared(A, B) \ - map(to: A[0:100], B[0:100]) map(tofrom: C[0:100]) + private(i) map(to: A[0:100], B[0:100]) map(tofrom: C[0:100]) for (int i = 0; i < 100; i++) { C[i] += A[i]; } @@ -37,13 +39,52 @@ This can be easily corrected by removing references to B from all the clauses: ```c void example(double *A, double *B, double *C) { #pragma omp target teams distribute parallel for schedule(auto) shared(A) \ - map(to: A[0:100]) map(tofrom: C[0:100]) + private(i) map(to: A[0:100]) map(tofrom: C[0:100]) for (int i = 0; i < 100; i++) { C[i] += A[i]; } } ``` +#### Fortran + +In the following example, matrix `B` is copied to the GPU even when it is not +used: + +```f90 +subroutine example(A, B, C) + implicit none + integer, intent(in) :: A(:), B(:) + integer, intent(inout) :: C(:) + integer :: i + + !$omp target teams distribute parallel do schedule(auto) default(none) & + !$omp& shared(A, B, C) private(i) map(to: A, B) map(tofrom: C) + do i = 1, size(C, 1) + C(i) = C(i) + A(i) + end do + !$omp end target teams distribute parallel do +end subroutine example +``` + +This can be easily corrected by removing references to B from all the clauses: + +```f90 +subroutine example(A, B, C) + implicit none + integer, intent(in) :: A(:), B(:) + integer, intent(inout) :: C(:) + integer :: i + + !$omp target teams distribute parallel do schedule(auto) default(none) & + !$omp& shared(A, C) private(i) map(to: A) map(tofrom: C) + do i = 1, size(C, 1) + C(i) = C(i) + A(i) + end do + !$omp end target teams distribute parallel do +end subroutine example +``` + ### Related resources * [PWR013 examples](../PWR013) diff --git a/Checks/PWR013/example-omp.c b/Checks/PWR013/example-omp.c index 666e9da..026fb25 100644 --- a/Checks/PWR013/example-omp.c +++ b/Checks/PWR013/example-omp.c @@ -2,7 +2,7 @@ void example(double *A, double *B, double *C) { #pragma omp target teams distribute parallel for schedule(auto) shared(A, B) \ - map(to: A[0:100], B[0:100]) map(tofrom: C[0:100]) + private(i) map(to: A[0:100], B[0:100]) map(tofrom: C[0:100]) for (int i = 0; i < 100; i++) { C[i] += A[i]; } diff --git a/Checks/PWR013/example-omp.f90 b/Checks/PWR013/example-omp.f90 index b5363ae..4b49670 100644 --- a/Checks/PWR013/example-omp.f90 +++ b/Checks/PWR013/example-omp.f90 @@ -7,7 +7,7 @@ subroutine example(A, B, C) integer :: i !$omp target teams distribute parallel do schedule(auto) default(none) & - !$omp& shared(A, B, C) map(to: A, B) map(tofrom: C) + !$omp& shared(A, B, C) private(i) map(to: A, B) map(tofrom: C) do i = 1, size(C, 1) C(i) = C(i) + A(i) end do diff --git a/Checks/PWR015/README.md b/Checks/PWR015/README.md index 9b9c1d4..63697c7 100644 --- a/Checks/PWR015/README.md +++ b/Checks/PWR015/README.md @@ -19,13 +19,15 @@ required data should be copied to or from the GPU memory. ### Code example +#### C + The following code performs the sum of two arrays: ```c void example() { int A[100], B[100], sum[100]; #pragma omp target map(to: A[0:100], B[0:100]) map(from: sum[0:100]) - #pragma omp parallel for + #pragma omp parallel for private(i) for (int i = 0; i < 50; i++) { sum[i] = A[i] + B[i]; } @@ -39,13 +41,54 @@ there is no need to transfer the entire arrays: void example() { int A[100], B[100], sum[100]; #pragma omp target map(to: A[0:50], B[0:50]) map(from: sum[0:50]) - #pragma omp parallel for + #pragma omp parallel for private(i) for (int i = 0; i < 50; i++) { sum[i] = A[i] + B[i]; } } ``` +#### Fortran + +The following code performs the sum of two arrays: + +```f90 +subroutine example(A, B, sum) + implicit none + integer, intent(in) :: A(:), B(:) + integer, intent(out) :: sum(:) + integer :: i + + !$omp target parallel do default(none) shared(A, B, sum) private(i) & + !$omp& map(to: a, b) map(from: sum) + do i = 1, size(sum, 1) / 2 + sum(i) = A(i) + B(i) + end do + !$omp end target parallel do +end subroutine example +``` + +However, only half of the total array elements are actually being used. Thus, +there is no need to transfer the entire arrays: + +```f90 +subroutine example(A, B, sum) + implicit none + integer, intent(in) :: A(:), B(:) + integer, intent(out) :: sum(:) + integer :: i, half_size + + half_size = size(sum, 1) / 2 + + !$omp target parallel do default(none) shared(A, B, sum) private(i) & + !$omp& map(to: a(1:half_size), b(1:half_size)) map(from: sum(1:half_size)) + do i = 1, half_size + sum(i) = A(i) + B(i) + end do + !$omp end target parallel do +end subroutine example +``` + ### Related resources * [PWR015 examples](../PWR015) diff --git a/Checks/PWR015/example-omp.c b/Checks/PWR015/example-omp.c index 3aab03b..e74fc2d 100644 --- a/Checks/PWR015/example-omp.c +++ b/Checks/PWR015/example-omp.c @@ -3,7 +3,7 @@ void example() { int A[100], B[100], sum[100]; #pragma omp target map(to: A[0:100], B[0:100]) map(from: sum[0:100]) - #pragma omp parallel for + #pragma omp parallel for private(i) for (int i = 0; i < 50; i++) { sum[i] = A[i] + B[i]; } diff --git a/Checks/PWR015/example-omp.f90 b/Checks/PWR015/example-omp.f90 index 58a4064..66c7b66 100644 --- a/Checks/PWR015/example-omp.f90 +++ b/Checks/PWR015/example-omp.f90 @@ -6,7 +6,7 @@ subroutine example(A, B, sum) integer, intent(out) :: sum(:) integer :: i - !$omp target parallel do default(none) shared(A, B, sum) & + !$omp target parallel do default(none) shared(A, B, sum) private(i) & !$omp& map(to: a, b) map(from: sum) do i = 1, size(sum, 1) / 2 sum(i) = A(i) + B(i) diff --git a/Checks/PWR016/README.md b/Checks/PWR016/README.md index ce34989..1bc1e6b 100644 --- a/Checks/PWR016/README.md +++ b/Checks/PWR016/README.md @@ -21,6 +21,8 @@ field. ### Code example +#### C + The following example shows a loop processing the `x` and `y` coordinates for an array of points: @@ -41,8 +43,8 @@ void example() { ``` This could seem like an example where using an Array-of-Structs is justified. -However, since the `z` coordinate is never accessed, the memory subsystem is not -used optimally. This could be avoided by creating one array for each +However, since the `z` coordinate is never accessed, the memory subsystem is +not used optimally. This could be avoided by creating one array for each coordinate: ```c @@ -57,6 +59,55 @@ void example() { } ``` +#### Fortran + +The following example shows a loop processing the `x` and `y` coordinates for +an array of points: + +```f90 +program main + implicit none + + type point + integer :: x + integer :: y + integer :: z + end type point + +contains + + subroutine foo() + implicit none + type(point) :: points(1000) + integer :: i + + do i = 1, 1000 + points(i)%x = 1 + points(i)%y = 1 + end do + end subroutine foo + +end program main +``` + +This could seem like an example where using an Array-of-Structs is justified. +However, since the `z` coordinate is never accessed, the memory subsystem is +not used optimally. This could be avoided by creating one array for each +coordinate: + +```f90 +subroutine foo() + implicit none + integer :: points_x(1000), points_y(1000), points_z(1000) + integer :: i + + do i = 1, 1000 + points_x(i) = 1 + points_y(i) = 1 + end do +end subroutine foo +``` + ### Related resources * [PWR016 examples](../PWR016) diff --git a/Checks/PWR016/example.f90 b/Checks/PWR016/example.f90 index 15c5d7e..9956153 100644 --- a/Checks/PWR016/example.f90 +++ b/Checks/PWR016/example.f90 @@ -2,19 +2,24 @@ program main implicit none + type point integer :: x integer :: y integer :: z end type point + contains + subroutine foo() implicit none - type(point) :: points(100) + type(point) :: points(1000) integer :: i - do i = 1, 100 + + do i = 1, 1000 points(i)%x = 1 points(i)%y = 1 end do end subroutine foo + end program main diff --git a/Checks/PWR018/README.md b/Checks/PWR018/README.md index 185939a..878b99d 100644 --- a/Checks/PWR018/README.md +++ b/Checks/PWR018/README.md @@ -23,8 +23,10 @@ control flow logic which the compilers cannot vectorize automatically. ### Code example +#### C + In the following example, the loop is invoking a recursive function computing -the Fibonacci number. This recursion inhibits the vectorization of the loop. +the Fibonacci number. This recursion inhibits the vectorization of the loop: ```c double fib(unsigned n) { @@ -46,7 +48,7 @@ double example(unsigned times) { } ``` -Fibonacci's sequence can be calculated non-recursively: +As an alternative, Fibonacci's sequence can be calculated non-recursively: ```c double example(unsigned times) { @@ -63,6 +65,66 @@ double example(unsigned times) { } ``` +#### Fortran + +In the following example, the loop is invoking a recursive function computing +the Fibonacci number. This recursion inhibits the vectorization of the loop: + +```f90 +module mod_fibonacci + contains + recursive function fibonacci(n) result(fibo) + implicit none + integer, intent(in) :: n + integer :: fibo + + if (n == 0) then + fibo = 0 + else if (n == 1) then + fibo = 1 + else + fibo = fibonacci(n - 1) + fibonacci(n - 2) + end if + end function fibonacci +end module mod_fibonacci + +subroutine example(times) + use mod_fibonacci, only : fibonacci + + implicit none + integer, intent(in) :: times + integer :: i, sum + + sum = 0 + + do i = 0, times - 1 + sum = sum + fibonacci(i) + end do +end subroutine example +``` + +As an alternative, Fibonacci's sequence can be calculated non-recursively: + +```f90 +subroutine example(times) + implicit none + integer, intent(in) :: times + integer :: i, sum + integer :: fib_0, fib_1, fib + + sum = 0 + fib_0 = 0 + fib_1 = 1 + + do i = 2, times - 1 + fib = fib_0 + fib_1 + sum = sum + fib + fib_0 = fib_1 + fib_1 = fib + end do +end subroutine example +``` + ### Related resources * [PWR018 examples](../PWR018) diff --git a/Checks/PWR018/example.f90 b/Checks/PWR018/example.f90 index 21184c7..4ae66c0 100644 --- a/Checks/PWR018/example.f90 +++ b/Checks/PWR018/example.f90 @@ -24,7 +24,9 @@ subroutine example(times) integer, intent(in) :: times integer :: i, sum - do i = 1, times + sum = 0 + + do i = 0, times - 1 sum = sum + fibonacci(i) end do end subroutine example diff --git a/Checks/PWR019/README.md b/Checks/PWR019/README.md index 3d12f4f..0cf5bc7 100644 --- a/Checks/PWR019/README.md +++ b/Checks/PWR019/README.md @@ -24,6 +24,8 @@ increase [vectorization](../../Glossary/Vectorization.md) performance. ### Code example +#### C + The following code shows two nested loops, where the outer one has a larger trip count than the inner one: @@ -42,7 +44,7 @@ The value of `margin` is not known at compile time, but it is typically low. We can increase the loop trip count of the innermost loop by performing loop interchange. To do loop interchange, the loop over `j` and the loop over `k` need to be perfectly nested. We can make them perfectly nested by moving the -initialization `bb[i][j] = 0.0` into a separate loop. +initialization `bb[i][j] = 0.0` into a separate loop: ```c for (int i = 0; i < n; i++) { @@ -58,6 +60,43 @@ for (int i = 0; i < n; i++) { } ``` +#### Fortran + +The following code shows two nested loops, where the outer one has a larger +trip count than the inner one: + +```f90 +do i = 1, n + do j = margin, n - margin + bb(i, j) = 0.0 + + do k = -margin, margin + bb(i, j) = bb(i, j) + aa(i, j + k) + end do + end do +end do +``` + +The value of `margin` is not known at compile time, but it is typically low. We +can increase the loop trip count of the innermost loop by performing loop +interchange. To do loop interchange, the loop over `j` and the loop over `k` +need to be perfectly nested. We can make them perfectly nested by moving the +initialization `bb(i, j) = 0.0` into a separate loop: + +```f90 +do i = 1, n + do j = margin, n - margin + bb(i, j) = 0.0 + end do + + do k = -margin, margin + do j = margin, n - margin + bb(i, j) = bb(i, j) + aa(i, j + k) + end do + end do +end do +``` + ### Related resources * [PWR019 examples](../PWR019) diff --git a/Checks/PWR022/README.md b/Checks/PWR022/README.md index 6d445c7..8572e58 100644 --- a/Checks/PWR022/README.md +++ b/Checks/PWR022/README.md @@ -29,19 +29,23 @@ it will always be either true or false. ### Code example +#### C + The following loop contains a condition that is invariant for all its iterations. Not only may this introduce an unnecessary redundant comparison, it -may also make the vectorization of the loop more difficult for some compilers. +may also make the vectorization of the loop more difficult for some compilers: ```c int example(int *A, int n) { int total = 0; + for (int i = 0; i < n; ++i) { if (n < 10) { total++; } A[i] = total; } + return total; } ``` @@ -53,6 +57,7 @@ follows: ```c int example(int *A, int n) { int total = 0; + if (n < 10) { for (int i = 0; i < n; ++i) { A[i] = ++total; @@ -62,10 +67,57 @@ int example(int *A, int n) { A[i] = total; } } + return total; } ``` +#### Fortran + +The following loop contains a condition that is invariant for all its +iterations. Not only may this introduce an unnecessary redundant comparison, it +may also make the vectorization of the loop more difficult for some compilers: + +```f90 +subroutine example(array) + integer, intent(out) :: array(:) + integer :: i, total + + total = 0 + + do i = 1, size(array, 1) + if (size(array, 1) < 10) then + total = total + 1 + end if + array(i) = total + end do +end subroutine example +``` + +The loop invariant can be extracted out of the loop in a simple way, by +duplicating the loop body and removing the condition. The resulting code is as +follows: + +```f90 +subroutine example(array) + integer, intent(out) :: array(:) + integer :: i, total + + total = 0 + + if (size(array, 1) < 10) then + do i = 1, size(array, 1) + total = total + 1 + array(i) = total + end do + else + do i = 1, size(array, 1) + array(i) = total + end do + end if +end subroutine example +``` + ### Related resources * [PWR022 examples](../PWR022) diff --git a/Checks/PWR029/README.md b/Checks/PWR029/README.md index 9e5d326..0c390c5 100644 --- a/Checks/PWR029/README.md +++ b/Checks/PWR029/README.md @@ -34,8 +34,10 @@ alternative ways of coding that are more hardware-friendly. ### Code example +#### C + In this example, the access to array `a` using the variable `k` can be -problematic for some compilers to optimize. +challenging to optimize for some compilers: ```c void example(float *a, float *b, unsigned size) { @@ -47,8 +49,8 @@ void example(float *a, float *b, unsigned size) { } ``` -We can fix it by removing the variable `k` and the corresponding increment -statement: +Since `k == i` in this context, we can fix the issue by removing the variable +`k` altogether and the corresponding increment statement: ```c for (unsigned i = 0; i < size; i++) { @@ -56,6 +58,40 @@ for (unsigned i = 0; i < size; i++) { } ``` +#### Fortran + +In this example, the access to array `a` using the variable `k` can be +challenging to optimize for some compilers: + +```f90 +subroutine example(a, b) + real, intent(in) :: a + real, intent(out) :: b + integer :: i, k + + k = 1 + do i = 1, size(b, 1) + b(i) = a(k) + 1 + k = k + 1 + end do +end subroutine example +``` + +Since `k == i` in this context, we can fix the issue by removing the variable +`k` altogether and the corresponding increment statement: + +```f90 +subroutine example(a, b) + real, intent(in) :: a + real, intent(out) :: b + integer :: i + + do i = 1, size(b, 1) + b(i) = a(i) + 1 + end do +end subroutine example +``` + ### Related resources * [PWR029 examples](../PWR029) diff --git a/Checks/PWR034/README.md b/Checks/PWR034/README.md index b46e446..c5cec69 100644 --- a/Checks/PWR034/README.md +++ b/Checks/PWR034/README.md @@ -21,9 +21,10 @@ is an example non-unit stride, where the stride is the column width. ### Code example +#### C + The following code shows a loop with a strided access to array `a` with stride -`2`. Avoiding it would require changing the data layout of the program, in -general. +`2`. Avoiding it would require changing the data layout of the program: ```c void example(float *a, unsigned size) { @@ -34,7 +35,7 @@ void example(float *a, unsigned size) { ``` Another code with strided accesses is show below. In this case, both variables -`a` and `b` have a stride `LEN`. +`a` and `b` have a stride `LEN`: ```c for (int i = 0; i < LEN; ++i) { @@ -44,10 +45,11 @@ for (int i = 0; i < LEN; ++i) { } ``` -Note that by using loop interchange, the loop order changes from `ij` to -`ji`. The resulting code shown below has sequential accesses (i.e. stride -`1`) for variables `ij` and `b` in the scope of the innermost loop. Note in this -case a code change solves the issue, no change in data layout is required. +Note that by using loop interchange, the loop order changes from `ij` to `ji`. +The resulting code shown below has sequential accesses (i.e., stride `1`) for +variables `ij` and `b` in the scope of the innermost loop. As a result, a +simple code change solves the issue in this scenario, without requiring +disruptive changes in data layout: ```c for (int j = 1; j < LEN; ++j) { @@ -57,6 +59,47 @@ for (int j = 1; j < LEN; ++j) { } ``` +#### Fortran + +The following code shows a loop with a strided access to array `a` with stride +`2`. Avoiding it would require changing the data layout of the program: + +```f90 +subroutine example(a) + real, intent(out) :: a(:) + integer :: i + + do i = 1, size(a, 1), 2 + a(i) = 0.0 + end do +end subroutine example +``` + +Another code with strided accesses is show below. In this case, both variables +`a` and `b` have dimensions `(LEN, LEN)`, and thus are implicitly accessed with +a stride of `LEN` elements as Fortran uses column-major order: + +```f90 +do i = 1, size(a, 1) + do j = 2, size(a, 2) + a(i, j) = a(i, j - 1) + b(i, j) + end do +end do +``` + +Note that by using loop interchange, the loop order changes from `ij` to `ji`. +The resulting code shown below has sequential accesses (i.e., stride `1`) in the +scope of the innermost loop. As a result, a simple code change solves the issue +in this scenario, without requiring disruptive changes in data layout: + +```f90 +do j = 2, size(a, 2) + do i = 1, size(a, 1) + a(i, j) = a(i, j - 1) + b(i, j) + end do +end do +``` + ### Related resources * [PWR034 examples](../PWR034) diff --git a/Checks/PWR035/README.md b/Checks/PWR035/README.md index b2716aa..b089c41 100644 --- a/Checks/PWR035/README.md +++ b/Checks/PWR035/README.md @@ -20,11 +20,13 @@ consecutive positions because the latter maximises ### Code example +#### C + Consider the example code below to illustrate the presence of non-consecutive access patterns. The elements of array `a` are accessed in a non-consecutive -manner. In the scope of the outer loop `for_i`, all the iterations access the -first row of the array. Thus, the code exhibits repeated accesses to all the -elements of the first row, a total number of times equal to `rows`. +manner. In the scope of the outer loop, `for (i)`, all the iterations access +the first row of the array. Thus, the code exhibits repeated accesses to all +the elements of the first row, a total number of times equal to `rows`: ```c void example(float **a, unsigned rows, unsigned cols) { @@ -36,6 +38,28 @@ void example(float **a, unsigned rows, unsigned cols) { } ``` +#### Fortran + +Consider the example code below to illustrate the presence of non-consecutive +access patterns. The elements of array `a` are accessed in a non-consecutive +manner. In the scope of the outer loop, `do j`, all the iterations access the +first column of the array. Thus, the code exhibits repeated accesses to all the +elements of the first column, a total number of times equal to `size(a, 2)`: + +```f90 +subroutine example(a) + implicit none + integer, intent(out) :: a(:, :) + integer :: i, j + + do j = 1, size(a, 2) + do i = 1, size(a, 1) + a(i, 1) = 0 + end do + end do +end subroutine example +``` + ### Related resources * [PWR035 examples](../PWR035) diff --git a/Checks/PWR035/example.f90 b/Checks/PWR035/example.f90 index fdd7ac8..86daf9c 100644 --- a/Checks/PWR035/example.f90 +++ b/Checks/PWR035/example.f90 @@ -7,7 +7,7 @@ subroutine example(a) do j = 1, size(a, 2) do i = 1, size(a, 1) - a(1, j) = 0 + a(i, 1) = 0 end do end do end subroutine example diff --git a/Checks/PWR036/README.md b/Checks/PWR036/README.md index b9cd931..b9dc90b 100644 --- a/Checks/PWR036/README.md +++ b/Checks/PWR036/README.md @@ -21,10 +21,12 @@ positions because the latter improves ### Code example +#### C + Consider the example code below to illustrate the presence of indirect access patterns. The elements of array `a` are accessed in an indirect manner through -the array b. Thus, the code exhibits random accesses that cannot be predicted -before the actual execution of the code. +the array `b`. Thus, the code exhibits random accesses that cannot be predicted +before the actual execution of the code: ```c void example(float *a, unsigned *b, unsigned size) { @@ -34,11 +36,11 @@ void example(float *a, unsigned *b, unsigned size) { } ``` -Next, consider another example code where memory access patterns are optimized -in order to improve locality of reference. More specifically, the elements of -array `a` are accessed indirectly, through array `index`. What this means is -that the program is accessing random elements of the array `a`, which leads to a -low performance because of the poor usage of the memory subsystem. +Next, consider another example code where memory access patterns can be +optimized to improve locality of reference. The elements of the array `a` are +accessed indirectly through the array `index`. Consequently, the program +accesses random elements of the array `a`, which leads to a low performance due +to a poor usage of the memory subsystem: ```c for (int i = 0; i < LEN_1D; ++i) { @@ -49,10 +51,12 @@ for (int i = 0; i < LEN_1D; ++i) { ``` The alternative implementation shown below takes advantage of loop interchange -to improve locality of reference. Now, the loop over `j` becomes the outer loop, -and the loop over `i` becomes the inner loop. By doing this, the access to -`a[index[j]]` is an access to a constant memory location, since the value of `j` -doesn't change inside the loop. This leads to performance improvement. +to improve locality of reference. Now, the loop over `j` becomes the outer +loop, and the loop over `i` becomes the inner loop. As a result, the access to +`a[index[j]]` is repeated across the iterations of the inner loop since the +value of `j` doesn't change, resulting in accesses to a constant memory +location. This leads to a better usage of the memory subsystem, and thus, to a +performance improvement: ```c for (int j = 1; j < LEN_1D; j++) { @@ -62,6 +66,56 @@ for (int j = 1; j < LEN_1D; j++) { } ``` +#### Fortran + +Consider the example code below to illustrate the presence of indirect access +patterns. The elements of array `a` are accessed in an indirect manner through +the array `b`. Thus, the code exhibits random accesses that cannot be predicted +before the actual execution of the code: + +```f90 +subroutine example() + implicit none + integer, intent(out) :: a + integer, intent(in) :: b + integer :: i + + do i = 1, size(a, 1) + a(b(i)) = 0 + end do +end subroutine example +``` + +Next, consider another example code where memory access patterns can be +optimized to improve locality of reference. The elements of the array `a` are +accessed indirectly through the array `index`. Consequently, the program +accesses random elements of the array `a`, which leads to a low performance due +to a poor usage of the memory subsystem: + +```f90 +do i = 1, size(c, 1) + do j = 2, size(index, 1) + c(i) = c(i) + a(index(j)) + end do +end do +``` + +The alternative implementation shown below takes advantage of loop interchange +to improve locality of reference. Now, the loop over `j` becomes the outer +loop, and the loop over `i` becomes the inner loop. As a result, the access to +`a(index(j))` is repeated across the iterations of the inner loop since the +value of `j` doesn't change, resulting in accesses to a constant memory +location. This leads to a better usage of the memory subsystem, and thus, to a +performance improvement: + +```f90 +do j = 2, size(index, 1) + do i = 1, size(c, 1) + c(i) = c(i) + a(index(j)) + end do +end do +``` + ### Related resources * [PWR036 examples](../PWR036) diff --git a/Checks/PWR039/README.md b/Checks/PWR039/README.md index 96e6235..73b8124 100644 --- a/Checks/PWR039/README.md +++ b/Checks/PWR039/README.md @@ -33,37 +33,80 @@ additionally improves performance. ### Code example +#### C + The following code shows two nested loops: ```c void example(double **A, int n) { - for (int i = 0; i < n; i++) { - for (int j = 0; j < n; j++) { - A[j][i] = 0.0; + for (int j = 0; j < n; j++) { + for (int i = 0; i < n; i++) { + A[i][j] = 0.0; } } } ``` -The matrix `A` is accessed column-wise, which is inefficient. To fix it, we -perform the loop interchange of loops over `i` and `j`. After the interchange, -the loop over `j` becomes the outer loop and loop over `i` becomes the inner -loop. - -After this modification, the access to matrix `A` is no longer column-wise, but -row-wise, which is much faster and more efficient. Additionally, the compiler -can vectorize the inner loop. +The matrix `A` is accessed column-wise, which is inefficient since C stores +arrays using row-major order. To fix this issue, a loop interchange can be +applied on loops `i` and `j`. As a result, the loop over `i` becomes the outer +loop, and the loop over `j` becomes the inner one: ```c void example(double **A, int n) { - for (int j = 0; j < n; j++) { - for (int i = 0; i < n; i++) { - A[j][i] = 0.0; + for (int i = 0; i < n; i++) { + for (int j = 0; j < n; j++) { + A[i][j] = 0.0; } } } ``` +After this modification, the access to matrix `A` is no longer column-wise, but +row-wise, resulting in a more efficient usage of the memory subsystem, and +thus, faster execution. Additionally, this optimization can help the compiler +vectorize the inner loop. + +#### Fortran + +The following code shows two nested loops: + +```f90 +subroutine example(A) + real, intent(out) :: A(:, :) + integer :: i, j + + do i = 1, size(A, 1) + do j = 1, size(A, 2) + A(i, j) = 0.0 + end do + end do +end subroutine example +``` + +The matrix `A` is accessed row-wise, which is inefficient since Fortran stores +arrays using column-major order. To fix this issue, a loop interchange can be +applied on loops `i` and `j`. As a result, the loop over `j` becomes the outer +loop, and the loop over `i` becomes the inner one: + +```f90 +subroutine example(A) + real, intent(out) :: A(:, :) + integer :: i, j + + do j = 1, size(A, 2) + do i = 1, size(A, 1) + A(i, j) = 0.0 + end do + end do +end subroutine example +``` + +After this modification, the access to matrix `A` is no longer row-wise, but +column-wise, resulting in a more efficient usage of the memory subsystem, and +thus, faster execution. Additionally, this optimization can help the compiler +vectorize the inner loop. + ### Related resources * [Source code examples and solutions](../PWR039) diff --git a/Checks/PWR040/README.md b/Checks/PWR040/README.md index 61f89be..fb09f24 100644 --- a/Checks/PWR040/README.md +++ b/Checks/PWR040/README.md @@ -36,11 +36,13 @@ speed. ### Code example +#### C + The following code shows two nested loops. The matrix `B` is accessed [column-wise](../../Glossary/Row-major-and-column-major-order.md), which is inefficient. [Loop interchange](../../Glossary/Loop-interchange.md) doesn't help either, because fixing the inefficient memory access pattern for `B` would -introduce an inefficient memory access pattern for `A`. +introduce an inefficient memory access pattern for `A`: ```c void example(double **A, double **B, int n) { @@ -67,6 +69,44 @@ for (int ii = 0; ii < n; ii += TILE_SIZE) { } ``` +#### Fortran + +The following code shows two nested loops. The matrix `B` is accessed +[row-wise](../../Glossary/Row-major-and-column-major-order.md), which is +inefficient. [Loop interchange](../../Glossary/Loop-interchange.md) doesn't +help either, because fixing the inefficient memory access pattern for `B` would +introduce an inefficient memory access pattern for `A`: + +```f90 +subroutine example(a, b) + implicit none + real, dimension(:, :), intent(out) :: a + real, dimension(:, :), intent(in) :: b + integer :: i, j + + do j = 1, size(a, 2) + do i = 1, size(a, 1) + a(i, j) = b(j, i) + end do + end do +end subroutine example +``` + +After applying loop tiling, the locality of reference is improved and the +performance is better. The tiled version of this loop nest is as follows: + +```f90 +do jj = 1, size(a, 2), TILE_SIZE + do ii = 1, size(a, 1), TILE_SIZE + do j = jj, MIN(jj + TILE_SIZE, size(a, 2)) + do i = ii, MIN(ii + TILE_SIZE, size(a, 1)) + a(i, j) = b(j, i) + end do + end do + end do +end do +``` + ### Related resources * [PWR040 examples](../PWR040) diff --git a/Checks/PWR043/README.md b/Checks/PWR043/README.md index 675f88d..13d9aa7 100644 --- a/Checks/PWR043/README.md +++ b/Checks/PWR043/README.md @@ -40,29 +40,31 @@ reduction variable, loop interchange is not directly applicable. ### Code example -Have a look at the following code: +#### C ```c for (int i = 0; i < n; i++) { double s = 0.0; + for (int j = 0; j < n; j++) { s += a[j][i]; } + b[i] = s; } ``` With regards to the innermost loop, the memory access pattern of the matrix `a` -is strided, and this loop can profit from loop interchange, but reduction -variable initialization on line 2 and reduction variable usage on line 6 prevent -it. +is strided, and this loop can benefit from loop interchange. However, the +reduction variable initialization (line 2) and usage (line 8) prevent it. -To make the loop vectorizable, we remove the temporary scalar value `s` and +To make the loop vectorizable, we can remove the temporary scalar value `s` and replace it with direct writes to `b[i]`: ```c for (int i = 0; i < n; i++) { b[i] = 0.0; + for (int j = 0; j < n; j++) { b[i] += a[j][i]; } @@ -76,6 +78,7 @@ separate loop. The result looks like this: for (int i = 0; i < n; i++) { b[i] = 0.0; } + for (int i = 0; i < n; i++) { for (int j = 0; j < n; j++) { b[i] += a[j][i]; @@ -84,17 +87,18 @@ for (int i = 0; i < n; i++) { ``` The first loop (line 1) is not performance critical, since it is not nested. On -the other hand, the second loop (line 4) is performance critical, since it +the other hand, the second loop (line 5) is performance critical, since it contains the loop nest. -Fortunately, the loop nest is now perfectly nested, so that loop interchange is -applicable. The final result looks like this (note the order of the nested loops -is now `ji` instead of the original order `ij`): +Fortunately, the loop nest is now perfectly nested, making loop interchange +applicable. The final result has the nested loops in `ji` order instead of the +original `ij` order: ```c for (int i = 0; i < n; i++) { b[i] = 0.0; } + for (int j = 0; j < n; j++) { for (int i = 0; i < n; i++) { b[i] += a[j][i]; @@ -102,6 +106,72 @@ for (int j = 0; j < n; j++) { } ``` +#### Fortran + +```f90 +do i = 1, size(b, 1) + s = 0.0 + + do j = 1, size(a, 2) + s = s + a(i, j) + end do + + b(i) = s +end do +``` + +With regards to the innermost loop, the memory access pattern of the matrix `a` +is strided, and this loop can benefit from loop interchange. However, the +reduction variable initialization (line 2) and usage (line 8) prevent it. + +To make the loop vectorizable, we can remove the temporary scalar value `s` and +replace it with direct writes to `b(i)`: + +```f90 +do i = 1, size(b, 1) + b(i) = 0.0 + + do j = 1, size(a, 2) + b(i) = b(i) + a(i, j) + end do +end do +``` + +After doing this, we can use loop fission to move the statement on line 2 to a +separate loop. The result looks like this: + +```f90 +do i = 1, size(b, 1) + b(i) = 0.0 +end do + +do i = 1, size(b, 1) + do j = 1, size(a, 2) + b(i) = b(i) + a(i, j) + end do +end do +``` + +The first loop (line 1) is not performance critical, since it is not nested. On +the other hand, the second loop (line 5) is performance critical, since it +contains the loop nest. + +Fortunately, the loop nest is now perfectly nested, making loop interchange +applicable. The final result has the nested loops in `ji` order instead of the +original `ij` order: + +```f90 +do i = 1, size(b, 1) + b(i) = 0.0 +end do + +do j = 1, size(a, 2) + do i = 1, size(b, 1) + b(i) = b(i) + a(i, j) + end do +end do +``` + ### Related resources * [Source code examples and solutions](../PWR043/) diff --git a/Checks/PWR049/README.md b/Checks/PWR049/README.md index 7c1134c..e95db1f 100644 --- a/Checks/PWR049/README.md +++ b/Checks/PWR049/README.md @@ -27,9 +27,9 @@ vectorization efficiency. ### Code examples -#### Example 1 +#### C -Have a look at the following simple code: +##### Example 1 ```c for (int i = 0; i < n; ++i) { @@ -48,6 +48,7 @@ without computing any conditional statement: ```c a[0] = 0; + for (int i = 1; i < n; ++i) { a[i] = 1; } @@ -59,7 +60,7 @@ For illustrative purposes, an example code with a loop nest is shown below: ```c for (int i = 0; i < n; ++i) { for (int j = 0; j < n; ++j) { - if (i == 0) { + if (j == 0) { a[i][j] = 0; } else { a[i][j] = a[i][j - 1] + b[i][j]; @@ -68,24 +69,23 @@ for (int i = 0; i < n; ++i) { } ``` -The condition on line 3 depends on the iterator `i` of the outer loop and can be -removed from the inner loop as follows: +The condition on line 3 depends on the iterator `j` of the inner loop and can +be removed as follows: ```c for (int i = 0; i < n; ++i) { a[i][0] = 0; + for (int j = 1; j < n; ++j) { a[i][j] = a[i][j - 1] + b[i][j]; } } ``` -In the example codes shown above the resulting loops are branchless, avoiding +In the example codes shown above, the resulting loops are branchless, avoiding redundant computations of predictable conditional instructions. -#### Example 2: Loop fission - -Have a look at the following code: +##### Example 2: Loop fission ```c for (int i = 0; i < n; ++i) { @@ -98,21 +98,22 @@ for (int i = 0; i < n; ++i) { ``` The condition on line 2 depends on the iterator `i` and can be removed by -splitting the inner loop over `i` into two loops: +splitting the loop over `i` into two loops: ```c for (int i = 0; i < 10; ++i) { a[i] = 0; } + for (int i = 10; i < n; ++i) { a[i] = 1; } ``` The first loop iterates from `0` to `9`, and the second loop iterates from `10` -until `n`. The condition is removed from the loop. +until `n - 1`. The condition is removed from the loop. -#### Example 3: Loop unrolling +##### Example 3: Loop unrolling Here is another example of a iterator-dependent condition in the loop body: @@ -139,6 +140,118 @@ for (int i = 0; i < n; i += 2) { Loop unrolling changes the increment of iterator variable `i`, so now it is 2 (see loop header at line 1). The condition is gone after this modification. +#### Fortran + +##### Example 1 + +```f90 +do i = 1, size(a, 1) + if (i == 1) then + a(i) = 0 + else + a(i) = 1 + end if +end do +``` + +The condition on line 2 depends on the iterator `i` and can be removed by +computing the first array element `a(1)` outside the loop. Thus, the loop +iterator starts in 2 and the loop initializes the remaining array elements +without computing any conditional statement: + +```f90 +a(1) = 0 + +do i = 2, size(a, 1) + a(i) = 1 +end do +``` + +The iterator-dependent condition can appear in more complicated loops as well. +For illustrative purposes, an example code with a loop nest is shown below: + +```f90 +do j = 1, size(a, 2) + do i = 1, size(a, 1) + if (i == 1) then + a(i, j) = 0 + else + a(i, j) = a(i - 1, j) + b(i, j) + end if + end do +end do +``` + +The condition on line 3 depends on the iterator `i` of the inner loop and can +be removed as follows: + +```f90 +do j = 1, size(a, 2) + a(1, j) = 0 + + do i = 2, size(a, 1) + a(i, j) = a(i - 1, j) + b(i, j) + end do +end do +``` + +In the example codes shown above, the resulting loops are branchless, avoiding +redundant computations of predictable conditional instructions. + +##### Example 2: Loop fission + +```f90 +do i = 1, size(a, 1) + if (i < 10) then + a(i) = 0 + else + a(i) = 1 +end do +``` + +The condition on line 2 depends on the iterator `i` and can be removed by +splitting the loop over `i` into two loops: + +```f90 +do i = 1, 9 + a(i) = 0 +end do + +do i = 10, size(a, 1) + a(i) = 1 +end do +``` + +The first loop iterates from `1` to `9`, and the second loop iterates from `10` +until `size(a, 1)`. The condition is removed from the loop. + +##### Example 3: Loop unrolling + +Here is another example of a iterator-dependent condition in the loop body: + +```f90 +do i = 1, size(a, 1) + if (modulo(i, 2) == 0) then + a(i) = 1 + else + a(i) = 0 + end if +end do +``` + +The iterator-dependent condition is on line 2, and can be removed through loop +unrolling: + +```f90 +do i = 1, size(a, 1), 2 + a(i) = 0 + a(i + 1) = 1 +end do +``` + +Loop unrolling changes the increment of iterator variable `i`, so now it is 2 +(see loop header at line 1). The condition is gone after this modification. + ### Related resources * [PWR049 examples](../PWR049) diff --git a/Checks/PWR049/example.f90 b/Checks/PWR049/example.f90 index 16ce0d5..2dcbda7 100644 --- a/Checks/PWR049/example.f90 +++ b/Checks/PWR049/example.f90 @@ -8,10 +8,10 @@ subroutine example(a, b) do j = 1, size(a, 2) do i = 1, size(a, 1) - if (j == 1) then + if (i == 1) then a(i, j) = 0 else - a(i, j) = a(i, j - 1) + b(i, j) + a(i, j) = a(i - 1, j) + b(i, j) end if end do end do diff --git a/Checks/PWR050/README.md b/Checks/PWR050/README.md index 4bd34ca..71c10c2 100644 --- a/Checks/PWR050/README.md +++ b/Checks/PWR050/README.md @@ -31,7 +31,7 @@ biggest challenge to speedup the code. ### Code example -Have a look at the following code snippet: +#### C ```c void example(double *D, double *X, double *Y, int n, double a) { @@ -47,8 +47,8 @@ independent memory location. Thus, no race conditions can appear at runtime related to array `D`, so no specific synchronization is needed. The code snippet below shows an implementation that uses the OpenMP compiler -directives for multithreading. Note no synchronization is required to avoid race -conditions. +directives for multithreading. Note how no synchronization is required to avoid +race conditions: ```c void example(double *D, double *X, double *Y, int n, double a) { @@ -62,6 +62,46 @@ void example(double *D, double *X, double *Y, int n, double a) { } ``` +#### Fortran + +```f90 +subroutine example(D, X, Y, a) + implicit none + real(kind=8), intent(out) :: D(:) + real(kind=8), intent(in) :: X(:), Y(:) + real(kind=8), intent(in) :: a + integer :: i + + do i = 1, size(D, 1) + D(i) = a * X(i) + Y(i) + end do +end subroutine example +``` + +The loop body has a `forall` pattern, meaning that each iteration of the loop +can be executed independently and the result in each iteration is written to an +independent memory location. Thus, no race conditions can appear at runtime +related to array `D`, so no specific synchronization is needed. + +The code snippet below shows an implementation that uses the OpenMP compiler +directives for multithreading. Note how no synchronization is required to avoid +race conditions: + +```f90 +subroutine example(D, X, Y, a) + implicit none + real(kind=8), intent(out) :: D(:) + real(kind=8), intent(in) :: X(:), Y(:) + real(kind=8), intent(in) :: a + integer :: i + + !$omp parallel do default(none) shared(D, X, Y, a) schedule(auto) + do i = 1, size(D, 1) + D(i) = a * X(i) + Y(i) + end do +end subroutine example +``` + ### Related resources * [PWR050 examples](../PWR050) diff --git a/Checks/PWR050/example-forall.f90 b/Checks/PWR050/example-forall.f90 index d00c6a2..74f2ed1 100644 --- a/Checks/PWR050/example-forall.f90 +++ b/Checks/PWR050/example-forall.f90 @@ -1,14 +1,13 @@ ! PWR050: Consider applying multithreading parallelism to forall loop -subroutine example(D, X, Y, n, a) +subroutine example(D, X, Y, a) implicit none - integer, intent(in) :: n + real(kind=8), intent(out) :: D(:) + real(kind=8), intent(in) :: X(:), Y(:) real(kind=8), intent(in) :: a - real(kind=8), dimension(1:n), intent(in) :: X, Y - real(kind=8), dimension(1:n), intent(out) :: D integer :: i - do i = 1, n + do i = 1, size(D, 1) D(i) = a * X(i) + Y(i) end do end subroutine example diff --git a/Checks/PWR051/README.md b/Checks/PWR051/README.md index 336ae9a..3f38703 100644 --- a/Checks/PWR051/README.md +++ b/Checks/PWR051/README.md @@ -31,50 +31,101 @@ biggest challenge to speedup the code. ### Code example -Have a look at the following code snippet: +#### C ```c double example(double *A, int n) { double sum = 0.0; + for (int i = 0; i < n; ++i) { sum += A[i]; } + return sum; } ``` The loop body has a `scalar reduction` pattern, meaning that each iteration of -the loop *reduces* its computational result to a single value, in this case -`sum`. - -Thus, two different iterations can potentially update the value of the scalar -`sum`, which creates a potential race condition that must be handled through -appropriate synchronization. +the loop *reduces* its computational result to a single value; in this case, +`sum`. Thus, any two iterations of the loop executing concurrently can +potentially update the value of the scalar `sum` at the same time. This creates +a potential race condition that must be handled through appropriate +synchronization. The code snippet below shows an implementation that uses the OpenMP compiler directives for multithreading. Note the synchronization added to avoid race -conditions. +conditions: ```c double example(double *A, int n) { double sum = 0.0; - #pragma omp parallel default(none) shared(A, n, sum) + + #pragma omp parallel default(none) shared(A, n, sum) private(i) { #pragma omp for reduction(+: sum) schedule(auto) for (int i = 0; i < n; ++i) { sum += A[i]; } } // end parallel + return sum; } ``` >**Note** >Executing scalar reduction loops using multithreading incurs a synchronization ->overhead. The example above shows an implementation that uses an efficient ->implementation that balances synchronization and memory overheads, by taking ->advantage of a reduction mechanism typically supported by the APIs for ->multithreading. +>overhead. The example above shows a code that uses an efficient implementation +>balancing synchronization and memory overheads, by taking advantage of a +>reduction mechanism typically supported by the APIs for multithreading. + +#### Fortran + +```f90 +function example(A) result(sum) + implicit none + real(kind=8), intent(in) :: A(:) + real(kind=8) :: sum + integer :: i + + sum = 0.0 + do i = 1, size(A, 1) + sum = sum + A(i) + end do +end function example +``` + +The loop body has a `scalar reduction` pattern, meaning that each iteration of +the loop *reduces* its computational result to a single value; in this case, +`sum`. Thus, any two iterations of the loop executing concurrently can +potentially update the value of the scalar `sum` at the same time. This creates +a potential race condition that must be handled through appropriate +synchronization. + +The code snippet below shows an implementation that uses the OpenMP compiler +directives for multithreading. Note the synchronization added to avoid race +conditions: + +```f90 +function example(A) result(sum) + implicit none + real(kind=8), intent(in) :: A(:) + real(kind=8) :: sum + integer :: i + + sum = 0.0 + !$omp parallel do default(none) shared(A) private(i) reduction(+: sum) & + !$omp& schedule(auto) + do i = 1, size(A, 1) + sum = sum + A(i) + end do +end function example +``` + +>**Note** +>Executing scalar reduction loops using multithreading incurs a synchronization +>overhead. The example above shows a code that uses an efficient implementation +>balancing synchronization and memory overheads, by taking advantage of a +>reduction mechanism typically supported by the APIs for multithreading. ### Related resources diff --git a/Checks/PWR051/example-scalar.c b/Checks/PWR051/example-scalar.c index 296e1b1..15d2b23 100644 --- a/Checks/PWR051/example-scalar.c +++ b/Checks/PWR051/example-scalar.c @@ -2,8 +2,10 @@ double example(double *A, int n) { double sum = 0.0; + for (int i = 0; i < n; ++i) { sum += A[i]; } + return sum; } diff --git a/Checks/PWR051/example-scalar.f90 b/Checks/PWR051/example-scalar.f90 index 44de99a..f4a9874 100644 --- a/Checks/PWR051/example-scalar.f90 +++ b/Checks/PWR051/example-scalar.f90 @@ -1,14 +1,13 @@ ! PWR051: Consider applying multithreading parallelism to scalar reduction loop -subroutine example(A, n, sum) +function example(A) result(sum) implicit none - integer, intent(in) :: n - real(kind=8), dimension(1:n), intent(in) :: A - real(kind=8), intent(out) :: sum + real(kind=8), intent(in) :: A(:) + real(kind=8) :: sum integer :: i - sum = 0 - do i = 1, n + sum = 0.0 + do i = 1, size(A, 1) sum = sum + A(i) end do -end subroutine example +end function example diff --git a/Checks/PWR052/README.md b/Checks/PWR052/README.md index b0ecaa6..cec1e6d 100644 --- a/Checks/PWR052/README.md +++ b/Checks/PWR052/README.md @@ -31,7 +31,7 @@ multithreading is the biggest challenge to speedup the code. ### Code example -Have a look at the following code snippet: +#### C ```c void example(double *A, int *nodes, int n) { @@ -43,17 +43,18 @@ void example(double *A, int *nodes, int n) { The loop body has a `sparse reduction` pattern, meaning that each iteration of the loop *reduces* its computational result to a value, but the place where the -value is stored is known at runtime only. Thus, two different iterations can -potentially update the same element of the array `A`, which creates a potential -race condition that must be handled through appropriate synchronization. +value is stored is known at runtime only. Thus, any two iterations of the loop +executing concurrently can potentially update the same element of the array `A` +at the same time. This creates a potential race condition that must be handled +through appropriate synchronization. The code snippet below shows an implementation that uses the OpenMP compiler directives for multithreading. Note the synchronization added to avoid race -conditions. +conditions: ```c void example(double *A, int *nodes, int n) { - #pragma omp parallel default(none) shared(A, n, nodes) + #pragma omp parallel default(none) shared(A, n, nodes) private(nel) { #pragma omp for schedule(auto) for (int nel = 0; nel < n; ++nel) { @@ -72,6 +73,55 @@ void example(double *A, int *nodes, int n) { >efficient implementation that balances synchronization and memory overheads >must be explored for each particular code. +#### Fortran + +```f90 +subroutine example(A, nodes) + implicit none + real(kind=8), intent(inout) :: A(:) + integer, intent(in) :: nodes(:) + integer :: nel + + do nel = 1, size(nodes, 1) + A(nodes(nel)) = A(nodes(nel)) + (nel * 1) + end do +end subroutine example +``` + +The loop body has a `sparse reduction` pattern, meaning that each iteration of +the loop *reduces* its computational result to a value, but the place where the +value is stored is known at runtime only. Thus, any two iterations of the loop +executing concurrently can potentially update the same element of the array `A` +at the same time. This creates a potential race condition that must be handled +through appropriate synchronization. + +The code snippet below shows an implementation that uses the OpenMP compiler +directives for multithreading. Note the synchronization added to avoid race +conditions: + +```f90 +subroutine example(A, nodes) + implicit none + real(kind=8), intent(inout) :: A(:) + integer, intent(in) :: nodes(:) + integer :: nel + + !$omp parallel do default(none) shared(A, nodes) private(nel) schedule(auto) + do nel = 1, size(nodes, 1) + !$omp atomic update + A(nodes(nel)) = A(nodes(nel)) + (nel * 1) + end do +end subroutine example +``` + +>**Note** +>Executing sparse reduction loops using multithreading incurs a synchronization +>overhead. The example above shows an implementation that uses atomic +>protection. Other implementations reduce this high overhead taking advantage +>of privatization, which increases the memory requirements of the code. An +>efficient implementation that balances synchronization and memory overheads +>must be explored for each particular code. + ### Related resources * [PWR052 examples](../PWR052) diff --git a/Checks/PWR052/example-sparse.f90 b/Checks/PWR052/example-sparse.f90 index c0a3dea..7fd1dbd 100644 --- a/Checks/PWR052/example-sparse.f90 +++ b/Checks/PWR052/example-sparse.f90 @@ -1,13 +1,12 @@ ! PWR052: Consider applying multithreading parallelism to sparse reduction loop -subroutine example(A, nodes, n) +subroutine example(A, nodes) implicit none - integer, intent(in) :: n - integer, dimension(1:n), intent(in) :: nodes - real(kind=8), dimension(1:n), intent(out) :: A + real(kind=8), intent(inout) :: A(:) + integer, intent(in) :: nodes(:) integer :: nel - do nel = 1, n + do nel = 1, size(nodes, 1) A(nodes(nel)) = A(nodes(nel)) + (nel * 1) end do end subroutine example diff --git a/Checks/PWR053/README.md b/Checks/PWR053/README.md index ffabdce..9233e2e 100644 --- a/Checks/PWR053/README.md +++ b/Checks/PWR053/README.md @@ -31,7 +31,7 @@ the capabilities of the compiler. ### Code example -Have a look at the following code snippet: +#### C ```c void example(double *D, double *X, double *Y, int n, double a) { @@ -47,8 +47,8 @@ independent memory location. Thus, no race conditions can appear at runtime related to array `D`, so no specific synchronization is needed. The code snippet below shows an implementation that uses the OpenMP compiler -directives to vectorize the loop explicitly. Note the synchronization added to -avoid race conditions. +directives to vectorize the loop explicitly. Note how no synchronization is +required to avoid race conditions: ```c void example(double *D, double *X, double *Y, int n, double a) { @@ -59,6 +59,46 @@ void example(double *D, double *X, double *Y, int n, double a) { } ``` +#### Fortran + +```f90 +subroutine example(D, X, Y, a) + implicit none + real(kind=8), intent(out) :: D(:) + real(kind=8), intent(in) :: X(:), Y(:) + real(kind=8), intent(in) :: a + integer :: i + + do i = 1, size(D, 1) + D(i) = a * X(i) + Y(i) + end do +end subroutine example +``` + +The loop body has a `forall` pattern, meaning that each iteration of the loop +can be executed independently and the result in each iteration is written to an +independent memory location. Thus, no race conditions can appear at runtime +related to array `D`, so no specific synchronization is needed. + +The code snippet below shows an implementation that uses the OpenMP compiler +directives to vectorize the loop explicitly. Note how no synchronization is +required to avoid race conditions: + +```f90 +subroutine example(D, X, Y, a) + implicit none + real(kind=8), intent(out) :: D(:) + real(kind=8), intent(in) :: X(:), Y(:) + real(kind=8), intent(in) :: a + integer :: i + + !$omp simd + do i = 1, size(D, 1) + D(i) = a * X(i) + Y(i) + end do +end subroutine example +``` + ### Related resources * [PWR053 examples](../PWR053) diff --git a/Checks/PWR053/example-forall.f90 b/Checks/PWR053/example-forall.f90 index 148b6ed..d4b9e44 100644 --- a/Checks/PWR053/example-forall.f90 +++ b/Checks/PWR053/example-forall.f90 @@ -1,14 +1,13 @@ ! PWR053: consider applying vectorization to forall loop -subroutine example(D, X, Y, n, a) +subroutine example(D, X, Y, a) implicit none - integer, intent(in) :: n + real(kind=8), intent(out) :: D(:) + real(kind=8), intent(in) :: X(:), Y(:) real(kind=8), intent(in) :: a - real(kind=8), dimension(1:n), intent(in) :: X, Y - real(kind=8), dimension(1:n), intent(out) :: D integer :: i - do i = 1, n + do i = 1, size(D, 1) D(i) = a * X(i) + Y(i) end do end subroutine example diff --git a/Checks/PWR054/README.md b/Checks/PWR054/README.md index 3c77b96..f4daab1 100644 --- a/Checks/PWR054/README.md +++ b/Checks/PWR054/README.md @@ -31,39 +31,86 @@ the capabilities of the compiler. ### Code example -Have a look at the following code snippet: +#### C ```c double example(double *A, int n) { double sum = 0.0; + for (int i = 0; i < n; ++i) { sum += A[i]; } + return sum; } ``` The loop body has a `scalar reduction` pattern, meaning that each iteration of -the loop *reduces* its computational result to a single value, in this case -`sum`. Thus, two different iterations can potentially update the value of the -scalar sum, which creates a potential race condition that must be handled -through appropriate synchronization. +the loop *reduces* its computational result to a single value; in this case, +`sum`. Thus, any two iterations of the loop executing concurrently can +potentially update the value of the scalar `sum` at the same time. This creates +a potential race condition that must be handled through appropriate +synchronization. The code snippet below shows an implementation that uses the OpenMP compiler -directives to vectorize the loop explicitly. Note the synchronization added to -avoid race conditions. +directives to explicitly vectorize the loop. Note the synchronization added to +avoid race conditions: ```c double example(double *A, int n) { double sum = 0.0; + #pragma omp simd reduction(+: sum) for (int i = 0; i < n; ++i) { sum += A[i]; } + return sum; } ``` +#### Fortran + +```f90 +function example(A) result(sum) + implicit none + real(kind=8), intent(in) :: A(:) + real(kind=8) :: sum + integer :: i + + sum = 0.0 + do i = 1, size(A, 1) + sum = sum + A(i) + end do +end function example +``` + +The loop body has a `scalar reduction` pattern, meaning that each iteration of +the loop *reduces* its computational result to a single value; in this case, +`sum`. Thus, any two iterations of the loop executing concurrently can +potentially update the value of the scalar `sum` at the same time. This creates +a potential race condition that must be handled through appropriate +synchronization. + +The code snippet below shows an implementation that uses the OpenMP compiler +directives to explicitly vectorize the loop. Note the synchronization added to +avoid race conditions: + +```f90 +function example(A) result(sum) + implicit none + real(kind=8), intent(in) :: A(:) + real(kind=8) :: sum + integer :: i + + sum = 0.0 + !$omp simd reduction(+: sum) + do i = 1, size(A, 1) + sum = sum + A(i) + end do +end function example +``` + ### Related resources * [PWR054 examples](../PWR054) diff --git a/Checks/PWR054/example-scalar.c b/Checks/PWR054/example-scalar.c index 297b81a..4ae9b4f 100644 --- a/Checks/PWR054/example-scalar.c +++ b/Checks/PWR054/example-scalar.c @@ -2,8 +2,10 @@ double example(double *A, int n) { double sum = 0.0; + for (int i = 0; i < n; ++i) { sum += A[i]; } + return sum; } diff --git a/Checks/PWR054/example-scalar.f90 b/Checks/PWR054/example-scalar.f90 index 63b5ac6..0f610ea 100644 --- a/Checks/PWR054/example-scalar.f90 +++ b/Checks/PWR054/example-scalar.f90 @@ -1,14 +1,13 @@ ! PWR054: consider applying vectorization to scalar reduction loop -subroutine example(A, n, sum) +function example(A) result(sum) implicit none - integer, intent(in) :: n - real(kind=8), dimension(1:n), intent(in) :: A - real(kind=8), intent(out) :: sum + real(kind=8), intent(in) :: A(:) + real(kind=8) :: sum integer :: i - sum = 0 - do i = 1, n + sum = 0.0 + do i = 1, size(A, 1) sum = sum + A(i) end do -end subroutine example +end function example diff --git a/Checks/PWR055/README.md b/Checks/PWR055/README.md index 9a7977a..70c8c32 100644 --- a/Checks/PWR055/README.md +++ b/Checks/PWR055/README.md @@ -33,7 +33,7 @@ code using accelerators. ### Code example -Have a look at the following code snippet: +#### C ```c void example(double *D, double *X, double *Y, int n, double a) { @@ -46,12 +46,12 @@ void example(double *D, double *X, double *Y, int n, double a) { The loop body has a `forall` pattern, meaning that each iteration of the loop can be executed independently and the result in each iteration is written to an independent memory location. Thus, no race conditions can appear at runtime -related to array D, so no specific synchronization is needed. +related to array `D`, so no specific synchronization is needed. The code snippet below shows an implementation that uses the OpenACC compiler -directives to offload the loop to an accelerator. Note no synchronization is -required to avoid race conditions and the data transfer clauses that manage the -data movement between the host memory and the accelerator memory. +directives to offload the loop to an accelerator. Note how no synchronization +is required to avoid race conditions, while the data transfer clauses manage +the data movement between the host memory and the accelerator memory: ```c void example(double *D, double *X, double *Y, int n, double a) { @@ -64,6 +64,51 @@ void example(double *D, double *X, double *Y, int n, double a) { } ``` +#### Fortran + +```f90 +subroutine example(D, X, Y, a) + implicit none + real(kind=8), intent(out) :: D(:) + real(kind=8), intent(in) :: X(:), Y(:) + real(kind=8), intent(in) :: a + integer :: i + + do i = 1, size(D, 1) + D(i) = a * X(i) + Y(i) + end do +end subroutine example +``` + +The loop body has a `forall` pattern, meaning that each iteration of the loop +can be executed independently and the result in each iteration is written to an +independent memory location. Thus, no race conditions can appear at runtime +related to array `D`, so no specific synchronization is needed. + +The code snippet below shows an implementation that uses the OpenACC compiler +directives to offload the loop to an accelerator. Note how no synchronization +is required to avoid race conditions, while the data transfer clauses manage +the data movement between the host memory and the accelerator memory: + +```f90 +subroutine example(D, X, Y, a) + implicit none + real(kind=8), intent(out) :: D(:) + real(kind=8), intent(in) :: X(:), Y(:) + real(kind=8), intent(in) :: a + integer :: i + + !$acc data copyin(X, Y, a) copyout(D) + !$acc parallel + !$acc loop + do i = 1, size(D, 1) + D(i) = a * X(i) + Y(i) + end do + !$acc end parallel + !$acc end data +end subroutine example +``` + ### Related resources * [PWR055 examples](../PWR055) diff --git a/Checks/PWR055/example-forall.f90 b/Checks/PWR055/example-forall.f90 index 0f1072d..0653347 100644 --- a/Checks/PWR055/example-forall.f90 +++ b/Checks/PWR055/example-forall.f90 @@ -1,14 +1,13 @@ ! PWR055: consider applying offloading parallelism to forall loop -subroutine example(D, X, Y, n, a) +subroutine example(D, X, Y, a) implicit none - integer, intent(in) :: n + real(kind=8), intent(out) :: D(:) + real(kind=8), intent(in) :: X(:), Y(:) real(kind=8), intent(in) :: a - real(kind=8), dimension(1:n), intent(in) :: X, Y - real(kind=8), dimension(1:n), intent(out) :: D integer :: i - do i = 1, n + do i = 1, size(D, 1) D(i) = a * X(i) + Y(i) end do end subroutine example diff --git a/Checks/PWR056/README.md b/Checks/PWR056/README.md index b2a501c..de9894b 100644 --- a/Checks/PWR056/README.md +++ b/Checks/PWR056/README.md @@ -34,42 +34,95 @@ challenge to speedup the code using accelerators**. ### Code example -Have a look at the following code snippet: +#### C ```c double example(double *A, int n) { double sum = 0; + for (int i = 0; i < n; ++i) { sum += A[i]; } + return sum; } ``` -The loop body has a `scalar reduction` pattern, meaning that each iteration of -the loop *reduces* its computational result to a single value, in this case -`sum`. Thus, two different iterations can potentially update the value of the -scalar `sum`, which creates a potential race condition that must be handled -through appropriate synchronization. +The loop body has a `scalar reduction` pattern, meaning that each iteration of +the loop *reduces* its computational result to a single value; in this case, +`sum`. Thus, any two iterations of the loop executing concurrently can +potentially update the value of the scalar `sum` at the same time. This creates +a potential race condition that must be handled through appropriate +synchronization. The code snippet below shows an implementation that uses the OpenACC compiler -directives to offload the loop to an accelerator. Note the synchronization added -to avoid race conditions and the data transfer clauses that manage the data -movement between the host memory and the accelerator memory. +directives to offload the loop to an accelerator. Note the synchronization +added to avoid race conditions, while the data transfer clauses manage the data +movement between the host memory and the accelerator memory: ```c double example(double *A, int n) { double sum = 0; + #pragma acc data copyin(A[0:n], n) copy(sum) #pragma acc parallel #pragma acc loop reduction(+: sum) for (int i = 0; i < n; ++i) { sum += A[i]; } + return sum; } ``` +#### Fortran + +```f90 +function example(A) result(sum) + implicit none + real(kind=8), intent(in) :: A(:) + real(kind=8) :: sum + integer :: i + + sum = 0.0 + do i = 1, size(A, 1) + sum = sum + A(i) + end do +end function example +``` + +The loop body has a `scalar reduction` pattern, meaning that each iteration of +the loop *reduces* its computational result to a single value; in this case, +`sum`. Thus, any two iterations of the loop executing concurrently can +potentially update the value of the scalar `sum` at the same time. This creates +a potential race condition that must be handled through appropriate +synchronization. + +The code snippet below shows an implementation that uses the OpenACC compiler +directives to offload the loop to an accelerator. Note the synchronization +added to avoid race conditions, while the data transfer clauses manage the data +movement between the host memory and the accelerator memory: + +```f90 +function example(A) result(sum) + implicit none + real(kind=8), intent(in) :: A(:) + real(kind=8) :: sum + integer :: i + + sum = 0.0 + + !$acc data copyin(A) copy(sum) + !$acc parallel + !$acc loop reduction(+: sum) + do i = 1, size(A, 1) + sum = sum + A(i) + end do + !$acc end parallel + !$acc end data +end function example +``` + ### Related resources * [PWR056 examples](../PWR056) diff --git a/Checks/PWR056/example-scalar.c b/Checks/PWR056/example-scalar.c index cbef396..c61207e 100644 --- a/Checks/PWR056/example-scalar.c +++ b/Checks/PWR056/example-scalar.c @@ -2,8 +2,10 @@ double example(double *A, int n) { double sum = 0.0; + for (int i = 0; i < n; ++i) { sum += A[i]; } + return sum; } diff --git a/Checks/PWR056/example-scalar.f90 b/Checks/PWR056/example-scalar.f90 index 1541d7c..af75132 100644 --- a/Checks/PWR056/example-scalar.f90 +++ b/Checks/PWR056/example-scalar.f90 @@ -1,14 +1,13 @@ ! PWR056: consider applying offloading parallelism to scalar reduction loop -subroutine example(A, n, sum) +function example(A) result(sum) implicit none - integer, intent(in) :: n - real(kind=8), dimension(1:n), intent(in) :: A - real(kind=8), intent(out) :: sum + real(kind=8), intent(in) :: A(:) + real(kind=8) :: sum integer :: i - sum = 0 - do i = 1, n + sum = 0.0 + do i = 1, size(A, 1) sum = sum + A(i) end do -end subroutine example +end function example diff --git a/Checks/PWR057/README.md b/Checks/PWR057/README.md index 4009a16..8fcafe9 100644 --- a/Checks/PWR057/README.md +++ b/Checks/PWR057/README.md @@ -36,7 +36,7 @@ code using accelerators. ### Code example -Have a look at the following code snippet: +#### C ```c void example(double *A, int *nodes, int n) { @@ -46,16 +46,17 @@ void example(double *A, int *nodes, int n) { } ``` -The loop body has a `sparse reduction` pattern, meaning that each iteration of +The loop body has a `sparse reduction` pattern, meaning that each iteration of the loop *reduces* its computational result to a value, but the place where the -value is stored is known at runtime only. Thus, two different iterations can -potentially update the same element of the array `A`, which creates a potential -race condition that must be handled through appropriate synchronization. +value is stored is known at runtime only. Thus, any two iterations of the loop +executing concurrently can potentially update the same element of the array `A` +at the same time. This creates a potential race condition that must be handled +through appropriate synchronization. The code snippet below shows an implementation that uses the OpenACC compiler -directives to offload the loop to an accelerator. Note the synchronization added -to avoid race conditions and the data transfer clauses that manage the data -movement between the host memory and the accelerator memory. +directives to offload the loop to an accelerator. Note the synchronization +added to avoid race conditions, while the data transfer clauses manage the data +movement between the host memory and the accelerator memory: ```c void example(double *A, int *nodes, int n) { @@ -69,6 +70,52 @@ void example(double *A, int *nodes, int n) { } ``` +#### Fortran + +```f90 +subroutine example(A, nodes) + implicit none + real(kind=8), intent(inout) :: A(:) + integer, intent(in) :: nodes(:) + integer :: nel + + do nel = 1, size(nodes, 1) + A(nodes(nel)) = A(nodes(nel)) + (nel * 1) + end do +end subroutine example +``` + +The loop body has a `sparse reduction` pattern, meaning that each iteration of +the loop *reduces* its computational result to a value, but the place where the +value is stored is known at runtime only. Thus, any two iterations of the loop +executing concurrently can potentially update the same element of the array `A` +at the same time. This creates a potential race condition that must be handled +through appropriate synchronization. + +The code snippet below shows an implementation that uses the OpenACC compiler +directives to offload the loop to an accelerator. Note the synchronization +added to avoid race conditions, while the data transfer clauses manage the data +movement between the host memory and the accelerator memory: + +```f90 +subroutine example(A, nodes) + implicit none + real(kind=8), intent(inout) :: A(:) + integer, intent(in) :: nodes(:) + integer :: nel + + !$acc data copyin(nodes) copy(A) + !$acc parallel + !$acc loop + do nel = 1, size(nodes, 1) + !$acc atomic update + A(nodes(nel)) = A(nodes(nel)) + (nel * 1) + end do + !$acc end parallel + !$acc end data +end subroutine example +``` + ### Related resources * [PWR057 examples](../PWR057) diff --git a/Checks/PWR057/example-sparse.f90 b/Checks/PWR057/example-sparse.f90 index cbf3e42..0fec2e9 100644 --- a/Checks/PWR057/example-sparse.f90 +++ b/Checks/PWR057/example-sparse.f90 @@ -1,13 +1,12 @@ ! PWR057: consider applying offloading parallelism to sparse reduction loop -subroutine example(A, nodes, n) +subroutine example(A, nodes) implicit none - integer, intent(in) :: n - integer, dimension(1:n), intent(in) :: nodes - real(kind=8), dimension(1:n), intent(out) :: A + real(kind=8), intent(inout) :: A(:) + integer, intent(in) :: nodes(:) integer :: nel - do nel = 1, n + do nel = 1, size(nodes, 1) A(nodes(nel)) = A(nodes(nel)) + (nel * 1) end do end subroutine example