User understanding, compiler or driver problem? #3950

kevinboulain · 2024-04-15T09:02:13Z

kevinboulain
Apr 15, 2024

This question might not be strictly related to Slang but to be honest, I'm not quite sure where the problem lies so I figured I could open a discussion (if I'm not wrong, it might be of interest to Slang that something might be broken in some configurations), apologies if I'm being too spammy. I did ask a similar question on the Vulkan Discord (focused on semantics) a few days ago but haven't gotten any answer.

I have this minimal repro:

import glsl;

struct Constants {
  uint *device_memory;
}
[[vk::push_constant]] Constants push_constants;

groupshared uint shared_memory[1];

[shader("compute")] [numthreads(1, 1, 1)] void main(void) {
  shared_memory[0] = 0xffffffff;

  atomicMin(shared_memory[0], 21);

  push_constants.device_memory[0] = shared_memory[0];
}

It's compiled like this (v2024.1.7):

slangc compute.slang -target spirv -o compute.spv -emit-spirv-directly -force-glsl-scalar-layout -fvk-use-entrypoint-name -matrix-layout-column-major

And the resulting SPIR-V is:

; SPIR-V
; Version: 1.5
; Generator: Khronos; 40
; Bound: 28
; Schema: 0
               OpCapability VariablePointers
               OpCapability PhysicalStorageBufferAddresses
               OpCapability Shader
               OpExtension "SPV_KHR_variable_pointers"
               OpExtension "SPV_KHR_physical_storage_buffer"
               OpMemoryModel PhysicalStorageBuffer64 GLSL450
               OpEntryPoint GLCompute %main "main" %shared_memory %push_constants
               OpExecutionMode %main LocalSize 1 1 1
               OpSource Slang 1
               OpName %shared_memory "shared_memory"
               OpName %shared_memory "shared_memory"
               OpName %Constants_natural "Constants_natural"
               OpMemberName %Constants_natural 0 "device_memory"
               OpName %push_constants "push_constants"
               OpName %push_constants "push_constants"
               OpName %main "main"
               OpDecorate %_arr_uint_int_1 ArrayStride 4
               OpDecorate %_ptr_PhysicalStorageBuffer_uint ArrayStride 4
               OpDecorate %Constants_natural Block
               OpMemberDecorate %Constants_natural 0 Offset 0
       %void = OpTypeVoid
          %3 = OpTypeFunction %void
       %uint = OpTypeInt 32 0
        %int = OpTypeInt 32 1
      %int_1 = OpConstant %int 1
%_arr_uint_int_1 = OpTypeArray %uint %int_1
%_ptr_Workgroup__arr_uint_int_1 = OpTypePointer Workgroup %_arr_uint_int_1
%_ptr_Workgroup_uint = OpTypePointer Workgroup %uint
      %int_0 = OpConstant %int 0
%uint_4294967295 = OpConstant %uint 4294967295
     %uint_1 = OpConstant %uint 1
    %uint_64 = OpConstant %uint 64
    %uint_21 = OpConstant %uint 21
%_ptr_PhysicalStorageBuffer_uint = OpTypePointer PhysicalStorageBuffer %uint
%Constants_natural = OpTypeStruct %_ptr_PhysicalStorageBuffer_uint
%_ptr_PushConstant_Constants_natural = OpTypePointer PushConstant %Constants_natural
%_ptr_PushConstant__ptr_PhysicalStorageBuffer_uint = OpTypePointer PushConstant %_ptr_PhysicalStorageBuffer_uint
%shared_memory = OpVariable %_ptr_Workgroup__arr_uint_int_1 Workgroup
%push_constants = OpVariable %_ptr_PushConstant_Constants_natural PushConstant
       %main = OpFunction %void None %3
          %4 = OpLabel
         %12 = OpAccessChain %_ptr_Workgroup_uint %shared_memory %int_0
               OpStore %12 %uint_4294967295
         %15 = OpAtomicUMin %uint %12 %uint_1 %uint_64 %uint_21
         %24 = OpAccessChain %_ptr_PushConstant__ptr_PhysicalStorageBuffer_uint %push_constants %int_0
         %25 = OpLoad %_ptr_PhysicalStorageBuffer_uint %24
         %26 = OpPtrAccessChain %_ptr_PhysicalStorageBuffer_uint %25 %int_0
         %27 = OpLoad %uint %12
               OpStore %26 %27 Aligned 4
               OpReturn
               OpFunctionEnd

It's dispatched with a single instance so a single invocation is doing the store, the atomic operation and the load (my understanding is that mixing these operations is okay in this particular context).
On Intel, device_memory[0] is 21. On Nvidia, it's 0xffffffff so the atomicMin is somehow lost.

As a comparison point, I tried out GLSL:

#version 450
#extension GL_EXT_buffer_reference : enable
#extension GL_EXT_scalar_block_layout : enable
#extension GL_EXT_shader_explicit_arithmetic_types_int64 : enable
#extension GL_EXT_control_flow_attributes : enable

const uint workgroup_size = 1;
layout(local_size_x = 1) in;

layout(buffer_reference, scalar) buffer uintBuffer {
  uint array[];
};

layout(push_constant, scalar) uniform PushConstants {
  uintBuffer device_memory;
} push_constants;

shared uint shared_memory[1];

void main(void) {
  shared_memory[0] = 0xffffffff;

  atomicMin(shared_memory[0], 21);

  push_constants.device_memory.array[0] = shared_memory[0];
}

Compiled like this:

glslc --target-spv=spv1.5 -fshader-stage=compute compute.glsl -o compute.spv

And the resulting SPIR-V is:

; SPIR-V
; Version: 1.5
; Generator: Google Shaderc over Glslang; 11
; Bound: 35
; Schema: 0
               OpCapability Shader
               OpCapability PhysicalStorageBufferAddresses
          %1 = OpExtInstImport "GLSL.std.450"
               OpMemoryModel PhysicalStorageBuffer64 GLSL450
               OpEntryPoint GLCompute %main "main" %shared_memory %push_constants
               OpExecutionMode %main LocalSize 1 1 1
               OpSource GLSL 450
               OpSourceExtension "GL_EXT_buffer_reference"
               OpSourceExtension "GL_EXT_control_flow_attributes"
               OpSourceExtension "GL_EXT_scalar_block_layout"
               OpSourceExtension "GL_EXT_shader_explicit_arithmetic_types_int64"
               OpSourceExtension "GL_GOOGLE_cpp_style_line_directive"
               OpSourceExtension "GL_GOOGLE_include_directive"
               OpName %main "main"
               OpName %shared_memory "shared_memory"
               OpName %PushConstants "PushConstants"
               OpMemberName %PushConstants 0 "device_memory"
               OpName %uintBuffer "uintBuffer"
               OpMemberName %uintBuffer 0 "array"
               OpName %push_constants "push_constants"
               OpMemberDecorate %PushConstants 0 Offset 0
               OpDecorate %PushConstants Block
               OpDecorate %_runtimearr_uint ArrayStride 4
               OpMemberDecorate %uintBuffer 0 Offset 0
               OpDecorate %uintBuffer Block
               OpDecorate %gl_WorkGroupSize BuiltIn WorkgroupSize
       %void = OpTypeVoid
          %3 = OpTypeFunction %void
       %uint = OpTypeInt 32 0
     %uint_1 = OpConstant %uint 1
%_arr_uint_uint_1 = OpTypeArray %uint %uint_1
%_ptr_Workgroup__arr_uint_uint_1 = OpTypePointer Workgroup %_arr_uint_uint_1
%shared_memory = OpVariable %_ptr_Workgroup__arr_uint_uint_1 Workgroup
        %int = OpTypeInt 32 1
      %int_0 = OpConstant %int 0
%uint_4294967295 = OpConstant %uint 4294967295
%_ptr_Workgroup_uint = OpTypePointer Workgroup %uint
    %uint_21 = OpConstant %uint 21
     %uint_0 = OpConstant %uint 0
               OpTypeForwardPointer %_ptr_PhysicalStorageBuffer_uintBuffer PhysicalStorageBuffer
%PushConstants = OpTypeStruct %_ptr_PhysicalStorageBuffer_uintBuffer
%_runtimearr_uint = OpTypeRuntimeArray %uint
 %uintBuffer = OpTypeStruct %_runtimearr_uint
%_ptr_PhysicalStorageBuffer_uintBuffer = OpTypePointer PhysicalStorageBuffer %uintBuffer
%_ptr_PushConstant_PushConstants = OpTypePointer PushConstant %PushConstants
%push_constants = OpVariable %_ptr_PushConstant_PushConstants PushConstant
%_ptr_PushConstant__ptr_PhysicalStorageBuffer_uintBuffer = OpTypePointer PushConstant %_ptr_PhysicalStorageBuffer_uintBuffer
%_ptr_PhysicalStorageBuffer_uint = OpTypePointer PhysicalStorageBuffer %uint
     %v3uint = OpTypeVector %uint 3
%gl_WorkGroupSize = OpConstantComposite %v3uint %uint_1 %uint_1 %uint_1
       %main = OpFunction %void None %3
          %5 = OpLabel
         %15 = OpAccessChain %_ptr_Workgroup_uint %shared_memory %int_0
               OpStore %15 %uint_4294967295
         %16 = OpAccessChain %_ptr_Workgroup_uint %shared_memory %int_0
         %19 = OpAtomicUMin %uint %16 %uint_1 %uint_0 %uint_21
         %27 = OpAccessChain %_ptr_PushConstant__ptr_PhysicalStorageBuffer_uintBuffer %push_constants %int_0
         %28 = OpLoad %_ptr_PhysicalStorageBuffer_uintBuffer %27
         %29 = OpAccessChain %_ptr_Workgroup_uint %shared_memory %int_0
         %30 = OpLoad %uint %29
         %32 = OpAccessChain %_ptr_PhysicalStorageBuffer_uint %28 %int_0 %int_0
               OpStore %32 %30 Aligned 4
               OpReturn
               OpFunctionEnd

It produces the expected result on both Nvidia and Intel.

I spotted two differences:

The atomic semantics:
- for Slang that's device scope, uniform memory and relaxed,
- for GLSL that's device scope and relaxed.
Pointer vs buffer references.

The atomic operation can easily be made the same, just in case:

// https://github.com/KhronosGroup/GLSL/blob/main/extensions/khr/GL_KHR_memory_scope_semantics.txt
public enum Scope: uint {
  Device = 1,
  Workgroup = 2,
}

[Flags] public enum MemorySemantics: uint {
  None = 0x0,
  Acquire = 0x2,
  Release = 0x4,
  AcquireRelease = 0x8,
  UniformMemory = 0x40,
  WorkgroupMemory = 0x100,
}

__glsl_extension(GL_KHR_memory_scope_semantics)
[ForceInline] [require(spirv)] public uint atomic_min<let scope: Scope, let semantics: MemorySemantics>(__ref uint memory, const uint value) {
  return spirv_asm {
    OpAtomicUMin $$uint result &memory $scope $semantics $value;
  };
}

struct Constants {
  uint *device_memory;
}
[[vk::push_constant]] Constants push_constants;

groupshared uint shared_memory[1];

[shader("compute")] [numthreads(1, 1, 1)] void main(void) {
  shared_memory[0] = 0xffffffff;

  // Doesn't compile to the same instruction as GLSL.
  // atomicMin(chain_indices[0], 21);
  atomic_min<Scope::Device, MemorySemantics::None>(shared_memory[0], 21);

  push_constants.device_memory[0] = shared_memory[0];
}

And the SPIR-V:

; SPIR-V
; Version: 1.5
; Generator: Khronos; 40
; Bound: 28
; Schema: 0
               OpCapability VariablePointers
               OpCapability PhysicalStorageBufferAddresses
               OpCapability Shader
               OpExtension "SPV_KHR_variable_pointers"
               OpExtension "SPV_KHR_physical_storage_buffer"
               OpMemoryModel PhysicalStorageBuffer64 GLSL450
               OpEntryPoint GLCompute %main "main" %shared_memory %push_constants
               OpExecutionMode %main LocalSize 1 1 1
               OpSource Slang 1
               OpName %shared_memory "shared_memory"
               OpName %shared_memory "shared_memory"
               OpName %Constants_natural "Constants_natural"
               OpMemberName %Constants_natural 0 "device_memory"
               OpName %push_constants "push_constants"
               OpName %push_constants "push_constants"
               OpName %main "main"
               OpDecorate %_arr_uint_int_1 ArrayStride 4
               OpDecorate %_ptr_PhysicalStorageBuffer_uint ArrayStride 4
               OpDecorate %Constants_natural Block
               OpMemberDecorate %Constants_natural 0 Offset 0
       %void = OpTypeVoid
          %3 = OpTypeFunction %void
       %uint = OpTypeInt 32 0
        %int = OpTypeInt 32 1
      %int_1 = OpConstant %int 1
%_arr_uint_int_1 = OpTypeArray %uint %int_1
%_ptr_Workgroup__arr_uint_int_1 = OpTypePointer Workgroup %_arr_uint_int_1
%_ptr_Workgroup_uint = OpTypePointer Workgroup %uint
      %int_0 = OpConstant %int 0
%uint_4294967295 = OpConstant %uint 4294967295
     %uint_1 = OpConstant %uint 1
     %uint_0 = OpConstant %uint 0
    %uint_21 = OpConstant %uint 21
%_ptr_PhysicalStorageBuffer_uint = OpTypePointer PhysicalStorageBuffer %uint
%Constants_natural = OpTypeStruct %_ptr_PhysicalStorageBuffer_uint
%_ptr_PushConstant_Constants_natural = OpTypePointer PushConstant %Constants_natural
%_ptr_PushConstant__ptr_PhysicalStorageBuffer_uint = OpTypePointer PushConstant %_ptr_PhysicalStorageBuffer_uint
%shared_memory = OpVariable %_ptr_Workgroup__arr_uint_int_1 Workgroup
%push_constants = OpVariable %_ptr_PushConstant_Constants_natural PushConstant
       %main = OpFunction %void None %3
          %4 = OpLabel
         %12 = OpAccessChain %_ptr_Workgroup_uint %shared_memory %int_0
               OpStore %12 %uint_4294967295
         %15 = OpAtomicUMin %uint %12 %uint_1 %uint_0 %uint_21
         %24 = OpAccessChain %_ptr_PushConstant__ptr_PhysicalStorageBuffer_uint %push_constants %int_0
         %25 = OpLoad %_ptr_PhysicalStorageBuffer_uint %24
         %26 = OpPtrAccessChain %_ptr_PhysicalStorageBuffer_uint %25 %int_0
         %27 = OpLoad %uint %12
               OpStore %26 %27 Aligned 4
               OpReturn
               OpFunctionEnd

Unfortunately, same results (the Nvidia graphics cards seem to ignore the atomicMin).

Now, I'm surprised that the device scope is used but it shouldn't harm, right? That's just requesting a larger scope than necessary, so I'm gonna change it to workgroup:

// https://github.com/KhronosGroup/GLSL/blob/main/extensions/khr/GL_KHR_memory_scope_semantics.txt
public enum Scope: uint {
  Device = 1,
  Workgroup = 2,
}

[Flags] public enum MemorySemantics: uint {
  None = 0x0,
  Acquire = 0x2,
  Release = 0x4,
  AcquireRelease = 0x8,
  UniformMemory = 0x40,
  WorkgroupMemory = 0x100,
}

__glsl_extension(GL_KHR_memory_scope_semantics)
[ForceInline] [require(spirv)] public uint atomic_min<let scope: Scope, let semantics: MemorySemantics>(__ref uint memory, const uint value) {
  return spirv_asm {
    OpAtomicUMin $$uint result &memory $scope $semantics $value;
  };
}

struct Constants {
  uint *device_memory;
}
[[vk::push_constant]] Constants push_constants;

groupshared uint shared_memory[1];

[shader("compute")] [numthreads(1, 1, 1)] void main(void) {
  shared_memory[0] = 0xffffffff;

  atomic_min<Scope::Workgroup, MemorySemantics::WorkgroupMemory | MemorySemantics::None>(shared_memory[0], 21);

  push_constants.device_memory[0] = shared_memory[0];
}

GLSL:

#version 450
#extension GL_EXT_buffer_reference : enable
#extension GL_EXT_scalar_block_layout : enable
#extension GL_EXT_shader_explicit_arithmetic_types_int64 : enable
#extension GL_EXT_control_flow_attributes : enable
#extension GL_KHR_memory_scope_semantics : enable

const uint workgroup_size = 1;
layout(local_size_x = 1) in;

layout(buffer_reference, scalar) buffer uintBuffer {
  uint array[];
};

layout(push_constant, scalar) uniform PushConstants {
  uintBuffer device_memory;
} push_constants;

shared uint shared_memory[1];

void main(void) {
  shared_memory[0] = 0xffffffff;

  atomicMin(shared_memory[0], 21, gl_ScopeWorkgroup, gl_StorageSemanticsShared, gl_SemanticsRelaxed);

  push_constants.device_memory.array[0] = shared_memory[0];
}

Same results.

In the real shader, I would need to synchronize the invocations in this way:

#version 450
#extension GL_EXT_buffer_reference : enable
#extension GL_EXT_scalar_block_layout : enable
#extension GL_EXT_shader_explicit_arithmetic_types_int64 : enable
#extension GL_EXT_control_flow_attributes : enable
#extension GL_KHR_memory_scope_semantics : enable

const uint workgroup_size = 1;
layout(local_size_x = 1) in;

layout(buffer_reference, scalar) buffer uintBuffer {
  uint array[];
};

layout(push_constant, scalar) uniform PushConstants {
  uintBuffer device_memory;
} push_constants;

shared uint shared_memory[1];

void main(void) {
  // Each invocation initializes a chunk of the shared memory (no conflict).
  shared_memory[0] = 0xffffffff;

  barrier(); // Synchronizes execution and shared memory.

  // Multiple invocations might write to the same shared memory index, hence the atomic operation (conflict).
  atomicMin(shared_memory[0], 21, gl_ScopeWorkgroup, gl_StorageSemanticsShared, gl_SemanticsRelaxed);

  barrier(); // Synchronizes execution and shared memory.

  // Each invocation copies a chunk of the shared memory to the device memory (no conflict).
  push_constants.device_memory.array[0] = shared_memory[0];
}

The corresponding Slang:

// https://github.com/KhronosGroup/GLSL/blob/main/extensions/khr/GL_KHR_memory_scope_semantics.txt
public enum Scope: uint {
  Device = 1,
  Workgroup = 2,
}

[Flags] public enum MemorySemantics: uint {
  None = 0x0,
  Acquire = 0x2,
  Release = 0x4,
  AcquireRelease = 0x8,
  UniformMemory = 0x40,
  WorkgroupMemory = 0x100,
}

[ForceInline] [require(spirv)] public void barrier<let execution: Scope, let memory: Scope, let semantics: MemorySemantics>() {
  spirv_asm {
    OpControlBarrier $execution $memory $semantics;
  };
}

__glsl_extension(GL_KHR_memory_scope_semantics)
[ForceInline] [require(spirv)] public uint atomic_min<let scope: Scope, let semantics: MemorySemantics>(__ref uint memory, const uint value) {
  return spirv_asm {
    OpAtomicUMin $$uint result &memory $scope $semantics $value;
  };
}

struct Constants {
  uint *device_memory;
}
[[vk::push_constant]] Constants push_constants;

groupshared uint shared_memory[1];

[shader("compute")] [numthreads(1, 1, 1)] void main(void) {
  shared_memory[0] = 0xffffffff;

  barrier<Scope::Workgroup, Scope::Workgroup, MemorySemantics::WorkgroupMemory | MemorySemantics::AcquireRelease>();

  atomic_min<Scope::Workgroup, MemorySemantics::WorkgroupMemory | MemorySemantics::None>(shared_memory[0], 21);

  barrier<Scope::Workgroup, Scope::Workgroup, MemorySemantics::WorkgroupMemory | MemorySemantics::AcquireRelease>();

  push_constants.device_memory[0] = shared_memory[0];
}

Same results...

Finally, I resort to try this out:

// https://github.com/KhronosGroup/GLSL/blob/main/extensions/khr/GL_KHR_memory_scope_semantics.txt
public enum Scope: uint {
  Device = 1,
  Workgroup = 2,
}

[Flags] public enum MemorySemantics: uint {
  None = 0x0,
  Acquire = 0x2,
  Release = 0x4,
  AcquireRelease = 0x8,
  UniformMemory = 0x40,
  WorkgroupMemory = 0x100,
}

[ForceInline] [require(spirv)] public void barrier<let execution: Scope, let memory: Scope, let semantics: MemorySemantics>() {
  spirv_asm {
    OpControlBarrier $execution $memory $semantics;
  };
}

__glsl_extension(GL_KHR_memory_scope_semantics)
[ForceInline] [require(spirv)] public void atomic_store<let scope: Scope, let semantics: MemorySemantics>(__ref uint memory, const uint value) {
  spirv_asm {
    OpAtomicStore &memory $scope $semantics $value;
  };
}

__glsl_extension(GL_KHR_memory_scope_semantics)
[ForceInline] [require(spirv)] public uint atomic_load<let scope: Scope, let semantics: MemorySemantics>(__ref uint memory) {
  return spirv_asm {
    OpAtomicLoad $$uint result &memory $scope $semantics;
  };
}

__glsl_extension(GL_KHR_memory_scope_semantics)
[ForceInline] [require(spirv)] public uint atomic_min<let scope: Scope, let semantics: MemorySemantics>(__ref uint memory, const uint value) {
  return spirv_asm {
    OpAtomicUMin $$uint result &memory $scope $semantics $value;
  };
}

struct Constants {
  uint *device_memory;
}
[[vk::push_constant]] Constants push_constants;

groupshared uint shared_memory[1];

[shader("compute")] [numthreads(1, 1, 1)] void main(void) {
  atomic_store<Scope::Workgroup, MemorySemantics::WorkgroupMemory | MemorySemantics::None>(shared_memory[0], 0xffffffff);

  barrier<Scope::Workgroup, Scope::Workgroup, MemorySemantics::WorkgroupMemory | MemorySemantics::AcquireRelease>();

  atomic_min<Scope::Workgroup, MemorySemantics::WorkgroupMemory | MemorySemantics::None>(shared_memory[0], 21);

  barrier<Scope::Workgroup, Scope::Workgroup, MemorySemantics::WorkgroupMemory | MemorySemantics::AcquireRelease>();

  push_constants.device_memory[0] = atomic_load<Scope::Workgroup, MemorySemantics::WorkgroupMemory | MemorySemantics::None>(shared_memory[0]);
}

GLSL:

#version 450
#extension GL_EXT_buffer_reference : enable
#extension GL_EXT_scalar_block_layout : enable
#extension GL_EXT_shader_explicit_arithmetic_types_int64 : enable
#extension GL_EXT_control_flow_attributes : enable
#extension GL_KHR_memory_scope_semantics : enable

const uint workgroup_size = 1;
layout(local_size_x = 1) in;

layout(buffer_reference, scalar) buffer uintBuffer {
  uint array[];
};

layout(push_constant, scalar) uniform PushConstants {
  uintBuffer device_memory;
} push_constants;

shared uint shared_memory[1];

void main(void) {
  atomicStore(shared_memory[0], 0xffffffffu, gl_ScopeWorkgroup, gl_StorageSemanticsShared, gl_SemanticsRelaxed);

  barrier();

  atomicMin(shared_memory[0], 21, gl_ScopeWorkgroup, gl_StorageSemanticsShared, gl_SemanticsRelaxed);

  barrier();

  push_constants.device_memory.array[0] = atomicLoad(shared_memory[0], gl_ScopeWorkgroup, gl_StorageSemanticsShared, gl_SemanticsRelaxed);
}

The atomicMin is still ignored on Nvidia for the Slang shader only.

So I have a few questions now:

Is the instruction emitted for atomicMin in the glsl.meta.slang module different on purpose? Is it generally safe to sprinkle SPIR-V assembly in a Slang program, especially for sensitive operations like that (e.g.: risk of reordering)?
Is my understanding of the memory model flawed? Does any of this makes sense without opting into the Vulkan memory model?
The only difference appears to be the use of pointers vs buffer references... Does that ring any bells? That can't be a driver bug, right?

jkwak-work · 2024-04-16T13:54:02Z

jkwak-work
Apr 16, 2024
Maintainer

I was not able to reproduce the problem on my RTX-3090.
I used a following test,

//TEST(compute, vulkan):COMPARE_COMPUTE(filecheck-buffer=BUF):-vk -compute -entry computeMain -allow-glsl -output-using-type
//TEST(compute, vulkan):COMPARE_COMPUTE(filecheck-buffer=BUF):-vk -compute -entry computeMain -allow-glsl -emit-spirv-directly  -output-using-type

//TEST_INPUT:ubuffer(data=[0], stride=4):out,name outputBuffer
RWStructuredBuffer<float> outputBuffer;

groupshared uint shared_memory[1];

[shader("compute")]
[numthreads(1, 1, 1)]
void computeMain(void)
{
  shared_memory[0] = 0xffffffff;

  atomicMin(shared_memory[0], 21);

  //BUF: 21
  outputBuffer[0] = float(shared_memory[0]);
}

I saved it as a file "tests/d3950.slang", and tested it with a following command,

bin/windows-x64/release/slang-test.exe tests/d3950.slang

The result is passing,

Supported backends: fxc dxc glslang spirv-dis visualstudio genericcpp nvrtc llvm spirv-opt
Check gl,ogl,opengl: Supported
Check vk,vulkan: Supported
Check dx12,d3d12: Supported
Check dx11,d3d11: Supported
Check cuda: Not Supported
passed test: 'tests/d3950.slang (vk)'
passed test: 'tests/d3950.slang.1 (vk)'

===
100% of tests passed (2/2)
===

What is your graphics driver version?
Can you retry with the latest graphics driver just in case?

0 replies

kevinboulain · 2024-04-16T16:15:25Z

kevinboulain
Apr 16, 2024
Author

Thanks for taking a look. With your test, the issue doesn't happen, though it's not exactly the same: it doesn't use push constants for example. I haven't checked yet if I can easily make it the same with the test framework.

I've extracted a 'minimal' repro here (not sure you're gonna like it though, it's in Rust and I never tried to run it on Windows since I don't have a physical instance...): https://gist.github.com/kevinboulain/282b6a5d24571ddd97324527254e27d7

What is your graphics driver version?

The latest available on Linux: 550.67. The graphics card model, driver, kernel are in the readme.md of the repository linked above.

0 replies

jkwak-work · 2024-04-16T21:01:02Z

jkwak-work
Apr 16, 2024
Maintainer

Does this look like a same issue as reported in 3946?

0 replies

kevinboulain · 2024-04-16T21:30:19Z

kevinboulain
Apr 16, 2024
Author

I don't think so. You can see the examples I've posted always use a struct for the push constants (not sure how that's mapped to SPIR-V by Slang but Vulkan GLSL requires a block). And in the repro code you can see I've also enabled the validation layers, including the GPU-assisted validation, so it would have caught an issue in the SPIR-V.

2 replies

csyonghe Apr 17, 2024
Maintainer

The definition of the push constant variable seems suspicious. Are you able to confirm that the shader can indeed write anything into that buffer?

I think it needs to be
[vk::push_constant]
uniform Constant constant;

The host probably also need to use a pipeline barrier before reading back the buffer content.

kevinboulain Apr 17, 2024
Author

The definition of the push constant variable seems suspicious. [...] I think it needs to be [...]

Adding uniform doesn't change the behavior, nor the SPIR-V.

Are you able to confirm that the shader can indeed write anything into that buffer?

Yeah, with a very suspicious change (looks like I forgot to include that in the original post, my bad):

-  atomic_min<Scope::Device, MemorySemantics::None>(shared_memory[0], 21);
+  atomic_store<Scope::Device, MemorySemantics::None>(shared_memory[0], 21);

Also, I updated the repro with its GLSL counterpart. None of the host code changes but it's now possible to switch between Slang and GLSL to observe the difference.

The host probably also need to use a pipeline barrier before reading back the buffer content.

Oh? Am I missing something? In main, the command buffer does:

a dispatch of the compute shader
a compute shader write / transfer read pipeline barrier
a transfer from the GPU buffer to the host buffer
a transfer write / host read pipeline barrier
a fence on the queue

before the host reads from the readback buffer.

I'll spend some time today to update the Slang test framework to reflect the bindless/push constant usage.

kevinboulain · 2024-04-17T14:43:01Z

kevinboulain
Apr 17, 2024
Author

I've made a mess out of the test framework to pass the buffer address as a push constant: master...kevinboulain:slang:master

~~I do get the correct result...~~ And if I trust the test's output, the SPIR-V is basically the same. ~~Which can only mean I'm doing something dumb. So I'm gonna close for now to not waste more of your time (sorry).~~

0 replies

kevinboulain · 2024-04-17T17:03:04Z

kevinboulain
Apr 17, 2024
Author

Silly me, I forgot the driver selection depends on what function the application calls to get the devices, so Slang test framework's picks my Intel graphics card by default.

So, on the Intel graphics card:

VK_ICD_FILENAMES=/run/opengl-driver/share/vulkan/icd.d/intel_icd.x86_64.json VK_LOADER_DEBUG=all .../Release/bin/slang-test tests/d3950.slang
...
DRIVER | LAYER:           Using "Intel(R) Graphics (ADL GT2)" with driver: "/nix/store/5xylmnrx48vqlcfv1daj63kql91gflx3-mesa-24.0.3-drivers/lib/libvulkan_intel.so"
got device address: 19327352832
setting device address: 19327352832
setData this=0x2489de0 m_data.getCount=16 size=8
bindAsEntryPoint this=0x2482900 m_data.getCount=0
push constants device address: 19327352832
...
passed test: 'tests/d3950.slang.1 (vk)' 

===
100% of tests passed (2/2)
===

cat tests/d3950.slang.1.actual.txt 
type: uint32_t
21

And on the Nvidia graphics card:

VK_ICD_FILENAMES=/run/opengl-driver/share/vulkan/icd.d/nvidia_icd.x86_64.json VK_LOADER_DEBUG=all .../Release/bin/slang-test tests/d3950.slang
...
DRIVER | LAYER:           Using "NVIDIA GeForce RTX 3080 Ti Laptop GPU" with driver: "/nix/store/1pzz5b3aw9rcbbihg12axrwh7mpmvm99-nvidia-x11-550.67-6.8.4/lib/libGLX_nvidia.so.0"
got device address: 957644275712
setting device address: 957644275712
setData this=0x1c27680 m_data.getCount=16 size=8
bindAsEntryPoint this=0x1e27de0 m_data.getCount=0
push constants device address: 957644275712
error: slang-test: tests/d3950.slang:23:9: error: BUF: expected string not found in input
 //BUF: 21
        ^

error: slang-test: actual-output:1:1: note: scanning from here
type: uint32_t
^

error: slang-test: actual-output:1:12: note: possible intended match here
type: uint32_t
           ^

FAILED test: 'tests/d3950.slang.1 (vk)' 

===
50% of tests passed (1/2)
===

failing tests:
---
tests/d3950.slang.1 (vk)
---

cat tests/d3950.slang.1.actual.txt 
type: uint32_t
4294967295

Same behavior as in the OP and the repro code I posted. Would like to try out the branch I've linked?

2 replies

csyonghe Apr 17, 2024
Maintainer

We can take a look, but this is likely a driver problem and we will need redirect this to some driver experts if you can share your branch.

kevinboulain Apr 17, 2024
Author

(If that wasn't clear from the previous comment, the patch to test the push constants is kevinboulain@4134d7e. Sorry for the poor code, I couldn't figure out how to pass the address between the test inputs and setData didn't seem to actually set the push constant.)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

User understanding, compiler or driver problem? #3950

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments 4 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

User understanding, compiler or driver problem? #3950

kevinboulain Apr 15, 2024

Replies: 6 comments · 4 replies

jkwak-work Apr 16, 2024 Maintainer

kevinboulain Apr 16, 2024 Author

jkwak-work Apr 16, 2024 Maintainer

kevinboulain Apr 16, 2024 Author

csyonghe Apr 17, 2024 Maintainer

kevinboulain Apr 17, 2024 Author

kevinboulain Apr 17, 2024 Author

kevinboulain Apr 17, 2024 Author

csyonghe Apr 17, 2024 Maintainer

kevinboulain Apr 17, 2024 Author

kevinboulain
Apr 15, 2024

Replies: 6 comments 4 replies

jkwak-work
Apr 16, 2024
Maintainer

kevinboulain
Apr 16, 2024
Author

jkwak-work
Apr 16, 2024
Maintainer

kevinboulain
Apr 16, 2024
Author

csyonghe Apr 17, 2024
Maintainer

kevinboulain Apr 17, 2024
Author

kevinboulain
Apr 17, 2024
Author

kevinboulain
Apr 17, 2024
Author

csyonghe Apr 17, 2024
Maintainer

kevinboulain Apr 17, 2024
Author