Skip to content
Thomas Smith edited this page Mar 9, 2024 · 1 revision

Overview

A simple library-less CUDA implementation of DeviceRadixSort and OneSweep. This implementation is purely a demo. Its main purpose is to demystify the implmentations of the algorithms, while also providing data on the performance of the implementations in this library. It is not intended for production or use, instead proper implementatios can be found in the CUB library.

Supported Features:

GPUSortingCUDA includes:

  • DeviceRadixSort: a general purpose 8-bit LSD radix sort using "reduce-then-scan" to perform the inter-threadblock prefix sum of digit counts. Although slower than OneSweep, use this whenever portability is a concern.

  • OneSweep: a general purpose 8-bit LSD radix sort using "chained-scan-with-decoupled-lookback" to perfom the the inter-threadblock prefix sum of digit counts.

GPUSortingCUDA currently supports:

  • keys only sorting
  • ascending order sorting
  • a maximum sorting size of $2^{32} - 128$ for DeviceRadixSort
  • a maxmimum sorting size of $2^{30} - 128$ for OneSweep
  • Thearling and Smith entropy controlled benchmarking.
  • uint32_t

Getting Started

  1. Clone or fork the repository.

  2. Download and install CUDA toolkit 12.3.2

  3. Download and install Visual Studio 2019 or greater.

  4. Open the solution file with Visual studio. If on Visual 2022, updating the platform toolset from v142 to v143 should work without issue.

  5. In Visual Studio go to Project > Properties > Configuration Properties > CUDA C/C++ > Device > Code Generation

  6. Change compute_x, sm_x to match the compute capability of your graphics card. You can lookup your graphics card's compute capability here. For example my 2080 super is compute capability 7.5 so I set the properties to compute_75, sm_75

  7. Build and run the solution.

Example Use

Create a sort object by calling the constructor for its wrapper class, and pass in the desired size of allocation for GPU resources:

OneSweepDispatcher* oneSweep = new OneSweepDispatcher(1 << 28);

Then run the tests:

oneSweep->TestAll();

Then benchmark the performance. Use the ENTROPY_PRESET enum to control the entropy of the distribution:

oneSweep->BatchTiming(1 << 28, 50, 10, ENTROPY_PRESET_1);
Clone this wiki locally