Releases: NVIDIA/cub
CUB 1.13.0 (NVIDIA HPC SDK 21.7)
CUB 1.13.0 is the major release accompanying the NVIDIA HPC SDK 21.7 release.
Notable new features include support for striped data arrangements in block load/store utilities, bfloat16
radix sort support, and fewer restrictions on offset iterators in segmented device algorithms. Several bugs in cub::BlockShuffle
, cub::BlockDiscontinuity
, and cub::DeviceHistogram
have been addressed. The amount of code generated in cub::DeviceScan
has been greatly reduced, leading to significant compile-time improvements when targeting multiple PTX architectures.
This release also includes several user-contributed documentation fixes that will be reflected in CUB's online documentation in the coming weeks.
Breaking Changes
- #320: Deprecated
cub::TexRefInputIterator<T, UNIQUE_ID>
. Usecub::TexObjInputIterator<T>
as a replacement.
New Features
- #274: Add
BLOCK_LOAD_STRIPED
andBLOCK_STORE_STRIPED
functionality tocub::BlockLoadAlgorithm
andcub::BlockStoreAlgorithm
. Thanks to Matthew Nicely (@mnicely) for this contribution. - #291:
cub::DeviceSegmentedRadixSort
andcub::DeviceSegmentedReduce
now support different types for begin/end offset iterators. Thanks to Sergey Pavlov (@psvvsp) for this contribution. - #306: Add
bfloat16
support tocub::DeviceRadixSort
. Thanks to Xiang Gao (@zasdfgbnm) for this contribution. - #320: Introduce a new
CUB_IGNORE_DEPRECATED_API
macro that disables deprecation warnings on Thrust and CUB APIs.
Bug Fixes
- #277: Fixed sanitizer warnings in
RadixSortScanBinsKernels
. Thanks to Andy Adinets (@canonizer) for this contribution. - #287:
cub::DeviceHistogram
now correctly handles cases whereOffsetT
is not anint
. Thanks to Dominique LaSalle (@nv-dlasalle) for this contribution. - #311: Fixed several bugs and added tests for the
cub::BlockShuffle
collective operations. - #312: Eliminate unnecessary kernel instantiations when compiling
cub::DeviceScan
. Thanks to Elias Stehle (@elstehle) for this contribution. - #319: Fixed out-of-bounds memory access on debugging builds of
cub::BlockDiscontinuity::FlagHeadsAndTails
. - #320: Fixed harmless missing return statement warning in unreachable
cub::TexObjInputIterator
code path.
Other Enhancements
- Several documentation fixes are included in this release.
- #275: Fixed comments describing the
cub::If
andcub::Equals
utilities. Thanks to Rukshan Jayasekara (@rukshan99) for this contribution. - #290: Documented that
cub::DeviceSegmentedReduce
will produce consistent results run-to-run on the same device for pseudo-associated reduction operators. Thanks to Himanshu (@himanshu007-creator) for this contribution. - #298:
CONTRIBUTING.md
now refers to Thrust's build instructions for developer builds, which is the preferred way to build the CUB test harness. Thanks to Xiang Gao (@zasdfgbnm) for contributing. - #301: Expand
cub::DeviceScan
documentation to include in-place support and add tests. Thanks to Xiang Gao (@zasdfgbnm) for this contribution. - #307: Expand
cub::DeviceRadixSort
andcub::BlockRadixSort
documentation to clarify stability, in-place support, and type-specific bitwise transformations. Thanks to Himanshu (@himanshu007-creator) for contributing. - #316: Move
WARP_TIME_SLICING
documentation to the correct location. Thanks to Peter Han (@Peter9606) for this contribution. - #321: Update URLs from deprecated github.com to preferred github.io. Thanks to Lilo Huang (@lilohuang) for this contribution.
- #275: Fixed comments describing the
CUB 1.12.1 (CUDA Toolkit 11.4)
CUB 1.12.1 is a trivial patch release that slightly changes the phrasing of a deprecation message.
CUB 1.12.0 (NVIDIA HPC SDK 21.3, CUDA Toolkit 11.4)
Summary
CUB 1.12.0 is a bugfix release accompanying the NVIDIA HPC SDK 21.3 release and the CUDA Toolkit 11.4 release.
Radix sort is now stable when both +0.0 and -0.0 are present in the input (they are treated as equivalent).
Many compilation warnings and subtle overflow bugs were fixed in the device algorithms, including a long-standing bug that returned invalid temporary storage requirements when num_items
was close to (but not exceeding) INT32_MAX
.
Support for Clang < 7.0 and MSVC < 2019 (aka 19.20/16.0/14.20) is now deprecated.
Breaking Changes
- #256: Deprecate Clang < 7 and MSVC < 2019.
New Features
- #218: Radix sort now treats -0.0 and +0.0 as equivalent for floating
point types, which is required for the sort to be stable. Thanks to Andy
Adinets for this contribution.
Bug Fixes
- #247: Suppress newly triggered warnings in Clang. Thanks to Andrew Corrigan for this contribution.
- #249: Enable stricter warning flags. This fixes a number of outstanding issues:
- #258: Use correct
OffsetT
inDispatchRadixSort::InitPassConfig
. Thanks to Felix Kallenborn for this contribution. - #259: Remove some problematic
__forceinline__
annotations.
Other Enhancements
- #123: Fix incorrect issue number in changelog. Thanks to Peet Whittaker for this contribution.
CUB 1.11.0 (CUDA Toolkit 11.3)
Summary
CUB 1.11.0 is a major release providing bugfixes and performance enhancements. It includes a new DeviceRadixSort
backend that improves performance by up to 2x on supported keys and hardware. Our CMake package and build system continue to see improvements with add_subdirectory
support, installation rules, status messages, and other features that make CUB easier to use from CMake projects. The release includes several other bugfixes and modernizations, and received updates from 11 contributors.
Breaking Changes
- #201: The intermediate accumulator type used when
DeviceScan
is invoked with different input/output types is now consistent with P0571. This may produce different results for some edge cases when compared with earlier releases of CUB.
New Features
- #204: Faster
DeviceRadixSort
, up to 2x performance increase for 32/64-bit keys on Pascal and up (SM60+). Thanks to Andy Adinets for this contribution. - Unroll loops in
BlockRadixRank
to improve performance for 32-bit keys by 1.5-2x on Clang CUDA. Thanks to Justin Lebar for this contribution. - #200: Allow CUB to be added to CMake projects via
add_subdirectory
. - #214: Optionally add install rules when included with CMake's
add_subdirectory
. Thanks to Kai Germaschewski for this contribution.
Bug Fixes
- #215: Fix integer truncation in
AgentReduceByKey
,AgentScan
, andAgentSegmentFixup
. Thanks to Rory Mitchell for this contribution. - #225: Fix compile-time regression when defining
CUB_NS_PREFIX
/CUB_NS_POSTFIX
macro. Thanks to Elias Stehle for this contribution. - #210: Fix some edge cases in
DeviceScan
:- Use values from the input when padding temporary buffers. This prevents custom functors from getting unexpected values.
- Prevent integer truncation when using large indices via the
DispatchScan
layer. - Use timesliced reads/writes for types > 128 bytes.
- #217: Fix and add test for cmake package install rules. Thanks to Keith Kraus and Kai Germaschewski for testing and discussion.
- #170, #233: Update CUDA version checks to behave on Clang CUDA and
nvc++
. Thanks to Artem Belevich, Andrew Corrigan, and David Olsen for these contributions. - #220, #216: Various fixes for Clang CUDA. Thanks to Andrew Corrigan for these contributions.
- #231: Fix signedness mismatch warnings in unit tests.
- #231: Suppress GPU deprecation warnings.
- #214: Use semantic versioning rules for our CMake package's compatibility checks. Thanks to Kai Germaschewski for this contribution.
- #214: Use
FindPackageHandleStandardArgs
to print standard status messages when our CMake package is found. Thanks to Kai Germaschewski for this contribution. - #207: Fix
CubDebug
usage inCachingDeviceAllocator::DeviceAllocate
. Thanks to Andreas Hehn for this contribution. - Fix documentation for
DevicePartition
. Thanks to ByteHamster for this contribution. - Clean up unused code in
DispatchScan
. Thanks to ByteHamster for this contribution.
Other Enhancements
- #213: Remove tuning policies for unsupported hardware (<SM35).
- References to the old Github repository and branch names were updated.
- Github's
thrust/cub
repository is nowNVIDIA/cub
- Development has moved from the
master
branch to themain
branch.
- Github's
CUB 1.10.0 (NVIDIA HPC SDK 20.9, CUDA Toolkit 11.2)
Summary
CUB 1.10.0 is the major release accompanying the NVIDIA HPC SDK 20.9 release and the CUDA Toolkit 11.2 release. It drops support for C++03, GCC < 5, Clang < 6, and MSVC < 2017. It also overhauls CMake support. Finally, we now have a Code of Conduct for contributors: https://github.com/thrust/cub/blob/main/CODE_OF_CONDUCT.md
Breaking Changes
- C++03 is no longer supported.
- GCC < 5, Clang < 6, and MSVC < 2017 are no longer supported.
- C++11 is deprecated. Using this dialect will generate a compile-time warning. These warnings can be suppressed by defining
CUB_IGNORE_DEPRECATED_CPP_DIALECT
orCUB_IGNORE_DEPRECATED_CPP_11
. Suppression is only a short term solution. We will be dropping support for C++11 in the near future. - CMake < 3.15 is no longer supported.
- The default branch on GitHub is now called
main
.
Other Enhancements
- Contributor documentation: https://github.com/thrust/cub/blob/main/CONTRIBUTING.md
- Code of Conduct: https://github.com/thrust/cub/blob/main/CODE_OF_CONDUCT.md. Thanks to Conor Hoekstra for this contribution.
- Added install targets to CMake builds.
- C++17 support.
Bug Fixes
- NVIDIA/thrust#1244: Check for macro collisions with system headers during header testing.
- thrust/thrust#1153: Switch to placement new instead of assignment to construct items in uninitialized memory. Thanks to Hugh Winkler for this contribution.
- thrust/cub#38: Fix
cub::DeviceHistogram
forsize_t
OffsetT
s. Thanks to Leo Fang for this contribution. - thrust/cub#35: Fix GCC-5 maybe-uninitialized warning. Thanks to Rong Ou for this contribution.
- thrust/cub#36: Qualify namespace for
va_printf
in_CubLog
. Thanks to Andrei Tchouprakov for this contribution.
CUB 1.9.10-1 (NVIDIA HPC SDK 20.7, CUDA Toolkit 11.1)
Summary
CUB 1.9.10-1 is the minor release accompanying the NVIDIA HPC SDK 20.7 release and the CUDA Toolkit 11.1 release.
Bug Fixes
- #1217: Move static local in
cub::DeviceCount
to a separate host-only function because NVC++ doesn't support static locals in host-device functions.
CUB 1.9.10 (NVIDIA HPC SDK 20.5)
Summary
Thrust 1.9.10 is the release accompanying the NVIDIA HPC SDK 20.5 release. It adds CMake find_package
support. C++03, C++11, GCC < 5, Clang < 6, and MSVC < 2017 are now deprecated. Starting with the upcoming 1.10.0 release, C++03 support will be dropped entirely.
Breaking Changes
- Thrust now checks that it is compatible with the version of CUB found in your include path, generating an error if it is not. If you are using your own version of CUB, it may be too old. It is recommended to simply delete your own version of CUB and use the version of CUB that comes with Thrust.
- C++03 and C++11 are deprecated. Using these dialects will generate a compile-time warning. These warnings can be suppressed by defining
CUB_IGNORE_DEPRECATED_CPP_DIALECT
(to suppress C++03 and C++11 deprecation warnings) orCUB_IGNORE_DEPRECATED_CPP11
(to suppress C++11 deprecation warnings). Suppression is only a short term solution. We will be dropping support for C++03 in the 1.10.0 release and C++11 in the near future. - GCC < 5, Clang < 6, and MSVC < 2017 are deprecated. Using these compilers will generate a compile-time warning. These warnings can be suppressed by defining
CUB_IGNORE_DEPRECATED_COMPILER
. Suppression is only a short term solution. We will be dropping support for these compilers in the near future.
New Features
- CMake
find_package
support. Just point CMake at thecmake
folder in your CUB include directory (ex:cmake -DCUB_DIR=/usr/local/cuda/include/cub/cmake/ .
) and then you can add CUB to your CMake project withfind_package(CUB REQUIRED CONFIG)
.
CUB 1.9.9 (CUDA 11.0)
CUB 1.9.9 (CUDA 11.0)
Summary
CUB 1.9.9 is the release accompanying the CUDA Toolkit 11.0 release. It introduces CMake support, version macros, platform detection machinery, and support for NVC++, which uses Thrust (and thus CUB) to implement GPU-accelerated C++17 Parallel Algorithms. Additionally, the scan dispatch layer was refactored and modernized. C++03, C++11, GCC < 5, Clang < 6, and MSVC < 2017 are now deprecated. Starting with the upcoming 1.10.0 release, C++03 support will be dropped entirely.
Breaking Changes
- Thrust now checks that it is compatible with the version of CUB found in your include path, generating an error if it is not. If you are using your own version of CUB, it may be too old. It is recommended to simply delete your own version of CUB and use the version of CUB that comes with Thrust.
- C++03 and C++11 are deprecated. Using these dialects will generate a compile-time warning. These warnings can be suppressed by defining
CUB_IGNORE_DEPRECATED_CPP_DIALECT
(to suppress C++03 and C++11 deprecation warnings) orCUB_IGNORE_DEPRECATED_CPP11
(to suppress C++11 deprecation warnings). Suppression is only a short term solution. We will be dropping support for C++03 in the 1.10.0 release and C++11 in the near future. - GCC < 5, Clang < 6, and MSVC < 2017 are deprecated. Using these compilers will generate a compile-time warning. These warnings can be suppressed by defining
CUB_IGNORE_DEPRECATED_COMPILER
. Suppression is only a short term solution. We will be dropping support for these compilers in the near future.
New Features
- CMake support. Thanks to Francis Lemaire for this contribution.
- Refactorized and modernized scan dispatch layer. Thanks to Francis Lemaire for this contribution.
- Policy hooks for device-wide reduce, scan, and radix sort facilities to simplify tuning and allow users to provide custom policies. Thanks to Francis Lemaire for this contribution.
<cub/version.cuh>
:CUB_VERSION
,CUB_VERSION_MAJOR
,CUB_VERISON_MINOR
,CUB_VERSION_SUBMINOR
, andCUB_PATCH_NUMBER
.- Platform detection machinery:
<cub/util_cpp_dialect.cuh>
: Detects the C++ standard dialect.<cub/util_compiler.cuh>
: host and device compiler detection.<cub/util_deprecated.cuh>
:CUB_DEPRECATED
.- <cub/config.cuh>
: Includes
<cub/util_arch.cuh>,
<cub/util_compiler.cuh>,
<cub/util_cpp_dialect.cuh>,
<cub/util_deprecated.cuh>,
<cub/util_macro.cuh>,
<cub/util_namespace.cuh>`
cub::DeviceCount
andcub::DeviceCountUncached
, caching abstractions forcudaGetDeviceCount
.
Other Enhancements
- Lazily initialize the per-device CUDAattribute caches, because CUDA context creation is expensive and adds up with large CUDA binaries on machines with many GPUs. Thanks to the NVIDIA PyTorch team for bringing this to our attention.
- Make
cub::SwitchDevice
avoid setting/resetting the device if the current device is the same as the target device.
Bug Fixes
- Add explicit failure parameter to CAS in the CUB attribute cache to workaround a GCC 4.8 bug.
- Revert a change in reductions that changed the signedness of the
lane_id
variable to suppress a warning, as this introduces a bug in optimized device code. - Fix initialization in
cub::ExclusiveSum
. Thanks to Conor Hoekstra for this contribution. - Fix initialization of the
std::array
in the CUB attribute cache. - Fix
-Wsign-compare
warnings. Thanks to Elias Stehle for this contribution. - Fix
test_block_reduce.cu
to build without parameters. Thanks to Francis Lemaire for this contribution. - Add missing includes to
grid_even_share.cuh
. Thanks to Francis Lemaire for this contribution. - Add missing includes to
thread_search.cuh
. Thanks to Francis Lemaire for this contribution. - Add missing includes to
cub.cuh
. Thanks to Felix Kallenborn for this contribution.
CUB 1.9.8-1 (NVIDIA HPC SDK 20.3)
Summary
CUB 1.9.8-1 is a variant of 1.9.8 accompanying the NVIDIA HPC SDK 20.3 release. It contains modifications necessary to serve as the implementation of NVC++'s GPU-accelerated C++17 Parallel Algorithms.
CUB 1.9.8 (CUDA 11.0 Early Access)
Summary
CUB 1.9.8 is the first release of CUB to be officially supported and included in the CUDA Toolkit.
When compiling CUB in C++11 mode, CUB now caches calls to CUDA attribute query APIs, which improves performance of these queries by 20x to 50x when they are called concurrently by multiple host threads.
Enhancements
- (C++11 or later) Cache calls to
cudaFuncGetAttributes
andcudaDeviceGetAttribute
withincub::PtxVersion
andcub::SmVersion
. These CUDA APIs acquire locks to CUDA driver/runtime mutex and perform poorly under contention; with the caching, they are 20 to 50x faster when called concurrently. Thanks to Bilge Acun for bringing this issue to our attention. DispatchReduce
now takes anOutputT
template parameter so that users can specify the intermediate type explicitly.- Radix sort tuning policies updates to fix performance issues for element types smaller than 4 bytes.
Bug Fixes
- Change initialization style from copy initialization to direct initialization (which is more permissive) in
AgentReduce
to allow a wider range of types to be used with it. - Fix bad signed/unsigned comparisons in
WarpReduce
. - Fix computation of valid lanes in warp-level reduction primitive to correctly handle the case where there are 0 input items per warp.