In-place Parallel Super Scalar Samplesort (IPS⁴o)

Related tags

Deep Learningips4o
Overview

In-place Parallel Super Scalar Samplesort (IPS⁴o)

This is the implementation of the algorithm IPS⁴o presented in the paper Engineering In-place (Shared-memory) Sorting Algorithms, which contains an in-depth description of its inner workings, as well as an extensive experimental performance evaluation. Here's the abstract:

We present new sequential and parallel sorting algorithms that now represent the fastest known techniques for a wide range of input sizes, input distributions, data types, and machines. Somewhat surprisingly, part of the speed advantage is due to the additional feature of the algorithms to work in-place, i.e., they do not need a significant amount of space beyond the input array. Previously, the in-place feature often implied performance penalties. Our main algorithmic contribution is a blockwise approach to in-place data distribution that is provably cache-efficient. We also parallelize this approach taking dynamic load balancing and memory locality into account.

Our new comparison-based algorithm In-place Superscalar Samplesort (IPS⁴o), combines this technique with branchless decision trees. By taking cases with many equal elements into account and by adapting the distribution degree dynamically, we obtain a highly robust algorithm that outperforms the best previous in-place parallel comparison-based sorting algorithms by almost a factor of three. That algorithm also outperforms the best comparison-based competitors regardless of whether we consider in-place or not in-place, parallel or sequential settings.

Another surprising result is that IPS⁴o even outperforms the best (in-place or not in-place) integer sorting algorithms in a wide range of situations. In many of the remaining cases (often involving near-uniform input distributions, small keys, or a sequential setting), our new In-place Parallel Super Scalar Radix Sort (IPS²Ra) turns out to be the best algorithm.

Claims to have the -- in some sense -- "best" sorting algorithm can be found in many papers which cannot all be true. Therefore, we base our conclusions on an extensive experimental study involving a large part of the cross product of 21 state-of-the-art sorting codes, 6 data types, 10 input distributions, 4 machines, 4 memory allocation strategies, and input sizes varying over 7 orders of magnitude. This confirms the claims made about the robust performance of our algorithms while revealing major performance problems in many competitors outside the concrete set of measurements reported in the associated publications. This is particularly true for integer sorting algorithms giving one reason to prefer comparison-based algorithms for robust general-purpose sorting.

An initial version of IPS⁴o has been described in our publication on the 25th Annual European Symposium on Algorithms.

Usage

Clone this repository and check out its submodule

git clone --recurse-submodules https://github.com/ips4o/ips4o.git

or use the following commands instead if you want to include this repository as a submodule:

git submodule add https://github.com/ips4o/ips4o.git
git submodule update --recursive --init

IPS⁴o provides a CMake library for simple usage:

add_subdirectory(<path-to-the-ips4o-repository>)
target_link_libraries(<your-target> PRIVATE ips4o)

A minimal working example:

#include "ips4o.hpp"

// sort sequentially
ips4o::sort(begin, end[, comparator]);

// sort in parallel (uses OpenMP if available, std::thread otherwise)
ips4o::parallel::sort(begin, end[, comparator]);

The parallel version of IPS⁴o requires 16-byte atomic compare-and-exchange instructions to run the fastest. Most CPUs and compilers support 16-byte compare-and-exchange instructions nowadays. If the CPU in question does so, IPS⁴o uses 16-byte compare-and-exchange instructions when you set your CPU correctly (e.g., -march=native) or when you enable the instructions explicitly (-mcx16). In this case, you also have to link against GCC's libatomic (-latomic). Otherwise, we emulate some 16-byte compare-and-exchange instructions with locks which may slightly mitigate the performance of IPS⁴o.

If you use the CMake example shown above, we automatically optimize IPS⁴o for the native CPU (e.g., -march=native). You can disable the CMake property IPS4O_OPTIMIZE_FOR_NATIVE to avoid native optimization and you can enable the CMake property IPS4O_USE_MCX16 if you compile with GCC or Clang to enable 16-byte compare-and-exchange instructions explicitly.

IPS⁴o uses C++ threads if not specified otherwise. If you prefer OpenMP threads, you need to enable OpenMP threads, e.g., enable the CMake property IPS4O_USE_OPENMP or add OpenMP to your target. If you enable the CMake property DISABLE_IPS4O_PARALLEL, most of the parallel code will not be compiled and no parallel libraries will be linked. Otherwise, CMake automatically enables C++ threads (e.g., -pthread) and links against TBB and GCC's libatomic. (Only when you compile your code for 16-byte compare-and-exchange instructions you need libatomic.) Thus, you need the Thread Building Blocks (TBB) library to compile and execute the parallel version of IPS⁴o. We search for TBB with find_package(TBB REQUIRED). If you want to execute IPS⁴o in parallel but your TBB library is not accessible via find_package(TBB REQUIRED), you can still compile IPS⁴o with parallel support. Just enable the CMake property DISABLE_IPS4O_PARALLEL, enable C++ threads for your own target and link your own target against your TBB library (and also link your target against libatomic if you want 16-byte atomic compare-and-exchange instruction support).

If you do not set a CMake build type, we use the build type Release which disables debugging (e.g., -DNDEBUG) and enables optimizations (e.g., -O3).

Currently, the code does not compile on Windows.

Licensing

IPS⁴o is free software provided under the BSD 2-Clause License described in the LICENSE file. If you use this implementation of IPS⁴o in an academic setting please cite the paper Engineering In-place (Shared-memory) Sorting Algorithms using the BibTeX entry

@misc{axtmann2020engineering,
  title =	 {Engineering In-place (Shared-memory) Sorting Algorithms},
  author =	 {Michael Axtmann and Sascha Witt and Daniel Ferizovic and Peter Sanders},
  howpublished = {Computing Research Repository (CoRR)},
  year =	 {Sept. 2020},
  archivePrefix ={arXiv},
  eprint =	 {2009.13569},
}
Code repo for EMNLP21 paper "Zero-Shot Information Extraction as a Unified Text-to-Triple Translation"

Zero-Shot Information Extraction as a Unified Text-to-Triple Translation Source code repo for paper Zero-Shot Information Extraction as a Unified Text

cgraywang 88 Dec 31, 2022
Voice control for Garry's Mod

WIP: Talonvoice GMod integrations Very work in progress voice control demo for Garry's Mod. HOWTO Install https://talonvoice.com/ Press https://i.imgu

Meta Construct 5 Nov 15, 2022
Rainbow DQN implementation that outperforms the paper's results on 40% of games using 20x less data 🌈

Rainbow 🌈 An implementation of Rainbow DQN which outperforms the paper's (Hessel et al. 2017) results on 40% of tested games while using 20x less dat

Dominik Schmidt 31 Dec 21, 2022
Recurrent Neural Network Tutorial, Part 2 - Implementing a RNN in Python and Theano

Please read the blog post that goes with this code! Jupyter Notebook Setup System Requirements: Python, pip (Optional) virtualenv To start the Jupyter

Denny Britz 863 Dec 15, 2022
PyTorch code for ICPR 2020 paper Future Urban Scene Generation Through Vehicle Synthesis

Future urban scene generation through vehicle synthesis This repository contains Pytorch code for the ICPR2020 paper "Future Urban Scene Generation Th

Alessandro Simoni 4 Oct 11, 2021
Joint parameterization and fitting of stroke clusters

StrokeStrip: Joint Parameterization and Fitting of Stroke Clusters Dave Pagurek van Mossel1, Chenxi Liu1, Nicholas Vining1,2, Mikhail Bessmeltsev3, Al

Dave Pagurek 44 Dec 01, 2022
The implementation of the algorithm in the paper "Safe Deep Semi-Supervised Learning for Unseen-Class Unlabeled Data" published in ICML 2020.

DS3L This is the code for paper "Safe Deep Semi-Supervised Learning for Unseen-Class Unlabeled Data" published in ICML 2020. Setups The code is implem

Guolz 36 Oct 19, 2022
Voice Gender Recognition

In this project it was used some different Machine Learning models to identify the gender of a voice (Female or Male) based on some specific speech and voice attributes.

Anne Livia 1 Jan 27, 2022
This is the implementation of the paper "Self-supervised Outdoor Scene Relighting"

Self-supervised Outdoor Scene Relighting This is the implementation of the paper "Self-supervised Outdoor Scene Relighting". The model is implemented

Ye Yu 24 Dec 17, 2022
Garbage Detection system which will detect objects based on whether it is plastic waste or plastics or just garbage.

Garbage Detection using Yolov5 on Jetson Nano 2gb Developer Kit. Garbage detection system which will detect objects based on whether it is plastic was

Rishikesh A. Bondade 2 May 13, 2022
A Free and Open Source Python Library for Multiobjective Optimization

Platypus What is Platypus? Platypus is a framework for evolutionary computing in Python with a focus on multiobjective evolutionary algorithms (MOEAs)

Project Platypus 424 Dec 18, 2022
Official repository of "Investigating Tradeoffs in Real-World Video Super-Resolution"

RealBasicVSR [Paper] This is the official repository of "Investigating Tradeoffs in Real-World Video Super-Resolution, arXiv". This repository contain

Kelvin C.K. Chan 566 Dec 28, 2022
PyTorch IPFS Dataset

PyTorch IPFS Dataset IPFSDataset(Dataset) See the jupyter notepad to see how it works and how it interacts with a standard pytorch DataLoader You need

Jake Kalstad 2 Apr 13, 2022
PyTorch implementations of the paper: "Learning Independent Instance Maps for Crowd Localization"

IIM - Crowd Localization This repo is the official implementation of paper: Learning Independent Instance Maps for Crowd Localization. The code is dev

tao han 91 Nov 10, 2022
An educational AI robot based on NVIDIA Jetson Nano.

JetBot Looking for a quick way to get started with JetBot? Many third party kits are now available! JetBot is an open-source robot based on NVIDIA Jet

NVIDIA AI IOT 2.6k Dec 29, 2022
JAX + dataclasses

jax_dataclasses jax_dataclasses provides a wrapper around dataclasses.dataclass for use in JAX, which enables automatic support for: Pytree registrati

Brent Yi 35 Dec 21, 2022
Cognition-aware Cognate Detection

Cognition-aware Cognate Detection The repository which contains our code for our EACL 2021 paper titled, "Cognition-aware Cognate Detection". This wor

Prashant K. Sharma 1 Feb 01, 2022
Source code for "OmniPhotos: Casual 360° VR Photography"

OmniPhotos: Casual 360° VR Photography Project Page | Video | Paper | Demo | Data This repository contains the source code for creating and viewing Om

Christian Richardt 144 Dec 30, 2022
Official PyTorch implementation of "Uncertainty-Based Offline Reinforcement Learning with Diversified Q-Ensemble" (NeurIPS'21)

Uncertainty-Based Offline Reinforcement Learning with Diversified Q-Ensemble This is the code for reproducing the results of the paper Uncertainty-Bas

43 Nov 23, 2022
PyTorch implementation for paper "Full-Body Visual Self-Modeling of Robot Morphologies".

Full-Body Visual Self-Modeling of Robot Morphologies Boyuan Chen, Robert Kwiatkowskig, Carl Vondrick, Hod Lipson Columbia University Project Website |

Boyuan Chen 32 Jan 02, 2023