Weird Sort-and-Compress Thing

A weird integer sorting + compression algorithm inspired by a conversation with Luthingx (it probably already exists by some name I don't know yet). There's a lot still to improve about this algorithm, so be careful where you use it.

How it works

Here's an example for the following list:

l = [1, 2, 2, 2, 3]

The algorithm starts with counting sort, creating a dictionary with each unique number as key and the number of occurences of it in the list as the value:

d = {1: 1, 2: 3, 3: 1}

To decrease the space needed to store the numbers in memory, we'll only store the first number and then the difference between each of the next numbers and the previous one:

d2 = [(1, 1), (1, 3), (1, 1))

Now, the minimum amount of memory we need to store every key that's in d2 is 1 bit, since 1 is the maximum difference between any subsequent elements. The same applies to the values, except that to store any value here we need 2 bits of memory, since the maximum value is 3(11 in binary). So we know that we can store this list as a sequence of 3 bits elements, like this:

d2_bin = ["101", "111", 101"]

We can now return the list as a single number, along with a pair of integers containing the number of bits in each key and the number of bits in each value, allowing the value to be decompressed.

Memory efficiency

Here's a list with the sum of the number of bits of all numbers in a list with 100 elements, generated with random values in the range 0 to 50 and generated 20 times, vs. the number of bits in the resulting compressed integer(taking as a premise that all numbers in the array are all actually stored in continuous memory, including duplicates):

And 1000 numbers from 0 to 50, also 20 times:

4724 => 358
4827 => 309
4818 => 308
4801 => 309
4763 => 309
4763 => 309
4801 => 359
4757 => 359
4766 => 309
4794 => 309
4769 => 309
4789 => 359
4887 => 359
4787 => 309
4761 => 309
4749 => 309
4844 => 308
4798 => 359
4799 => 308
4763 => 359

Weird Sort-and-Compress Thing

Related tags

Overview

Weird Sort-and-Compress Thing

How it works

Memory efficiency

Owner

Douglas

CodeBERT: A Pre-Trained Model for Programming and Natural Languages.

An open collection of annotated voices in Japanese language

[EMNLP 2021] Mirror-BERT: Converting Pretrained Language Models to universal text encoders without labels.

Arabic speech recognition, classification and text-to-speech.

PyTorch implementation of convolutional neural networks-based text-to-speech synthesis models

Line as a Visual Sentence: Context-aware Line Descriptor for Visual Localization

NLP, Machine learning

SAVI2I: Continuous and Diverse Image-to-Image Translation via Signed Attribute Vectors

Open-World Entity Segmentation

Modified GPT using average pooling to reduce the softmax attention memory constraints.

Translate U is capable of translating the text present in an image from one language to the other.

Fast, DB Backed pretrained word embeddings for natural language processing.

The following links explain a bit the idea of semantic search and how search mechanisms work by doing retrieve and rerank

Nested Named Entity Recognition for Chinese Biomedical Text

Code repository for "It's About Time: Analog clock Reading in the Wild"

This repository will contain the code for the CVPR 2021 paper "GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields"

Official code repository of the paper Linear Transformers Are Secretly Fast Weight Programmers.

This Project is based on NLTK It generates a RANDOM WORD from a predefined list of words, From that random word it read out the word, its meaning with parts of speech , its antonyms, its synonyms

The tool to make NLP datasets ready to use

Understand Text Summarization and create your own summarizer in python