CaskDB is a disk-based, embedded, persistent, key-value store based on the Riak's bitcask paper, written in Python.

Overview

CaskDB - Disk based Log Structured Hash Table Store

made-with-python build codecov MIT license

architecture

CaskDB is a disk-based, embedded, persistent, key-value store based on the Riak's bitcask paper, written in Python. It is more focused on the educational capabilities than using it in production. The file format is platform, machine, and programming language independent. Say, the database file created from Python on macOS should be compatible with Rust on Windows.

This project aims to help anyone, even a beginner in databases, build a persistent database in a few hours. There are no external dependencies; only the Python standard library is enough.

If you are interested in writing the database yourself, head to the workshop section.

Features

  • Low latency for reads and writes
  • High throughput
  • Easy to back up / restore
  • Simple and easy to understand
  • Store data much larger than the RAM

Limitations

Most of the following limitations are of CaskDB. However, there are some due to design constraints by the Bitcask paper.

  • Single file stores all data, and deleted keys still take up the space
  • CaskDB does not offer range scans
  • CaskDB requires keeping all the keys in the internal memory. With a lot of keys, RAM usage will be high
  • Slow startup time since it needs to load all the keys in memory

Dependencies

CaskDB does not require any external libraries to run. For local development, install the packages from requirements_dev.txt:

pip install -r requirements_dev.txt

Installation

PyPi is not used for CaskDB yet (issue #5), and you'd have to install it directly from the repository by cloning.

Usage

disk: DiskStorage = DiskStore(file_name="books.db")
disk.set(key="othello", value="shakespeare")
author: str = disk.get("othello")
# it also supports dictionary style API too:
disk["hamlet"] = "shakespeare"

Prerequisites

The workshop is for intermediate-advanced programmers. Knowing Python is not a requirement, and you can build the database in any language you wish.

Not sure where you stand? You are ready if you have done the following in any language:

  • If you have used a dictionary or hash table data structure
  • Converting an object (class, struct, or dict) to JSON and converting JSON back to the things
  • Open a file to write or read anything. A common task is dumping a dictionary contents to disk and reading back

Workshop

NOTE: I don't have any workshops scheduled shortly. Follow me on Twitter for updates. Drop me an email if you wish to arrange a workshop for your team/company.

CaskDB comes with a full test suite and a wide range of tools to help you write a database quickly. A Github action is present with an automated tests runner, code formatter, linter, type checker and static analyser. Fork the repo, push the code, and pass the tests!

Throughout the workshop, you will implement the following:

  • Serialiser methods take a bunch of objects and serialise them into bytes. Also, the procedures take a bunch of bytes and deserialise them back to the things.
  • Come up with a data format with a header and data to store the bytes on the disk. The header would contain metadata like timestamp, key size, and value.
  • Store and retrieve data from the disk
  • Read an existing CaskDB file to load all keys

Tasks

  1. Read the paper. Fork this repo and checkout the start-here branch
  2. Implement the fixed-sized header, which can encode timestamp (uint, 4 bytes), key size (uint, 4 bytes), value size (uint, 4 bytes) together
  3. Implement the key, value serialisers, and pass the tests from test_format.py
  4. Figure out how to store the data on disk and the row pointer in the memory. Implement the get/set operations. Tests for the same are in test_disk_store.py
  5. Code from the task #2 and #3 should be enough to read an existing CaskDB file and load the keys into memory

Use make lint to run mypy, black, and pytype static analyser. Run make test to run the tests locally. Push the code to Github, and tests will run on different OS: ubuntu, mac, and windows.

Not sure how to proceed? Then check the hints file which contains more details on the tasks and hints.

Hints

  • Check out the documentation of struck.pack for serialisation methods in Python
  • Not sure how to come up with a file format? Read the comment in the format module

What next?

I often get questions about what is next after the basic implementation. Here are some challenges (with different levels of difficulties)

Level 1:

  • Crash safety: the bitcask paper stores CRC in the row, and while fetching the row back, it verifies the data
  • Key deletion: CaskDB does not have a delete API. Read the paper and implement it
  • Instead of using a hash table, use a data structure like the red-black tree to support range scans
  • CaskDB accepts only strings as keys and values. Make it generic and take other data structures like int or bytes.

Level 2:

  • Hint file to improve the startup time. The paper has more details on it
  • Implement an internal cache which stores some of the key-value pairs. You may explore and experiment with different cache eviction strategies like LRU, LFU, FIFO etc.
  • Split the data into multiple files when the files hit a specific capacity

Level 3:

  • Support for multiple processes
  • Garbage collector: keys which got updated and deleted remain in the file and take up space. Write a garbage collector to remove such stale data
  • Add SQL query engine layer
  • Store JSON in values and explore making CaskDB as a document database like Mongo
  • Make CaskDB distributed by exploring algorithms like raft, paxos, or consistent hashing

Name

This project was named cdb earlier and now renamed to CaskDB.

Line Count

$ tokei -f format.py disk_store.py
===============================================================================
 Language            Files        Lines         Code     Comments       Blanks
===============================================================================
 Python                  2          391          261          103           27
-------------------------------------------------------------------------------
 disk_store.py                      204          120           70           14
 format.py                          187          141           33           13
===============================================================================
 Total                   2          391          261          103           27
===============================================================================

License

The MIT license. Please check LICENSE for more details.

Owner
I git stuff done
Nick Craig-Wood's Website

Nick Craig-Wood's public website This directory tree is used to build all the different docs for Nick Craig-Wood's website. The content here is (c) Ni

Nick Craig-Wood 2 Sep 02, 2022
【教程】莉沫酱教你学继承!?

【教程】莉沫酱教你学继承! 众所周知,类的继承就是说当一个类死亡的时候,它的子类会获得它拥有的资源。 根据类的继承法不同,各个子类能获得的资源也不同。 继承法的类型 在解释继承法之前,我们先定义三个类,一个父类A,和它的子类B、C。 它们都拥有x、y、z三个属性。

黄巍 17 Dec 05, 2022
Hotpile: High Order Turing Machine Language Compiler

Hotpile: High Order Turing Machine Language Compiler Build and Run Requirements: Python 3.6+, bison, flex, and GCC installed. Needs to be run under UN

Jiang Weihao 4 Dec 29, 2021
Simply create JIRA releases based on your github releases

Simply create JIRA releases based on your github releases

8 Jun 17, 2022
World's best free and open source ERP.

World's best free and open source ERP.

Frappe 12.5k Jan 07, 2023
Morth - Stack Based Programming Language

Morth WARNING! THIS LANGUAGE IS A WORKING PROGRESS. THIS IS JUST A HOBBY PROJECT

Dominik Danner 2 Mar 05, 2022
Collection of Python scripts to perform Eikonal Tomography

Collection of Python scripts to perform Eikonal Tomography

Emanuel Kästle 10 Nov 04, 2022
Simple python bot, that notifies about new manga chapters through Telegram.

Simple python bot, that notifies about new manga chapters through Telegram.

Dmitry Kopturov 1 Dec 05, 2021
A program that analyzes data from inertia measurement units installeed in aircraft and generates g-exceedance curves

A program that analyzes data from inertia measurement units installeed in aircraft and generates g-exceedance curves

Pooya 1 Nov 23, 2021
The worst and slowest programming language you have ever seen

VenumLang this is a complete joke EXAMPLE: fizzbuzz in venumlang x = 0

Venum 7 Mar 12, 2022
A webapp that timestamps key moments in a football clip

A look into what we're building Demo.mp4 Prerequisites Python 3 Node v16+ Steps to run Create a virtual environment. Activate the virtual environment.

Pranav 1 Dec 10, 2021
Coronavirus Tracker API

Coronavirus Tracker API Provides up-to-date data about Coronavirus outbreak. Includes numbers about confirmed cases, deaths and recovered. Support mul

Francisco Laguna 1 Oct 31, 2020
Automatically find solutions when your Python code encounters an issue.

What The Python?! Helping you find answers to the errors Python spits out. Installation You can find the source code on GitHub at: https://github.com/

What The Python?! 139 Dec 14, 2022
i3wm helper tool for workspaces on multiple monitors

i3screens A helper tool for managing i3wm workspaces on multiple monitors. Use-case You have a multi-monitor setup and want to have the "same" workspa

Sebastian Neef 1 Dec 05, 2022
Find Transposon Element insertions using long reads (nanopore), by alignment directly. (minimap2)

find_te_ins find_te_ins is designed to find Transposon Element (TE) insertions using long reads (nanopore), by alignment directly. (minimap2) Install

Ming Wang 1 Feb 09, 2022
Basic infrastructure for writing scripts in Python

Base Script Python is an excellent language that makes writing scripts very straightforward. Over the course of writing many scripts, we realized that

Deep Compute, LLC 9 Jan 07, 2023
Search and Find Jobs in Ethiopia

✨ EthioJobs ✨ Search and Find Jobs in Ethiopia Easy start critical warning Use pycharm No vscode No sublime No Vim No nothing when you want to use

Abdimk 12 Nov 09, 2022
Simple but maybe too simple config management through python data classes. We use it for machine learning.

👩‍✈️ Coqpit Simple, light-weight and no dependency config handling through python data classes with to/from JSON serialization/deserialization. Curre

coqui 67 Nov 29, 2022
Better GitHub statistics images for your profile, with stats from private and public repos

Better GitHub statistics images for your profile, with stats from private and public repos

Jacob Strieb 2k Dec 30, 2022
laTEX is awesome but we are lazy -> groff with markdown syntax and inline code execution

pyGroff A wrapper for groff using python to have a nicer syntax for groff documents DOCUMENTATION Very similar to markdown. So if you know what that i

Subhaditya Mukherjee 27 Jul 23, 2022