An Indexer that works out-of-the-box when you have less than 100K stored Documents

Last update: Mar 15, 2022

Related tags

Overview

U100KIndexer

An Indexer that works out-of-the-box when you have less than 100K stored Documents. U100K means under 100K. At 100K stored Documents with 768-dim embeddings, you can expect 300ms for single query or 20~120QPS for batch queries. Results are full Documents.

U100KIndexer leverages jina.DocumenetArrayMemmap as the storage backend and .match() to conduct nearest neighbours search. It returns the full Documents as-is, hence no need to concatenate it with another key-value indexer to retrieve Documents.

Pros & cons

Pros

Exhaustive search: highest recall
Fast indexing
Acceptable query performance under 100K
Always return full Documents
No extra dependencies

Cons

Slow query time

Performance

The indexing and query performance on 768-dim embeddings is as follows (unit is second):

Stored data	Indexing time	Query size=1	Query size=8	Query size=64
10000	0.256	0.019	0.029	0.086
50000	1.156	0.147	0.177	0.314
100000	2.329	0.297	0.332	0.536
200000	4.704	0.656	0.744	1.050
400000	11.105	1.289	1.536	2.793

Benchmark script can be found in benchmark.py.

Tips

To change workspace,

U100KIndexer(metas={'workspace': './my'})

Or .add(..., uses_metas={'workspace': './my'}) when you use it in a Flow.

An Indexer that works out-of-the-box when you have less than 100K stored Documents

Related tags

Overview

U100KIndexer

Pros & cons

Pros

Cons

Performance

Tips

Owner

Jina AI

Single-Cell Analysis in Python. Scales to >1M cells.

Datashredder is a simple data corruption engine written in python. You can corrupt anything text, images and video.

Intercepting proxy + analysis toolkit for Second Life compatible virtual worlds

4CAT: Capture and Analysis Toolkit

Monitor the stability of a pandas or spark dataframe ⚙︎

Common bioinformatics database construction

PyChemia, Python Framework for Materials Discovery and Design

Py-price-monitoring - A Python price monitor

Random dataframe and database table generator

Udacity - Data Analyst Nanodegree - Project 4 - Wrangle and Analyze Data

Reading streams of Twitter data, save them to Kafka, then process with Kafka Stream API and Spark Streaming

MS in Data Science capstone project. Studying attacks on autonomous vehicles.

SNV calling pipeline developed explicitly to process individual or trio vcf files obtained from Illumina based pipeline (grch37/grch38).

bigdata_analyse 大数据分析项目

Detecting Underwater Objects (DUO)

Nobel Data Analysis

Stream-Kafka-ELK-Stack - Weather data streaming using Apache Kafka and Elastic Stack.

2019 Data Science Bowl

Manage large and heterogeneous data spaces on the file system.

Python Library for learning (Structure and Parameter) and inference (Statistical and Causal) in Bayesian Networks.