The following links explain a bit the idea of semantic search and how search mechanisms work by doing retrieve and rerank

Last update: Jan 28, 2022

Related tags

Text Data & NLP information_retrieval

Overview

Main Idea

The following links explain a bit the idea of semantic search and how search mechanisms work by doing retrieve and rerank

Setup

Download trained models

There are two models trained for spanish, a bi-encoder and a cross-encoder. These serve to make the retrieval system using the retrieve and rerank idea:

make setup
pip install -r requirements.txt

Basic usage

Setup Elasticsearch index with semantic vectors. For this step we supose that a set of json files is folder. Each json can contain several optional fields but need to contain id and text fiedlds.

from information_retrieval import SemanticEmbedder, CrossEncoder, Prepare, Search

data_folder = 'data/'
text_field = "texto_parrafo"
id_field = "id_parrafo"
elastic_index_name = "sentencias_2.0"

# Read the files, compute embeddings and upload them to elasticsearch
P = Prepare(data_folder, text_field, id_field, elastic_index_name)
P.prepare()

Make queries to retrieve documents:

from information_retrieval import SearchEngine

query = "la vida es bella"
S = SearchEngine(elastic_index_name)
S.retrieve(query) # Only semantic search

S.rerank(query) # Retrieve and rerank

The following links explain a bit the idea of semantic search and how search mechanisms work by doing retrieve and rerank

Related tags

Overview

Main Idea

Setup

Download trained models

Basic usage

Model architecture

Training

Finetuning

Owner

Sergio Arnaud Gomez

A spaCy wrapper of OpenTapioca for named entity linking on Wikidata

Text preprocessing, representation and visualization from zero to hero.

This repository contains the code for "Exploiting Cloze Questions for Few-Shot Text Classification and Natural Language Inference"

[AAAI 21] Curriculum Labeling: Revisiting Pseudo-Labeling for Semi-Supervised Learning

NLP, before and after spaCy

Understand Text Summarization and create your own summarizer in python

OceanScript is an Esoteric language used to encode and decode text into a formulation of characters

A Transformer Implementation that is easy to understand and customizable.

TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset.

Python3 to Crystal Translation using Python AST Walker

Diaformer: Automatic Diagnosis via Symptoms Sequence Generation

This is a NLP based project to extract effective date of the contract from their text files.

A natural language processing model for sequential sentence classification in medical abstracts.

A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks

This repository contains examples of Task-Informed Meta-Learning

Tevatron is a simple and efficient toolkit for training and running dense retrievers with deep language models.

🗣️ NALP is a library that covers Natural Adversarial Language Processing.

This repository collects together basic linguistic processing data for using dataset dumps from the Common Voice project

Text to speech for Vietnamese, ez to use, ez to update

Beyond Masking: Demystifying Token-Based Pre-Training for Vision Transformers