The Easy-to-use Dialogue Response Selection Toolkit for Researchers

Last update: Nov 13, 2022

Related tags

Text Data & NLP SimpleReDial-v1

Overview

Easy-to-use toolkit for retrieval-based Chatbot

Recent Activity

Our released RRS corpus can be found here.
Our released BERT-FP post-training checkpoint for the RRS corpus can be found here.
Our related work (Exploring Dense Retrieval for Dialogue Response Selection) can be found here.

How to Use

Init the repo

Before using the repo, please run the following command to init:

# create the necessay folders
python init.py

# prepare the environment
# if some package cannot be installed, just google and install it from other ways
pip install -r requirements.txt

train the model

./scripts/train.sh <dataset_name> <model_name> <cuda_ids>

test the model [rerank]

./scripts/test_rerank.sh <dataset_name> <model_name> <cuda_id>

test the model [recal]

# different recall_modes are available: q-q, q-r
./scripts/test_recall.sh <dataset_name> <model_name> <cuda_id>

inference the responses and save into the faiss index

Somethings inference will missing data samples, please use the 1 gpu (faiss-gpu search use 1 gpu quickly)

It should be noted that: 1. For writer dataset, use extract_inference.py script to generate the inference.txt 2. For other datasets(douban, ecommerce, ubuntu), just cp train.txt inference.txt. The dataloader will automatically read the test.txt to supply the corpus.

# work_mode=response, inference the response and save into faiss (for q-r matching) [dual-bert/dual-bert-fusion]
# work_mode=context, inference the context to do q-q matching
# work_mode=gray, inference the context; read the faiss(work_mode=response has already been done), search the topk hard negative samples; remember to set the BERTDualInferenceContextDataloader in config/base.yaml
./scripts/inference.sh <dataset_name> <model_name> <cuda_ids>

If you want to generate the gray dataset for the dataset:

# 1. set the mode as the **response**, to generate the response faiss index; corresponding dataset name: BERTDualInferenceDataset;
./scripts/inference.sh <dataset_name> response <cuda_ids>

# 2. set the mode as the **gray**, to inference the context in the train.txt and search the top-k candidates as the gray(hard negative) samples; corresponding dataset name: BERTDualInferenceContextDataset
./scripts/inference.sh <dataset_name> gray <cuda_ids>

# 3. set the mode as the **gray-one2many** if you want to generate the extra positive samples for each context in the train set, the needings of this mode is the same as the **gray** work mode
./scripts/inference.sh <dataset_name> gray-one2many <cuda_ids>

If you want to generate the pesudo positive pairs, run the following commands:

# make sure the dual-bert inference dataset name is BERTDualInferenceDataset
./scripts/inference.sh <dataset_name> unparallel <cuda_ids>

deploy the rerank and recall model

# load the model on the cuda:0(can be changed in deploy.sh script)
./scripts/deploy.sh <cuda_id>

at the same time, you can test the deployed model by using:

# test_mode: recall, rerank, pipeline
./scripts/test_api.sh <test_mode> <dataset>

test the recall performance of the elasticsearch

Before testing the es recall, make sure the es index has been built:

# recall_mode: q-q/q-r
./scripts/build_es_index.sh <dataset_name> <recall_mode>

# recall_mode: q-q/q-r
./scripts/test_es_recall.sh <dataset_name> <recall_mode> 0

simcse generate the gray responses

# train the simcse model
./script/train.sh <dataset_name> simcse <cuda_ids>

# generate the faiss index, dataset name: BERTSimCSEInferenceDataset
./script/inference_response.sh <dataset_name> simcse <cuda_ids>

# generate the context index
./script/inference_simcse_response.sh <dataset_name> simcse <cuda_ids>
# generate the test set for unlikelyhood-gen dataset
./script/inference_simcse_unlikelyhood_response.sh <dataset_name> simcse <cuda_ids>

# generate the gray response
./script/inference_gray_simcse.sh <dataset_name> simcse <cuda_ids>
# generate the test set for unlikelyhood-gen dataset
./script/inference_gray_simcse_unlikelyhood.sh <dataset_name> simcse <cuda_ids>

The Easy-to-use Dialogue Response Selection Toolkit for Researchers

Related tags

Overview

Easy-to-use toolkit for retrieval-based Chatbot

Recent Activity

How to Use

Owner

GMFTBY

Awesome-NLP-Research (ANLP)

NumPy String-Indexed is a NumPy extension that allows arrays to be indexed using descriptive string labels

Data and code to support "Applied Natural Language Processing" (INFO 256, Fall 2021, UC Berkeley)

Textlesslib - Library for Textless Spoken Language Processing

TEACh is a dataset of human-human interactive dialogues to complete tasks in a simulated household environment.

A NLP program: tokenize method, PoS Tagging with deep learning

Repository for the paper: VoiceMe: Personalized voice generation in TTS

Resources for "Natural Language Processing" Coursera course.

An implementation of the Pay Attention when Required transformer

Submit issues and feature requests for our API here.

Unsupervised Abstract Reasoning for Raven’s Problem Matrices

Torchrecipes provides a set of reproduci-able, re-usable, ready-to-run RECIPES for training different types of models, across multiple domains, on PyTorch Lightning.

Bot to connect a real Telegram user, simulating responses with OpenAI's davinci GPT-3 model.

Mlcode - Continuous ML API Integrations

Implementation of COCO-LM, Correcting and Contrasting Text Sequences for Language Model Pretraining, in Pytorch

This repository contains the codes for LipGAN. LipGAN was published as a part of the paper titled "Towards Automatic Face-to-Face Translation".

100+ Chinese Word Vectors 上百种预训练中文词向量

मराठी भाषा वाचविण्याचा एक प्रयास. इंग्रजी ते मराठीचा शब्दकोश. An attempt to preserve the Marathi language. A lightweight and ad free English to Marathi thesaurus.

Journalism AI – Quotes extraction for modular journalism

Simple text to phones converter for multiple languages