LTR_CrossEncoder: Legal Text Retrieval Zalo AI Challenge 2021

Last update: Jan 12, 2022

Related tags

Deep Learning ZaloAI2021_LTR

Overview

LTR_CrossEncoder: Legal Text Retrieval Zalo AI Challenge 2021

We propose a cross encoder model (LTR_CrossEncoder) for information retrieval, re-retrieval text relevant base on result of elasticsearch

Model achieved 0.747 F2 score in public test (Legal Text Retrieval Zalo AI Challenge 2021)
If using elasticsearch only, our F2 score is 0.54

Algorithm design

Our algorithm includes two key components:

Elasticsearch
Cross Encoder Model

Elasticsearch

Elasticsearch is used for filtering top-k most relevant articles based on BM25 score.

Cross Encoder Model

Our model accepts query, article text (passage) and article title as inputs and outputs a relevant score of that query and that article. Higher score, more relavant. We use pretrained vinai/phobert-base and CrossEntropyLoss or BCELoss as loss function

Train dataset

Non-relevant samples in dataset are obtained by top-10 result of elasticsearch, the training data (train_data_model.json) has format as follow:

[
    {
        "question_id": "..."
        "question": "..."
        "relevant_articles":[
            {
                "law_id": "..."
                "article_id": "..."
                "title": "..."
                "text": "..."
            },
            ...
        ]
        "non_relevant_articles":[
            {
                "law_id": "..."
                "article_id": "..."
                "title": "..."
                "text": "..."
            },
            ...
        ]
    },
    ...
]

Test dataset

First we use elasticsearch to obtain k relevant candidates (k=top-50 result of elasticsearch), then LTR_CrossEncoder classify which actual relevant article. The test data (test_data_model.json) has format as follow:

[
    {
        "question_id": "..."
        "question": "..."
        "articles":[
            {
                "law_id": "..."
                "article_id": "..."
                "title": "..."
                "text": "..."
            },
            ...
        ]
    },
    ...
]

Training

Run the following bash file to train model:

bash run_phobert.sh

Inference

We also provide model checkpoints. Please download these checkpoints if you want to make inference on a new text file without training the models from scratch. Create new checkpoint folder, unzip model file and push it in checkpoint folder. https://drive.google.com/file/d/1oT8nlDIAatx3XONN1n5eOgYTT6Lx_h_C/view?usp=sharing

Run the following bash file to infer test dataset:

bash run_predict.sh

LTR_CrossEncoder: Legal Text Retrieval Zalo AI Challenge 2021

Related tags

Overview

LTR_CrossEncoder: Legal Text Retrieval Zalo AI Challenge 2021

Algorithm design

Elasticsearch

Cross Encoder Model

Train dataset

Test dataset

Training

Inference

Owner

Xuan Hieu Duong

TEDSummary is a speech summary corpus. It includes TED talks subtitle (Document), Title-Detail (Summary), speaker name (Meta info), MP4 URL, and utterance id

PyTorch implementation for paper "Full-Body Visual Self-Modeling of Robot Morphologies".

Backdoor Attack through Frequency Domain

SGPT: Multi-billion parameter models for semantic search

CausaLM: Causal Model Explanation Through Counterfactual Language Models

Differentiable Surface Triangulation

2021搜狐校园文本匹配算法大赛分比我们低的都是帅哥队

PyTorch implementation of convolutional neural networks-based text-to-speech synthesis models

x-transformers-paddle 2.x version

Fast, flexible and easy to use probabilistic modelling in Python.

[ICCV 2021 Oral] Deep Evidential Action Recognition

converts nominal survey data into a numerical value based on a dictionary lookup.

StyleGAN2-ada for practice

Code for our NeurIPS 2021 paper 'Exploiting the Intrinsic Neighborhood Structure for Source-free Domain Adaptation'

Code for "Optimizing risk-based breast cancer screening policies with reinforcement learning"

PiCIE: Unsupervised Semantic Segmentation using Invariance and Equivariance in clustering (CVPR2021)

Official PyTorch implementation of the ICRA 2021 paper: Adversarial Differentiable Data Augmentation for Autonomous Systems.

[ICML 2020] "When Does Self-Supervision Help Graph Convolutional Networks?" by Yuning You, Tianlong Chen, Zhangyang Wang, Yang Shen

A toy project using OpenCV and PyMunk

Official code for "Distributed Deep Learning in Open Collaborations" (NeurIPS 2021)

LTR_CrossEncoder: Legal Text Retrieval Zalo AI Challenge 2021

Related tags

Overview

LTR_CrossEncoder: Legal Text Retrieval Zalo AI Challenge 2021

Algorithm design

Elasticsearch

Cross Encoder Model

Train dataset

Test dataset

Training

Inference

Owner

Xuan Hieu Duong

TEDSummary is a speech summary corpus. It includes TED talks subtitle (Document), Title-Detail (Summary), speaker name (Meta info), MP4 URL, and utterance id

PyTorch implementation for paper "Full-Body Visual Self-Modeling of Robot Morphologies".

Backdoor Attack through Frequency Domain

SGPT: Multi-billion parameter models for semantic search

CausaLM: Causal Model Explanation Through Counterfactual Language Models

Differentiable Surface Triangulation

2021搜狐校园文本匹配算法大赛 分比我们低的都是帅哥队

PyTorch implementation of convolutional neural networks-based text-to-speech synthesis models

x-transformers-paddle 2.x version

Fast, flexible and easy to use probabilistic modelling in Python.

[ICCV 2021 Oral] Deep Evidential Action Recognition

converts nominal survey data into a numerical value based on a dictionary lookup.

StyleGAN2-ada for practice

Code for our NeurIPS 2021 paper 'Exploiting the Intrinsic Neighborhood Structure for Source-free Domain Adaptation'

Code for "Optimizing risk-based breast cancer screening policies with reinforcement learning"

PiCIE: Unsupervised Semantic Segmentation using Invariance and Equivariance in clustering (CVPR2021)

Official PyTorch implementation of the ICRA 2021 paper: Adversarial Differentiable Data Augmentation for Autonomous Systems.

[ICML 2020] "When Does Self-Supervision Help Graph Convolutional Networks?" by Yuning You, Tianlong Chen, Zhangyang Wang, Yang Shen

A toy project using OpenCV and PyMunk

Official code for "Distributed Deep Learning in Open Collaborations" (NeurIPS 2021)

2021搜狐校园文本匹配算法大赛分比我们低的都是帅哥队