LTR_CrossEncoder: Legal Text Retrieval Zalo AI Challenge 2021

Last update: Jan 12, 2022

Related tags

Deep Learning ZaloAI2021_LTR

Overview

LTR_CrossEncoder: Legal Text Retrieval Zalo AI Challenge 2021

We propose a cross encoder model (LTR_CrossEncoder) for information retrieval, re-retrieval text relevant base on result of elasticsearch

Model achieved 0.747 F2 score in public test (Legal Text Retrieval Zalo AI Challenge 2021)
If using elasticsearch only, our F2 score is 0.54

Algorithm design

Our algorithm includes two key components:

Elasticsearch
Cross Encoder Model

Elasticsearch

Elasticsearch is used for filtering top-k most relevant articles based on BM25 score.

Cross Encoder Model

Our model accepts query, article text (passage) and article title as inputs and outputs a relevant score of that query and that article. Higher score, more relavant. We use pretrained vinai/phobert-base and CrossEntropyLoss or BCELoss as loss function

Train dataset

Non-relevant samples in dataset are obtained by top-10 result of elasticsearch, the training data (train_data_model.json) has format as follow:

[
    {
        "question_id": "..."
        "question": "..."
        "relevant_articles":[
            {
                "law_id": "..."
                "article_id": "..."
                "title": "..."
                "text": "..."
            },
            ...
        ]
        "non_relevant_articles":[
            {
                "law_id": "..."
                "article_id": "..."
                "title": "..."
                "text": "..."
            },
            ...
        ]
    },
    ...
]

Test dataset

First we use elasticsearch to obtain k relevant candidates (k=top-50 result of elasticsearch), then LTR_CrossEncoder classify which actual relevant article. The test data (test_data_model.json) has format as follow:

[
    {
        "question_id": "..."
        "question": "..."
        "articles":[
            {
                "law_id": "..."
                "article_id": "..."
                "title": "..."
                "text": "..."
            },
            ...
        ]
    },
    ...
]

Training

Run the following bash file to train model:

bash run_phobert.sh

Inference

We also provide model checkpoints. Please download these checkpoints if you want to make inference on a new text file without training the models from scratch. Create new checkpoint folder, unzip model file and push it in checkpoint folder. https://drive.google.com/file/d/1oT8nlDIAatx3XONN1n5eOgYTT6Lx_h_C/view?usp=sharing

Run the following bash file to infer test dataset:

bash run_predict.sh

LTR_CrossEncoder: Legal Text Retrieval Zalo AI Challenge 2021

Related tags

Overview

LTR_CrossEncoder: Legal Text Retrieval Zalo AI Challenge 2021

Algorithm design

Elasticsearch

Cross Encoder Model

Train dataset

Test dataset

Training

Inference

Owner

Hieu Duong

Neurolab is a simple and powerful Neural Network Library for Python

Learning from Synthetic Data with Fine-grained Attributes for Person Re-Identification

Few-shot Relation Extraction via Bayesian Meta-learning on Relation Graphs

A python implementation of Physics-informed Spline Learning for nonlinear dynamics discovery

This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages" published in Findings of the Association for Computational Linguistics: ACL 2021.

GE2340 project source code without credentials.

[ICLR'19] Trellis Networks for Sequence Modeling

Feature extraction made simple with torchextractor

Using multidimensional LSTM neural networks to create a forecast for Bitcoin price

SHRIMP: Sparser Random Feature Models via Iterative Magnitude Pruning

A framework for analyzing computer vision models with simulated data

CN24 is a complete semantic segmentation framework using fully convolutional networks

Bio-Computing Platform Featuring Large-Scale Representation Learning and Multi-Task Deep Learning “螺旋桨”生物计算工具集

Intro-to-dl - Resources for "Introduction to Deep Learning" course.

Direct application of DALLE-2 to video synthesis, using factored space-time Unet and Transformers

How to use TensorLayer

A fast implementation of bss_eval metrics for blind source separation

Simple and Robust Loss Design for Multi-Label Learning with Missing Labels

Pytorch implementation for A-NeRF: Articulated Neural Radiance Fields for Learning Human Shape, Appearance, and Pose

Fuse radar and camera for detection