Legal text retrieval for python

Last update: Dec 06, 2022

Related tags

Text Data & NLP legal_text_retrieval

Overview

legal-text-retrieval

Overview

This system contains 2 steps:

generate training data containing negative sample found by mixture score of cosine(tfidf) + bm25 (using top 150 law articles most similarity)
fine-tune PhoBERT model (+NlpHUST model - optional) on generated data

Environments

git clone https://github.com/vncorenlp/VnCoreNLP.git vncorenlp_data # for vncorebnlp tokenize lib

conda create -n legal_retrieval_env python=3.8
conda activate legal_retrieval_env
pip install -r requirements.txt

Run

Generate data from folder data/zac2021-ltr-data/ containing public_test_question.json and train_question_answer.json
```
python3 src/data_generator.py --path_folder_base data/zac2021-ltr-data/ --test_file public_test_question.json --topk 150  --tok --path_output_dir data/zalo-tfidfbm25150-full
```
Note:
- --test_file public_test_question.json is optional, if this parameter is not used, test set will be random 33% in file train_question_answer.json
- --path_output_dir is the folder save 3 output file (train.csv, dev.csv, test.csv) and tfidf classifier (tfidf_classifier.pkl) for top k best relevant documents.

Train model

bash scripts/run_finetune_bert.sh "magic"  vinai/phobert-base  ../  data/zalo-tfidfbm25150-full Tfbm150E5-full 5

Predict
```
python3 src/infer.py 
```
Note: This script will load model and run prediction, pls check the variable model_configs in file src/infer.py to modify.

License

MIT-licensed.

Citation

Please cite as:

@article{DBLP:journals/corr/abs-2106-13405,
  author    = {Ha{-}Thanh Nguyen and
               Phuong Minh Nguyen and
               Thi{-}Hai{-}Yen Vuong and
               Quan Minh Bui and
               Chau Minh Nguyen and
               Tran Binh Dang and
               Vu Tran and
               Minh Le Nguyen and
               Ken Satoh},
  title     = {{JNLP} Team: Deep Learning Approaches for Legal Processing Tasks in
               {COLIEE} 2021},
  journal   = {CoRR},
  volume    = {abs/2106.13405},
  year      = {2021},
  url       = {https://arxiv.org/abs/2106.13405},
  eprinttype = {arXiv},
  eprint    = {2106.13405},
  biburl    = {https://dblp.org/rec/journals/corr/abs-2106-13405.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

@article{DBLP:journals/corr/abs-2011-08071,
  author    = {Ha{-}Thanh Nguyen and
               Hai{-}Yen Thi Vuong and
               Phuong Minh Nguyen and
               Tran Binh Dang and
               Quan Minh Bui and
               Vu Trong Sinh and
               Chau Minh Nguyen and
               Vu D. Tran and
               Ken Satoh and
               Minh Le Nguyen},
  title     = {{JNLP} Team: Deep Learning for Legal Processing in {COLIEE} 2020},
  journal   = {CoRR},
  volume    = {abs/2011.08071},
  year      = {2020},
  url       = {https://arxiv.org/abs/2011.08071},
  eprinttype = {arXiv},
  eprint    = {2011.08071},
  biburl    = {https://dblp.org/rec/journals/corr/abs-2011-08071.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

Legal text retrieval for python

Related tags

Overview

legal-text-retrieval

Overview

Environments

Run

License

Citation

Owner

Nguyễn Minh Phương

Use Tensorflow2.7.0 Build OpenAI'GPT-2

A 30000+ Chinese MRC dataset - Delta Reading Comprehension Dataset

Code for lyric-section-to-comment generation based on huggingface transformers.

TruthfulQA: Measuring How Models Imitate Human Falsehoods

DeLighT: Very Deep and Light-Weight Transformers

Pytorch version of BERT-whitening

Unofficial Python library for using the Polish Wordnet (plWordNet / Słowosieć)

Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition

multi-label，classifier，text classification，多标签文本分类，文本分类，BERT，ALBERT，multi-label-classification，seq2seq，attention，beam search

Simple GUI where you can enter an article and get a crisp summarized version.

PyTorch original implementation of Cross-lingual Language Model Pretraining.

Unsupervised Document Expansion for Information Retrieval with Stochastic Text Generation

A BERT-based reverse dictionary of Korean proverbs

Code and checkpoints for training the transformer-based Table QA models introduced in the paper TAPAS: Weakly Supervised Table Parsing via Pre-training.

T‘rex Park is a Youzan sponsored project. Offering Chinese NLP and image models pretrained from E-commerce datasets

Rich Prosody Diversity Modelling with Phone-level Mixture Density Network

Задания КЕГЭ по информатике 2021 на Python

SASE : Self-Adaptive noise distribution network for Speech Enhancement with heterogeneous data of Cross-Silo Federated learning

The FinQA dataset from paper: FinQA: A Dataset of Numerical Reasoning over Financial Data

ADCS cert template modification and ACL enumeration