TFIDF-based QA system for AIO2 competition

Last update: Feb 19, 2022

Related tags

Overview

AIO2 TF-IDF Baseline

This is a very simple question answering system, which is developed as a lightweight baseline for AIO2 competition.

In the training stage, the model builds a sparse matrix of TF-IDF features from the questions in training dataset. In the inference stage, the model predicts answers of unseen questions by finding the most similar training question to the input by computing dot product scores of TF-IDF features.

Therefore, in principle, the model cannot predict answers unseen in the training data.

Steps to experiment with the model

Install requirements

$ pip install -r requirements.txt

Train

$ python train.py \
--train_file <data dir>/aio_02_train.jsonl \
--output_dir model \
--pos_list 名詞 \
--stop_words でしょ う \
--max_features 10000

Predict

$ python predict.py \
--model_dir model \
--test_file <data dir>/aio_02_dev_unlabeled_v1.0.jsonl \
--prediction_file <output dir>/predictions.jsonl

Building Docker image

$ docker build -t aio2-tfidf-baseline .

Test locally:

:/app/input" -v ":/app/output" aio2-tfidf-baseline bash ./submission.sh input/aio_02_dev_unlabeled_v1.0.jsonl output/predictions.jsonl "> $ docker run --rm -v ":/app/input" -v ":/app/output" aio2-tfidf-baseline bash ./submission.sh input/aio_02_dev_unlabeled_v1.0.jsonl output/predictions.jsonl 

Save the docker image to file:

$ docker save aio2-tfidf-baseline | gzip > aio2-tfidf-baseline.tar.gz

License

The codes in this repository are open-sourced under MIT License.

TFIDF-based QA system for AIO2 competition

Related tags

Overview

AIO2 TF-IDF Baseline

Steps to experiment with the model

Install requirements

Train

Predict

Building Docker image

License

Owner

Masatoshi Suzuki

Universal End2End Training Platform, including pre-training, classification tasks, machine translation, and etc.

CDLA: A Chinese document layout analysis (CDLA) dataset

[KBS] Aspect-based sentiment analysis via affective knowledge enhanced graph convolutional networks

fastai ulmfit - Pretraining the Language Model, Fine-Tuning and training a Classifier

Train BPE with fastBPE, and load to Huggingface Tokenizer.

Sequence-to-Sequence Framework in PyTorch

Task-based datasets, preprocessing, and evaluation for sequence models.

Source code for CsiNet and CRNet using Fully Connected Layer-Shared feedback architecture.

A Python script which randomly chooses and prints a file from a directory.

Ελληνικά νέα (Python script) / Greek News Feed (Python script)

NLTK Source

NeoDays-based tileset for the roguelike CDDA (Cataclysm Dark Days Ahead)

🦆 Contextually-keyed word vectors

WikiPron - a command-line tool and Python API for mining multilingual pronunciation data from Wiktionary

NeMo: a toolkit for conversational AI

🐍 A hyper-fast Python module for reading/writing JSON data using Rust's serde-json.

Exploring dimension-reduced embeddings

Summarization module based on KoBART

Nested Named Entity Recognition for Chinese Biomedical Text

This is my reading list for my PhD in AI, NLP, Deep Learning and more.