Automated question generation and question answering from Turkish texts using text-to-text transformers

Overview
citation

If you use this software in your work, please cite as:

@article{akyon2021automated,
  title={Automated question generation and question answering from Turkish texts using text-to-text transformers},
  author={Akyon, Fatih Cagatay and Cavusoglu, Devrim and Cengiz, Cemil and Altinuc, Sinan Onur and Temizel, Alptekin},
  journal={arXiv preprint arXiv:2111.06476},
  year={2021}
}
install
git clone https://github.com/obss/turkish-question-generation.git
cd turkish-question-generation
pip install -r requirements.txt
train
  • start a training using args:
python run.py --model_name_or_path google/mt5-small  --output_dir runs/exp1 --do_train --do_eval --tokenizer_name_or_path mt5_qg_tokenizer --per_device_train_batch_size 4 --gradient_accumulation_steps 2 --learning_rate 1e-4 --seed 42 --save_total_limit 1
python run.py config.json
python run.py config.yaml
evaluate
  • arrange related params in config:
do_train: false
do_eval: true
eval_dataset_list: ["tquad2-valid", "xquad.tr"]
prepare_data: true
mt5_task_list: ["qa", "qg", "ans_ext"]
mt5_qg_format: "both"
no_cuda: false
  • start an evaluation:
python run.py config.yaml
neptune
  • install neptune:
pip install neptune-client
  • download config file and arrange neptune params:
run_name: 'exp1'
neptune_project: 'name/project'
neptune_api_token: 'YOUR_API_TOKEN'
  • start a training:
python train.py config.yaml
wandb
  • install wandb:
pip install wandb
  • download config file and arrange wandb params:
run_name: 'exp1'
wandb_project: 'turque'
  • start a training:
python train.py config.yaml
finetuned checkpoints
Name Model data
train
params
(M)
model size
(GB)
turque-s1 mt5-small tquad2-train+tquad2-valid+xquad.tr 60M 1.2GB
mt5-small-3task-both-tquad2 mt5-small tquad2-train 60M 1.2GB
mt5-small-3task-prepend-tquad2 mt5-small tquad2-train 60M 1.2GB
mt5-base-3task-both-tquad2 mt5-base tquad2-train 220M 2.3GB
format
  • answer extraction:

input:

Osman Bey 1258 yılında Söğüt’te doğdu. Osman Bey 1 Ağustos 1326’da Bursa’da hayatını kaybetmiştir.1281 yılında Osman Bey 23 yaşında iken Ahi teşkilatından olan Şeyh Edebali’nin kızı Malhun Hatun ile evlendi." ">
"
      
        Osman Bey 1258 yılında Söğüt’te doğdu. 
       
         Osman Bey 1 Ağustos 1326’da Bursa’da hayatını kaybetmiştir.1281 yılında Osman Bey 23 yaşında iken Ahi teşkilatından olan Şeyh Edebali’nin kızı Malhun Hatun ile evlendi."

       
      

target:


    
      1258 
     
       Söğüt’te 
      

      
     
    
  • question answering:

input:

"question: Osman Bey nerede doğmuştur? context: Osman Bey 1258 yılında Söğüt’te doğdu. Osman Bey 1 Ağustos 1326’da Bursa’da hayatını kaybetmiştir.1281 yılında Osman Bey 23 yaşında iken Ahi teşkilatından olan Şeyh Edebali’nin kızı Malhun Hatun ile evlendi."

target:

"Söğüt’te"
  • question generation (prepend):

input:

"answer: Söğüt’te context: Osman Bey 1258 yılında Söğüt’te doğdu. Osman Bey 1 Ağustos 1326’da Bursa’da hayatını kaybetmiştir.1281 yılında Osman Bey 23 yaşında iken Ahi teşkilatından olan Şeyh Edebali’nin kızı Malhun Hatun ile evlendi."

target:

"Osman Bey nerede doğmuştur?"
  • question generation (highlight):

input:

Söğüt’te doğdu. Osman Bey 1 Ağustos 1326’da Bursa’da hayatını kaybetmiştir.1281 yılında Osman Bey 23 yaşında iken Ahi teşkilatından olan Şeyh Edebali’nin kızı Malhun Hatun ile evlendi." ">
"generate question: Osman Bey 1258 yılında 
     
       Söğüt’te 
      
        doğdu. Osman Bey 1 Ağustos 1326’da Bursa’da hayatını kaybetmiştir.1281 yılında Osman Bey 23 yaşında iken Ahi teşkilatından olan Şeyh Edebali’nin kızı Malhun Hatun ile evlendi."

      
     

target:

"Osman Bey nerede doğmuştur?"
  • question generation (both):

input:

Söğüt’te doğdu. Osman Bey 1 Ağustos 1326’da Bursa’da hayatını kaybetmiştir.1281 yılında Osman Bey 23 yaşında iken Ahi teşkilatından olan Şeyh Edebali’nin kızı Malhun Hatun ile evlendi." ">
"answer: Söğüt’te context: Osman Bey 1258 yılında 
     
       Söğüt’te 
      
        doğdu. Osman Bey 1 Ağustos 1326’da Bursa’da hayatını kaybetmiştir.1281 yılında Osman Bey 23 yaşında iken Ahi teşkilatından olan Şeyh Edebali’nin kızı Malhun Hatun ile evlendi."

      
     

target:

"Osman Bey nerede doğmuştur?"
paper results
BERTurk-base and mT5-base QA evaluation results for TQuADv2 fine-tuning.

mT5-base QG evaluation results for single-task (ST) and multi-task (MT) for TQuADv2 fine-tuning.

TQuADv1 and TQuADv2 fine-tuning QG evaluation results for multi-task mT5 variants. MT-Both means, mT5 model is fine-tuned with ’Both’ input format and in a multi-task setting.

paper configs

You can find the config files used in the paper under configs/paper.

contributing

Before opening a PR:

  • Install required development packages:
pip install "black==21.7b0" "flake8==3.9.2" "isort==5.9.2"
  • Reformat with black and isort:
black . --config pyproject.toml
isort .
You might also like...
NeuralQA: A Usable Library for Question Answering on Large Datasets with BERT
NeuralQA: A Usable Library for Question Answering on Large Datasets with BERT

NeuralQA: A Usable Library for (Extractive) Question Answering on Large Datasets with BERT Still in alpha, lots of changes anticipated. View demo on n

Knowledge Graph,Question Answering System,基于知识图谱和向量检索的医疗诊断问答系统
Knowledge Graph,Question Answering System,基于知识图谱和向量检索的医疗诊断问答系统

Knowledge Graph,Question Answering System,基于知识图谱和向量检索的医疗诊断问答系统

Baseline code for Korean open domain question answering(ODQA)
Baseline code for Korean open domain question answering(ODQA)

Open-Domain Question Answering(ODQA)는 다양한 주제에 대한 문서 집합으로부터 자연어 질의에 대한 답변을 찾아오는 task입니다. 이때 사용자 질의에 답변하기 위해 주어지는 지문이 따로 존재하지 않습니다. 따라서 사전에 구축되어있는 Knowl

Disfl-QA: A Benchmark Dataset for Understanding Disfluencies in Question Answering

Disfl-QA is a targeted dataset for contextual disfluencies in an information seeking setting, namely question answering over Wikipedia passages. Disfl-QA builds upon the SQuAD-v2 (Rajpurkar et al., 2018) dataset, where each question in the dev set is annotated to add a contextual disfluency using the paragraph as a source of distractors.

CCQA A New Web-Scale Question Answering Dataset for Model Pre-Training

CCQA: A New Web-Scale Question Answering Dataset for Model Pre-Training This is the official repository for the code and models of the paper CCQA: A N

chaii - hindi & tamil question answering

chaii - hindi & tamil question answering This is the solution for rank 5th in Kaggle competition: chaii - Hindi and Tamil Question Answering. The comp

Contact Extraction with Question Answering.

contactsQA Extraction of contact entities from address blocks and imprints with Extractive Question Answering. Goal Input: Dr. Max Mustermann Hauptstr

BERT-based Financial Question Answering System
BERT-based Financial Question Answering System

BERT-based Financial Question Answering System In this example, we use Jina, PyTorch, and Hugging Face transformers to build a production-ready BERT-b

Python package to easily retrain OpenAI's GPT-2 text-generating model on new texts
Python package to easily retrain OpenAI's GPT-2 text-generating model on new texts

gpt-2-simple A simple Python package that wraps existing model fine-tuning and generation scripts for OpenAI's GPT-2 text generation model (specifical

Owner
Open Business Software Solutions
Open Source for Open Business
Open Business Software Solutions
تولید اسم های رندوم فینگیلیش

karafs کرفس تولید اسم های رندوم فینگیلیش installation ➜ pip install karafs usage دو زبانه ➜ karafs -n 10 توت فرنگی بی ناموس toot farangi-ye bi_namoos

Vaheed NÆINI (9E) 36 Nov 24, 2022
A cross platform OCR Library based on PaddleOCR & OnnxRuntime

A cross platform OCR Library based on PaddleOCR & OnnxRuntime

RapidOCR Team 767 Jan 09, 2023
A sentence aligner for comparable corpora

About Yalign is a tool for extracting parallel sentences from comparable corpora. Statistical Machine Translation relies on parallel corpora (eg.. eur

Machinalis 128 Aug 24, 2022
jel - Japanese Entity Linker - is Bi-encoder based entity linker for japanese.

jel: Japanese Entity Linker jel - Japanese Entity Linker - is Bi-encoder based entity linker for japanese. Usage Currently, link and question methods

izuna385 10 Jan 06, 2023
Named Entity Recognition API used by TEI Publisher

TEI Publisher Named Entity Recognition API This repository contains the API used by TEI Publisher's web-annotation editor to detect entities in the in

e-editiones.org 14 Nov 15, 2022
基于pytorch+bert的中文事件抽取

pytorch_bert_event_extraction 基于pytorch+bert的中文事件抽取,主要思想是QA(问答)。 要预先下载好chinese-roberta-wwm-ext模型,并在运行时指定模型的位置。

西西嘛呦 31 Nov 30, 2022
A repository to run gpt-j-6b on low vram machines (4.2 gb minimum vram for 2000 token context, 3.5 gb for 1000 token context). Model loading takes 12gb free ram.

Basic-UI-for-GPT-J-6B-with-low-vram A repository to run GPT-J-6B on low vram systems by using both ram, vram and pinned memory. There seem to be some

90 Dec 25, 2022
Tutorial to pretrain & fine-tune a 🤗 Flax T5 model on a TPUv3-8 with GCP

Pretrain and Fine-tune a T5 model with Flax on GCP This tutorial details how pretrain and fine-tune a FlaxT5 model from HuggingFace using a TPU VM ava

Gabriele Sarti 41 Nov 18, 2022
lightweight, fast and robust columnar dataframe for data analytics with online update

streamdf Streamdf is a lightweight data frame library built on top of the dictionary of numpy array, developed for Kaggle's time-series code competiti

23 May 19, 2022
Sentence boundary disambiguation tool for Japanese texts (日本語文境界判定器)

Bunkai Bunkai is a sentence boundary (SB) disambiguation tool for Japanese texts. Quick Start $ pip install bunkai $ echo -e '宿を予約しました♪!まだ2ヶ月も先だけど。早すぎ

Megagon Labs 160 Dec 23, 2022
Official PyTorch implementation of Time-aware Large Kernel (TaLK) Convolutions (ICML 2020)

Time-aware Large Kernel (TaLK) Convolutions (Lioutas et al., 2020) This repository contains the source code, pre-trained models, as well as instructio

Vasileios Lioutas 28 Dec 07, 2022
초성 해석기 based on ko-BART

초성 해석기 개요 한국어 초성만으로 이루어진 문장을 입력하면, 완성된 문장을 예측하는 초성 해석기입니다. 초성: ㄴㄴ ㄴㄹ ㅈㅇㅎ 예측 문장: 나는 너를 좋아해 모델 모델은 SKT-AI에서 공개한 Ko-BART를 이용합니다. 데이터 문장 단위로 이루어진 아무 코퍼스나

Dawoon Jung 29 Oct 28, 2022
Binary LSTM model for text classification

Text Classification The purpose of this repository is to create a neural network model of NLP with deep learning for binary classification of texts re

Nikita Elenberger 1 Mar 11, 2022
Meta learning algorithms to train cross-lingual NLI (multi-task) models

Meta learning algorithms to train cross-lingual NLI (multi-task) models

M.Hassan Mojab 4 Nov 20, 2022
OceanScript is an Esoteric language used to encode and decode text into a formulation of characters

OceanScript is an Esoteric language used to encode and decode text into a formulation of characters - where the final result looks like waves in the ocean.

🧪 Cutting-edge experimental spaCy components and features

spacy-experimental: Cutting-edge experimental spaCy components and features This package includes experimental components and features for spaCy v3.x,

Explosion 65 Dec 30, 2022
Backend for the Autocomplete platform. An AI assisted coding platform.

Introduction A custom predictor allows you to deploy your own prediction implementation, useful when the existing serving implementations don't fit yo

Tatenda Christopher Chinyamakobvu 1 Jan 31, 2022
Codename generator using WordNet parts of speech database

codenames Codename generator using WordNet parts of speech database References: https://possiblywrong.wordpress.com/2021/09/13/code-name-generator/ ht

possiblywrong 27 Oct 30, 2022
ttslearn: Library for Pythonで学ぶ音声合成 (Text-to-speech with Python)

ttslearn: Library for Pythonで学ぶ音声合成 (Text-to-speech with Python) 日本語は以下に続きます (Japanese follows) English: This book is written in Japanese and primaril

Ryuichi Yamamoto 189 Dec 29, 2022
Text editor on python to convert english text to malayalam(Romanization/Transiteration).

Manglish Text Editor This is a simple transiteration (romanization ) program which is used to convert manglish to malayalam (converts njaan to ഞാൻ ).

Merin Rose Tom 1 May 11, 2022