Code related to "Have Your Text and Use It Too! End-to-End Neural Data-to-Text Generation with Semantic Fidelity" paper

Overview

DataTuner

You have just found the DataTuner. This repository provides tools for fine-tuning language models for a task.

Installation

Environment Creation

Assuming you have an existing conda setup, you can setup the environment with the following script. In order to activate the conda environment within the bash script, you need the location of the conda.sh file:

bash setup.sh  ~/miniconda3/etc/profile.d/conda.sh

You can update your existing environment:

conda env update -f=environment.yml

To start development, activate your environment:

conda activate finetune

Alternatively, you can always use the python binary with the absolute path, e.g.: ~/miniconda3/envs/finetune/bin/python.

Data

For any task you want to fine-tune on, you need the data to be a json file containing a list of json objects, one per data point. For example:

[
  {
    "question": "question text 1",
    "query": "query 1"
  },
  {
    "question": "question text 2",
    "query": "query 2 with [SpecialToken example]"
  }
]

The library assumes that you have placed your data in a single directory with three files: train.json, validation.json, and test.json.

Configuration

Now that we have the data in shape, we need to create a new task configuration file that specifies how we want the data to be formatted and what fields should be considered. You can create new config files in the folder src/datatuner/lm/task_configs.

A typical config file would look as follows:

{
"name": "dataset_name",
"data_shape": [
        {
            "id": "<question>",
            "type": "special",
            "learn": false
        },
        {
            "id": "question",
            "type": "text",
            "learn": false
        },
        {
            "id": "<query>",
            "type": "special",
            "learn": false
        },
        {
            "id": "query",
            "type": "text",
            "learn": true,
            "metrics": [
                "match"
            ]
        }
    ],
"extra_special_tokens": ["[SpecialToken"],
"extra_fields": []
}

For each item in the data shape:

  • type (required): special if special token, text if normal text.
  • id (required): the special token ID if type is special; the key for the text in the json data if type is text
  • learn (required): whether to allow the model to learn this part of the text. If false, the model masks that part during fine-tuning.
  • metrics (optional): the list of metrics that the model should compute upon evaluation. Each metric should have a corresponding function with the same name in metrics.py.
  • converter (optional): the name of the converter function in converters.py to apply on that text field after reading the text from the file.

The value of extra_special_tokens is a list of special tokens to be added to the vocabulary. Alternatively (especially if the list is too long or is generated automatically), you can create a text file with one special token per line and pass that as an argument during training via the --special_tokens_file argument.

The value of extra_fields is a list of additional fields to include from the input json files to output during evaluation, aside from the main fields used as inputs/outputs.

Training

The training script train.py can be used in single GPU or multi GPU settings.

cd src/datatuner/lm

# single gpu
python train.py --model_checkpoint ~/data/openai-gpt/  --dataset_path ../../../data/my_dataset/  --task_config ./task_configs/my_task_config.json --n_epoch 3 --lr 1e-5

# multi gpu
python -m torch.distributed.launch --nproc_per_node=4 train.py --model_checkpoint ~/data/openai-gpt/  --dataset_path ../../../data/my_dataset/  --task_config ./task_configs/my_task_config.json --n_epoch 3 --lr 1e-5

Evaluating the Model

You can run the following to evaluate the model on any test set. The data format is the same as the training data. Notice that you have to currently specify the model_type parameter matching the model you're loading:

cd src/datatuner/lm

python ./evaluate.py --task_config ./task_configs/my_task_config.json --model_checkpoint runs/2020-01-01_01-01-01  --filename ../../../data/my_dataset/test.json --max_length 200 --model_type gpt --top_k 1

# or if you just want to evaluate the latest model you trained 
RUN=$(ls -t ./runs | head -1) && python ./evaluate.py --task_config ./task_configs/my_task_config.json --model_checkpoint runs/$RUN  --filename ../../../data/my_dataset/test.json --max_length 200 --model_type gpt  --top_k 1

# or if you want to use the latest intermediate checkpoint while the model is training:
RUN=$(ls -t ./runs | head -1) && CHECKPOINT=$(ls -t ./runs/$RUN/checkpoint* | head -1) && cp $CHECKPOINT runs/$RUN/pytorch_model.bin

During evaluation, the outputs that do not exactly match the expected outputs will be printed. Also, the metrics will be printed (a dictionary with keys <metric_name>_<field_name>). At the end of evaluation, you will find the file with all the generated ouputs in the file eval_results/<run_folder_name>/<task_name>_<test_file_name>_<model_type>_generated.json.

Interacting with the model

You can also interact with the models. The client will ask you to input the fields required, and it will generate the fields it learnt.

cd src/datatuner/lm

python ./evaluate.py --task_config ./task_configs/my_task_config.json --model_checkpoint runs/2020-01-01_01-01-01  --max_length 200 --model_type gpt  --top_k 1 --input

# or if you just want to evaluate the latest model you trained 
RUN=$(ls -t ./runs | head -1) && python ./evaluate.py --task_config ./task_configs/my_task_config.json --model_checkpoint runs/$RUN  --max_length 200 --model_type gpt  --top_k 1 --input
kaldi-asr/kaldi is the official location of the Kaldi project.

Kaldi Speech Recognition Toolkit To build the toolkit: see ./INSTALL. These instructions are valid for UNIX systems including various flavors of Linux

Kaldi 12.3k Jan 05, 2023
A fastai/PyTorch package for unpaired image-to-image translation.

Unpaired image-to-image translation A fastai/PyTorch package for unpaired image-to-image translation currently with CycleGAN implementation. This is a

Tanishq Abraham 120 Dec 02, 2022
Some Boring Research About Products Recognition 、Duplicate Img Detection、Img Stitch、OCR

Products Recognition 介绍 商品识别,围绕在复杂的商场零售场景中,识别出货架图像中的商品信息。主要组成部分: 重复图像检测。【更新进度 4/10】 图像拼接。【更新进度 0/10】 目标检测。【更新进度 0/10】 商品识别。【更新进度 1/10】 OCR。【更新进度 1/10】

zhenjieWang 18 Jan 27, 2022
Augmenting Anchors by the Detector Itself

Augmenting Anchors by the Detector Itself Introduction It is difficult to determine the scale and aspect ratio of anchors for anchor-based object dete

4 Nov 06, 2022
(CVPR 2021) ST3D: Self-training for Unsupervised Domain Adaptation on 3D Object Detection

ST3D Code release for the paper ST3D: Self-training for Unsupervised Domain Adaptation on 3D Object Detection, CVPR 2021 Authors: Jihan Yang*, Shaoshu

CVMI Lab 224 Dec 28, 2022
SceneCollisionNet This repo contains the code for "Object Rearrangement Using Learned Implicit Collision Functions", an ICRA 2021 paper. For more info

SceneCollisionNet This repo contains the code for "Object Rearrangement Using Learned Implicit Collision Functions", an ICRA 2021 paper. For more info

NVIDIA Research Projects 31 Nov 22, 2022
Turn images of tables into CSV data. Detect tables from images and run OCR on the cells.

Table of Contents Overview Requirements Demo Modules Overview This python package contains modules to help with finding and extracting tabular data fr

Eric Ihli 311 Dec 24, 2022
learn how to use Gesture Control to change the volume of a computer

Volume-Control-using-gesture In this project we are going to learn how to use Gesture Control to change the volume of a computer. We first look into h

Diwas Pandey 49 Sep 22, 2022
It is a image ocr tool using the Tesseract-OCR engine with the pytesseract package and has a GUI.

OCR-Tool It is a image ocr tool made in Python using the Tesseract-OCR engine with the pytesseract package and has a GUI. This is my second ever pytho

Khant Htet Aung 4 Jul 11, 2022
Natural language detection

Detect the language of text. What’s so cool about franc? franc can support more languages(†) than any other library franc is packaged with support for

Titus 3.8k Jan 02, 2023
POT : Python Optimal Transport

This open source Python library provide several solvers for optimization problems related to Optimal Transport for signal, image processing and machine learning.

Python Optimal Transport 1.7k Jan 04, 2023
This is the code for our paper DAAIN: Detection of Anomalous and AdversarialInput using Normalizing Flows

Merantix-Labs: DAAIN This is the code for our paper DAAIN: Detection of Anomalous and Adversarial Input using Normalizing Flows which can be found at

Merantix 14 Oct 12, 2022
Fatigue Driving Detection Based on Dlib

Fatigue Driving Detection Based on Dlib

5 Dec 14, 2022
The code for “Oriented RepPoints for Aerail Object Detection”

Oriented RepPoints for Aerial Object Detection The code for the implementation of “Oriented RepPoints”, Under review. (arXiv preprint) Introduction Or

WentongLi 207 Dec 24, 2022
Bu uygulamada Python ve Opencv kullanarak bilgisayar kamerasından yüz tespiti yapıyoruz.

opencv_yuz_bulma Bu uygulamada Python ve Opencv kullanarak bilgisayar kamerasından yüz tespiti yapıyoruz. Bilgisarın kendi kamerasını kullanmak için;

Ahmet Haydar Ornek 6 Apr 16, 2022
An unofficial package help developers to implement ZATCA (Fatoora) QR code easily which required for e-invoicing

ZATCA (Fatoora) QR-Code Implementation An unofficial package help developers to implement ZATCA (Fatoora) QR code easily which required for e-invoicin

TheAwiteb 28 Nov 03, 2022
A simple Security Camera created using Opencv in Python where images gets saved in realtime in your Dropbox account at every 5 seconds

Security Camera using Opencv & Dropbox This is a simple Security Camera created using Opencv in Python where images gets saved in realtime in your Dro

Arpit Rath 1 Jan 31, 2022
A python script based on opencv and paddleocr, which can automatically pick up tasks, make cookies, and receive rewards in the Destiny 2 Dawning Oven

A python script based on opencv and paddleocr, which can automatically pick up tasks, make cookies, and receive rewards in the Destiny 2 Dawning Oven

1 Dec 22, 2021
7th place solution

SIIM-FISABIO-RSNA-COVID-19-Detection 7th place solution Validation: We used iterative-stratification with 5 folds (https://github.com/trent-b/iterativ

11 Jul 17, 2022
QuanTaichi: A Compiler for Quantized Simulations (SIGGRAPH 2021)

QuanTaichi: A Compiler for Quantized Simulations (SIGGRAPH 2021) Yuanming Hu, Jiafeng Liu, Xuanda Yang, Mingkuan Xu, Ye Kuang, Weiwei Xu, Qiang Dai, W

Taichi Developers 119 Dec 02, 2022