UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language

Overview

UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language

This repository contains UA-GEC data and an accompanying Python library.

Data

All corpus data and metadata stay under the ./data. It has two subfolders for train and test splits

Each split (train and test) has further subfolders for different data representations:

./data/{train,test}/annotated stores documents in the annotated format

./data/{train,test}/source and ./data/{train,test}/target store the original and the corrected versions of documents. Text files in these directories are plain text with no annotation markup. These files were produced from the annotated data and are, in some way, redundant. We keep them because this format is convenient in some use cases.

Metadata

./data/metadata.csv stores per-document metadata. It's a CSV file with the following fields:

  • id (str): document identifier.
  • author_id (str): document author identifier.
  • is_native (int): 1 if the author is native-speaker, 0 otherwise
  • region (str): the author's region of birth. A special value "Інше" is used both for authors who were born outside Ukraine and authors who preferred not to specify their region.
  • gender (str): could be "Жіноча" (female), "Чоловіча" (male), or "Інша" (other).
  • occupation (str): one of "Технічна", "Гуманітарна", "Природнича", "Інша"
  • submission_type (str): one of "essay", "translation", or "text_donation"
  • source_language (str): for submissions of the "translation" type, this field indicates the source language of the translated text. Possible values are "de", "en", "fr", "ru", and "pl".
  • annotator_id (int): ID of the annotator who corrected the document.
  • partition (str): one of "test" or "train"
  • is_sensitive (int): 1 if the document contains profanity or offensive language

Annotation format

Annotated files are text files that use the following in-text annotation format: {error=>edit:::error_type=Tag}, where error and edit stand for the text item before and after correction respectively, and Tag denotes an error category (Grammar, Spelling, Punctuation, or Fluency).

Example of an annotated sentence:

    I {likes=>like:::error_type=Grammar} turtles.

An accompanying Python package, ua_gec, provides many tools for working with annotated texts. See its documentation for details.

Train-test split

We expect users of the corpus to train and tune their models on the train split only. Feel free to further split it into train-dev (or use cross-validation).

Please use the test split only for reporting scores of your final model. In particular, never optimize on the test set. Do not tune hyperparameters on it. Do not use it for model selection in any way.

Next section lists the per-split statistics.

Statistics

UA-GEC contains:

Split Documents Sentences Tokens Authors
train 851 18,225 285,247 416
test 160 2,490 43,432 76
TOTAL 1,011 20,715 328,779 492

See stats.txt for detailed statistics generated by the following command (ua-gec must be installed first):

$ make stats

Python library

Alternatively to operating on data files directly, you may use a Python package called ua_gec. This package includes the data and has classes to iterate over documents, read metadata, work with annotations, etc.

Getting started

The package can be easily installed by pip:

    $ pip install ua_gec==1.1

Alternatively, you can install it from the source code:

    $ cd python
    $ python setup.py develop

Iterating through corpus

Once installed, you may get annotated documents from the Python code:

    
    >>> from ua_gec import Corpus
    >>> corpus = Corpus(partition="train")
    >>> for doc in corpus:
    ...     print(doc.source)         # "I likes it."
    ...     print(doc.target)         # "I like it."
    ...     print(doc.annotated)      # <AnnotatedText("I {likes=>like} it.")
    ...     print(doc.meta.region)    # "Київська"

Note that the doc.annotated property is of type AnnotatedText. This class is described in the next section

Working with annotations

ua_gec.AnnotatedText is a class that provides tools for processing annotated texts. It can iterate over annotations, get annotation error type, remove some of the annotations, and more.

While we're working on a detailed documentation, here is an example to get you started. It will remove all Fluency annotations from a text:

    >>> from ua_gec import AnnotatedText
    >>> text = AnnotatedText("I {likes=>like:::error_type=Grammar} it.")
    >>> for ann in text.iter_annotations():
    ...     print(ann.source_text)       # likes
    ...     print(ann.top_suggestion)    # like
    ...     print(ann.meta)              # {'error_type': 'Grammar'}
    ...     if ann.meta["error_type"] == "Fluency":
    ...         text.remove(ann)         # or `text.apply(ann)`

Contributing

  • The data collection is an ongoing activity. You can always contribute your Ukrainian writings or complete one of the writing tasks at https://ua-gec-dataset.grammarly.ai/

  • Code improvements and document are welcomed. Please submit a pull request.

Contacts

Owner
Grammarly
Millions of users rely on Grammarly's AI-powered products to make their messages, documents, and social media posts clear, mistake-free, and impactful.
Grammarly
Minecraft Hack Detection With Python

Minecraft Hack Detection An attempt to try and use crowd sourced replays to find

Kuleen Sasse 3 Mar 26, 2022
Cluster-GCN: An Efficient Algorithm for Training Deep and Large Graph Convolutional Networks

Cluster-GCN: An Efficient Algorithm for Training Deep and Large Graph Convolutional Networks This repository contains a TensorFlow implementation of "

Jingwei Zheng 5 Jan 08, 2023
Problem-943.-ACMP - Problem 943. ACMP

Problem-943.-ACMP В "main.py" расположен вариант моего решения задачи 943 с серв

Konstantin Dyomshin 2 Aug 19, 2022
Official implementation for "Symbolic Learning to Optimize: Towards Interpretability and Scalability"

Symbolic Learning to Optimize This is the official implementation for ICLR-2022 paper "Symbolic Learning to Optimize: Towards Interpretability and Sca

VITA 8 Dec 19, 2022
The Pytorch code of "Joint Distribution Matters: Deep Brownian Distance Covariance for Few-Shot Classification", CVPR 2022 (Oral).

DeepBDC for few-shot learning        Introduction In this repo, we provide the implementation of the following paper: "Joint Distribution Matters: Dee

FeiLong 116 Dec 19, 2022
Repository for MuSiQue: Multi-hop Questions via Single-hop Question Composition

🎵 MuSiQue: Multi-hop Questions via Single-hop Question Composition This is the repository for our paper "MuSiQue: Multi-hop Questions via Single-hop

21 Jan 02, 2023
PyTorch Implement for Path Attention Graph Network

SPAGAN in PyTorch This is a PyTorch implementation of the paper "SPAGAN: Shortest Path Graph Attention Network" Prerequisites We prefer to create a ne

Yang Yiding 38 Dec 28, 2022
Inference code for "StylePeople: A Generative Model of Fullbody Human Avatars" paper. This code is for the part of the paper describing video-based avatars.

NeuralTextures This is repository with inference code for paper "StylePeople: A Generative Model of Fullbody Human Avatars" (CVPR21). This code is for

Visual Understanding Lab @ Samsung AI Center Moscow 18 Oct 06, 2022
Implicit MLE: Backpropagating Through Discrete Exponential Family Distributions

torch-imle Concise and self-contained PyTorch library implementing the I-MLE gradient estimator proposed in our NeurIPS 2021 paper Implicit MLE: Backp

UCL Natural Language Processing 249 Jan 03, 2023
A PaddlePaddle implementation of Time Interval Aware Self-Attentive Sequential Recommendation.

TiSASRec.paddle A PaddlePaddle implementation of Time Interval Aware Self-Attentive Sequential Recommendation. Introduction 论文:Time Interval Aware Sel

Paddorch 2 Nov 28, 2021
JupyterLite demo deployed to GitHub Pages 🚀

JupyterLite Demo JupyterLite deployed as a static site to GitHub Pages, for demo purposes. ✨ Try it in your browser ✨ ➡️ https://jupyterlite.github.io

JupyterLite 223 Jan 04, 2023
Beginner-friendly repository for Hacktober Fest 2021. Start your contribution to open source through baby steps. 💜

Hacktober Fest 2021 🎉 Open source is changing the world – one contribution at a time! 🎉 This repository is made for beginners who are unfamiliar wit

Abhilash M Nair 32 Dec 11, 2022
A denoising diffusion probabilistic model synthesises galaxies that are qualitatively and physically indistinguishable from the real thing.

Realistic galaxy simulation via score-based generative models Official code for 'Realistic galaxy simulation via score-based generative models'. We us

Michael Smith 32 Dec 20, 2022
How to Train a GAN? Tips and tricks to make GANs work

(this list is no longer maintained, and I am not sure how relevant it is in 2020) How to Train a GAN? Tips and tricks to make GANs work While research

Soumith Chintala 10.8k Dec 31, 2022
Multi Agent Reinforcement Learning for ROS in 2D Simulation Environments

IROS21 information To test the code and reproduce the experiments, follow the installation steps in Installation.md. Afterwards, follow the steps in E

11 Oct 29, 2022
Implementation for Simple Spectral Graph Convolution in ICLR 2021

Simple Spectral Graph Convolutional Overview This repo contains an example implementation of the Simple Spectral Graph Convolutional (S^2GC) model. Th

allenhaozhu 64 Dec 31, 2022
Luminaire is a python package that provides ML driven solutions for monitoring time series data.

A hands-off Anomaly Detection Library Table of contents What is Luminaire Quick Start Time Series Outlier Detection Workflow Anomaly Detection for Hig

Zillow 670 Jan 02, 2023
An abstraction layer for mathematical optimization solvers.

MathOptInterface Documentation Build Status Social An abstraction layer for mathematical optimization solvers. Replaces MathProgBase. Citing MathOptIn

JuMP-dev 284 Jan 04, 2023
End-To-End Crowdsourcing

End-To-End Crowdsourcing Comparison of traditional crowdsourcing approaches to a state-of-the-art end-to-end crowdsourcing approach LTNet on sentiment

Andreas Koch 1 Mar 06, 2022
Python code for loading the Aschaffenburg Pose Dataset.

Aschaffenburg Pose Dataset (APD) This repository contains Python code for loading and filtering the Aschaffenburg Pose Dataset. The dataset itself and

1 Nov 26, 2021