YACLC - Yet Another Chinese Learner Corpus

Related tags

Text Data & NLPYACLC
Overview

汉语学习者文本多维标注数据集YACLC V1.0

中文 | English

汉语学习者文本多维标注数据集(Yet Another Chinese Learner Corpus,YACLC)由北京语言大学、清华大学、北京师范大学、云南师范大学、东北大学、上海财经大学等高校组成的团队共同发布。主要项目负责人有杨麟儿、杨尔弘、孙茂松、张宝林、胡韧奋、何姗、岳岩、饶高琦、刘正皓、陈云等。

简介

汉语学习者文本多维标注数据集(Yet Another Chinese Learner Corpus,YACLC)是一个大规模的、提供偏误多维标注的汉语学习者文本数据集。我们招募了百余位汉语国际教育、语言学及应用语言学等专业背景的研究生组成标注团队,并采用众包策略分组标注。每个句子由10位标注员进行标注,每位标注员需要给出0或1的句子可接受度评分,以及纠偏标注(Grammatical Error Correction)和流利标注(Fluency based Correction)两个维度的标注结果。纠偏标注是从语法层面对偏误句进行修改,遵循忠实原意、最小改动的原则,将偏误句修改为符合汉语语法规范的句子;流利标注是将句子修改得更为流利和地道,符合母语者的表达习惯。在标注时,若句子可接受度评分为0,则标注员至少需要完成一条纠偏标注,同时可以进行流利标注。若句子可接受度评分为1,则标注员只需给出流利标注。本数据集可用于语法纠错、文本校对等自然语言处理任务,也可为汉语二语教学与习得、语料库语言学等研究领域提供数据支持。

数据规模

训练集规模为8,000条,每条数据包括原始句子及其多种纠偏标注与流利标注。验证集和测试集规模都为1,000条,每条数据包括原始句子及其全部纠偏标注与流利标注。

数据格式

每条数据中包含汉语学习者所写的待标注句子及其id、所属篇章id、所属篇章标题、标注员数量以及多维标注信息。其中,多维度标注信息包括:

  • 标注维度,"1"表示纠偏标注,"0"表示流利标注;
  • 标注后的正确文本;
  • 标注中的修改操作数量;
  • 提供该标注的标注员数量。

注意:测试集数据无标注者和多维标注信息。

数据样例如下:

{
  "sentence_id": 4308, // 句子id
  "sentence_text": "我只可以是坐飞机去的,因为巴西离英国到远极了。", // 学习者原句文本
  "article_id": 7267, // 该句所属的篇章id
  "article_name": "我放假的打算", // 篇章标题
  "total_annotators": 10, // 共多少个标注者参与了该句的标注
  "sentence_annos": [ // 多维标注信息
    {
      "is_grammatical": 1, // 标注维度:1表示纠偏标注,0表示流利标注
      "correction": "我只能坐飞机去,因为巴西离英国远极了。", // 修改后的正确文本
      "edits_count": 3, // 共有几处修改操作
      "annotator_count": 6 // 共有几个标注者修改为了这一结果
    },
    { // 下同
      "is_grammatical": 1,
      "correction": "我只能是坐飞机去的,因为巴西离英国远极了。",
      "edits_count": 2,
      "annotator_count": 1
    },
    {
      "is_grammatical": 1,
      "correction": "我只可以坐飞机去,因为巴西离英国远极了。",
      "edits_count": 3,
      "annotator_count": 2
    },
    {
      "is_grammatical": 0,
      "correction": "我只能坐飞机去,因为巴西离英国太远了。",
      "edits_count": 6,
      "annotator_count": 2
    }
  ]
}

评测代码使用

提交结果为文本文件,每行为一个修改后的句子,并与测试集中的数据逐条对应。

  • 每条测试集中的数据仅需给出一条修改结果;
  • 修改结果需使用THULAC工具包分词,请提交分词后的结果。

对于提交的结果文件output_file,后台将调用eval.py将其同标准答案文件test_gold_m2进行比较:

python eval.py output_file test_gold_m2

评测指标为F_0.5,输出结果示例:

{
  "Precision": 70.63,	
  "Recall": 37.04,
  "F_0.5": 59.79
}

引用

如果您使用了本数据集,请引用以下技术报告:

@article{wang-etal-2021-yaclc,
  title={YACLC: A Chinese Learner Corpus with Multidimensional Annotation}, 
  author={Yingying Wang, Cunliang Kong, Liner Yang, Yijun Wang, Xiaorong Lu, Renfen Hu, Shan He, Zhenghao Liu, Yun Chen, Erhong Yang, Maosong Sun},
  journal={arXiv preprint arXiv:2112.15043},
  year = {2021}
}

相关资源

“文心·写作”演示系统:https://writer.wenmind.net/

语法改错论文列表: https://github.com/blcuicall/GEC-Reading-List


Introduction

YACLC is a large-scale Chinese learner text dataset, providing multi-dimensional annotations, jointly released by a team composed of Beijing Language and Culture University, Tsinghua University, Beijing Normal University, Yunnan Normal University, Northeastern University, Shanghai University of Finance and Economics.

We recruited more than 100 students majoring in Chinese International Education, Linguistics, and Applied Linguistics for crowdsourcing annotation. Each sentence is annotated by 10 annotators, and each annotator needs to give a sentence acceptability score of 0 or 1, as well as grammatical error correction and fluency-based correction. The grammatical corrections follows the principle of minimum modification to make the learners' text meet the Chinese grammatical standards. The fluency-based correction modifies the sentence to be more fluent and authentic, in line with the expression habits of native speakers. This dataset can be used for multiple NLP tasks such as grammatical error correction and spell checking. It can also provide data support for research fields such as Chinese second language teaching and acquisition, corpus linguistics, etc.

Size

YACLC V1.0 contains the train (8,000 instances), validation (1,000 instances) and test (1,000 instances) sets. Each instance of train set includes 1 sentence written by Chinese learner, various grammatical error corrections and fluent error corrections of the sentence. While, each instance in the validation and test set includes all the corrections provided by the annotators.

Format

Each instance is composed of the informations of the sentence, annotators and the multi-dimensional annotations. The multi-dimensional annotation includes:

  • dimensioning, "1" indicates grammatical error correction, and "0" indicates fluency-based correction;
  • correct sentence after annotation;
  • number of edit operations in this annotation;
  • number of annotators for this annotation.

Here is an example:

{
  "sentence_id": 4308, // 
  "sentence_text": "我只可以是坐飞机去的,因为巴西离英国到远极了。",
  "article_id": 7267, // the article id that this sentence belongs to
  "article_name": "我放假的打算", // the title of this sentence
  "total_annotators": 10, // the number of annotators for this sentence
  "sentence_annos": [ // multi-dimensional annotations
    {
      "is_grammatical": 1, // 1: grammatical, 0: fluent
      "correction": "我只能坐飞机去,因为巴西离英国远极了。", // corrected sentence 
      "edits_count": 3, // the number of edits of this annotation
      "annotator_count": 6 // the number of annotators for this annotation
    },
    { 
      "is_grammatical": 1,
      "correction": "我只能是坐飞机去的,因为巴西离英国远极了。",
      "edits_count": 2,
      "annotator_count": 1
    },
    {
      "is_grammatical": 1,
      "correction": "我只可以坐飞机去,因为巴西离英国远极了。",
      "edits_count": 3,
      "annotator_count": 2
    },
    {
      "is_grammatical": 0,
      "correction": "我只能坐飞机去,因为巴西离英国太远了。",
      "edits_count": 6,
      "annotator_count": 2
    }
  ]
}

Usage of the Evaluation Code

The submission result is a text file, where each line is a corrected sentence and corresponds to the instance in the test set one by one.

  • Only one correction needs to be given for the each instance in the test set.
  • Please submit results after word segmentation using the THULAC toolkit.

For a submitted result file output_file,we will call the script eval.py to compare it with the golden standard test_gold_m2

python eval.py output_file test_gold_m2

The Evaluation Metric is F_0.5:

{
  "Precision": 70.63,
  "Recall": 37.04,
  "F_0.5": 59.79
}

Citation

Please cite our technical report if you use this dataset:

@article{wang-etal-2021-yaclc,
  title={YACLC: A Chinese Learner Corpus with Multidimensional Annotation},
  author={Yingying Wang, Cunliang Kong, Liner Yang, Yijun Wang, Xiaorong Lu, Renfen Hu, Shan He, Zhenghao Liu, Yun Chen, Erhong Yang, Maosong Sun},
  journal={arXiv preprint arXiv:2112.15043},
  year = {2021}
}
Owner
BLCU-ICALL
ICALL Research Group at Beijing Language and Culture University
BLCU-ICALL
Community and sentiment analysis based on tweets

The project has set itself the goal of analyzing the thoughts and interaction of Italian users through the social posts expressed through the Twitter platform on the day of the entry into force of th

3 Nov 17, 2022
A method to generate speech across multiple speakers

VoiceLoop PyTorch implementation of the method described in the paper VoiceLoop: Voice Fitting and Synthesis via a Phonological Loop. VoiceLoop is a n

Facebook Archive 873 Dec 15, 2022
Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks

Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks. It takes raw videos/images + text as inputs, and outputs task predictions. ClipB

Jie Lei 雷杰 612 Jan 04, 2023
This is Assignment1 code for the Web Data Processing System.

This is a Python program to Entity Linking by processing WARC files. We recognize entities from web pages and link them to a Knowledge Base(Wikidata).

3 Dec 04, 2022
KR-FinBert And KR-FinBert-SC

KR-FinBert & KR-FinBert-SC Much progress has been made in the NLP (Natural Language Processing) field, with numerous studies showing that domain adapt

5 Jul 29, 2022
NewsMTSC: (Multi-)Target-dependent Sentiment Classification in News Articles

NewsMTSC: (Multi-)Target-dependent Sentiment Classification in News Articles NewsMTSC is a dataset for target-dependent sentiment classification (TSC)

Felix Hamborg 79 Dec 30, 2022
A unified tokenization tool for Images, Chinese and English.

ICE Tokenizer Token id [0, 20000) are image tokens. Token id [20000, 20100) are common tokens, mainly punctuations. E.g., icetk[20000] == 'unk', ice

THUDM 42 Dec 27, 2022
Levenshtein and Hamming distance computation

distance - Utilities for comparing sequences This package provides helpers for computing similarities between arbitrary sequences. Included metrics ar

112 Dec 22, 2022
A PyTorch implementation of paper "Learning Shared Semantic Space for Speech-to-Text Translation", ACL (Findings) 2021

Chimera: Learning Shared Semantic Space for Speech-to-Text Translation This is a Pytorch implementation for the "Chimera" paper Learning Shared Semant

Chi Han 43 Dec 28, 2022
customer care chatbot made with Rasa Open Source.

Customer Care Bot Customer care bot for ecomm company which can solve faq and chitchat with users, can contact directly to team. 🛠 Features Basic E-c

Dishant Gandhi 23 Oct 27, 2022
👄 The most accurate natural language detection library for Python, suitable for long and short text alike

1. What does this library do? Its task is simple: It tells you which language some provided textual data is written in. This is very useful as a prepr

Peter M. Stahl 334 Dec 30, 2022
IMS-Toucan is a toolkit to train state-of-the-art Speech Synthesis models

IMS-Toucan is a toolkit to train state-of-the-art Speech Synthesis models. Everything is pure Python and PyTorch based to keep it as simple and beginner-friendly, yet powerful as possible.

Digital Phonetics at the University of Stuttgart 247 Jan 05, 2023
Unsupervised text tokenizer for Neural Network-based text generation.

SentencePiece SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabu

Google 6.4k Jan 01, 2023
iSTFTNet : Fast and Lightweight Mel-spectrogram Vocoder Incorporating Inverse Short-time Fourier Transform

iSTFTNet : Fast and Lightweight Mel-spectrogram Vocoder Incorporating Inverse Short-time Fourier Transform This repo try to implement iSTFTNet : Fast

Rishikesh (ऋषिकेश) 126 Jan 02, 2023
A Flask Sentiment Analysis API, with visual implementation

The Sentiment Analysis Api was created using python flask module,it allows users to parse a text or sentence throught the (?text) arguement, then view the sentiment analysis of that sentence. It can

Ifechukwudeni Oweh 10 Jul 17, 2022
Sequence-to-Sequence learning using PyTorch

Seq2Seq in PyTorch This is a complete suite for training sequence-to-sequence models in PyTorch. It consists of several models and code to both train

Elad Hoffer 514 Nov 17, 2022
硕士期间自学的NLP子任务,供学习参考

NLP_Chinese_down_stream_task 自学的NLP子任务,供学习参考 任务1 :短文本分类 (1).数据集:THUCNews中文文本数据集(10分类) (2).模型:BERT+FC/LSTM,Pytorch实现 (3).使用方法: 预训练模型使用的是中文BERT-WWM, 下载地

12 May 31, 2022
The Sudachi synonym dictionary in Solar format.

solr-sudachi-synonyms The Sudachi synonym dictionary in Solar format. Summary Run a script that checks for updates to the Sudachi dictionary every hou

Karibash 3 Aug 19, 2022
Codename generator using WordNet parts of speech database

codenames Codename generator using WordNet parts of speech database References: https://possiblywrong.wordpress.com/2021/09/13/code-name-generator/ ht

possiblywrong 27 Oct 30, 2022
TextFlint is a multilingual robustness evaluation platform for natural language processing tasks,

TextFlint is a multilingual robustness evaluation platform for natural language processing tasks, which unifies general text transformation, task-specific transformation, adversarial attack, sub-popu

TextFlint 587 Dec 20, 2022