Gold standard corpus annotated with verb-preverb connections for Hungarian.

Overview

Hungarian Preverb Corpus

A gold standard corpus manually annotated with verb-preverb connections for Hungarian.

corpus

The corpus consist of the following 4 files:

filename # sentences # preverbs
difficult_validate1.txt 310 357
difficult_validate2.txt 840 935
difficult_test.txt 327 376
general_test.txt 503 500

Preverbs in the general dataset are in the distribution as they appear in normal Hungarian text. The difficult dataset is specially crafted: the most common and most-easy-to-handle pattern, i.e. when a verb is directly followed by its preverb (e.g. megy ki 'go out'), is omitted. validate is for development/validation, test is for testing. Note that a general_validate dataset would not be useful, because the trivial pattern would be in vast majority overwhelming the more interesting less frequent patterns.

Accordingly, the emPreverb tool which connects preverbs to their corresponding verb, was developed based only on interesting difficult examples, and tested both on difficult and general data.

(Remark. The difficult_validate dataset is divided into two parts for historical reasons, but you can simply use them together: they consist a total of 1150 sentences and 1292 preverbs.)

corpus annotation guidelines

  • Preverb marked by a suffixed backslash followed by a (single digit!) ID number: meg\1.
  • Word from which the preverb was separated marked by a pipe followed by the same ID number: főzve|1.
  • Within the same line, different verb-prefix pairs must (obviously) receive different ID numbers.
  • A preverb that does not belong to any word in the sentence (ellipsis etc.) is marked with a zero ID: "Hazakísérhetlek?" "Meg\0 hát." Any number of preverbs can have the 0 ID within the same line.
  • In the difficult dataset, a verb directly followed by its preverb is not annotated: főzte meg, but: főzte|1 volna meg\1.
  • In the general dataset, the first pattern is annotated as well: főzte|1 meg\1.
  • Normally there is a 1:1 correspondence between preverbs and verbs. However, there are exceptions, and these are annotated accordingly, e.g. Se ki\1, se be\1 nem lehetett menni|1 Budakesziről; át-\1 meg átjárták|1.

Check (see Step 1 to 4 in evaluate.ipynb) whether tokens annotated as separated preverbs are also analysed by e-magyar morph,pos as preverbs. If not (e.g. if the preverb meg is tagged by emtsv as a [/Conj]), remove this annotation (or the whole item if no annotation left) from the dataset because preverb will necessarily fail due to incorrect emtsv annotation, which is extraneous to its performance evaluation. Exception: person-inflected preverb-like postpositions such as in utánam\1 dobják|1, which are tagged by emtsv as [/Post], and case-inflected personal pronouns such as in hozzá\1 voltam szokva|1, which are tagged as [/N|Pro], should not be removed from the dataset since preverb should be able to handle these.

If a token is annotated as the verb stem counterpart of a separated preverb, but is not tagged by emtsv as a verb, check whether the preverb annotation is correct, but if so, do not remove this annotation from the dataset. preverb is supposed to be able to handle the connection of such separated preverbs.

evaluation

An environment for reproducing evaluation of emPreverb as published in the paper below.

git clone https://github.com/ril-lexknowrep/emPreverb
cd emPreverb
make evaluate

Note that make evaluate clones this current repo inside emPreverb and runs evaluation.

The results are obtained in general_test_results.txt and difficult_test_results.txt. This should be exactly the same which can be found in Table 3 of the paper below.

development

An environment used for developing emPreverb. It is "for us" but if you insist to use it:

git clone https://github.com/ril-lexknowrep/emPreverb
cd emPreverb
git clone https://github.com/ril-lexknowrep/hungarian-preverb-corpus
cd hungarian-preverb-corpus/development
jupyter notebook evaluate.ipynb

(Remark. Yes, please clone this repo inside emPreverb.)

citation

If you use the corpus, please cite the following paper.

Pethő, Gergely and Sass, Bálint and Kalivoda, Ágnes and Simon, László and Lipp, Veronika: Igekötő-kapcsolás. In: MSZNY 2022.

Owner
RIL Lexical Knowledge Representation Research Group
RIL Lexical Knowledge Representation Research Group
Let Xiao Ai speakers control third-party devices

A stupid way to extend miot/xiaoai. Demo for Panasonic Bath Bully FV-RB20VL1 逆向 Panasonic Smart China,获得控制浴霸的请求信息(HTTP 请求),详见 apps/panasonic.py; 2. 通过

bin 14 Jul 07, 2022
Analyse japanese ebooks using MeCab to determine the difficulty level for japanese learners

japanese-ebook-analysis This aim of this project is to make analysing the contents of a japanese ebook easy and streamline the process for non-technic

Christoffer Aakre 14 Jul 23, 2022
Example code for "Real-World Natural Language Processing"

Real-World Natural Language Processing This repository contains example code for the book "Real-World Natural Language Processing." AllenNLP (2.5.0 or

Masato Hagiwara 303 Dec 17, 2022
天池中药说明书实体识别挑战冠军方案;中文命名实体识别;NER; BERT-CRF & BERT-SPAN & BERT-MRC;Pytorch

天池中药说明书实体识别挑战冠军方案;中文命名实体识别;NER; BERT-CRF & BERT-SPAN & BERT-MRC;Pytorch

zxx飞翔的鱼 751 Dec 30, 2022
PG-19 Language Modelling Benchmark

PG-19 Language Modelling Benchmark This repository contains the PG-19 language modeling benchmark. It includes a set of books extracted from the Proje

DeepMind 161 Oct 30, 2022
A Word Level Transformer layer based on PyTorch and 🤗 Transformers.

Transformer Embedder A Word Level Transformer layer based on PyTorch and 🤗 Transformers. How to use Install the library from PyPI: pip install transf

Riccardo Orlando 27 Nov 20, 2022
GVT is a generic translation tool for parts of text on the PC screen with Text to Speak functionality.

GVT is a generic translation tool for parts of text on the PC screen with Text to Speech functionality. I wanted to create it because the existing tools that I experimented with did not satisfy me in

Nuked 1 Aug 21, 2022
This is the writeup of all the challenges from Advent-of-cyber-2019 of TryHackMe

Advent-of-cyber-2019-writeup This is the writeup of all the challenges from Advent-of-cyber-2019 of TryHackMe https://tryhackme.com/shivam007/badges/c

shivam danawale 5 Jul 17, 2022
Multilingual Emotion classification using BERT (fine-tuning). Published at the WASSA workshop (ACL2022).

XLM-EMO: Multilingual Emotion Prediction in Social Media Text Abstract Detecting emotion in text allows social and computational scientists to study h

MilaNLP 35 Sep 17, 2022
Code for EMNLP 2021 main conference paper "Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification"

Code for EMNLP 2021 main conference paper "Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification"

LancoPKU 105 Jan 03, 2023
spaCy plugin for Transformers , Udify, ELmo, etc.

Camphr - spaCy plugin for Transformers, Udify, Elmo, etc. Camphr is a Natural Language Processing library that helps in seamless integration for a wid

342 Nov 21, 2022
All the code I wrote for Overwatch-related projects that I still own the rights to.

overwatch_shit.zip This is (eventually) going to contain all the software I wrote during my five-year imprisonment stay playing Overwatch. I'll be add

zkxjzmswkwl 2 Dec 31, 2021
NLP tool to extract emotional phrase from tweets 🤩

Emotional phrase extractor Extract phrase in the given text that is used to express the sentiment. Capturing sentiment in language is important in the

Shahul ES 38 Oct 17, 2022
KakaoBrain KoGPT (Korean Generative Pre-trained Transformer)

KoGPT KoGPT (Korean Generative Pre-trained Transformer) https://github.com/kakaobrain/kogpt https://huggingface.co/kakaobrain/kogpt Model Descriptions

Kakao Brain 797 Dec 26, 2022
Model for recasing and repunctuating ASR transcripts

Recasing and punctuation model based on Bert Benoit Favre 2021 This system converts a sequence of lowercase tokens without punctuation to a sequence o

Benoit Favre 88 Dec 29, 2022
PRAnCER is a web platform that enables the rapid annotation of medical terms within clinical notes.

PRAnCER (Platform enabling Rapid Annotation for Clinical Entity Recognition) is a web platform that enables the rapid annotation of medical terms within clinical notes. A user can highlight spans of

Sontag Lab 39 Nov 14, 2022
The source code of HeCo

HeCo This repo is for source code of KDD 2021 paper "Self-supervised Heterogeneous Graph Neural Network with Co-contrastive Learning". Paper Link: htt

Nian Liu 106 Dec 27, 2022
🤕 spelling exceptions builder for lazy people

🤕 spelling exceptions builder for lazy people

Vlad Bokov 3 May 12, 2022
A modular framework for vision & language multimodal research from Facebook AI Research (FAIR)

MMF is a modular framework for vision and language multimodal research from Facebook AI Research. MMF contains reference implementations of state-of-t

Facebook Research 5.1k Dec 26, 2022
:P Some basic stuff I'm gonna use for my upcoming Agile Software Development and Devops

reverse-image-search-py bash script.sh img_name.jpg Requirements pip install requests pip install pyshorteners Dry run [ Sudhanva M 3 Dec 18, 2021