PASTRIE: A Corpus of Prepositions Annotated with Supersense Tags in Reddit International English

Last update: Dec 02, 2021

Related tags

Overview

PASTRIE

Official release of the corpus described in the paper:

Michael Kranzlein, Emma Manning, Siyao Peng, Shira Wein, Aryaman Arora, and Nathan Schneider (2020). PASTRIE: A Corpus of Prepositions Annotated with Supersense Tags in Reddit International English [link]. Proceedings of the 14th Linguistic Annotation Workshop.

Overview

PASTRIE is a corpus of English data from Reddit annotated with preposition supersenses from the SNACS inventory.

While the data in PASTRIE is in English, it was produced by presumed speakers of four L1s:

English
French
German
Spanish

For details on how L1s were identified, see section 3.1 of Rabinovich et al. (2018).

Annotation Example

Below is an example sentence from the corpus, where annotation targets are bolded and preposition supersenses are annotated with the notation SceneRole↝Function. Together, a scene role and function are known as a construal.

Data Formats

PASTRIE is released in the following formats. We expect that most projects will be best served by one of the JSON formats.

.conllulex: the 19-column CoNLL-U-Lex format originally used for STREUSLE.
.json: a JSON representation of the CoNLL-U-Lex that does not require a CoNLL-U-Lex parser.
.govobj.json: an extended version of the JSON representation that contains information about each preposition's syntactic parent and object.

PASTRIE mostly follows STREUSLE with respect to the data format and SNACS annotation practice. Primary differences in the annotations are:

Lemmas, part-of-speech tags, and syntactic dependencies aim to follow the UD standard in both cases. They are gold in STREUSLE, versus automatic with some manual corrections in PASTRIE.
- PASTRIE does not group together base+clitic combinations, whereas STREUSLE does (multiword tokens—where a single orthographic word contains multiple syntactic words).
- PASTRIE does not regularly specify SpaceAfter=No to indicate alignment between the tokens and the raw text.
- In PASTRIE, the raw text string accompanying the sentence may contain two or more consecutive spaces.
- PASTRIE lacks enhanced dependencies.
Multiword expression annotations in PASTRIE are limited to expressions containing a preposition. Depending on the syntactic head, the expression may or may not have a SNACS supersense.
- Verbal multiword expressions in PASTRIE are not subtyped in the lexcat; they all have a lexcat of V.
Noun and verb expressions in PASTRIE do not have supersense labels.

Comments

Misc. annotation errors and/or conversion script bugs

There are some annotations which I'm fairly sure are incorrect and are choking up the JSON conversion script. (These errors occur using the unmodified versions of all scripts taken straight from STRUESLE.) One or two might also be indicative of a bug in the conllulex2json.py file.

vs mistagged as a noun--should be prep

AssertionError: ('french-fad32caf-e595-e3cb-07bf-aaea891e53cb-02', {'lexlemma': 'versus', 'lexcat': 'CCONJ', 'ss': 'c', 'ss2': 'c', 'toknums': [3]}, {'#': 3, 'word': 'vs', 'lemma': 'versus', 'upos': 'NOUN', 'xpos': 'NN', 'feats': None, 'head': 8, 'deprel': 'nsubj', 'edeps': None, 'misc': None, 'smwe': None, 'wmwe': None, 'lextag': 'O-CCONJ-`c'})

ditto

AssertionError: ('french-fad32caf-e595-e3cb-07bf-aaea891e53cb-02', {'lexlemma': 'versus', 'lexcat': 'CCONJ', 'ss': 'c', 'ss2': 'c', 'toknums': [3]}, {'#': 3, 'word': 'vs', 'lemma': 'versus', 'upos': 'NOUN', 'xpos': 'NN', 'feats': None, 'head': 8, 'deprel': 'nsubj', 'edeps': None, 'misc': None, 'smwe': None, 'wmwe': None, 'lextag': 'O-CCONJ-`c'})

Script complains about "to" in this snippet at ID=23. Not immediately clear to me what the issue is--perhaps that "to" is labeled ADP/IN? For its xpos I think it ought to be TO, not sure about its upos. Snippet:

13      shit    shit    NOUN    NN      _       16      obl:npmod       _       _       _       _       _       _       _       _       _       _       _
14      this    this    PRON    DT      _       16      nsubj   _       _       _       _       _       _       _       _       _       _       _
15      can     can     AUX     MD      _       16      aux     _       _       _       _       _       _       _       _       _       _       _
16      end     end     VERB    VB      _       4       parataxis       _       _       _       _       _       _       _       _       _       _       _
17      right   right   ADV     RB      _       18      advmod  _       _       _       _       _       _       _       _       _       _       _
18      now     now     ADV     RB      _       16      advmod  _       _       _       _       _       _       _       _       _       _       _
19      if      if      SCONJ   IN      _       21      mark    _       _       _       _       _       _       _       _       _       _       _
20      I       I       PRON    PRP     _       21      nsubj   _       _       _       _       _       _       _       _       _       _       _
21      want    want    VERB    VBP     _       16      advcl   _       _       _       _       _       _       _       _       _       _       _
22      it      it      PRON    PRP     _       21      obj     _       _       _       _       _       _       _       _       _       _       _
23      to      to      ADP     IN      _       21      obl     _       _       _       _       _       `i      `i      _       _       _       _
24      .       .       PUNCT   .       _       4       punct   _       _       _       _       _       _       _       _       _       _       _

Error:

AssertionError: ('french-a17a4340-f9c0-8fef-fa1b-1bf13879399b-02', {'lexlemma': 'to', 'lexcat': 'INF', 'ss': 'i', 'ss2': 'i', 'toknums': [23]}, {'#': 23, 'word': 'to', 'lemma': 'to', 'upos': 'ADP', 'xpos': 'IN', 'feats': None, 'head': 21, 'deprel': 'obl', 'edeps': None, 'misc': None, 'smwe': None, 'wmwe': None, 'lextag': 'O-INF-`i'})

Relevant span of code:

            if validate_pos and upos!=lc and (upos,lc) not in {('NOUN','N'),('PROPN','N'),('VERB','V'),
                ('ADP','P'),('ADV','P'),('SCONJ','P'),
                ('ADP','DISC'),('ADV','DISC'),('SCONJ','DISC'),
                ('PART','POSS')}:
                # most often, the single-word lexcat should match its upos
                # check a list of exceptions
                mismatchOK = False
                if xpos=='TO' and lc.startswith('INF'):
                    mismatchOK = True
                elif (xpos=='TO')!=lc.startswith('INF'):
                    assert upos in ['SCONJ', "ADP"] and swe['lexlemma']=='for',(sent['sent_id'],swe,tok)
                    mismatchOK = True

Originator as function:

(in french-c02823ec-60bd-adce-7327-01337eb9d1c8-02) AssertionError: ('p.Originator should never be function', {'lexlemma': 'you', 'lexcat': 'PRON.POSS', 'ss': 'p.Originator', 'ss2': 'p.Originator', 'toknums': [1]})

lexcat DISC with ADJ:

AssertionError: In spanish-a25e8289-e04a-f5af-ce56-ead9faca65b1-02, single-word expression 'like' has lexcat DISC, which is incompatible with its upos ADJ

"her" tagged with Possessor is incorrectly parsed as iobj and tagged as PRP instead of PRP$. Relevant snippet:

1       My      my      PRON    PRP$    _       2       nmod:poss       _       _       _       _       _       SocialRel       Gestalt _       _       _       _
2       grandma grandma NOUN    NN      _       3       nsubj   _       _       _       _       _       _       _       _       _       _       _
3       had     have    VERB    VBD     _       0       root    _       _       _       _       _       _       _       _       _       _       _
4       her     she     PRON    PRP     _       3       iobj    _       _       _       _       _       Possessor       Possessor       _       _       _       _
5       super   super   ADV     RB      _       6       advmod  _       _       _       _       _       _       _       _       _       _       _
6       thick   thick   ADJ     JJ      _       8       amod    _       _       _       _       _       _       _       _       _       _       _
7       floor   floor   NOUN    NN      _       8       compound        _       _       _       _       _       _       _       _       _       _       _
8       mats    mat     NOUN    NNS     _       3       obj     _       _       _       _       _       _       _       _       _       _       _
9       *       *       PUNCT   NFP     _       8       punct   _       _       _       _       _       _       _       _       _       _       _
10      over    over    ADP     IN      _       13      case    _       _       _       _       _       Locus   Locus   _       _       _       _
11      *       *       PUNCT   NFP     _       13      punct   _       _       _       _       _       _       _       _       _       _       _
12      the     the     DET     DT      _       13      det     _       _       _       _       _       _       _       _       _       _       _
13      accelerator     accelerator     NOUN    NN      _       3       obl     _       _       _       _       _       _       _       _       _       _       _
14      ,       ,       PUNCT   ,       _       3       punct   _       _       _       _       _       _       _       _       _       _       _

Error:

AssertionError: In spanish-ebba3c73-2431-c216-8f4d-d469ee8d5564-01, single-word expression 'her' has lexcat P, which is incompatible with its upos PRON

"NA" is misannotated--this is NA as in North America, i.e. a PROPN/NP, but it's lemmatized as "no", and its tags are weird.

AssertionError: ('german-35000895-1d78-c18a-01ed-f7410b9c0581-01', {'lexlemma': 'no', 'lexcat': 'ADV', 'ss': None, 'ss2': None, 'toknums': [5]}, {'#': 5, 'word': 'NA', 'lemma': 'no', 'upos': 'PART', 'xpos': 'TO', 'feats': None, 'head': 6, 'deprel': 'mark', 'edeps': None, 'misc': None, 'smwe': None, 'wmwe': None, 'lextag': 'O-ADV'})

opened by lgessler 6

Prepositional supersense annotations on non-preposition targets
Is it OK for a verb-headed SMWE to have a prepositional supersense? The validator complains about it. Offending SMWE:

21 give give VERB VB _ 10 conj _ _ 2:1 _ give up on p.Theme p.Theme _ _ _ _ 22 up up ADP RP _ 21 compound:prt _ _ 2:2 _ _ _ _ _ _ _ _ 23 on on ADP IN _ 24 case _ _ 2:3 _ _ _ _ _ _ _ _
opened by lgessler 5

Prepositions unannotated for supersense

Token 6:

# sent_id = french-f57dd6ab-5263-4c8a-e360-8ec683e6a37a-02
# text = Once you have the hang of it it s pretty fast ( and does n't eat your clutch ) .
1	Once	once	SCONJ	IN	_	3	mark	_	_	_	_	_	_	_	_	_	_	_
2	you	you	PRON	PRP	_	3	nsubj	_	_	_	_	_	_	_	_	_	_	_
3	have	have	VERB	VBP	_	11	advcl	_	_	_	_	_	_	_	_	_	_	_
4	the	the	DET	DT	_	5	det	_	_	_	_	_	_	_	_	_	_	_
5	hang	hang	NOUN	NN	_	3	obj	_	_	_	_	_	_	_	_	_	_	_
6	of	of	ADP	IN	_	7	case	_	_	_	_	_	_	_	_	_	_	_
7	it	it	PRON	PRP	_	5	nmod	_	_	_	_	_	_	_	_	_	_	_
8	it	it	PRON	PRP	_	11	nsubj	_	_	_	_	_	_	_	_	_	_	_
9	s	be	AUX	VBZ	_	11	cop	_	_	_	_	_	_	_	_	_	_	_
10	pretty	pretty	ADV	RB	_	11	advmod	_	_	_	_	_	_	_	_	_	_	_
11	fast	fast	ADJ	JJ	_	0	root	_	_	_	_	_	_	_	_	_	_	_
12	(	(	PUNCT	-LRB-	_	16	punct	_	_	_	_	_	_	_	_	_	_	_
13	and	and	CCONJ	CC	_	16	cc	_	_	_	_	_	_	_	_	_	_	_
14	does	do	AUX	VBZ	_	16	aux	_	_	_	_	_	_	_	_	_	_	_
15	n't	not	PART	RB	_	16	advmod	_	_	_	_	_	_	_	_	_	_	_
16	eat	eat	VERB	VB	_	11	conj	_	_	_	_	_	_	_	_	_	_	_
17	your	you	PRON	PRP$	_	18	nmod:poss	_	_	_	_	_	Possessor	Possessor	_	_	_	_
18	clutch	clutch	NOUN	NN	_	16	obj	_	_	_	_	_	_	_	_	_	_	_
19	)	)	PUNCT	-RRB-	_	11	punct	_	_	_	_	_	_	_	_	_	_	_
20	.	.	PUNCT	.	_	11	punct	_	_	_	_	_	_	_	_	_	_	_

I assumed that all preps were supposed to be annotated, but perhaps not?

opened by lgessler 3

Apostrophes removed in preprocessing?

Looking through the data, there are a LOT of sentences where clitics are tokenized off but lack an apostrophe. Is that just the genre or did they get lost in preprocessing?

opened by nschneid 2
Dataset requested

Hi all,

I would like to request the PASTRIE dataset accompanying the paper "PASTRIE: A Corpus of Prepositions Annotated with Supsersense Tags in Reddit International English".

Thanks for reply.

opened by fj-morales 2
SNACS supersense tags should start with "p."

For compatibility with STREUSLE, it should be p.Locus, p.Theme, etc.

Special labels like `i `d `c `$ ?? should not start with p.. In fact, the backtick labels from annotation are not represented as such in STREUSLE—they are reflected in the LEXCAT column of the data.

opened by nschneid 0
Questionable adpositional MWEs
in_male_term — from "in male terms"; should be in_term (at most)

in_the_first_place

in_my_hand — from "in my hands"; should be in_hand (at most)

for_quite_some_time — just Duration for, weak MWE?

at_all_time — from what should have been "at all times". OK?

on_a_smaller_scale — omit adjective?

withouth — typo

see_as — "seeing as" (deverbal MWE acting like a preposition)
opened by nschneid 0
Some undersegmentation of sentences

Despite manual editing there are still places where a long sentence ought to be split up (esp. where it consists of a blockquoted sentence with > followed by a response). Looking for multiple consecutive spaces in the raw text uncovers some of these (as well as some discourse appendages like emoticons, which should probably remain in the same UD sentence).

It would be nice to write a script to help clean these up—the tricky part is updating offsets in each parse.

opened by nschneid 0

Releases(v2.0.1)

v2.0.1(Nov 21, 2021)
Fixes 3 erroneous sentence IDs (along with beefed up sentence ID validation in scripts). (#16)

Source code(tar.gz)
Source code(zip)
v2.0(Oct 22, 2021)
Switch to full .conllulex format following STREUSLE

add lexcats (#3), morphological features, newdoc directives

Scripts for validation and format conversion

Clean up various annotation issues, including:

restore apostrophes and fixing other conversion problems (#6, #9)

include pretokenized raw text (#12)

Source code(tar.gz)
Source code(zip)
v1.0.1(Dec 14, 2020)
Added .json file format

Switched lemmatization and pos tagging from StanfordNLP 0.2.0 to Stanza 1.1.1

Corrected rare encoding issue from v1.0

Source code(tar.gz)
Source code(zip)
v1.0(Dec 12, 2020)

Source code(tar.gz)
Source code(zip)

Owner

NERT @ Georgetown

GitHub Repository

基于Pytorch实现优秀的自然图像分割框架！(包括FCN、U-Net和Deeplab)

语义分割学习实验-基于VOC数据集 usage：下载VOC数据集，将JPEGImages SegmentationClass两个文件夹放入到data文件夹下。终端切换到目标目录，运行python train.py -h查看训练 (torch) Li Xiang 28 Dec 21, 2022

PyTorch Implementation of AnimeGANv2

PyTorch implementation of AnimeGANv2

4k Jan 07, 2023

pybaum provides tools to work with pytrees which is a concept burrowed from JAX.

9 May 11, 2022

Implementation of light baking system for ray tracing based on Activision's UberBake

Vulkan Light Bakary MSU Graphics Group Student's Diploma Project Treefonov Andrey [GitHub] [LinkedIn] Project Goal The goal of the project is to imple

7 Dec 27, 2022

Can we visualize a large scientific data set with a surrogate model? We're building a GAN for the Earth's Mantle Convection data set to see if we can!

EarthGAN - Earth Mantle Surrogate Modeling Can a surrogate model of the Earth’s Mantle Convection data set be built such that it can be readily run in

0 Dec 09, 2021

Answering Open-Domain Questions of Varying Reasoning Steps from Text

This repository contains the authors' implementation of the Iterative Retriever, Reader, and Reranker (IRRR) model in the EMNLP 2021 paper "Answering Open-Domain Questions of Varying Reasoning Steps

26 Dec 22, 2022

Non-Imaging Transient Reconstruction And TEmporal Search (NITRATES)

Non-Imaging Transient Reconstruction And TEmporal Search (NITRATES) This repo contains the full NITRATES pipeline for maximum likelihood-driven discov

13 Nov 08, 2022

Gesture-Volume-Control - This Python program can adjust the system's volume by using hand gestures

Gesture-Volume-Control This Python program can adjust the system's volume by usi

1 Dec 30, 2021

The Official Implementation of the ICCV-2021 Paper: Semantically Coherent Out-of-Distribution Detection.

SCOOD-UDG (ICCV 2021) This repository is the official implementation of the paper: Semantically Coherent Out-of-Distribution Detection Jingkang Yang,

62 Nov 21, 2022

Official repository for the CVPR 2021 paper "Learning Feature Aggregation for Deep 3D Morphable Models"

Deep3DMM Official repository for the CVPR 2021 paper Learning Feature Aggregation for Deep 3D Morphable Models. Requirements This code is tested on Py

38 Dec 27, 2022

Machine Learning From Scratch. Bare bones NumPy implementations of machine learning models and algorithms with a focus on accessibility. Aims to cover everything from linear regression to deep learning.

Machine Learning From Scratch About Python implementations of some of the fundamental Machine Learning models and algorithms from scratch. The purpose

21.8k Jan 09, 2023

transfer attack; adversarial examples; black-box attack; unrestricted Adversarial Attacks on ImageNet; CVPR2021 天池黑盒竞赛

transfer_adv CVPR-2021 AIC-VI: unrestricted Adversarial Attacks on ImageNet CVPR2021 安全AI挑战者计划第六期赛道2：ImageNet无限制对抗攻击介绍：深度神经网络已经在各种视觉识别问题上取得了最先进的性能。

25 Dec 08, 2022

Newt - a Gaussian process library in JAX.

Newt __ \/_ (' \`\ _\, \ \\/ /`\/\ \\ \ \\

0 Nov 02, 2021

Wider-Yolo Kütüphanesi ile Yüz Tespit Uygulamanı Yap

WIDER-YOLO : Yüz Tespit Uygulaması Yap Wider-Yolo Kütüphanesinin Kullanımı 1. Wider Face Veri Setini İndir Train Dataset Val Dataset Test Dataset Not:

6 Aug 22, 2022

load .txt to train YOLOX, same as Yolo others

YOLOX train your data you need generate data.txt like follow format (per line- one image). prepare one data.txt like this: img_path1 x1,y1,x2,y2,clas

18 Aug 18, 2022

The goal of the exercises below is to evaluate the candidate knowledge and problem solving expertise regarding the main development focuses for the iFood ML Platform team: MLOps and Feature Store development.

The goal of the exercises below is to evaluate the candidate knowledge and problem solving expertise regarding the main development focuses for the iFood ML Platform team: MLOps and Feature Store dev

0 Feb 03, 2022

PASTRIE: A Corpus of Prepositions Annotated with Supersense Tags in Reddit International English

Related tags

Overview

PASTRIE

Overview

Annotation Example

Data Formats

Comments

Releases(v2.0.1)

v2.0.1(Nov 21, 2021)

v2.0(Oct 22, 2021)

v1.0.1(Dec 14, 2020)

v1.0(Dec 12, 2020)

Owner

NERT @ Georgetown

基于Pytorch实现优秀的自然图像分割框架！(包括FCN、U-Net和Deeplab)

PyTorch Implementation of AnimeGANv2

pybaum provides tools to work with pytrees which is a concept burrowed from JAX.

Implementation of light baking system for ray tracing based on Activision's UberBake

Can we visualize a large scientific data set with a surrogate model? We're building a GAN for the Earth's Mantle Convection data set to see if we can!

Answering Open-Domain Questions of Varying Reasoning Steps from Text

Non-Imaging Transient Reconstruction And TEmporal Search (NITRATES)

Gesture-Volume-Control - This Python program can adjust the system's volume by using hand gestures

The Official Implementation of the ICCV-2021 Paper: Semantically Coherent Out-of-Distribution Detection.

Official repository for the CVPR 2021 paper "Learning Feature Aggregation for Deep 3D Morphable Models"

Machine Learning From Scratch. Bare bones NumPy implementations of machine learning models and algorithms with a focus on accessibility. Aims to cover everything from linear regression to deep learning.

transfer attack; adversarial examples; black-box attack; unrestricted Adversarial Attacks on ImageNet; CVPR2021 天池黑盒竞赛

Newt - a Gaussian process library in JAX.

Wider-Yolo Kütüphanesi ile Yüz Tespit Uygulamanı Yap

load .txt to train YOLOX, same as Yolo others

The goal of the exercises below is to evaluate the candidate knowledge and problem solving expertise regarding the main development focuses for the iFood ML Platform team: MLOps and Feature Store development.

Applying CLIP to Point Cloud Recognition.

Adversarial Self-Defense for Cycle-Consistent GANs

Source code for Zalo AI 2021 submission

Campsite Reservation Finder