PASTRIE: A Corpus of Prepositions Annotated with Supersense Tags in Reddit International English

Related tags

Deep Learningpastrie
Overview

PASTRIE

CC BY-SA 4.0

Official release of the corpus described in the paper:

Michael Kranzlein, Emma Manning, Siyao Peng, Shira Wein, Aryaman Arora, and Nathan Schneider (2020). PASTRIE: A Corpus of Prepositions Annotated with Supersense Tags in Reddit International English [link]. Proceedings of the 14th Linguistic Annotation Workshop.


Overview

PASTRIE is a corpus of English data from Reddit annotated with preposition supersenses from the SNACS inventory.

While the data in PASTRIE is in English, it was produced by presumed speakers of four L1s:

  • English
  • French
  • German
  • Spanish

For details on how L1s were identified, see section 3.1 of Rabinovich et al. (2018).

Annotation Example

Below is an example sentence from the corpus, where annotation targets are bolded and preposition supersenses are annotated with the notation SceneRole↝Function. Together, a scene role and function are known as a construal.


Data Formats

PASTRIE is released in the following formats. We expect that most projects will be best served by one of the JSON formats.

  • .conllulex: the 19-column CoNLL-U-Lex format originally used for STREUSLE.
  • .json: a JSON representation of the CoNLL-U-Lex that does not require a CoNLL-U-Lex parser.
  • .govobj.json: an extended version of the JSON representation that contains information about each preposition's syntactic parent and object.

PASTRIE mostly follows STREUSLE with respect to the data format and SNACS annotation practice. Primary differences in the annotations are:

  • Lemmas, part-of-speech tags, and syntactic dependencies aim to follow the UD standard in both cases. They are gold in STREUSLE, versus automatic with some manual corrections in PASTRIE.
    • PASTRIE does not group together base+clitic combinations, whereas STREUSLE does (multiword tokens—where a single orthographic word contains multiple syntactic words).
    • PASTRIE does not regularly specify SpaceAfter=No to indicate alignment between the tokens and the raw text.
    • In PASTRIE, the raw text string accompanying the sentence may contain two or more consecutive spaces.
    • PASTRIE lacks enhanced dependencies.
  • Multiword expression annotations in PASTRIE are limited to expressions containing a preposition. Depending on the syntactic head, the expression may or may not have a SNACS supersense.
    • Verbal multiword expressions in PASTRIE are not subtyped in the lexcat; they all have a lexcat of V.
  • Noun and verb expressions in PASTRIE do not have supersense labels.
Comments
  • Misc. annotation errors and/or conversion script bugs

    Misc. annotation errors and/or conversion script bugs

    There are some annotations which I'm fairly sure are incorrect and are choking up the JSON conversion script. (These errors occur using the unmodified versions of all scripts taken straight from STRUESLE.) One or two might also be indicative of a bug in the conllulex2json.py file.

    1. vs mistagged as a noun--should be prep

    AssertionError: ('french-fad32caf-e595-e3cb-07bf-aaea891e53cb-02', {'lexlemma': 'versus', 'lexcat': 'CCONJ', 'ss': 'c', 'ss2': 'c', 'toknums': [3]}, {'#': 3, 'word': 'vs', 'lemma': 'versus', 'upos': 'NOUN', 'xpos': 'NN', 'feats': None, 'head': 8, 'deprel': 'nsubj', 'edeps': None, 'misc': None, 'smwe': None, 'wmwe': None, 'lextag': 'O-CCONJ-`c'})

    1. ditto

    AssertionError: ('french-fad32caf-e595-e3cb-07bf-aaea891e53cb-02', {'lexlemma': 'versus', 'lexcat': 'CCONJ', 'ss': 'c', 'ss2': 'c', 'toknums': [3]}, {'#': 3, 'word': 'vs', 'lemma': 'versus', 'upos': 'NOUN', 'xpos': 'NN', 'feats': None, 'head': 8, 'deprel': 'nsubj', 'edeps': None, 'misc': None, 'smwe': None, 'wmwe': None, 'lextag': 'O-CCONJ-`c'})

    1. Script complains about "to" in this snippet at ID=23. Not immediately clear to me what the issue is--perhaps that "to" is labeled ADP/IN? For its xpos I think it ought to be TO, not sure about its upos. Snippet:
    13      shit    shit    NOUN    NN      _       16      obl:npmod       _       _       _       _       _       _       _       _       _       _       _
    14      this    this    PRON    DT      _       16      nsubj   _       _       _       _       _       _       _       _       _       _       _
    15      can     can     AUX     MD      _       16      aux     _       _       _       _       _       _       _       _       _       _       _
    16      end     end     VERB    VB      _       4       parataxis       _       _       _       _       _       _       _       _       _       _       _
    17      right   right   ADV     RB      _       18      advmod  _       _       _       _       _       _       _       _       _       _       _
    18      now     now     ADV     RB      _       16      advmod  _       _       _       _       _       _       _       _       _       _       _
    19      if      if      SCONJ   IN      _       21      mark    _       _       _       _       _       _       _       _       _       _       _
    20      I       I       PRON    PRP     _       21      nsubj   _       _       _       _       _       _       _       _       _       _       _
    21      want    want    VERB    VBP     _       16      advcl   _       _       _       _       _       _       _       _       _       _       _
    22      it      it      PRON    PRP     _       21      obj     _       _       _       _       _       _       _       _       _       _       _
    23      to      to      ADP     IN      _       21      obl     _       _       _       _       _       `i      `i      _       _       _       _
    24      .       .       PUNCT   .       _       4       punct   _       _       _       _       _       _       _       _       _       _       _
    

    Error:

    AssertionError: ('french-a17a4340-f9c0-8fef-fa1b-1bf13879399b-02', {'lexlemma': 'to', 'lexcat': 'INF', 'ss': 'i', 'ss2': 'i', 'toknums': [23]}, {'#': 23, 'word': 'to', 'lemma': 'to', 'upos': 'ADP', 'xpos': 'IN', 'feats': None, 'head': 21, 'deprel': 'obl', 'edeps': None, 'misc': None, 'smwe': None, 'wmwe': None, 'lextag': 'O-INF-`i'})

    Relevant span of code:

                if validate_pos and upos!=lc and (upos,lc) not in {('NOUN','N'),('PROPN','N'),('VERB','V'),
                    ('ADP','P'),('ADV','P'),('SCONJ','P'),
                    ('ADP','DISC'),('ADV','DISC'),('SCONJ','DISC'),
                    ('PART','POSS')}:
                    # most often, the single-word lexcat should match its upos
                    # check a list of exceptions
                    mismatchOK = False
                    if xpos=='TO' and lc.startswith('INF'):
                        mismatchOK = True
                    elif (xpos=='TO')!=lc.startswith('INF'):
                        assert upos in ['SCONJ', "ADP"] and swe['lexlemma']=='for',(sent['sent_id'],swe,tok)
                        mismatchOK = True
    
    1. Originator as function:

    (in french-c02823ec-60bd-adce-7327-01337eb9d1c8-02) AssertionError: ('p.Originator should never be function', {'lexlemma': 'you', 'lexcat': 'PRON.POSS', 'ss': 'p.Originator', 'ss2': 'p.Originator', 'toknums': [1]})

    1. lexcat DISC with ADJ:

    AssertionError: In spanish-a25e8289-e04a-f5af-ce56-ead9faca65b1-02, single-word expression 'like' has lexcat DISC, which is incompatible with its upos ADJ

    1. "her" tagged with Possessor is incorrectly parsed as iobj and tagged as PRP instead of PRP$. Relevant snippet:
    1       My      my      PRON    PRP$    _       2       nmod:poss       _       _       _       _       _       SocialRel       Gestalt _       _       _       _
    2       grandma grandma NOUN    NN      _       3       nsubj   _       _       _       _       _       _       _       _       _       _       _
    3       had     have    VERB    VBD     _       0       root    _       _       _       _       _       _       _       _       _       _       _
    4       her     she     PRON    PRP     _       3       iobj    _       _       _       _       _       Possessor       Possessor       _       _       _       _
    5       super   super   ADV     RB      _       6       advmod  _       _       _       _       _       _       _       _       _       _       _
    6       thick   thick   ADJ     JJ      _       8       amod    _       _       _       _       _       _       _       _       _       _       _
    7       floor   floor   NOUN    NN      _       8       compound        _       _       _       _       _       _       _       _       _       _       _
    8       mats    mat     NOUN    NNS     _       3       obj     _       _       _       _       _       _       _       _       _       _       _
    9       *       *       PUNCT   NFP     _       8       punct   _       _       _       _       _       _       _       _       _       _       _
    10      over    over    ADP     IN      _       13      case    _       _       _       _       _       Locus   Locus   _       _       _       _
    11      *       *       PUNCT   NFP     _       13      punct   _       _       _       _       _       _       _       _       _       _       _
    12      the     the     DET     DT      _       13      det     _       _       _       _       _       _       _       _       _       _       _
    13      accelerator     accelerator     NOUN    NN      _       3       obl     _       _       _       _       _       _       _       _       _       _       _
    14      ,       ,       PUNCT   ,       _       3       punct   _       _       _       _       _       _       _       _       _       _       _
    

    Error:

    AssertionError: In spanish-ebba3c73-2431-c216-8f4d-d469ee8d5564-01, single-word expression 'her' has lexcat P, which is incompatible with its upos PRON

    1. "NA" is misannotated--this is NA as in North America, i.e. a PROPN/NP, but it's lemmatized as "no", and its tags are weird.

    AssertionError: ('german-35000895-1d78-c18a-01ed-f7410b9c0581-01', {'lexlemma': 'no', 'lexcat': 'ADV', 'ss': None, 'ss2': None, 'toknums': [5]}, {'#': 5, 'word': 'NA', 'lemma': 'no', 'upos': 'PART', 'xpos': 'TO', 'feats': None, 'head': 6, 'deprel': 'mark', 'edeps': None, 'misc': None, 'smwe': None, 'wmwe': None, 'lextag': 'O-ADV'})

    opened by lgessler 6
  • Prepositional supersense annotations on non-preposition targets

    Prepositional supersense annotations on non-preposition targets

    Is it OK for a verb-headed SMWE to have a prepositional supersense? The validator complains about it. Offending SMWE:

    21	give	give	VERB	VB	_	10	conj	_	_	2:1	_	give up on	p.Theme	p.Theme	_	_	_	_
    22	up	up	ADP	RP	_	21	compound:prt	_	_	2:2	_	_	_	_	_	_	_	_
    23	on	on	ADP	IN	_	24	case	_	_	2:3	_	_	_	_	_	_	_	_
    
    opened by lgessler 5
  • Prepositions unannotated for supersense

    Prepositions unannotated for supersense

    Token 6:

    # sent_id = french-f57dd6ab-5263-4c8a-e360-8ec683e6a37a-02
    # text = Once you have the hang of it it s pretty fast ( and does n't eat your clutch ) .
    1	Once	once	SCONJ	IN	_	3	mark	_	_	_	_	_	_	_	_	_	_	_
    2	you	you	PRON	PRP	_	3	nsubj	_	_	_	_	_	_	_	_	_	_	_
    3	have	have	VERB	VBP	_	11	advcl	_	_	_	_	_	_	_	_	_	_	_
    4	the	the	DET	DT	_	5	det	_	_	_	_	_	_	_	_	_	_	_
    5	hang	hang	NOUN	NN	_	3	obj	_	_	_	_	_	_	_	_	_	_	_
    6	of	of	ADP	IN	_	7	case	_	_	_	_	_	_	_	_	_	_	_
    7	it	it	PRON	PRP	_	5	nmod	_	_	_	_	_	_	_	_	_	_	_
    8	it	it	PRON	PRP	_	11	nsubj	_	_	_	_	_	_	_	_	_	_	_
    9	s	be	AUX	VBZ	_	11	cop	_	_	_	_	_	_	_	_	_	_	_
    10	pretty	pretty	ADV	RB	_	11	advmod	_	_	_	_	_	_	_	_	_	_	_
    11	fast	fast	ADJ	JJ	_	0	root	_	_	_	_	_	_	_	_	_	_	_
    12	(	(	PUNCT	-LRB-	_	16	punct	_	_	_	_	_	_	_	_	_	_	_
    13	and	and	CCONJ	CC	_	16	cc	_	_	_	_	_	_	_	_	_	_	_
    14	does	do	AUX	VBZ	_	16	aux	_	_	_	_	_	_	_	_	_	_	_
    15	n't	not	PART	RB	_	16	advmod	_	_	_	_	_	_	_	_	_	_	_
    16	eat	eat	VERB	VB	_	11	conj	_	_	_	_	_	_	_	_	_	_	_
    17	your	you	PRON	PRP$	_	18	nmod:poss	_	_	_	_	_	Possessor	Possessor	_	_	_	_
    18	clutch	clutch	NOUN	NN	_	16	obj	_	_	_	_	_	_	_	_	_	_	_
    19	)	)	PUNCT	-RRB-	_	11	punct	_	_	_	_	_	_	_	_	_	_	_
    20	.	.	PUNCT	.	_	11	punct	_	_	_	_	_	_	_	_	_	_	_
    

    I assumed that all preps were supposed to be annotated, but perhaps not?

    opened by lgessler 3
  • Apostrophes removed in preprocessing?

    Apostrophes removed in preprocessing?

    Looking through the data, there are a LOT of sentences where clitics are tokenized off but lack an apostrophe. Is that just the genre or did they get lost in preprocessing?

    opened by nschneid 2
  • Dataset requested

    Dataset requested

    Hi all,

    I would like to request the PASTRIE dataset accompanying the paper "PASTRIE: A Corpus of Prepositions Annotated with Supsersense Tags in Reddit International English".

    Thanks for reply.

    opened by fj-morales 2
  • SNACS supersense tags should start with

    SNACS supersense tags should start with "p."

    For compatibility with STREUSLE, it should be p.Locus, p.Theme, etc.

    Special labels like `i `d `c `$ ?? should not start with p.. In fact, the backtick labels from annotation are not represented as such in STREUSLE—they are reflected in the LEXCAT column of the data.

    opened by nschneid 0
  • Questionable adpositional MWEs

    Questionable adpositional MWEs

    • in_male_term — from "in male terms"; should be in_term (at most)
    • in_the_first_place
    • in_my_hand — from "in my hands"; should be in_hand (at most)
    • for_quite_some_time — just Duration for, weak MWE?
    • at_all_time — from what should have been "at all times". OK?
    • on_a_smaller_scale — omit adjective?
    • withouth — typo
    • see_as — "seeing as" (deverbal MWE acting like a preposition)
    opened by nschneid 0
  • Some undersegmentation of sentences

    Some undersegmentation of sentences

    Despite manual editing there are still places where a long sentence ought to be split up (esp. where it consists of a blockquoted sentence with > followed by a response). Looking for multiple consecutive spaces in the raw text uncovers some of these (as well as some discourse appendages like emoticons, which should probably remain in the same UD sentence).

    It would be nice to write a script to help clean these up—the tricky part is updating offsets in each parse.

    opened by nschneid 0
Releases(v2.0.1)
  • v2.0.1(Nov 21, 2021)

  • v2.0(Oct 22, 2021)

    • Switch to full .conllulex format following STREUSLE
      • add lexcats (#3), morphological features, newdoc directives
    • Scripts for validation and format conversion
    • Clean up various annotation issues, including:
      • restore apostrophes and fixing other conversion problems (#6, #9)
      • include pretokenized raw text (#12)
    Source code(tar.gz)
    Source code(zip)
  • v1.0.1(Dec 14, 2020)

    • Added .json file format
    • Switched lemmatization and pos tagging from StanfordNLP 0.2.0 to Stanza 1.1.1
    • Corrected rare encoding issue from v1.0
    Source code(tar.gz)
    Source code(zip)
Owner
NERT @ Georgetown
NERT @ Georgetown
aka "Bayesian Methods for Hackers": An introduction to Bayesian methods + probabilistic programming with a computation/understanding-first, mathematics-second point of view. All in pure Python ;)

Bayesian Methods for Hackers Using Python and PyMC The Bayesian method is the natural approach to inference, yet it is hidden from readers behind chap

Cameron Davidson-Pilon 25.1k Jan 02, 2023
Image-popularity-score - A novel deep regression method for image scoring.

Image-popularity-score - A novel deep regression method for image scoring.

Shoaib ahmed 1 Dec 26, 2021
Supplementary code for the experiments described in the 2021 ISMIR submission: Leveraging Hierarchical Structures for Few Shot Musical Instrument Recognition.

Music Trees Supplementary code for the experiments described in the 2021 ISMIR submission: Leveraging Hierarchical Structures for Few Shot Musical Ins

Hugo Flores García 32 Nov 22, 2022
Official implementation for the paper: Permutation Invariant Graph Generation via Score-Based Generative Modeling

Permutation Invariant Graph Generation via Score-Based Generative Modeling This repo contains the official implementation for the paper Permutation In

64 Dec 29, 2022
Lighthouse: Predicting Lighting Volumes for Spatially-Coherent Illumination

Lighthouse: Predicting Lighting Volumes for Spatially-Coherent Illumination Pratul P. Srinivasan, Ben Mildenhall, Matthew Tancik, Jonathan T. Barron,

Pratul Srinivasan 65 Dec 14, 2022
Streamlit Tutorial (ex: stock price dashboard, cartoon-stylegan, vqgan-clip, stylemixing, styleclip, sefa)

Streamlit Tutorials Install pip install streamlit Run cd [directory] streamlit run app.py --server.address 0.0.0.0 --server.port [your port] # http:/

Jihye Back 30 Jan 06, 2023
OCR-D wrapper for detectron2 based segmentation models

ocrd_detectron2 OCR-D wrapper for detectron2 based segmentation models Introduction Installation Usage OCR-D processor interface ocrd-detectron2-segm

Robert Sachunsky 13 Dec 06, 2022
DeFMO: Deblurring and Shape Recovery of Fast Moving Objects (CVPR 2021)

Evaluation, Training, Demo, and Inference of DeFMO DeFMO: Deblurring and Shape Recovery of Fast Moving Objects (CVPR 2021) Denys Rozumnyi, Martin R. O

Denys Rozumnyi 139 Dec 26, 2022
Sionna: An Open-Source Library for Next-Generation Physical Layer Research

Sionna: An Open-Source Library for Next-Generation Physical Layer Research Sionna™ is an open-source Python library for link-level simulations of digi

NVIDIA Research Projects 313 Dec 22, 2022
Header-only library for using Keras models in C++.

frugally-deep Use Keras models in C++ with ease Table of contents Introduction Usage Performance Requirements and Installation FAQ Introduction Would

Tobias Hermann 927 Jan 05, 2023
Tutorial: Introduction to Graph Machine Learning, with Jupyter notebooks

GraphMLTutorialNLDL22 Tutorial NLDL22: Introduction to Graph Machine Learning, with Jupyter notebooks This tutorial takes place during the conference

UiT Machine Learning Group 3 Jan 10, 2022
High-quality single file implementation of Deep Reinforcement Learning algorithms with research-friendly features

CleanRL (Clean Implementation of RL Algorithms) CleanRL is a Deep Reinforcement Learning library that provides high-quality single-file implementation

Costa Huang 1.8k Jan 01, 2023
Code for "LoRA: Low-Rank Adaptation of Large Language Models"

LoRA: Low-Rank Adaptation of Large Language Models This repo contains the implementation of LoRA in GPT-2 and steps to replicate the results in our re

Microsoft 394 Jan 08, 2023
Example how to deploy deep learning model with aiohttp.

aiohttp-demos Demos for aiohttp project. Contents Imagetagger Deep Learning Image Classifier URL shortener Toxic Comments Classifier Moderator Slack B

aio-libs 661 Jan 04, 2023
Segmentation for medical image.

EfficientSegmentation Introduction EfficientSegmentation is an open source, PyTorch-based segmentation framework for 3D medical image. Features A whol

68 Nov 28, 2022
Semantic Edge Detection with Diverse Deep Supervision

Semantic Edge Detection with Diverse Deep Supervision This repository contains the code for our IJCV paper: "Semantic Edge Detection with Diverse Deep

Yun Liu 12 Dec 31, 2022
Specification language for generating Generalized Linear Models (with or without mixed effects) from conceptual models

tisane Tisane: Authoring Statistical Models via Formal Reasoning from Conceptual and Data Relationships TL;DR: Analysts can use Tisane to author gener

Eunice Jun 11 Nov 15, 2022
Code release for Local Light Field Fusion at SIGGRAPH 2019

Local Light Field Fusion Project | Video | Paper Tensorflow implementation for novel view synthesis from sparse input images. Local Light Field Fusion

1.1k Dec 27, 2022
PyTorch code for Composing Partial Differential Equations with Physics-Aware Neural Networks

FInite volume Neural Network (FINN) This repository contains the PyTorch code for models, training, and testing, and Python code for data generation t

Cognitive Modeling 20 Dec 18, 2022