Unofficial Python library for using the Polish Wordnet (plWordNet / Słowosieć)

Overview

Polish Wordnet Python library

Simple, easy-to-use and reasonably fast library for using the Słowosieć (also known as PlWordNet) - a lexico-semantic database of the Polish language. PlWordNet can also be browsed here.

I created this library, because since version 2.9, PlWordNet cannot be easily loaded into Python (for example with nltk), as it is only provided in a custom plwnxml format.

Usage

Load wordnet from an XML file (this will take about 20 seconds), and print basic statistics.

import plwordnet
wn = plwordnet.load('plwordnet_4_2.xml')
print(wn)

Expected output:

PlWordnet
  lexical units: 513410
  synsets: 353586
  relation types: 306
  synset relations: 1477849
  lexical relations: 393137

Find lexical units with name leśny and print all relations, where where that unit is in the subject/parent position.

for lu in wn.lemmas('leśny'):
    for s, p, o in wn.lexical_relations_where(subject=lu):
        print(p.format(s, o))

Expected output:

leśny.2 tworzy kolokację z polana.1
leśny.2 jest synonimem mpar. do las.1
leśny.3 przypomina las.1
leśny.4 jest derywatem od las.1
leśny.5 jest derywatem od las.1
leśny.6 przypomina las.1

Print all relation types and their ids:

for id, rel in wn.relation_types.items():
    print(id, rel.name)

Expected output:

10 hiponimia
11 hiperonimia
12 antonimia
13 konwersja
...

Installation

Note: plwordnet requires at Python 3.7 or newer.

pip install plwordnet

Version support

This library should be able to read future versions of PlWordNet without modification, even if more relation types are added. Still, if you use this library with a version of PlWordNet that is not listed below, please consider contributing information if it is supported.

  • PlWordNet 4.2
  • PlWordNet 4.0
  • PlWordNet 3.2
  • PlWordNet 3.0
  • PlWordNet 2.3
  • PlWordNet 2.2
  • PlWordNet 2.1

Documentation

See plwordnet/wordnet.py for RelationType, Synset and LexicalUnit class definitions.

Package functions

  • load(source): Reads PlWordNet, where src is a path to the wordnet XML file, or a path to the pickled wordnet object. Passed paths can point to files compressed with gzip or lzma.

Wordnet instance properties

  • lexical_relations: List of (subject, predicate, object) triples
  • synset_relations: List of (subject, predicate, object) triples
  • relation_types: Mapping from relation type id to object
  • lexical_units: Mapping from lexical unit id to unit object
  • synsets: Mapping from synset id to object
  • (lexical|synset)_relations_(s|o|p): Mapping from id of subject/object/predicate to a set of matching lexical unit/synset relation ids
  • lexical_units_by_name: Mapping from lexical unit name to a set of matching lexical unit ids

Wordnet methods

  • lemmas(value): Returns a list of LexicalUnit, where the name is equal to value
  • lexical_relations_where(subject, predicate, object): Returns lexical relation triples, with matching subject or/and predicate or/and object. Subject, predicate and object arguments can be integer ids or LexicalUnit and RelationType objects.
  • synset_relations_where(subject, predicate, object): Returns synset relation triples, with matching subject or/and predicate or/and object. Subject, predicate and object arguments can be integer ids or Synset and RelationType objects.
  • dump(dst): Pickles the Wordnet object to opened file dst or to a new file with path dst.

RelationType methods

  • format(x, y, short=False): Substitutes x and y into the RelationType display format display. If short, x and y are separated by the short relation name shortcut.
Comments
  • Fix for abstract attribute bug, MAJOR speedup of synset_relations_where

    Fix for abstract attribute bug, MAJOR speedup of synset_relations_where

    Hi Max.

    I've fixed the bug related to abstract attribute of the synset (it was always True, because bool("non-empty-string") is always True)

    I've also speeded up synset_relations_where by order of 3-4 magnitudes.

    opened by dchaplinsky 7
  • Exposing relations in Wordnet class

    Exposing relations in Wordnet class

    This might be a bit an overkill, but it has two advantages.

    First is: image

    Another is that you can rewrite code like this:

    def path_to_top(synset):
        spo = []
        for rel in [11, 107, 171, 172, 199, 212, 213]:
    

    with meaningful names, not numbers

    opened by dchaplinsky 3
  • Domains dict

    Domains dict

    I've used wikipedia (https://en.wikipedia.org/wiki/PlWordNet) to decipher 45 of 54 domains listed on Słowosieć.

    There might be more: image for example, zwz

    Can you try to decipher the rest? My Polish isn't too good (yet ))

    opened by dchaplinsky 3
  • WIP: hypernyms/hyponyms/hypernym_paths routines for WordNet class

    WIP: hypernyms/hyponyms/hypernym_paths routines for WordNet class

    So, here is my attempt. I've used standard python stack for now, will let you know if it caused any problems

    I've tested it on Africa/Afryka with different combinations, all looked sane to me:

    for lu in wn.find("Afryka"):
        for i, pth in enumerate(wn.hypernym_paths(lu.synset, full_searh=True, interlingual=True)):
            print(f"{i + 1}: " + "->".join(str(s) for s in pth))
    

    gave me

    1: {kontynent.2}->{ląd.1 ziemia.4}->{obszar.1 rejon.3 obręb.1}->{przestrzeń.1}
    2: {kontynent.2}->{ląd.1 ziemia.4}->{obszar.1 rejon.3 obręb.1}->{location.1}->{object.1 physical object.1}->{physical entity.1}->{entity.1}
    3: {kontynent.2}->{ląd.1 ziemia.4}->{land.4 dry land.1 earth.3 ground.1 solid ground.1 terra firma.1}->{object.1 physical object.1}->{physical entity.1}->{entity.1}
    

    Sorry, I accidentally blacked your file, so now it has more changes than expected. The important one, though is that:

    +        # For cases like Instance_Hypernym/Instance_Hyponym
    +        for rel in self.relation_types.values():
    +            if rel.inverse is not None and rel.inverse.inverse is None:
    +                rel.inverse.inverse = rel
    
    opened by dchaplinsky 1
  • Question: hypernym/hyponym tree traversal and export

    Question: hypernym/hyponym tree traversal and export

    Hello.

    The next logical step for me is to implement tree traversal and data export. For tree traversal I'd try to stick to the following algorithm:

    • Find the true top-level hypernyms for the english and polish (no interlingual hypernymy)
    • Calculate number of leaves under each top level hypernym (and/or number of LUs under it)
    • For each node calculate the distance from top-level hypernym

    To export I'd like to use the information above and pass some callables for filtering to only export particular nodes/rels. For example, I only need first 3-4 levels of the trees for nouns, that has more than X leaves. This way I'll have a way to export and visualize only parts of the trees I need.

    Speaking of export, I'm looking into graphviz (to basically lay top level ontology on paper) and ttl, but in the format, that is similar to PWN original TTL export.

    I'd like to have your opinion on two things:

    • General approach
    • How to incorporate that into code. It might be a part of Wordnet class, a separate file (maybe under contrib section), an usage example or a separate script which I/we do or don't publish at all
    opened by dchaplinsky 1
  • Separate file and classes for domains, support for bz2 in load helper

    Separate file and classes for domains, support for bz2 in load helper

    Hi Max. I've slightly cleaned up your spreadsheet on domains (replaced TODO and dashes with nones and made POSes compatible to UD POS tagset) and wrapped everything into classes. I've also made two rows out of cwytw / cwyt and moved pl description of adj/adv into english one. I made en fields default ones for str method

    It's up to you to replace str domains in LexicalUnit with instances of Domain class as it's still ok to compare Domain to str

    I've also added support for bz2 in loader helper.

    opened by dchaplinsky 1
  • Include sentiment annotations

    Include sentiment annotations

    PlWordNet 4.2 comes with a supplementary file (słownik_anotacji_emocjonalnej.csv) containing sentiment annotations for lexical units. Users should be able to load and access sentiment data.

    enhancement 
    opened by maxadamski 1
  • Parse the description format

    Parse the description format

    Currently, nothing is done with the description field in Synset and LexicalUnit. Information about the description format comes in PlWordNets readme.

    Parsing should be done lazily to avoid slowing down the initial loading of PlWordNet into memory.

    Example description:

    ##K: og. ##D: owoc (wielopestkowiec) jabłoni. [##P: Jabłka są kształtem zbliżone do kuli, z zagłębieniem na szczycie, z którego wystaje ogonek.] {##L: http://pl.wikipedia.org/wiki/Jab%C5%82ko}
    

    Desired behavior:

    A new (memoized) method rich_description returns the following dict:

    dict(
      qualifier='og.',
      definition='owoc (wielopestkowiec) jabłoni.',
      examples=['Jabłka są kształtem zbliżone do kuli, z zagłębieniem na szczycie, z którego wystaje ogonek'],
      sources=['http://pl.wikipedia.org/wiki/Jab%C5%82ko'])
    
    enhancement 
    opened by maxadamski 1
Releases(0.1.5)
Owner
Max Adamski
Student of AI @ PUT
Max Adamski
Disfl-QA: A Benchmark Dataset for Understanding Disfluencies in Question Answering

Disfl-QA is a targeted dataset for contextual disfluencies in an information seeking setting, namely question answering over Wikipedia passages. Disfl-QA builds upon the SQuAD-v2 (Rajpurkar et al., 2

Google Research Datasets 52 Jun 21, 2022
多语言降噪预训练模型MBart的中文生成任务

mbart-chinese 基于mbart-large-cc25 的中文生成任务 Input source input: text + /s + lang_code target input: lang_code + text + /s Usage token_ids_mapping.jso

11 Sep 19, 2022
自然言語で書かれた時間情報表現を抽出/規格化するルールベースの解析器

ja-timex 自然言語で書かれた時間情報表現を抽出/規格化するルールベースの解析器 概要 ja-timex は、現代日本語で書かれた自然文に含まれる時間情報表現を抽出しTIMEX3と呼ばれるアノテーション仕様に変換することで、プログラムが利用できるような形に規格化するルールベースの解析器です。

Yuki Okuda 116 Nov 09, 2022
Various Algorithms for Short Text Mining

Short Text Mining in Python Introduction This package shorttext is a Python package that facilitates supervised and unsupervised learning for short te

Kwan-Yuet 466 Dec 06, 2022
Python library for parsing resumes using natural language processing and machine learning

CVParser Python library for parsing resumes using natural language processing and machine learning. Setup Installation on Linux and Mac OS Follow the

nafiu 0 Jul 29, 2021
Code for "Semantic Role Labeling as Dependency Parsing: Exploring Latent Tree Structures Inside Arguments".

Code for "Semantic Role Labeling as Dependency Parsing: Exploring Latent Tree Structures Inside Arguments".

Yu Zhang 50 Nov 08, 2022
Official code repository of the paper Linear Transformers Are Secretly Fast Weight Programmers.

Linear Transformers Are Secretly Fast Weight Programmers This repository contains the code accompanying the paper Linear Transformers Are Secretly Fas

Imanol Schlag 77 Dec 19, 2022
Longformer: The Long-Document Transformer

Longformer Longformer and LongformerEncoderDecoder (LED) are pretrained transformer models for long documents. ***** New December 1st, 2020: Longforme

AI2 1.6k Dec 29, 2022
A collection of Classical Chinese natural language processing models, including Classical Chinese related models and resources on the Internet.

GuwenModels: 古文自然语言处理模型合集, 收录互联网上的古文相关模型及资源. A collection of Classical Chinese natural language processing models, including Classical Chinese related models and resources on the Internet.

Ethan 66 Dec 26, 2022
Research Code for NeurIPS 2020 Spotlight paper "Large-Scale Adversarial Training for Vision-and-Language Representation Learning": UNITER adversarial training part

VILLA: Vision-and-Language Adversarial Training This is the official repository of VILLA (NeurIPS 2020 Spotlight). This repository currently supports

Zhe Gan 109 Dec 31, 2022
An example project using OpenPrompt under pytorch-lightning for prompt-based SST2 sentiment analysis model

pl_prompt_sst An example project using OpenPrompt under the framework of pytorch-lightning for a training prompt-based text classification model on SS

Zhiling Zhang 5 Oct 21, 2022
ConferencingSpeech2022; Non-intrusive Objective Speech Quality Assessment (NISQA) Challenge

ConferencingSpeech 2022 challenge This repository contains the datasets list and scripts required for the ConferencingSpeech 2022 challenge. For more

21 Dec 02, 2022
precise iris segmentation

PI-DECODER Introduction PI-DECODER, a decoder structure designed for Precise Iris Segmentation and Location. The decoder structure is shown below: Ple

8 Aug 08, 2022
Automatically search Stack Overflow for the command you want to run

stackshell Automatically search Stack Overflow (and other Stack Exchange sites) for the command you want to ru Use the up and down arrows to change be

circuit10 22 Oct 27, 2021
An Explainable Leaderboard for NLP

ExplainaBoard: An Explainable Leaderboard for NLP Introduction | Website | Download | Backend | Paper | Video | Bib Introduction ExplainaBoard is an i

NeuLab 319 Dec 20, 2022
Coreference resolution for English, French, German and Polish, optimised for limited training data and easily extensible for further languages

Coreferee Author: Richard Paul Hudson, Explosion AI 1. Introduction 1.1 The basic idea 1.2 Getting started 1.2.1 English 1.2.2 French 1.2.3 German 1.2

Explosion 70 Dec 12, 2022
💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

Rasa Open Source Rasa is an open source machine learning framework to automate text-and voice-based conversations. With Rasa, you can build contextual

Rasa 15.3k Jan 03, 2023
Study German declensions (dER nettE Mann, ein nettER Mann, mit dEM nettEN Mann, ohne dEN nettEN Mann ...) Generate as many exercises as you want using the incredible power of SPACY!

Study German declensions (dER nettE Mann, ein nettER Mann, mit dEM nettEN Mann, ohne dEN nettEN Mann ...) Generate as many exercises as you want using the incredible power of SPACY!

Hans Alemão 4 Jul 20, 2022
A simple tool to update bib entries with their official information (e.g., DBLP or the ACL anthology).

Rebiber: A tool for normalizing bibtex with official info. We often cite papers using their arXiv versions without noting that they are already PUBLIS

(Bill) Yuchen Lin 2k Jan 01, 2023
Blackstone is a spaCy model and library for processing long-form, unstructured legal text

Blackstone Blackstone is a spaCy model and library for processing long-form, unstructured legal text. Blackstone is an experimental research project f

ICLR&D 579 Jan 08, 2023