spaCy plugin for Transformers , Udify, ELmo, etc.

Overview

Camphr - spaCy plugin for Transformers, Udify, Elmo, etc.

Documentation Status Gitter PyPI version test and publish

Camphr is a Natural Language Processing library that helps in seamless integration for a wide variety of techniques from state-of-the-art to conventional ones. You can use Transformers , Udify, ELmo, etc. on spaCy.

Check the documentation for more information.

(For Japanese: https://qiita.com/tamurahey/items/53a1902625ccaac1bb2f)

Features

  • A spaCy plugin - Easily integration for a wide variety of methods
  • Transformers with spaCy - Fine-tuning pretrained model with Hydra. Embedding vector
  • Udify - BERT based multitask model in 75 languages
  • Elmo - Deep contextualized word representations
  • Rule base matching with Aho-Corasick, Regex
  • (for Japanese) KNP

License

Camphr is licensed under Apache 2.0.

Comments
  • NER Problem

    NER Problem

    Hello!

    First of all I would like to thank you for the great work on lib Camphr. It's been very useful to me! Can you help me with this doubt? I used lib to train a name recognition model (ner) but when I load the model using nlp = (spacy.load ("~ / outputs // 2020-04-30 // 22-28-36 // models // 9 "), and I pass a text (doc = nlp (" I live in Brazil ")), I can't get any entity recognition (doc.ents >> ()). Could you tell me why this is happening?

    opened by gabrielluz07 9
  • Gender and number subtags generation

    Gender and number subtags generation

    I was comparing the default morpho-syntactic tags generated by camphr-udify and https://github.com/Hyperparticle/udify.

    import spacy
    import stanza
    from spacy_conll import ConllFormatter
    
    nlp=spacy.load("en_udify")
    conllformatter = ConllFormatter(nlp)
    nlp.add_pipe(conllformatter, last=True)
    
    doc=nlp("Mother Teresa devoted her entire life to helping others") 
    print(doc._.conll_str)
    
    
    1	Mother	Mother	PROPN		_	2	compound	_	_
    2	Teresa	Teresa	PROPN		_	3	nsubj	_	_
    3	devoted	devote	VERB		_	0	root	_	_
    4	her	her	PRON		_	6	nmod:poss	_	_
    5	entire	entire	ADJ		_	6	amod	_	_
    6	life	life	NOUN		_	3	obj	_	_
    7	to	to	SCONJ		_	8	mark	_	_
    8	helping	help	VERB		_	3	advcl	_	_
    9	others	other	NOUN		_	8	obj	_	SpaceAfter=No
    
    

    Tags returned by https://github.com/Hyperparticle/udify, for the same input.

    prediction:  1  Mother  Mother  PROPN   _       Number=Sing     2       compound        _       _
    2       Teresa  Teresa  PROPN   _       Number=Sing     3       nsubj   _       _
    3       devoted devote  VERB    _       Mood=Ind|Tense=Past|VerbForm=Fin        0       root    _       _
    4       her     her     PRON    _       Gender=Fem|Number=Sing|Person=3|Poss=Yes|PronType=Prs   6       nmod:poss      _                                               _
    5       entire  entire  ADJ     _       Degree=Pos      6       amod    _       _
    6       life    life    NOUN    _       Number=Sing     3       obj     _       _
    7       to      to      SCONJ   _       _       8       mark    _       _
    8       helping help    VERB    _       VerbForm=Ger    3       advcl   _       _
    9       others  other   NOUN    _       Number=Plur     8       obj     _       _
    

    Gender and number subtags are missing in camphr-udify. Could we have those included by default please?

    thanks, Ranjita

    enhancement 
    opened by ranjita-naik 6
  • Camphr+KNP returns an incorrect dependency tag when using a specific adposition.

    Camphr+KNP returns an incorrect dependency tag when using a specific adposition.

    Hello. I report a problem that is happened when analyzing universal dependencies in Japanese text using KNP. When I use a adposition “から”, camphr returns a following wrong result (that shows the conj dependency tag on NOUN→VERB, but an expectation result is the obl dependency tag on VERB→NOUN).

    例1 例2

    (Note that "再結晶", "留去" are the words I added manually, but other VERB words that existed in the original dictionary such as "除去", "撹拌" generates similarly incorrect results.) Same problems sometimes occur when using an adposition "と".

    But using other adpositions, such as “より”, “にて”, camphr returns a correct result.

    例3 例4

    Environment:

    • Docker(python:3.7-buster)
    • spacy = 2.3.2
    • camphr = 0.6.5
    • pyknp = 0.4.5
    • Juman++ ver.1.02
    • KNP ver.4.19
    opened by undermakingbook 6
  • Python 3.8

    Python 3.8

    Camphr is currently pinned at python < 3.8, is there a specific reason for this and if so, what can we do to help?

    Edit: sorry, I just saw #19, still, what can we do to help?

    opened by Evpok 5
  • Support multi labels textcat pipe for transformers

    Support multi labels textcat pipe for transformers

    closes #9

    • Add TrfForMultiLabelSequenceClassification for multiple text classification.
      • pipe name: transformers_multilabel_sequence_classifier
    • Add docs for fine-tuning multi textcat pipe
      • https://github.com/PKSHATechnology-Research/camphr/blob/feature%2Fmulti-textcat/docs/source/notes/finetune_transformers.rst#multilabel-text-classification
    enhancement 
    opened by tamuhey 5
  • unofficial-udify, allennlp,  and transformers  conflicting dependencies

    unofficial-udify, allennlp, and transformers conflicting dependencies

    I'm trying to install udify on WSL as shown below.

    $ pip install unofficial-udify==0.3.0 [email protected]://github.com/PKSHATechnology-Research/camphr_models/releases/download/0.7.0/en_udify-0.7.tar.gz

    ERROR: Cannot install unofficial-udify and unofficial-udify==0.3.0 because these package versions have conflicting dependencies.

    The conflict is caused by: unofficial-udify 0.3.0 depends on transformers<3.0.0 and >=2.3.0 allennlp 1.3.0 depends on transformers<4.1 and >=4.0 unofficial-udify 0.3.0 depends on transformers<3.0.0 and >=2.3.0 allennlp 1.2.2 depends on transformers<3.6 and >=3.4 unofficial-udify 0.3.0 depends on transformers<3.0.0 and >=2.3.0 allennlp 1.2.1 depends on transformers<3.5 and >=3.1 unofficial-udify 0.3.0 depends on transformers<3.0.0 and >=2.3.0 allennlp 1.2.0 depends on transformers<3.5 and >=3.1 unofficial-udify 0.3.0 depends on transformers<3.0.0 and >=2.3.0 allennlp 1.1.0 depends on transformers<3.1 and >=3.0

    Is this a known issue? Could you suggest a workaroudn please?

    bug 
    opened by ranjita-naik 3
  • Missing tag information

    Missing tag information

    I noticed that the spacy tag field is empty. Is this a known issue? It looks like Udify supports some level of ufeats tagging (https://universaldependencies.org/u/feat/index.html)? I wonder if I'm supposed to b getting any of this in Spacy and I have a bug in my setup, or if it just isn't implemented yet? Would it be souced in token.tag like I'm thinking (if it does exist)?

    I also noticed that displacy doesn't render the POS info. I am wondering if that is related?

    BTW, just have to say that this is awesome.

    opened by tslater 3
  • ImportError: cannot import name 'load_udify' from 'camphr.pipelines' following the example

    ImportError: cannot import name 'load_udify' from 'camphr.pipelines' following the example

    I followed the example here: https://camphr.readthedocs.io/en/latest/notes/udify.html

    I did only see the 0.7.0 model, so I went with that instead. Anyway, the German and English examples work great, but the Japanese one gives me this error:

    >>> from camphr.pipelines import load_udify
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    ImportError: cannot import name 'load_udify' from 'camphr.pipelines' (/home/tyler/camphr/env/lib/python3.8/site-packages/camphr/pipelines/__init__.py)
    
    opened by tslater 3
  • doc.ents empty, doc.is_nered == False

    doc.ents empty, doc.is_nered == False

    I followed the documentation to fine-tune the bert-base-cased (en) ner model and then made a spacy doc with text "Bob Jones and Barack Obama went up the hill in Wisconsin." but the resulting doc has doc.ents = () and doc.is_nered = False.

    Am I missing something?

    Thank you!

    opened by jack-rory-staunton 3
  • Improvement for サ変 of KNP

    Improvement for サ変 of KNP

    Inside _get_child_dep(c), pos for 名詞,サ変名詞 is changed into VERB when it is followed by AUX. So now I think that _get_dep(tag[0]) should be done after _get_child_dep(c).

    opened by KoichiYasuoka 3
  • Bump transformers from 3.0.2 to 4.1.1

    Bump transformers from 3.0.2 to 4.1.1

    Bumps transformers from 3.0.2 to 4.1.1.

    Release notes

    Sourced from transformers's releases.

    Patch release: better error message & invalid trainer attribute

    This patch releases introduces:

    • A better error message when trying to instantiate a SentencePiece-based tokenizer without having SentencePiece installed. #8881
    • Fixes an incorrect attribute in the trainer. #8996

    Transformers v4.0.0: Fast tokenizers, model outputs, file reorganization

    Transformers v4.0.0-rc-1: Fast tokenizers, model outputs, file reorganization

    Breaking changes since v3.x

    Version v4.0.0 introduces several breaking changes that were necessary.

    1. AutoTokenizers and pipelines now use fast (rust) tokenizers by default.

    The python and rust tokenizers have roughly the same API, but the rust tokenizers have a more complete feature set. The main breaking change is the handling of overflowing tokens between the python and rust tokenizers.

    How to obtain the same behavior as v3.x in v4.x

    In version v3.x:

    from transformers import AutoTokenizer
    

    tokenizer = AutoTokenizer.from_pretrained("xxx")

    to obtain the same in version v4.x:

    from transformers import AutoTokenizer
    

    tokenizer = AutoTokenizer.from_pretrained("xxx", use_fast=False)

    2. SentencePiece is removed from the required dependencies

    The requirement on the SentencePiece dependency has been lifted from the setup.py. This is done so that we may have a channel on anaconda cloud without relying on conda-forge. This means that the tokenizers that depend on the SentencePiece library will not be available with a standard transformers installation.

    This includes the slow versions of:

    • XLNetTokenizer
    • AlbertTokenizer
    • CamembertTokenizer
    • MBartTokenizer
    • PegasusTokenizer
    • T5Tokenizer
    • ReformerTokenizer
    • XLMRobertaTokenizer

    How to obtain the same behavior as v3.x in v4.x

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language
    • @dependabot badge me will comment on this PR with code to add a "Dependabot enabled" badge to your readme

    Additionally, you can set the following in your Dependabot dashboard:

    • Update frequency (including time of day and day of week)
    • Pull request limits (per update run and/or open at any time)
    • Out-of-range updates (receive only lockfile updates, if desired)
    • Security updates (receive only security updates, if desired)
    dependencies 
    opened by dependabot-preview[bot] 2
  • Bump certifi from 2021.5.30 to 2022.12.7 in /packages/camphr_pattern_search

    Bump certifi from 2021.5.30 to 2022.12.7 in /packages/camphr_pattern_search

    Bumps certifi from 2021.5.30 to 2022.12.7.

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 0
  • Bump numpy from 1.21.0 to 1.22.0 in /packages/camphr_pattern_search

    Bump numpy from 1.21.0 to 1.22.0 in /packages/camphr_pattern_search

    Bumps numpy from 1.21.0 to 1.22.0.

    Release notes

    Sourced from numpy's releases.

    v1.22.0

    NumPy 1.22.0 Release Notes

    NumPy 1.22.0 is a big release featuring the work of 153 contributors spread over 609 pull requests. There have been many improvements, highlights are:

    • Annotations of the main namespace are essentially complete. Upstream is a moving target, so there will likely be further improvements, but the major work is done. This is probably the most user visible enhancement in this release.
    • A preliminary version of the proposed Array-API is provided. This is a step in creating a standard collection of functions that can be used across application such as CuPy and JAX.
    • NumPy now has a DLPack backend. DLPack provides a common interchange format for array (tensor) data.
    • New methods for quantile, percentile, and related functions. The new methods provide a complete set of the methods commonly found in the literature.
    • A new configurable allocator for use by downstream projects.

    These are in addition to the ongoing work to provide SIMD support for commonly used functions, improvements to F2PY, and better documentation.

    The Python versions supported in this release are 3.8-3.10, Python 3.7 has been dropped. Note that 32 bit wheels are only provided for Python 3.8 and 3.9 on Windows, all other wheels are 64 bits on account of Ubuntu, Fedora, and other Linux distributions dropping 32 bit support. All 64 bit wheels are also linked with 64 bit integer OpenBLAS, which should fix the occasional problems encountered by folks using truly huge arrays.

    Expired deprecations

    Deprecated numeric style dtype strings have been removed

    Using the strings "Bytes0", "Datetime64", "Str0", "Uint32", and "Uint64" as a dtype will now raise a TypeError.

    (gh-19539)

    Expired deprecations for loads, ndfromtxt, and mafromtxt in npyio

    numpy.loads was deprecated in v1.15, with the recommendation that users use pickle.loads instead. ndfromtxt and mafromtxt were both deprecated in v1.17 - users should use numpy.genfromtxt instead with the appropriate value for the usemask parameter.

    (gh-19615)

    ... (truncated)

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 0
Releases(0.7.0)
Code for paper Multitask-Finetuning of Zero-shot Vision-Language Models

Code for paper Multitask-Finetuning of Zero-shot Vision-Language Models

Zhenhailong Wang 2 Jul 15, 2022
Pytorch-Named-Entity-Recognition-with-BERT

BERT NER Use google BERT to do CoNLL-2003 NER ! Train model using Python and Inference using C++ ALBERT-TF2.0 BERT-NER-TENSORFLOW-2.0 BERT-SQuAD Requi

Kamal Raj 1.1k Dec 25, 2022
Open solution to the Toxic Comment Classification Challenge

Starter code: Kaggle Toxic Comment Classification Challenge More competitions 🎇 Check collection of public projects 🎁 , where you can find multiple

minerva.ml 153 Jun 22, 2022
Simplified diarization pipeline using some pretrained models - audio file to diarized segments in a few lines of code

simple_diarizer Simplified diarization pipeline using some pretrained models. Made to be a simple as possible to go from an input audio file to diariz

Chau 65 Dec 30, 2022
An extension for asreview implements a version of the tf-idf feature extractor that saves the matrix and the vocabulary.

Extension - matrix and vocabulary extractor for TF-IDF and Doc2Vec An extension for ASReview that adds a tf-idf extractor that saves the matrix and th

ASReview 4 Jun 17, 2022
Yomichad - a Japanese pop-up dictionary that can display readings and English definitions of Japanese words

Yomichad is a Japanese pop-up dictionary that can display readings and English definitions of Japanese words, kanji, and optionally named entities. It is similar to yomichan, 10ten, and rikaikun in s

Jonas Belouadi 7 Nov 07, 2022
precise iris segmentation

PI-DECODER Introduction PI-DECODER, a decoder structure designed for Precise Iris Segmentation and Location. The decoder structure is shown below: Ple

8 Aug 08, 2022
Speech to text streamlit app

Speech to text Streamlit-app! 👄 This speech to text recognition is powered by t

Charly Wargnier 9 Jan 01, 2023
StarGAN - Official PyTorch Implementation

StarGAN - Official PyTorch Implementation ***** New: StarGAN v2 is available at https://github.com/clovaai/stargan-v2 ***** This repository provides t

Yunjey Choi 5.1k Dec 30, 2022
Pipeline for fast building text classification TF-IDF + LogReg baselines.

Text Classification Baseline Pipeline for fast building text classification TF-IDF + LogReg baselines. Usage Instead of writing custom code for specif

Dani El-Ayyass 57 Dec 07, 2022
🦆 Contextually-keyed word vectors

sense2vec: Contextually-keyed word vectors sense2vec (Trask et. al, 2015) is a nice twist on word2vec that lets you learn more interesting and detaile

Explosion 1.5k Dec 25, 2022
Implementation of N-Grammer, augmenting Transformers with latent n-grams, in Pytorch

N-Grammer - Pytorch Implementation of N-Grammer, augmenting Transformers with latent n-grams, in Pytorch Install $ pip install n-grammer-pytorch Usage

Phil Wang 66 Dec 29, 2022
MiCECo - Misskey Custom Emoji Counter

MiCECo Misskey Custom Emoji Counter Introduction This little script counts custo

7 Dec 25, 2022
A framework for training and evaluating AI models on a variety of openly available dialogue datasets.

ParlAI (pronounced “par-lay”) is a python framework for sharing, training and testing dialogue models, from open-domain chitchat, to task-oriented dia

Facebook Research 9.7k Jan 09, 2023
Beyond the Imitation Game collaborative benchmark for enormous language models

BIG-bench 🪑 The Beyond the Imitation Game Benchmark (BIG-bench) will be a collaborative benchmark intended to probe large language models, and extrap

Google 1.3k Jan 01, 2023
ACL'2021: Learning Dense Representations of Phrases at Scale

DensePhrases DensePhrases is an extractive phrase search tool based on your natural language inputs. From 5 million Wikipedia articles, it can search

Princeton Natural Language Processing 540 Dec 30, 2022
PyTorch implementation of Microsoft's text-to-speech system FastSpeech 2: Fast and High-Quality End-to-End Text to Speech.

An implementation of Microsoft's "FastSpeech 2: Fast and High-Quality End-to-End Text to Speech"

Chung-Ming Chien 1k Dec 30, 2022
Easy-to-use CPM for Chinese text generation

CPM 项目描述 CPM(Chinese Pretrained Models)模型是北京智源人工智能研究院和清华大学发布的中文大规模预训练模型。官方发布了三种规模的模型,参数量分别为109M、334M、2.6B,用户需申请与通过审核,方可下载。 由于原项目需要考虑大模型的训练和使用,需要安装较为复杂

382 Jan 07, 2023
Finds snippets in iambic pentameter in English-language text and tries to combine them to a rhyming sonnet.

Sonnet finder Finds snippets in iambic pentameter in English-language text and tries to combine them to a rhyming sonnet. Usage This is a Python scrip

Marcel Bollmann 11 Sep 25, 2022
Transcribing audio files using Hugging Face's implementation of Wav2Vec2 + "chain-linking" NLP tasks to combine speech-to-text with downstream tasks like translation and summarisation.

PART 2: CHAIN LINKING AUDIO-TO-TEXT NLP TASKS 2A: TRANSCRIBE-TRANSLATE-SENTIMENT-ANALYSIS In notebook3.0, I demo a simple workflow to: transcribe a lo

Chua Chin Hon 30 Jul 13, 2022