spaCy plugin for Transformers , Udify, ELmo, etc.

Overview

Camphr - spaCy plugin for Transformers, Udify, Elmo, etc.

Documentation Status Gitter PyPI version test and publish

Camphr is a Natural Language Processing library that helps in seamless integration for a wide variety of techniques from state-of-the-art to conventional ones. You can use Transformers , Udify, ELmo, etc. on spaCy.

Check the documentation for more information.

(For Japanese: https://qiita.com/tamurahey/items/53a1902625ccaac1bb2f)

Features

  • A spaCy plugin - Easily integration for a wide variety of methods
  • Transformers with spaCy - Fine-tuning pretrained model with Hydra. Embedding vector
  • Udify - BERT based multitask model in 75 languages
  • Elmo - Deep contextualized word representations
  • Rule base matching with Aho-Corasick, Regex
  • (for Japanese) KNP

License

Camphr is licensed under Apache 2.0.

Comments
  • NER Problem

    NER Problem

    Hello!

    First of all I would like to thank you for the great work on lib Camphr. It's been very useful to me! Can you help me with this doubt? I used lib to train a name recognition model (ner) but when I load the model using nlp = (spacy.load ("~ / outputs // 2020-04-30 // 22-28-36 // models // 9 "), and I pass a text (doc = nlp (" I live in Brazil ")), I can't get any entity recognition (doc.ents >> ()). Could you tell me why this is happening?

    opened by gabrielluz07 9
  • Gender and number subtags generation

    Gender and number subtags generation

    I was comparing the default morpho-syntactic tags generated by camphr-udify and https://github.com/Hyperparticle/udify.

    import spacy
    import stanza
    from spacy_conll import ConllFormatter
    
    nlp=spacy.load("en_udify")
    conllformatter = ConllFormatter(nlp)
    nlp.add_pipe(conllformatter, last=True)
    
    doc=nlp("Mother Teresa devoted her entire life to helping others") 
    print(doc._.conll_str)
    
    
    1	Mother	Mother	PROPN		_	2	compound	_	_
    2	Teresa	Teresa	PROPN		_	3	nsubj	_	_
    3	devoted	devote	VERB		_	0	root	_	_
    4	her	her	PRON		_	6	nmod:poss	_	_
    5	entire	entire	ADJ		_	6	amod	_	_
    6	life	life	NOUN		_	3	obj	_	_
    7	to	to	SCONJ		_	8	mark	_	_
    8	helping	help	VERB		_	3	advcl	_	_
    9	others	other	NOUN		_	8	obj	_	SpaceAfter=No
    
    

    Tags returned by https://github.com/Hyperparticle/udify, for the same input.

    prediction:  1  Mother  Mother  PROPN   _       Number=Sing     2       compound        _       _
    2       Teresa  Teresa  PROPN   _       Number=Sing     3       nsubj   _       _
    3       devoted devote  VERB    _       Mood=Ind|Tense=Past|VerbForm=Fin        0       root    _       _
    4       her     her     PRON    _       Gender=Fem|Number=Sing|Person=3|Poss=Yes|PronType=Prs   6       nmod:poss      _                                               _
    5       entire  entire  ADJ     _       Degree=Pos      6       amod    _       _
    6       life    life    NOUN    _       Number=Sing     3       obj     _       _
    7       to      to      SCONJ   _       _       8       mark    _       _
    8       helping help    VERB    _       VerbForm=Ger    3       advcl   _       _
    9       others  other   NOUN    _       Number=Plur     8       obj     _       _
    

    Gender and number subtags are missing in camphr-udify. Could we have those included by default please?

    thanks, Ranjita

    enhancement 
    opened by ranjita-naik 6
  • Camphr+KNP returns an incorrect dependency tag when using a specific adposition.

    Camphr+KNP returns an incorrect dependency tag when using a specific adposition.

    Hello. I report a problem that is happened when analyzing universal dependencies in Japanese text using KNP. When I use a adposition “から”, camphr returns a following wrong result (that shows the conj dependency tag on NOUN→VERB, but an expectation result is the obl dependency tag on VERB→NOUN).

    例1 例2

    (Note that "再結晶", "留去" are the words I added manually, but other VERB words that existed in the original dictionary such as "除去", "撹拌" generates similarly incorrect results.) Same problems sometimes occur when using an adposition "と".

    But using other adpositions, such as “より”, “にて”, camphr returns a correct result.

    例3 例4

    Environment:

    • Docker(python:3.7-buster)
    • spacy = 2.3.2
    • camphr = 0.6.5
    • pyknp = 0.4.5
    • Juman++ ver.1.02
    • KNP ver.4.19
    opened by undermakingbook 6
  • Python 3.8

    Python 3.8

    Camphr is currently pinned at python < 3.8, is there a specific reason for this and if so, what can we do to help?

    Edit: sorry, I just saw #19, still, what can we do to help?

    opened by Evpok 5
  • Support multi labels textcat pipe for transformers

    Support multi labels textcat pipe for transformers

    closes #9

    • Add TrfForMultiLabelSequenceClassification for multiple text classification.
      • pipe name: transformers_multilabel_sequence_classifier
    • Add docs for fine-tuning multi textcat pipe
      • https://github.com/PKSHATechnology-Research/camphr/blob/feature%2Fmulti-textcat/docs/source/notes/finetune_transformers.rst#multilabel-text-classification
    enhancement 
    opened by tamuhey 5
  • unofficial-udify, allennlp,  and transformers  conflicting dependencies

    unofficial-udify, allennlp, and transformers conflicting dependencies

    I'm trying to install udify on WSL as shown below.

    $ pip install unofficial-udify==0.3.0 [email protected]://github.com/PKSHATechnology-Research/camphr_models/releases/download/0.7.0/en_udify-0.7.tar.gz

    ERROR: Cannot install unofficial-udify and unofficial-udify==0.3.0 because these package versions have conflicting dependencies.

    The conflict is caused by: unofficial-udify 0.3.0 depends on transformers<3.0.0 and >=2.3.0 allennlp 1.3.0 depends on transformers<4.1 and >=4.0 unofficial-udify 0.3.0 depends on transformers<3.0.0 and >=2.3.0 allennlp 1.2.2 depends on transformers<3.6 and >=3.4 unofficial-udify 0.3.0 depends on transformers<3.0.0 and >=2.3.0 allennlp 1.2.1 depends on transformers<3.5 and >=3.1 unofficial-udify 0.3.0 depends on transformers<3.0.0 and >=2.3.0 allennlp 1.2.0 depends on transformers<3.5 and >=3.1 unofficial-udify 0.3.0 depends on transformers<3.0.0 and >=2.3.0 allennlp 1.1.0 depends on transformers<3.1 and >=3.0

    Is this a known issue? Could you suggest a workaroudn please?

    bug 
    opened by ranjita-naik 3
  • Missing tag information

    Missing tag information

    I noticed that the spacy tag field is empty. Is this a known issue? It looks like Udify supports some level of ufeats tagging (https://universaldependencies.org/u/feat/index.html)? I wonder if I'm supposed to b getting any of this in Spacy and I have a bug in my setup, or if it just isn't implemented yet? Would it be souced in token.tag like I'm thinking (if it does exist)?

    I also noticed that displacy doesn't render the POS info. I am wondering if that is related?

    BTW, just have to say that this is awesome.

    opened by tslater 3
  • ImportError: cannot import name 'load_udify' from 'camphr.pipelines' following the example

    ImportError: cannot import name 'load_udify' from 'camphr.pipelines' following the example

    I followed the example here: https://camphr.readthedocs.io/en/latest/notes/udify.html

    I did only see the 0.7.0 model, so I went with that instead. Anyway, the German and English examples work great, but the Japanese one gives me this error:

    >>> from camphr.pipelines import load_udify
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    ImportError: cannot import name 'load_udify' from 'camphr.pipelines' (/home/tyler/camphr/env/lib/python3.8/site-packages/camphr/pipelines/__init__.py)
    
    opened by tslater 3
  • doc.ents empty, doc.is_nered == False

    doc.ents empty, doc.is_nered == False

    I followed the documentation to fine-tune the bert-base-cased (en) ner model and then made a spacy doc with text "Bob Jones and Barack Obama went up the hill in Wisconsin." but the resulting doc has doc.ents = () and doc.is_nered = False.

    Am I missing something?

    Thank you!

    opened by jack-rory-staunton 3
  • Improvement for サ変 of KNP

    Improvement for サ変 of KNP

    Inside _get_child_dep(c), pos for 名詞,サ変名詞 is changed into VERB when it is followed by AUX. So now I think that _get_dep(tag[0]) should be done after _get_child_dep(c).

    opened by KoichiYasuoka 3
  • Bump transformers from 3.0.2 to 4.1.1

    Bump transformers from 3.0.2 to 4.1.1

    Bumps transformers from 3.0.2 to 4.1.1.

    Release notes

    Sourced from transformers's releases.

    Patch release: better error message & invalid trainer attribute

    This patch releases introduces:

    • A better error message when trying to instantiate a SentencePiece-based tokenizer without having SentencePiece installed. #8881
    • Fixes an incorrect attribute in the trainer. #8996

    Transformers v4.0.0: Fast tokenizers, model outputs, file reorganization

    Transformers v4.0.0-rc-1: Fast tokenizers, model outputs, file reorganization

    Breaking changes since v3.x

    Version v4.0.0 introduces several breaking changes that were necessary.

    1. AutoTokenizers and pipelines now use fast (rust) tokenizers by default.

    The python and rust tokenizers have roughly the same API, but the rust tokenizers have a more complete feature set. The main breaking change is the handling of overflowing tokens between the python and rust tokenizers.

    How to obtain the same behavior as v3.x in v4.x

    In version v3.x:

    from transformers import AutoTokenizer
    

    tokenizer = AutoTokenizer.from_pretrained("xxx")

    to obtain the same in version v4.x:

    from transformers import AutoTokenizer
    

    tokenizer = AutoTokenizer.from_pretrained("xxx", use_fast=False)

    2. SentencePiece is removed from the required dependencies

    The requirement on the SentencePiece dependency has been lifted from the setup.py. This is done so that we may have a channel on anaconda cloud without relying on conda-forge. This means that the tokenizers that depend on the SentencePiece library will not be available with a standard transformers installation.

    This includes the slow versions of:

    • XLNetTokenizer
    • AlbertTokenizer
    • CamembertTokenizer
    • MBartTokenizer
    • PegasusTokenizer
    • T5Tokenizer
    • ReformerTokenizer
    • XLMRobertaTokenizer

    How to obtain the same behavior as v3.x in v4.x

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language
    • @dependabot badge me will comment on this PR with code to add a "Dependabot enabled" badge to your readme

    Additionally, you can set the following in your Dependabot dashboard:

    • Update frequency (including time of day and day of week)
    • Pull request limits (per update run and/or open at any time)
    • Out-of-range updates (receive only lockfile updates, if desired)
    • Security updates (receive only security updates, if desired)
    dependencies 
    opened by dependabot-preview[bot] 2
  • Bump certifi from 2021.5.30 to 2022.12.7 in /packages/camphr_pattern_search

    Bump certifi from 2021.5.30 to 2022.12.7 in /packages/camphr_pattern_search

    Bumps certifi from 2021.5.30 to 2022.12.7.

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 0
  • Bump numpy from 1.21.0 to 1.22.0 in /packages/camphr_pattern_search

    Bump numpy from 1.21.0 to 1.22.0 in /packages/camphr_pattern_search

    Bumps numpy from 1.21.0 to 1.22.0.

    Release notes

    Sourced from numpy's releases.

    v1.22.0

    NumPy 1.22.0 Release Notes

    NumPy 1.22.0 is a big release featuring the work of 153 contributors spread over 609 pull requests. There have been many improvements, highlights are:

    • Annotations of the main namespace are essentially complete. Upstream is a moving target, so there will likely be further improvements, but the major work is done. This is probably the most user visible enhancement in this release.
    • A preliminary version of the proposed Array-API is provided. This is a step in creating a standard collection of functions that can be used across application such as CuPy and JAX.
    • NumPy now has a DLPack backend. DLPack provides a common interchange format for array (tensor) data.
    • New methods for quantile, percentile, and related functions. The new methods provide a complete set of the methods commonly found in the literature.
    • A new configurable allocator for use by downstream projects.

    These are in addition to the ongoing work to provide SIMD support for commonly used functions, improvements to F2PY, and better documentation.

    The Python versions supported in this release are 3.8-3.10, Python 3.7 has been dropped. Note that 32 bit wheels are only provided for Python 3.8 and 3.9 on Windows, all other wheels are 64 bits on account of Ubuntu, Fedora, and other Linux distributions dropping 32 bit support. All 64 bit wheels are also linked with 64 bit integer OpenBLAS, which should fix the occasional problems encountered by folks using truly huge arrays.

    Expired deprecations

    Deprecated numeric style dtype strings have been removed

    Using the strings "Bytes0", "Datetime64", "Str0", "Uint32", and "Uint64" as a dtype will now raise a TypeError.

    (gh-19539)

    Expired deprecations for loads, ndfromtxt, and mafromtxt in npyio

    numpy.loads was deprecated in v1.15, with the recommendation that users use pickle.loads instead. ndfromtxt and mafromtxt were both deprecated in v1.17 - users should use numpy.genfromtxt instead with the appropriate value for the usemask parameter.

    (gh-19615)

    ... (truncated)

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 0
Releases(0.7.0)
My Implementation for the paper EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks using Tensorflow

Easy Data Augmentation Implementation This repository contains my Implementation for the paper EDA: Easy Data Augmentation Techniques for Boosting Per

Aflah 9 Oct 31, 2022
ProteinBERT is a universal protein language model pretrained on ~106M proteins from the UniRef90 dataset.

ProteinBERT is a universal protein language model pretrained on ~106M proteins from the UniRef90 dataset. Through its Python API, the pretrained model can be fine-tuned on any protein-related task in

241 Jan 04, 2023
Code for Text Prior Guided Scene Text Image Super-Resolution

Code for Text Prior Guided Scene Text Image Super-Resolution

82 Dec 26, 2022
GPT-3 command line interaction

Writer_unblock Straight-forward command line interfacing with GPT-3. Finding yourself stuck at a conceptual stage? Spinning your wheels needlessly on

Seth Nuzum 6 Feb 10, 2022
Repositório da disciplina no semestre 2021-2

Avisos! Nenhum aviso! Compiladores 1 Este é o Git da disciplina Compiladores 1. Aqui ficará o material produzido em sala de aula assim como tarefas, w

6 May 13, 2022
Code for Emergent Translation in Multi-Agent Communication

Emergent Translation in Multi-Agent Communication PyTorch implementation of the models described in the paper Emergent Translation in Multi-Agent Comm

Facebook Research 75 Jul 15, 2022
Applied Natural Language Processing in the Enterprise - An O'Reilly Media Publication

Applied Natural Language Processing in the Enterprise This is the companion repo for Applied Natural Language Processing in the Enterprise, an O'Reill

Applied Natural Language Processing in the Enterprise 95 Jan 05, 2023
UniSpeech - Large Scale Self-Supervised Learning for Speech

UniSpeech The family of UniSpeech: WavLM (arXiv): WavLM: Large-Scale Self-Supervised Pre-training for Full Stack Speech Processing UniSpeech (ICML 202

Microsoft 281 Dec 15, 2022
nlp-tutorial is a tutorial for who is studying NLP(Natural Language Processing) using Pytorch

nlp-tutorial is a tutorial for who is studying NLP(Natural Language Processing) using Pytorch. Most of the models in NLP were implemented with less than 100 lines of code.(except comments or blank li

Tae-Hwan Jung 11.9k Jan 08, 2023
Script to generate VAD dataset used in Asteroid recipe

About the dataset LibriVAD is an open source dataset for voice activity detection in noisy environments. It is derived from LibriSpeech signals (clean

11 Sep 15, 2022
NLPretext packages in a unique library all the text preprocessing functions you need to ease your NLP project.

NLPretext packages in a unique library all the text preprocessing functions you need to ease your NLP project.

Artefact 114 Dec 15, 2022
Collection of scripts to pinpoint obfuscated code

Obfuscation Detection (v1.0) Author: Tim Blazytko Automatically detect control-flow flattening and other state machines Description: Scripts and binar

Tim Blazytko 230 Nov 26, 2022
Autoregressive Entity Retrieval

The GENRE (Generative ENtity REtrieval) system as presented in Autoregressive Entity Retrieval implemented in pytorch. @inproceedings{decao2020autoreg

Meta Research 611 Dec 16, 2022
Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS)

TOPSIS implementation in Python Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS) CHING-LAI Hwang and Yoon introduced TOPSIS

Hamed Baziyad 8 Dec 10, 2022
☀️ Measuring the accuracy of BBC weather forecasts in Honolulu, USA

Accuracy of BBC Weather forecasts for Honolulu This repository records the forecasts made by BBC Weather for the city of Honolulu, USA. Essentially, t

Max Halford 12 Oct 15, 2022
A Multilingual Latent Dirichlet Allocation (LDA) Pipeline with Stop Words Removal, n-gram features, and Inverse Stemming, in Python.

Multilingual Latent Dirichlet Allocation (LDA) Pipeline This project is for text clustering using the Latent Dirichlet Allocation (LDA) algorithm. It

Artifici Online Services inc. 74 Oct 07, 2022
PyTorch Implementation of "Bridging Pre-trained Language Models and Hand-crafted Features for Unsupervised POS Tagging" (Findings of ACL 2022)

Feature_CRF_AE Feature_CRF_AE provides a implementation of Bridging Pre-trained Language Models and Hand-crafted Features for Unsupervised POS Tagging

Jacob Zhou 6 Apr 29, 2022
Nateve compiler developed with python.

Adam Adam is a Nateve Programming Language compiler developed using Python. Nateve Nateve is a new general domain programming language open source ins

Nateve 7 Jan 15, 2022
This repository contains all the source code that is needed for the project : An Efficient Pipeline For Bloom’s Taxonomy Using Natural Language Processing and Deep Learning

Pipeline For NLP with Bloom's Taxonomy Using Improved Question Classification and Question Generation using Deep Learning This repository contains all

Rohan Mathur 9 Jul 17, 2021
Code for the paper "BERT Loses Patience: Fast and Robust Inference with Early Exit".

Patience-based Early Exit Code for the paper "BERT Loses Patience: Fast and Robust Inference with Early Exit". NEWS: We now have a better and tidier i

Kevin Canwen Xu 54 Jan 04, 2023