🛸 Use pretrained transformers like BERT, XLNet and GPT-2 in spaCy

Last update: Jan 08, 2023

Overview

spacy-transformers: Use pretrained transformers like BERT, XLNet and GPT-2 in spaCy

This package provides spaCy components and architectures to use transformer models via Hugging Face's transformers in spaCy. The result is convenient access to state-of-the-art transformer architectures, such as BERT, GPT-2, XLNet, etc.

This release requires spaCy v3. For the previous version of this library, see the v0.6.x branch.

Features

Use pretrained transformer models like BERT, RoBERTa and XLNet to power your spaCy pipeline.
Easy multi-task learning: backprop to one transformer model from several pipeline components.
Train using spaCy v3's powerful and extensible config system.
Automatic alignment of transformer output to spaCy's tokenization.
Easily customize what transformer data is saved in the Doc object.
Easily customize how long documents are processed.
Out-of-the-box serialization and model packaging.

🚀 Installation

Installing the package from pip will automatically install all dependencies, including PyTorch and spaCy. Make sure you install this package before you install the models. Also note that this package requires Python 3.6+, PyTorch v1.5+ and spaCy v3.0+.

pip install spacy[transformers]

For GPU installation, find your CUDA version using nvcc --version and add the version in brackets, e.g. spacy[transformers,cuda92] for CUDA9.2 or spacy[transformers,cuda100] for CUDA10.0.

If you are having trouble installing PyTorch, follow the instructions on the official website for your specific operation system and requirements, or try the following:

pip install spacy-transformers -f https://download.pytorch.org/whl/torch_stable.html

📖 Documentation

⚠️ Important note: This package has been extensively refactored to take advantage of spaCy v3.0. Previous versions that were built for spaCy v2.x worked considerably differently. Please see previous tagged versions of this README for documentation on prior versions.

📘 Embeddings, Transformers and Transfer Learning: How to use transformers in spaCy
📘 Training Pipelines and Models: Train and update components on your own data and integrate custom models
📘 Layers and Model Architectures: Power spaCy components with custom neural networks
📗 Transformer: Pipeline component API reference
📗 Transformer architectures: Architectures and registered functions

Comments

Use ModelOutput instead of tuples
Save model output as ModelOutput instead of a list of tensors in TransformerData.model_output and FullTransformerBatch.model_output.

For backwards compatibility with transformers v3 set return_dict = True in the transformer config.

TransformerData.tensors and FullTransformerBatch.tensors return ModelOutput.to_tuple().

Store any additional model output as ModelOutput in TransformerData.model_output.

Save all torch.Tensor and tuple(torch.Tensor) values in TransformerData.model_output for cases where tensor.shape[0] is the batch size so that it's possible to slice the output for individual docs.

Includes: pooler_output, hidden_states, attentions, and cross_attentions

Re-enable tests for gpt2 and xlnet in the CI.

Following #285, include some minor modifications and bug fixes for HFShim and HFObjects

Rename the temporary init-only configs in HFObjects and don't serialize them in HFShim once the model is initialized

enhancement v1.1
opened by adrianeboyd 14
Add support for mixed-precision training
This change makes it possible to use and configure the support for mixed-precision training that was added to thinc.

Example configuration:

[components.transformer.model] @architectures = "spacy-transformers.TransformerModel.v3" name = "roberta-base" mixed_precision = true
enhancement perf / memory perf / speed
opened by danieldk 7
HFShim: support MPS device
Before this change, two devices and map locations were supported:

CUDA: cuda:N

CPU: cpu

This change adds support for other devices like Metal Performance Shader (MPS) devices by mapping to the active Torch device.
opened by danieldk 6

added transformers_config for passing arguments to the transformer

Added transformers_config to allow the user to pass arguments to the transformers forward pass. Most notably the output_attentions.

for convenience, I used this example to test the code:

import spacy

nlp = spacy.blank("en")

# Construction via add_pipe with custom config
config = {
    "model": {
        "@architectures": "spacy-transformers.TransformerModel.v1",
        "name": "bert-base-uncased",
        "transformers_config": {"output_attentions": True},
    }
}
transformer = nlp.add_pipe(
    "transformer", config=config)
transformer.model.initialize()


doc = nlp("This is a sentence.")

which gives you:

len(doc._.trf_data.attention) # 12
doc._.trf_data.attention[-1].shape  # (1, 12, 7, 7) last layer of attention 
len(doc._.trf_data.tensors) # 2 
doc._.trf_data.tensors[0].shape # (1, 7, 768) <-- wordpiece embedding
doc._.trf_data.tensors[1].shape  # (1, 768) <-- assuming this is the pooled embedding?

Sidenote: it took me quite a while to find the default config str. It might be ideal to make this into a standalone file and load it in?

enhancement

opened by KennethEnevoldsen 6

Add kwargs features of `from_pretrained()` in HFShim.from_bytes()

We're trying to give some parameters, such as use_fast and trsut_remote_code, to AutoTokenizer.from_pretrained() via kwargs in order to utilize custom tokenizers in spacy-transformers. In the current implementation of spacy-transformers, HFShim.from_bytes() does not apply kwargs to either AutoConfig.from_pretrained() or AutoTokenizer.from_pretrained(), while huggingface_from_pretrained() applies kwargs to both of them to create a transformer model.

https://github.com/huggingface/transformers/blob/dcb08b99f44919425f8ba9be9ddcc041af8ec25e/src/transformers/models/auto/tokenization_auto.py https://github.com/huggingface/transformers/blob/dcb08b99f44919425f8ba9be9ddcc041af8ec25e/src/transformers/models/auto/configuration_auto.py#L679 https://github.com/explosion/spacy-transformers/blob/cab060701fe2bad0c9ae23a822249c9bebb56da7/spacy_transformers/layers/hf_shim.py#L98 https://github.com/explosion/spacy-transformers/blob/cab060701fe2bad0c9ae23a822249c9bebb56da7/spacy_transformers/layers/transformer_model.py#L256

In this pull request, we used _init_tokenizer_config and _init_transformer_config in msg passed from the deserializer as kwargs parameter of from_pretrained().

Another possible approach is to write these kwargs in the transformer/cfg.

opened by hiroshi-matsuda-rit 4
Convert token identifiers to Long for torch < 1.8.0

Since PyTorch 1.8.0 token identifiers can be both Int and Long for embedding lookups. Prior versions only support Long. Since we still support older versions, convert token identifiers to Long for compatibility.

This fixes the incompatibility with older versions introduced in #289.

opened by danieldk 4
Fixing Transformer IO

Note: some tests are adjusted in this PR. This was done first, before implementing the code changes, as we were aware that the initialize statements shouldn't be there, cf https://github.com/explosion/spaCy/issues/8319 and https://github.com/explosion/spaCy/issues/8566.

Description

Before this PR, the HF Transformer model was loaded through set_pytorch_transformer (stored in model.attrs["set_transformer"]), but this happened in the initialize call of TransformerModel. Unfortunately, this meant that saving/loading a transformer-based pipeline was kind of broken, as you needed to call initialize on a previously trained pipeline, which isn't the normal spaCy API. This also broke the from_bytes / to_bytes typical API.

Furthermore, the from_disk / to_disk functionality worked with a "listener" transformer, because the transformer pipeline component saved out the PyTorch files directly. However, this solution did not work for the "inline" Transformer, because the TransformerModel would be used directly and not via the pipeline component.

I've been looking at various different solutions, but the proposal in this PR is the only one that I got working for all use-cases: basically we need to load/define the transformer model in the constructor as we do for any other spaCy component.

Unfortunately with this proposal, if you'd have the initialize calls as before in your old code, it will crash, complaining about the fact that the transformer model can't be set twice. I think this is actually the correct behaviour in the end, but it might break some people's code. The fix is obvious/easy though.

But we'll have to discuss which version bump we want to do when releasing this.

Fixes https://github.com/explosion/spaCy/issues/8319 Fixes https://github.com/explosion/spaCy/issues/8566
bug feat / serialize

opened by svlandeg 4
IO for transformer component
IO

Been going in circles a bit with this, trying to puzzle it into the IO mechanisms we decided on for the config refactor for spaCy 3 ...

This PR: Transformer(Pipe) knows how to do to_disk and from_disk and stores the internal tokenizer & transformer object using huggingface transformer standard IO mechanisms. In the nlp/transformer output directory, this results in a folder model with files

config.json

pytorch_model.bin

special_tokens_map.json

tokenizer_config.json

vocab.txt

This folder can be read using the spacy.TransformerFromFile.v1 architecture for the model, and then calling from_disk on the pipeline component (which happens automatically when reading the nlp object from a config)

If users want to download a model by using architecture spacy.TransformerByName.v2, then when calling nlp.to_disk, we need to do a little hack rewriting that architecture to the one from file. This is done by directly modifying nlp.config when the component is created with from_nlp. This feels hacky, but not sure how else to prevent multiple downloads.

Other fixes

fixed the config files to be up-to-date with the latest version of the v3 branch

moved install_extensions to the init of the transformer pipe, where I think it makes more sense. Added force=True to prevent warnings/errors when calling it multiple times (I don't think that matters?)

enhancement feat / serialize
opened by svlandeg 4
WIP: Fix IO after init_model

To create a distilbert-base-german-cased model with the init_model.py script, I had to add two additional fields to the serialization code.

Fixes #117
bug

opened by svlandeg 4
Transformer: add update_listeners_in_predict option
Draft: still needs docs, but I first wanted to discuss this proposal.

The output of a transformer is passed through in two different ways:

Prediction: the data is passed through the Doc._.trf_data attribute.

Training: the data is broadcast directly to the transformer's listeners.

However, the Transformer.predict method breaks the strict separation between training and prediction by also broadcasting transformer outputs to its listeners. This was added (I think) to support training with a frozen transformer.

However, this breaks down when we are training a model with an unfrozen transformer when the transformer is also in annotating_components. The transformer will first (as part of its update step) broadcast the tensors and backprop function to its listeners. However, then when acting as an annotating component, it would immediately override its own output and clear the backprop function. As a result, gradients will not flow into the transformer.

This change fixes this issue by adding the update_listeners_in_predict option, which is enabled by default. When this option is disabled, the tensors will not be broadcast to listeners in predict.

Alternatives considered:

Yanking the listener code from predict: breaks our current semantics, would make it harder to train with a frozen transformer.

Checking in the listener if the tensors that we are receiving is the same batch ID as we already have. Don't update if we already have the same batch with a backprop function. I thought this is a bit fragile, because it breaks when batching differs between training and prediction (?).

bug feat / pipeline
opened by danieldk 3
Support offset mapping alignment for fast tokenizers
Switch to offset mapping-based alignment for fast tokenizers. With this change, slow vs. fast tokenizers will not give identical results with spacy-transformers.

Additional modifications:

Update package setup for cython

Update CI for compiled package

feat / alignment
opened by adrianeboyd 3
Add test for textcat CNN issue
This is a test demonstrating the issue in https://github.com/explosion/spaCy/issues/11968. A potential fix is being worked on in https://github.com/explosion/thinc/pull/820.

In its current condition, the test just creates a pipeline with textcat and transformer, creates a minimal doc and calls nlp.initialize. As described in the spaCy issue, that fails with this error:

ValueError: Cannot get dimension 'nI' for model 'linear': value unset

This will be left in draft until the fix is clarified.
tests
opened by polm 7
Transformer.predict: do not broadcast to listeners
The output of a transformer is passed through in two different ways:

Prediction: the data is passed through the Doc._.trf_data attribute.

Training: the data is broadcast directly to the transformer's listeners.

However, the Transformer.predict method breaks the strict separation between training and prediction by also broadcasting transformer outputs to its listeners.

However, this breaks down when we are training a model with an unfrozen transformer when the transformer is also in annotating_components. The transformer will first (as part of its update step) broadcast the tensors and backprop function to its listeners. However, then when acting as an annotating component, it would immediately override its own output and clear the backprop function. As a result, gradients will not flow into the transformer.

This change removes the broadcast from the predict method. If a listener does not receive a batch, attempt to get the transformer output from the Doc instances. This makes it possible to train a pipeline with a frozen transformer.

This ports https://github.com/explosion/spaCy/pull/11385 to spacy-transformers. Alternative to #342.
bug feat / pipeline
opened by danieldk 0

Releases(v1.1.9)

v1.1.9(Dec 19, 2022)
Extend support for transformers up to v4.25.x.

Add support for Python 3.11 (currently limited to linux due to supported platforms for PyTorch v1.13.x).

Source code(tar.gz)
Source code(zip)
v1.1.8(Aug 12, 2022)
Extend support for transformers up to v4.21.x.

Support MPS device in HFShim (#328).

Track seen docs during alignment to improve speed (#337).

Don't require examples in Transformer.initialize (#341).

Source code(tar.gz)
Source code(zip)
v1.1.7(Aug 25, 2022)
Extend support for transformers up to v4.20.x.

Convert all transformer outputs to XP arrays at once (#330).

Support alternate model loaders in HFShim and HFWrapper (#332).

Source code(tar.gz)
Source code(zip)
v1.1.6(Jun 2, 2022)
Extend support for transformers up to v4.19.x.

Fix issue #324: Skip backprop for transformer if not available, for example if the transformer is frozen.

Source code(tar.gz)
Source code(zip)
v1.1.5(Mar 15, 2022)
✨ New features and improvements

Extend support for transformers up to v4.17.x.

👥 Contributors

@adrianeboyd
Source code(tar.gz)
Source code(zip)
v1.1.4(Jan 14, 2022)
✨ New features and improvements

Extend support for transformers up to v4.15.x.

👥 Contributors

@adrianeboyd, @danieldk
Source code(tar.gz)
Source code(zip)
v1.1.3(Dec 7, 2021)
✨ New features and improvements

Extend support for transformers up to v4.12.x.

👥 Contributors

@adrianeboyd
Source code(tar.gz)
Source code(zip)
v1.1.2(Oct 28, 2021)
🔴 Bug fixes

Fix #315: Enable loading of v1.0.x pipelines in windows.

👥 Contributors

@adrianeboyd, @ryndaniels, @svlandeg
Source code(tar.gz)
Source code(zip)
v1.1.1(Oct 19, 2021)
🔴 Bug fixes

Fix #309: Fix parameter ordering and defaults for new parameters in TransformerModel architectures.

Fix #310: Fix config and model issues when replacing listeners.

👥 Contributors

@adrianeboyd, @svlandeg
Source code(tar.gz)
Source code(zip)
v1.1.0(Oct 18, 2021)
✨ New features and improvements

Refactor and improve transformer serialization for better support of inline transformer components and replacing listeners.

Provide the transformer model output as ModelOutput instead of tuples in TransformerData.model_output and FullTransformerBatch.model_output. For backwards compatibility, the tuple format remains available under TransformerData.tensors and FullTransformerBatch.tensors. See more details in the transformer API docs.

Add support for transformer_config settings such as output_attentions. Additional output is stored under TransformerData.model_output. More details in the TransformerModel docs.

Add support for mixed-precision training.

Improve training speed by streamlining allocations for tokenizer output.

Extend support for transformers up to v4.11.x.

🔴 Bug fixes

Fix support for GPT2 models.

⚠️ Backwards incompatibilities

The serialization format for transformer components has changed in v1.1 and is not compatible with spacy-transformers v1.0.x. Pipelines trained with v1.0.x can be loaded with v1.1.x, but pipelines saved with v1.1.x cannot be loaded with v1.0.x.

TransformerData.tensors and FullTransformerBatch.tensors return a tuple instead of a list.

👥 Contributors

@adrianeboyd, @bryant1410, @danieldk, @honnibal, @ines, @KennethEnevoldsen, @svlandeg
Source code(tar.gz)
Source code(zip)
v1.0.6(Sep 2, 2021)
Fix copying of grad_factor when replacing listeners.

Source code(tar.gz)
Source code(zip)
v1.0.5(Aug 26, 2021)
Fix replacing listeners: #277

Require spaCy 3.1.0 or higher

Source code(tar.gz)
Source code(zip)
v1.0.4(Aug 12, 2021)
Extend transformers support to <4.10.0

Enable pickling of span getters and annotation setters, which is required for multiprocessing with spawn

Source code(tar.gz)
Source code(zip)
v1.0.3(Jul 20, 2021)
Allow spaCy 3.1

Extend transformers to <4.7.0

Source code(tar.gz)
Source code(zip)
v1.0.2(Apr 21, 2021)
✨ New features and improvements

Add support for transformers v4.3-v4.5

Add extra for CUDA 11.2

🔴 Bug fixes

Fix #264, #265: Improve handling of empty docs

Fix #269: Add trf_data extension in Transformer.__call__ and Transformer.pipe to support distributed processing

👥 Contributors

Thanks to @bryant1410 for the pull requests and contributions!
Source code(tar.gz)
Source code(zip)
v1.0.1(Feb 2, 2021)
🔴 Bug fixes

Fix listener initialization when model width is unset.

Source code(tar.gz)
Source code(zip)
v1.0.0(Feb 1, 2021)
This release requires spaCy v3.

✨ New features and improvements

Rewrite library from scratch for spaCy v3.0.

Transformer component for easy pipeline integration.

TransformerListener to share transformer weights between components.

Built-in registered functions that are available in spaCy if spacy-transformers is installed in the same environment.

📖 Documentation

Embeddings, Transformers and Transfer Learning: How to use transformers in spaCy

Training Pipelines and Models: Train and update components on your own data and integrate custom models

Layers and Model Architectures: Power spaCy components with custom neural networks

Transformer: Pipeline component API reference

Transformer architectures: Architectures and registered functions

Source code(tar.gz)
Source code(zip)
v1.0.0rc2(Jan 19, 2021)
🌙 This release is a pre-release and requires spaCy v3 (nightly).

✨ New features and improvements

Add support for Python 3.9

Add support for transformers v4

🔴 Bug fixes

Fix #230: Add upstream argument to TransformerListener.v1

Fix #238: Skip special tokens during alignment

Fix #246: Raise error if model max length exceeded

Source code(tar.gz)
Source code(zip)
v1.0.0rc0(Oct 15, 2020)
🌙 This release is a pre-release and requires spaCy v3 (nightly).

✨ New features and improvements

Rewrite library from scratch for spaCy v3.0.

Transformer component for easy pipeline integration.

TransformerListener to share transformer weights between components.

Built-in registered functions that are available in spaCy if spacy-transformers is installed in the same environment.

📖 Documentation

Embeddings, Transformers and Transfer Learning: How to use transformers in spaCy

Training Pipelines and Models: Train and update components on your own data and integrate custom models

Layers and Model Architectures: Power spaCy components with custom neural networks

Transformer: Pipeline component API reference

Transformer architectures: Architectures and registered functions

Source code(tar.gz)
Source code(zip)
v0.6.2(Jun 29, 2020)
Fix issue #204: Store model_type in tok2vec config to fix speed degradation

Source code(tar.gz)
Source code(zip)
v0.6.1(Jun 18, 2020)
⚠️ This release requires downloading new models.

Update spacy-transformers for spaCy v2.3

Update and extend supported transformers versions to >=2.4.0,<2.9.0

Use transformers.AutoConfig to support loading pretrained models from https://huggingface.co/models

#123: Fix alignment algorithm using pytokenizations

Thanks to @tamuhey for the pull request!
Source code(tar.gz)
Source code(zip)
v0.5.3(Jun 18, 2020)
Bug fixes related to alignment and truncation:

#191: Reset max_len in case of alignment error

#196: Fix wordpiecer truncation to be per sentence

Enhancement:

#162: Let nlp.update handle Doc type inputs

Thanks to @ZhuoruLin for the pull requests and helping us debug issues related to batching and truncation!
Source code(tar.gz)
Source code(zip)
v0.6.0(May 24, 2020)

Update to newer version of transformers.

This library is being rewritten for spaCy v3, in order to improve its flexibility and performance and to make it easier to stay up to date with new transformer models. See here for details: https://github.com/explosion/spacy-transformers/pull/173
Source code(tar.gz)
Source code(zip)
v0.5.2(May 24, 2020)

Fix various alignment and preprocessing bugs.
Source code(tar.gz)
Source code(zip)
v0.5.1(Oct 28, 2019)
Downgrade version pin of importlib_metadata to prevent conflict.

Fix issue #92: Fix index error when calculating doc.tensor.

Thanks to @ssavvi for the pull request!
Source code(tar.gz)
Source code(zip)
v0.5.0(Oct 8, 2019)
⚠️ This release requires downloading new models. Also note the new model names that specify trf (transformers) instead of pytt (PyTorch transformers).

Rename package from spacy-pytorch-transformers to spacy-transformers.

Update to spacy>=2.2.0.

Upgrade to latest transformers.

Improve code and repo organization.

Source code(tar.gz)
Source code(zip)
v0.4.0(Sep 4, 2019)
Fix serialization.

Update pytorch-transformers to support DistilBERT.

Add pre-packed DistilBERT model.

Source code(tar.gz)
Source code(zip)
v0.3.0(Aug 27, 2019)
Add out-of-the-box support for RoBERTa.

Add pre-packaged RoBERTa model.

Update to pytorch-transformers v1.1.

Fix serialization when model was saved from GPU.

Source code(tar.gz)
Source code(zip)
v0.2.0(Aug 12, 2019)
Add support for GLUE benchmark tasks.

Support text-pair classification. The specifics of this are likely to change, but you can see run_glue.py for current usage.

Improve reliability of tokenization and alignment.

Add support for segment IDs to the PyTT_Wrapper class. These can now be passed in as a second column of the RaggedArray input. See the model_registry.get_word_pieces function for example usage.

Set default maximum sequence length to 128.

Fix bug that caused settings not to be passed into PyTT_TextCategorizer on model initialization.

Fix serialization of XLNet model.

Source code(tar.gz)
Source code(zip)
v0.1.1(Aug 10, 2019)
Handle unaligned tokens in extension attributes.

Source code(tar.gz)
Source code(zip)

Owner

Explosion

A software company specializing in developer tools for Artificial Intelligence and Natural Language Processing

GitHub Repository https://spacy.io/usage/embeddings-transformers

Implementaion of our ACL 2022 paper Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation

Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation This is the implementaion of our paper: Bridging the

20 Dec 12, 2022

Topic Modelling for Humans

gensim – Topic Modelling in Python Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Targ

13.8k Jan 02, 2023

Korea Spell Checker

한국어 문서 koSpellPy Korean Spell checker How to use Install pip install kospellpy Use from kospellpy import spell_init spell_checker = spell_init() # d

2 Oct 20, 2021

A Streamlit web app that generates Rick and Morty stories using GPT2.

Rick and Morty Story Generator This project uses a pre-trained GPT2 model, which was fine-tuned on Rick and Morty transcripts, to generate new stories

33 Oct 13, 2022

A sentence aligner for comparable corpora

About Yalign is a tool for extracting parallel sentences from comparable corpora. Statistical Machine Translation relies on parallel corpora (eg.. eur

128 Aug 24, 2022

DeBERTa: Decoding-enhanced BERT with Disentangled Attention

DeBERTa: Decoding-enhanced BERT with Disentangled Attention This repository is the official implementation of DeBERTa: Decoding-enhanced BERT with Dis

1.2k Jan 03, 2023

Korean Sentence Embedding Repository

Korean-Sentence-Embedding 🍭 Korean sentence embedding repository. You can download the pre-trained models and inference right away, also it provides

80 Jan 02, 2023

This repository serves as a place to document a toy attempt on how to create a generative text model in Catalan, based on GPT-2

GPT-2 Catalan playground and scripts to train a GPT-2 model either from scrath or from another pretrained model.

1 Jan 28, 2022

A BERT-based reverse dictionary of Korean proverbs

Wisdomify A BERT-based reverse-dictionary of Korean proverbs. 김유빈 : 모델링 / 데이터 수집 / 프로젝트 설계 / back-end 김종윤 : 데이터 수집 / 프로젝트 설계 / front-end / back-end 임용

94 Dec 08, 2022

NAACL 2022: MCSE: Multimodal Contrastive Learning of Sentence Embeddings

MCSE: Multimodal Contrastive Learning of Sentence Embeddings This repository contains code and pre-trained models for our NAACL-2022 paper MCSE: Multi

39 Nov 15, 2022

Protein Language Model

ProteinLM We pretrain protein language model based on Megatron-LM framework, and then evaluate the pretrained model results on TAPE (Tasks Assessing P

77 Dec 27, 2022

Code for Discovering Topics in Long-tailed Corpora with Causal Intervention.

Code for Discovering Topics in Long-tailed Corpora with Causal Intervention ACL2021 Findings Usage 0. Prepare environment Requirements: python==3.6 te

8 Dec 16, 2022

Turkish Stop Words Türkçe Dolgu Sözcükleri

trstop Turkish Stop Words Türkçe Dolgu Sözcükleri In this repository I put Turkish stop words that is contained in the first 10 thousand words with th

103 Nov 12, 2022

Edge-Augmented Graph Transformer

Edge-augmented Graph Transformer Introduction This is the official implementation of the Edge-augmented Graph Transformer (EGT) as described in https:

21 Dec 14, 2022

A toolkit for document-level event extraction, containing some SOTA model implementations

Document-level Event Extraction via Heterogeneous Graph-based Interaction Model with a Tracker Source code for ACL-IJCNLP 2021 Long paper: Document-le

84 Dec 15, 2022

Contact Extraction with Question Answering.

contactsQA Extraction of contact entities from address blocks and imprints with Extractive Question Answering. Goal Input: Dr. Max Mustermann Hauptstr

2 Apr 20, 2022

Contains analysis of trends from Fitbit Dataset (source: Kaggle) to see how the trends can be applied to Bellabeat customers and Bellabeat products

Contains analysis of trends from Fitbit Dataset (source: Kaggle) to see how the trends can be applied to Bellabeat customers and Bellabeat products.

2 Jan 12, 2022

🛸 Use pretrained transformers like BERT, XLNet and GPT-2 in spaCy

Related tags

Overview

spacy-transformers: Use pretrained transformers like BERT, XLNet and GPT-2 in spaCy

Features

🚀 Installation

📖 Documentation

Comments

Description

IO

Other fixes

Releases(v1.1.9)

v1.1.9(Dec 19, 2022)

v1.1.8(Aug 12, 2022)

v1.1.7(Aug 25, 2022)

v1.1.6(Jun 2, 2022)

v1.1.5(Mar 15, 2022)

✨ New features and improvements

👥 Contributors

v1.1.4(Jan 14, 2022)

✨ New features and improvements

👥 Contributors

v1.1.3(Dec 7, 2021)

✨ New features and improvements

👥 Contributors

v1.1.2(Oct 28, 2021)

🔴 Bug fixes

👥 Contributors

v1.1.1(Oct 19, 2021)

🔴 Bug fixes

👥 Contributors

v1.1.0(Oct 18, 2021)

✨ New features and improvements

🔴 Bug fixes

⚠️ Backwards incompatibilities

👥 Contributors

v1.0.6(Sep 2, 2021)

v1.0.5(Aug 26, 2021)

v1.0.4(Aug 12, 2021)

v1.0.3(Jul 20, 2021)

v1.0.2(Apr 21, 2021)

✨ New features and improvements

🔴 Bug fixes

👥 Contributors

v1.0.1(Feb 2, 2021)

🔴 Bug fixes

v1.0.0(Feb 1, 2021)

✨ New features and improvements

📖 Documentation

v1.0.0rc2(Jan 19, 2021)

✨ New features and improvements

🔴 Bug fixes

v1.0.0rc0(Oct 15, 2020)

✨ New features and improvements

📖 Documentation

v0.6.2(Jun 29, 2020)

v0.6.1(Jun 18, 2020)

v0.5.3(Jun 18, 2020)

v0.6.0(May 24, 2020)

v0.5.2(May 24, 2020)

v0.5.1(Oct 28, 2019)

v0.5.0(Oct 8, 2019)

v0.4.0(Sep 4, 2019)

v0.3.0(Aug 27, 2019)

v0.2.0(Aug 12, 2019)

v0.1.1(Aug 10, 2019)

Owner

Explosion

Implementaion of our ACL 2022 paper Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation

Topic Modelling for Humans

Korea Spell Checker

A Streamlit web app that generates Rick and Morty stories using GPT2.

A sentence aligner for comparable corpora

DeBERTa: Decoding-enhanced BERT with Disentangled Attention

Korean Sentence Embedding Repository

This repository serves as a place to document a toy attempt on how to create a generative text model in Catalan, based on GPT-2

A BERT-based reverse dictionary of Korean proverbs

NAACL 2022: MCSE: Multimodal Contrastive Learning of Sentence Embeddings

Protein Language Model

Code for Discovering Topics in Long-tailed Corpora with Causal Intervention.