This repository has a implementations of data augmentation for NLP for Japanese.

Last update: Nov 11, 2022

Related tags

Text Data & NLP daaja

Overview

daaja

This repository has a implementations of data augmentation for NLP for Japanese:

EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks
An Analysis of Simple Data Augmentation for Named Entity Recognition

Install

pip install daaja

How to use

EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks

Command

python -m aug_ja.eda.run --input input.tsv --output data_augmentor.tsv

The format of input.tsv is as follows:

1	この映画はとてもおもしろい
0	つまらない映画だった

In Python

from aug_ja.eda import EasyDataAugmentor
augmentor = EasyDataAugmentor(alpha_sr=0.1, alpha_ri=0.1, alpha_rs=0.1, p_rd=0.1, num_aug=4)
text = "日本語でデータ拡張を行う"
aug_texts = augmentor.augments(text)
print(aug_texts)
# ['日本語でを拡張データ行う', '日本語でデータ押広げるを行う', '日本語でデータ拡張を行う', '日本語で智見拡張を行う', '日本語でデータ拡張を行う']

An Analysis of Simple Data Augmentation for Named Entity Recognition

Command

python -m aug_ja.ner_sda.run --input input.tsv --output data_augmentor.tsv

The format of input.tsv is as follows:

私	O
は	O
田中	B-PER
と	O
いい	O
ます	O

In Python

from daaja.ner_sda import SimpleDataAugmentationforNER
tokens_list = [
    ["私", "は", "田中", "と", "いい", "ます"],
    ["筑波", "大学", "に", "所属", "して", "ます"],
    ["今日", "から", "筑波", "大学", "に", "通う"],
    ["茨城", "大学"],
]
labels_list = [
    ["O", "O", "B-PER", "O", "O", "O"],
    ["B-ORG", "I-ORG", "O", "O", "O", "O"],
    ["B-DATE", "O", "B-ORG", "I-ORG", "O", "O"],
    ["B-ORG", "I-ORG"],
]
augmentor = SimpleDataAugmentationforNER(tokens_list=tokens_list, labels_list=labels_list,
                                            p_power=1, p_lwtr=1, p_mr=1, p_sis=1, p_sr=1, num_aug=4)
tokens = ["吉田", "さん", "は", "株式", "会社", "A", "に", "出張", "予定", "だ"]
labels = ["B-PER", "O", "O", "B-ORG", "I-ORG", "I-ORG", "O", "O", "O", "O"]
augmented_tokens_list, augmented_labels_list = augmentor.augments(tokens, labels)
print(augmented_tokens_list)
# [['吉田', 'さん', 'は', '株式', '会社', 'A', 'に', '出張', '志す', 'だ'],
#  ['吉田', 'さん', 'は', '株式', '大学', '大学', 'に', '出張', '予定', 'だ'],
#  ['吉田', 'さん', 'は', '株式', '会社', 'A', 'に', '出張', '予定', 'だ'],
#  ['吉田', 'さん', 'は', '筑波', '大学', 'に', '出張', '予定', 'だ'],
#  ['吉田', 'さん', 'は', '株式', '会社', 'A', 'に', '出張', '予定', 'だ']]
print(augmented_labels_list)
# [['B-PER', 'O', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'O', 'O', 'O', 'O'],
#  ['B-PER', 'O', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'O', 'O', 'O', 'O'],
#  ['B-PER', 'O', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'O', 'O', 'O', 'O'],
#  ['B-PER', 'O', 'O', 'B-ORG', 'I-ORG', 'O', 'O', 'O', 'O'],
#  ['B-PER', 'O', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'O', 'O', 'O', 'O']]

Reference

Comments

too many progress bars

When I use EasyDataAugmentor in the train process, there are too many progress bars in the console.

So, can you make this line 19 tqdm selectable on-off when we define EasyDataAugmentor? https://github.com/kajyuuen/daaja/blob/12835943868d43f5c248cf1ea87ab60f67a6e03d/daaja/flows/sequential_flow.py#L19

opened by Yongtae723 6
from daaja.methods.eda.easy_data_augmentor import EasyDataAugmentorにてエラー

daajaをpipインストール後、from daaja.methods.eda.easy_data_augmentor import EasyDataAugmentorを行うと、以下のエラーとなる。 ConnectionError: HTTPConnectionPool(host='compling.hss.ntu.edu.sg', port=80): Max retries exceeded with url: /wnja/data/1.1/wnjpn.db.gz (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f3b6a6cced0>: Failed to establish a new connection: [Errno 110] Connection timed out'))

opened by naoki1213mj 5
is it possible to use on GPU device?

Hi!

thank you for the great library. when I train with this augmentation, this takes so much more time than forward and backward process.

therefore, can we possibly use this augmentation on GPU to save time?

thank you

opened by Yongtae723 3
Bump joblib from 1.1.0 to 1.2.0
Bumps joblib from 1.1.0 to 1.2.0.

Changelog

Sourced from joblib's changelog.

Release 1.2.0

Fix a security issue where eval(pre_dispatch) could potentially run arbitrary code. Now only basic numerics are supported. joblib/joblib#1327

Make sure that joblib works even when multiprocessing is not available, for instance with Pyodide joblib/joblib#1256

Avoid unnecessary warnings when workers and main process delete the temporary memmap folder contents concurrently. joblib/joblib#1263

Fix memory alignment bug for pickles containing numpy arrays. This is especially important when loading the pickle with mmap_mode != None as the resulting numpy.memmap object would not be able to correct the misalignment without performing a memory copy. This bug would cause invalid computation and segmentation faults with native code that would directly access the underlying data buffer of a numpy array, for instance C/C++/Cython code compiled with older GCC versions or some old OpenBLAS written in platform specific assembly. joblib/joblib#1254

Vendor cloudpickle 2.2.0 which adds support for PyPy 3.8+.

Vendor loky 3.3.0 which fixes several bugs including:

robustly forcibly terminating worker processes in case of a crash (joblib/joblib#1269);

avoiding leaking worker processes in case of nested loky parallel calls;

reliability spawn the correct number of reusable workers.

Release 1.1.1

Fix a security issue where eval(pre_dispatch) could potentially run arbitrary code. Now only basic numerics are supported. joblib/joblib#1327

Commits

5991350 Release 1.2.0

3fa2188 MAINT cleanup numpy warnings related to np.matrix in tests (#1340)

cea26ff CI test the future loky-3.3.0 branch (#1338)

8aca6f4 MAINT: remove pytest.warns(None) warnings in pytest 7 (#1264)

067ed4f XFAIL test_child_raises_parent_exits_cleanly with multiprocessing (#1339)

ac4ebd5 MAINT add back pytest warnings plugin (#1337)

a23427d Test child raises parent exits cleanly more reliable on macos (#1335)

ac09691 [MAINT] various test updates (#1334)

4a314b1 Vendor loky 3.2.0 (#1333)

bdf47e9 Make test_parallel_with_interactively_defined_functions_default_backend timeo...

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies
opened by dependabot[bot] 0
Implement Data Augmentation using Pre-trained Transformer Models
paper

Data Augmentation using Pre-trained Transformer Models

code

https://github.com/varunkumar-dev/TransformersDataAugmentation

ref

https://www.ai-shift.co.jp/techblog/1939

add-new-technique
opened by kajyuuen 0
Implement Contextual Augmentation
Paper

Contextual Augmentation: Data Augmentation by Words with Paradigmatic Relations

Code

https://github.com/pfnet-research/contextual_augmentation

add-new-technique
opened by kajyuuen 0
Implement MixText
Paper

MixText: Linguistically-Informed Interpolation of Hidden Space for Semi-Supervised Text Classification

Code

https://github.com/GT-SALT/MixText

add-new-technique
opened by kajyuuen 0

Releases(v0.0.7)

v0.0.7(Oct 24, 2022)
Changes

Change pytest @kajyuuen (#35 #37 #38)

Change WORDNER_URL @kajyuuen (#34)

Source code(tar.gz)
Source code(zip)
daaja-0.0.7-py3-none-any.whl(18.19 KB)
v0.0.6(Mar 3, 2022)
Changes

Update version @kajyuuen (#27)

Add verbose option @kajyuuen (#25)

📖 Documentation

Add README_ja.md and Update README.md @kajyuuen (#26)

Source code(tar.gz)
Source code(zip)
v0.0.5(Feb 27, 2022)
Changes

💪 Enhancement

Add ContextualAugmentor @kajyuuen (#23)

Add BackTranslationAugmentor @kajyuuen (#21 , #22)

📖 Documentation

Add quick_example @kajyuuen (#17)

Source code(tar.gz)
Source code(zip)
v0.0.4(Feb 21, 2022)
Changes

Release v0.0.4 @kajyuuen (#16)

Chore add release drafter @kajyuuen (#6)

💪 Enhancement

Add tqdm @kajyuuen (#8)

📖 Documentation

Refactoring @kajyuuen (#15)

Add SDA example @kajyuuen (#9)

Add EDA example @kajyuuen (#7)

Source code(tar.gz)
Source code(zip)
v0.0.3(Feb 13, 2022)

Source code(tar.gz)
Source code(zip)
daaja-0.0.3-py3-none-any.whl(14.80 KB)
v0.0.2(Feb 13, 2022)

Source code(tar.gz)
Source code(zip)
daaja-0.0.2-py3-none-any.whl(14.97 KB)

Owner

Koga Kobayashi

GitHub Repository

Chinese real time voice cloning (VC) and Chinese text to speech (TTS).

Chinese real time voice cloning (VC) and Chinese text to speech (TTS). 好用的中文语音克隆兼中文语音合成系统，包含语音编码器、语音合成器、声码器和可视化模块。

6 Nov 08, 2022

BERT-based Financial Question Answering System

BERT-based Financial Question Answering System In this example, we use Jina, PyTorch, and Hugging Face transformers to build a production-ready BERT-b

61 Sep 18, 2022

🍊 PAUSE (Positive and Annealed Unlabeled Sentence Embedding), accepted by EMNLP'2021 🌴

PAUSE: Positive and Annealed Unlabeled Sentence Embedding Sentence embedding refers to a set of effective and versatile techniques for converting raw

21 Dec 15, 2022

MRC approach for Aspect-based Sentiment Analysis (ABSA)

B-MRC MRC approach for Aspect-based Sentiment Analysis (ABSA) Paper: Bidirectional Machine Reading Comprehension for Aspect Sentiment Triplet Extracti

1 Apr 05, 2022

profile tools for pytorch nn models

nnprof Introduction nnprof is a profile tool for pytorch neural networks. Features multi profile mode: nnprof support 4 profile mode: Layer level, Ope

42 Jul 09, 2022

Sploitus - Command line search tool for sploitus.com. Think searchsploit, but with more POCs

Sploitus Command line search tool for sploitus.com. Think searchsploit, but with

5 Mar 07, 2022

Codes to pre-train Japanese T5 models

t5-japanese Codes to pre-train a T5 (Text-to-Text Transfer Transformer) model pre-trained on Japanese web texts. The model is available at https://hug

37 Dec 25, 2022

多语言降噪预训练模型MBart的中文生成任务

mbart-chinese 基于mbart-large-cc25 的中文生成任务 Input source input: text + /s + lang_code target input: lang_code + text + /s Usage token_ids_mapping.jso

11 Sep 19, 2022

Sinkhorn Transformer - Practical implementation of Sparse Sinkhorn Attention

Sinkhorn Transformer This is a reproduction of the work outlined in Sparse Sinkhorn Attention, with additional enhancements. It includes a parameteriz

217 Nov 25, 2022

DiY Oxygen Concentrator based on the OxiKit

M19O2 DiY Oxygen Concentrator based on / inspired by the OxiKit, OpenOx, Marut, RepRap and Project Apollo platforms. About Read about the project on H

62 Dec 22, 2022

:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.

Dedupe Python Library dedupe is a python library that uses machine learning to perform fuzzy matching, deduplication and entity resolution quickly on

3.6k Jan 02, 2023

PyTorch implementation of Tacotron speech synthesis model.

tacotron_pytorch PyTorch implementation of Tacotron speech synthesis model. Inspired from keithito/tacotron. Currently not as much good speech quality

279 Dec 09, 2022

This is a modification of the OpenAI-CLIP repository of moein-shariatnia

2 Mar 04, 2022

State-of-the-art NLP through transformer models in a modular design and consistent APIs.

Trapper (Transformers wRAPPER) Trapper is an NLP library that aims to make it easier to train transformer based models on downstream tasks. It wraps h

42 Sep 21, 2022

Protein Language Model

ProteinLM We pretrain protein language model based on Megatron-LM framework, and then evaluate the pretrained model results on TAPE (Tasks Assessing P

77 Dec 27, 2022

ChessCoach is a neural network-based chess engine capable of natural-language commentary.

380 Dec 03, 2022

Train 🤗transformers with DeepSpeed: ZeRO-2, ZeRO-3

Fork from https://github.com/huggingface/transformers/tree/86d5fb0b360e68de46d40265e7c707fe68c8015b/examples/pytorch/language-modeling at 2021.05.17.

12 Oct 26, 2022

Datasets of Automatic Keyphrase Extraction

This repository contains 20 annotated datasets of Automatic Keyphrase Extraction made available by the research community. Following are the datasets and the original papers that proposed them. If yo

163 Dec 23, 2022

Framework for fine-tuning pretrained transformers for Named-Entity Recognition (NER) tasks

NERDA Not only is NERDA a mesmerizing muppet-like character. NERDA is also a python package, that offers a slick easy-to-use interface for fine-tuning

141 Dec 30, 2022

A Pytorch implementation of "Splitter: Learning Node Representations that Capture Multiple Social Contexts" (WWW 2019).

Splitter ⠀⠀ A PyTorch implementation of Splitter: Learning Node Representations that Capture Multiple Social Contexts (WWW 2019). Abstract Recent inte

201 Nov 09, 2022

This repository has a implementations of data augmentation for NLP for Japanese.

Related tags

Overview

daaja

Install

How to use

Command

In Python

Command

In Python

Comments

Release 1.2.0

Release 1.1.1

Releases(v0.0.7)

v0.0.7(Oct 24, 2022)

Changes

v0.0.6(Mar 3, 2022)

Changes

📖 Documentation

v0.0.5(Feb 27, 2022)

Changes

💪 Enhancement

📖 Documentation

v0.0.4(Feb 21, 2022)

Changes

💪 Enhancement

📖 Documentation

v0.0.3(Feb 13, 2022)

v0.0.2(Feb 13, 2022)

Owner

Koga Kobayashi

Chinese real time voice cloning (VC) and Chinese text to speech (TTS).

BERT-based Financial Question Answering System

🍊 PAUSE (Positive and Annealed Unlabeled Sentence Embedding), accepted by EMNLP'2021 🌴

MRC approach for Aspect-based Sentiment Analysis (ABSA)

profile tools for pytorch nn models

Sploitus - Command line search tool for sploitus.com. Think searchsploit, but with more POCs

Codes to pre-train Japanese T5 models

多语言降噪预训练模型MBart的中文生成任务

Sinkhorn Transformer - Practical implementation of Sparse Sinkhorn Attention

DiY Oxygen Concentrator based on the OxiKit

:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.

PyTorch implementation of Tacotron speech synthesis model.

This is a modification of the OpenAI-CLIP repository of moein-shariatnia

State-of-the-art NLP through transformer models in a modular design and consistent APIs.

Protein Language Model

ChessCoach is a neural network-based chess engine capable of natural-language commentary.

Train 🤗transformers with DeepSpeed: ZeRO-2, ZeRO-3

Datasets of Automatic Keyphrase Extraction

Framework for fine-tuning pretrained transformers for Named-Entity Recognition (NER) tasks

A Pytorch implementation of "Splitter: Learning Node Representations that Capture Multiple Social Contexts" (WWW 2019).