Weakly-supervised Text Classification Based on Keyword Graph

Overview

Weakly-supervised Text Classification Based on Keyword Graph

How to run?

Download data

Our dataset follows previous works. For long texts, we follow Conwea. For short texts, we follow LOTClass.
We transform all their data into unified json format.

  1. Download datasets from: https://drive.google.com/drive/folders/1D8E9T-vuBE-YdAd9OBy-yS4UW4AptA58?usp=sharing

    • Long text datasets(follow Conwea):

      • 20Newsgroup Fine(20NF)
      • 20Newsgroup Coarse(20NC)
      • NYT Fine(NYT_25)
      • NYT Coarse(NYT_5)
    • Short text datasets(follow LOTClass)

      • Agnews
      • dbpedia
      • imdb
      • amazon
  2. Unzip data into './data/processed'

Another way to obtain data (Not recommended):
You can download long text data from Conwea and short text data from LOTClass and transform data into json format using our code. The code is located at 'preprocess_data/process_long.py (process_short.py) You need to edit the preprocess code to change the dataset path to your downloaded path and change the taskname. The processed data is located in 'data/processed'. We alse provide preprocess code for X-class, which is 'process_x_class.py'.

Requirements

This project is based on python==3.8. The dependencies are as follow:

pytorch
DGL
yacs
visdom
transformers
scikit-learn
numpy
scipy

Train and Eval

  • Recommend to start visdom to show the results.
visdom -p 8888

Open the browser to the server_ip:8888 to show visdom panel.

  • Train:
    • First edit 'task/pipeline.py' to specify to config file and CUDA devices you used.
      Some configuration files are provided in the config folder.

    • Start training:

      python task/pipeline.py
      
    • Our code is based on multi GPUs, may be unable to run on single GPU currently.

Run on your custom dataset.

  1. provide datasets to dir data/processed.

    • keywords.json
      keywords for each class. type: dict. key: class_index. value: list containing all keywords for this class. See provided datasets for details.

    • unlabeled.json
      unlabeled sentences in our paper. type: list. item: list with 2 items([sentence_i,label_i]).
      In order to facilitate the evaluation, we are similar to Conwea's settings, where labels of sentences are provided. The labels are only used for evaluation.

  2. provide config to dir config. You can copy one of the existing config files and change some fields, like number_classes, classifier.type, data_dir_name etc.

  3. Specify the config file name in pipeline.py and run the pipeline code.

Citation

Please cite the following paper if you find our code helpful! Thank you very much.

Lu Zhang, Jiandong Ding, Yi Xu, Yingyao Liu and Shuigeng Zhou. "Weakly-supervised Text Classification Based on Keyword Graph". EMNLP 2021.

Owner
Hello_World
Computer Science at Fudan University.
Hello_World
Implementation of TF-IDF algorithm to find documents similarity with cosine similarity

NLP learning Trying to learn NLP to use in my projects! Table of Contents About The Project Built With Getting Started Requirements Run Usage License

Faraz Farangizadeh 3 Aug 25, 2022
A BERT-based reverse dictionary of Korean proverbs

Wisdomify A BERT-based reverse-dictionary of Korean proverbs. 김유빈 : 모델링 / 데이터 수집 / 프로젝트 설계 / back-end 김종윤 : 데이터 수집 / 프로젝트 설계 / front-end / back-end 임용

94 Dec 08, 2022
Contains descriptions and code of the mini-projects developed in various programming languages

TexttoSpeechAndLanguageTranslator-project introduction A pleasant application where the client will be given buttons like play,reset and exit. The cli

Adarsh Reddy 1 Dec 22, 2021
This repo stores the codes for topic modeling on palliative care journals.

This repo stores the codes for topic modeling on palliative care journals. Data Preparation You first need to download the journal papers. bash 1_down

3 Dec 20, 2022
Text to speech for Vietnamese, ez to use, ez to update

Chào mọi người, đây là dự án mở nhằm giúp việc đọc được trở nên dễ dàng hơn. Rất cảm ơn đội ngũ Zalo đã cung cấp hạ tầng để mình có thể tạo ra app này

Trần Cao Minh Bách 32 Jul 29, 2022
Transformers4Rec is a flexible and efficient library for sequential and session-based recommendation, available for both PyTorch and Tensorflow.

Transformers4Rec is a flexible and efficient library for sequential and session-based recommendation, available for both PyTorch and Tensorflow.

730 Jan 09, 2023
Training code of Spatial Time Memory Network. Semi-supervised video object segmentation.

Training-code-of-STM This repository fully reproduces Space-Time Memory Networks Performance on Davis17 val set&Weights backbone training stage traini

haochen wang 128 Dec 11, 2022
DAGAN - Dual Attention GANs for Semantic Image Synthesis

Contents Semantic Image Synthesis with DAGAN Installation Dataset Preparation Generating Images Using Pretrained Model Train and Test New Models Evalu

Hao Tang 104 Oct 08, 2022
Meta learning algorithms to train cross-lingual NLI (multi-task) models

Meta learning algorithms to train cross-lingual NLI (multi-task) models

M.Hassan Mojab 4 Nov 20, 2022
Contains analysis of trends from Fitbit Dataset (source: Kaggle) to see how the trends can be applied to Bellabeat customers and Bellabeat products

Contains analysis of trends from Fitbit Dataset (source: Kaggle) to see how the trends can be applied to Bellabeat customers and Bellabeat products.

Leah Pathan Khan 2 Jan 12, 2022
IndoBERTweet is the first large-scale pretrained model for Indonesian Twitter. Published at EMNLP 2021 (main conference)

IndoBERTweet 🐦 🇮🇩 1. Paper Fajri Koto, Jey Han Lau, and Timothy Baldwin. IndoBERTweet: A Pretrained Language Model for Indonesian Twitter with Effe

IndoLEM 40 Nov 30, 2022
Implementation of Multistream Transformers in Pytorch

Multistream Transformers Implementation of Multistream Transformers in Pytorch. This repository deviates slightly from the paper, where instead of usi

Phil Wang 47 Jul 26, 2022
Nested Named Entity Recognition for Chinese Biomedical Text

CBio-NAMER CBioNAMER (Nested nAMed Entity Recognition for Chinese Biomedical Text) is our method used in CBLUE (Chinese Biomedical Language Understand

8 Dec 25, 2022
ETM - R package for Topic Modelling in Embedding Spaces

ETM - R package for Topic Modelling in Embedding Spaces This repository contains an R package called topicmodels.etm which is an implementation of ETM

bnosac 37 Nov 06, 2022
Stanford CoreNLP provides a set of natural language analysis tools written in Java

Stanford CoreNLP Stanford CoreNLP provides a set of natural language analysis tools written in Java. It can take raw human language text input and giv

Stanford NLP 8.8k Jan 07, 2023
Adversarial Examples for Extreme Multilabel Text Classification

Adversarial Examples for Extreme Multilabel Text Classification The code is adapted from the source codes of BERT-ATTACK [1], APLC_XLNet [2], and Atte

1 May 14, 2022
Traditional Chinese Text Recognition Dataset: Synthetic Dataset and Labeled Data

Traditional Chinese Text Recognition Dataset: Synthetic Dataset and Labeled Data Authors: Yi-Chang Chen, Yu-Chuan Chang, Yen-Cheng Chang and Yi-Ren Ye

Yi-Chang Chen 5 Dec 15, 2022
A modular Karton Framework service that unpacks common packers like UPX and others using the Qiling Framework.

Unpacker Karton Service A modular Karton Framework service that unpacks common packers like UPX and others using the Qiling Framework. This project is

c3rb3ru5 45 Jan 05, 2023
A list of NLP(Natural Language Processing) tutorials

NLP Tutorial A list of NLP(Natural Language Processing) tutorials built on PyTorch. Table of Contents A step-by-step tutorial on how to implement and

Allen Lee 1.3k Dec 25, 2022
A python package for deep multilingual punctuation prediction.

This python library predicts the punctuation of English, Italian, French and German texts. We developed it to restore the punctuation of transcribed spoken language.

Oliver Guhr 27 Dec 22, 2022