Chinese Pre-Trained Language Models (CPM-LM) Version-I

Overview

CPM-Generate

为了促进中文自然语言处理研究的发展,本项目提供了 CPM-LM (2.6B) 模型的文本生成代码,可用于文本生成的本地测试,并以此为基础进一步研究零次学习/少次学习等场景。[项目首页] [模型下载] [技术报告]

若您想使用CPM-1进行推理,我们建议使用高效推理工具BMInf,支持1060以上显卡单卡推理。

安装

首先安装pytorch等基础依赖,再安装APEX以支持fp16:

pip install -r requirements.txt
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

考虑apex的安装容易发生问题,我们构建了对应的Docker容器,可以进行快速环境搭建。安装方式如下:

docker pull dmye/cpm:v0

参考运行指令如下:

:/CPM --name=cpm cpm:v0 ">
sudo docker run --gpus '"device=0,1"' -it -v 
   
    :/CPM  --name=cpm  cpm:v0

   

其中 为代码所在目录,-v进行文件目录挂载

注:感谢qhduan同学提供了基于TensorFlow的使用代码,用作Pytorch之外的备选。

模型

模型下载后文件夹的目录结构需设置如下:

.
├── 80000
│   ├── mp_rank_00_model_states.pt
│   └── mp_rank_01_model_states.pt
└── latest_checkpointed_iteration.txt

为保证下载文件的正确性,文件的checksum如下:

SHA1
71d6b6ad4f47b46724eb82c05da8fb9175e62a7d  80000/mp_rank_00_model_states.pt
42aa247a262e2011fa5e276f1a8389fad6d80edc  80000/mp_rank_01_model_states.pt
MD5
f3f6d2f7d84c6a45290a31dabf79ddac  80000/mp_rank_00_model_states.pt
b0e960be4b5226e759ae6fc5246f9160  80000/mp_rank_01_model_states.pt

使用

提供了命令行交互式生成:

bash scripts/generate_text.sh /path/to/CPM

如不使用交互式输入,可增加第二个参数,告知输入文本的位置

bash scripts/generate_text.sh /path/to/CPM example.txt

运行该脚本需要两块GPU,每张卡的GPU内存占用约为7GB。该项目主要基于 Megatron-LM 进行修改。模型的主体架构与GPT-2一致。

默认的模型并行参数为2,如果需要修改,可以使用change_mp.py,并调整generate_text.sh中的MPSIZEchange_mp.py的使用示例如下:

python change_mp.py /path/to/CPM MPSIZE

这里的/path/to/CPM为模型路径,MPSIZE为一个整数,可以为1或者2的倍数,结果会生成一个新的模型,存储路径为/path/to/CPM_MPSIZE

Tokenization

Tokenization实现主要在data_util/tokenization_gpt2.py,先对于文本进行分词,再使用 SentencePiece 得到 BPE 的结果。由于 SentencePiece 不能有效编码空格和换行符,在 BPE 之前,我们将文本中的空格和换行符替换为\u2582\u2583。生成文本的时候也会对应的把生成的\u2582\u2583替换回空格和换行符。

对应问题已解决。

分类任务零次学习(Zero-shot Learning)

提供了三个任务的零次学习任务脚本以供参考,包括OCNLI、TNEWS和IFLYTEK,数据下载链接。脚本使用方法如下:

# OCNLI
bash scripts/zero-shot-ocnli.sh /path/to/CPM /path/to/dataset
# TNEWS
bash scripts/zero-shot-tnews.sh /path/to/CPM /path/to/dataset
# IFLYTEK
bash scripts/zero-shot-iflytek.sh /path/to/CPM /path/to/dataset

TODO

  • 实验环境的docker镜像
  • 提供各个任务具体的使用模板
  • 公开技术报告
  • 模型并行数可动态调整
  • Fine-tune代码
  • 开源实验中使用的小规模模型参数

引用

@article{cpm-v1,
  title={CPM: A Large-scale Generative Chinese Pre-trained Language Model},
  author={Zhang, Zhengyan and Han, Xu, and Zhou, Hao, and Ke, Pei, and Gu, Yuxian and Ye, Deming and Qin, Yujia and Su, Yusheng and Ji, Haozhe and Guan, Jian and Qi, Fanchao and Wang, Xiaozhi and Zheng, Yanan and Zeng, Guoyang and Cao, Huanqi and Chen, Shengqi and Li, Daixuan and Sun, Zhenbo and Liu, Zhiyuan and Huang, Minlie and Han, Wentao and Tang, Jie and Li, Juanzi and Sun, Maosong},
  year={2020}
}
Owner
Tsinghua AI
Tsinghua AI
Sentiment Classification using WSD, Maximum Entropy & Naive Bayes Classifiers

Sentiment Classification using WSD, Maximum Entropy & Naive Bayes Classifiers

Pulkit Kathuria 173 Jan 04, 2023
Abhijith Neil Abraham 2 Nov 05, 2021
multi-label,classifier,text classification,多标签文本分类,文本分类,BERT,ALBERT,multi-label-classification,seq2seq,attention,beam search

multi-label,classifier,text classification,多标签文本分类,文本分类,BERT,ALBERT,multi-label-classification,seq2seq,attention,beam search

hellonlp 30 Dec 12, 2022
Repositório do trabalho de introdução a NLP

Trabalho da disciplina de BI NLP Repositório do trabalho da disciplina Introdução a Processamento de Linguagem Natural da pós BI-Master da PUC-RIO. Eq

Leonardo Lins 1 Jan 18, 2022
:mag: Transformers at scale for question answering & neural search. Using NLP via a modular Retriever-Reader-Pipeline. Supporting DPR, Elasticsearch, HuggingFace's Modelhub...

Haystack is an end-to-end framework that enables you to build powerful and production-ready pipelines for different search use cases. Whether you want

deepset 6.4k Jan 09, 2023
Tutorial to pretrain & fine-tune a 🤗 Flax T5 model on a TPUv3-8 with GCP

Pretrain and Fine-tune a T5 model with Flax on GCP This tutorial details how pretrain and fine-tune a FlaxT5 model from HuggingFace using a TPU VM ava

Gabriele Sarti 41 Nov 18, 2022
Unsupervised Abstract Reasoning for Raven’s Problem Matrices

Unsupervised Abstract Reasoning for Raven’s Problem Matrices This code is the implementation of our TIP paper. This is the first unsupervised abstract

Tao Zhuo 9 Dec 17, 2022
Write Python in Urdu - اردو میں کوڈ لکھیں

UrduPython Write simple Python in Urdu. How to Use Write Urdu code in سامپل۔پے The mappings are as following: "۔": ".", "،":

Saad A. Bazaz 26 Nov 27, 2022
Knowledge Graph,Question Answering System,基于知识图谱和向量检索的医疗诊断问答系统

Knowledge Graph,Question Answering System,基于知识图谱和向量检索的医疗诊断问答系统

wangle 823 Dec 28, 2022
State of the art faster Natural Language Processing in Tensorflow 2.0 .

tf-transformers: faster and easier state-of-the-art NLP in TensorFlow 2.0 ****************************************************************************

74 Dec 05, 2022
NLP and Text Generation Experiments in TensorFlow 2.x / 1.x

Code has been run on Google Colab, thanks Google for providing computational resources Contents Natural Language Processing(自然语言处理) Text Classificati

1.5k Nov 14, 2022
使用Mask LM预训练任务来预训练Bert模型。训练垂直领域语料的模型表征,提升下游任务的表现。

Pretrain_Bert_with_MaskLM Info 使用Mask LM预训练任务来预训练Bert模型。 基于pytorch框架,训练关于垂直领域语料的预训练语言模型,目的是提升下游任务的表现。 Pretraining Task Mask Language Model,简称Mask LM,即

Desmond Ng 24 Dec 10, 2022
HuggingSound: A toolkit for speech-related tasks based on HuggingFace's tools

HuggingSound HuggingSound: A toolkit for speech-related tasks based on HuggingFace's tools. I have no intention of building a very complex tool here.

Jonatas Grosman 247 Dec 26, 2022
PyTorch implementation of the NIPS-17 paper "Poincaré Embeddings for Learning Hierarchical Representations"

Poincaré Embeddings for Learning Hierarchical Representations PyTorch implementation of Poincaré Embeddings for Learning Hierarchical Representations

Facebook Research 1.6k Dec 29, 2022
A Domain Specific Language (DSL) for building language patterns. These can be later compiled into spaCy patterns, pure regex, or any other format

RITA DSL This is a language, loosely based on language Apache UIMA RUTA, focused on writing manual language rules, which compiles into either spaCy co

Šarūnas Navickas 60 Sep 26, 2022
code for "AttentiveNAS Improving Neural Architecture Search via Attentive Sampling"

AttentiveNAS: Improving Neural Architecture Search via Attentive Sampling This repository contains PyTorch evaluation code, training code and pretrain

Facebook Research 94 Oct 26, 2022
Simple bots or Simbots is a library designed to create simple bots using the power of python. This library utilises Intent, Entity, Relation and Context model to create bots .

Simple bots or Simbots is a library designed to create simple chat bots using the power of python. This library utilises Intent, Entity, Relation and

14 Dec 15, 2021
T‘rex Park is a Youzan sponsored project. Offering Chinese NLP and image models pretrained from E-commerce datasets

T‘rex Park is a Youzan sponsored project. Offering Chinese NLP and image models pretrained from E-commerce datasets (product titles, images, comments, etc.).

55 Nov 22, 2022
jiant is an NLP toolkit

jiant is an NLP toolkit The multitask and transfer learning toolkit for natural language processing research Why should I use jiant? jiant supports mu

ML² AT CILVR 1.5k Jan 04, 2023
KR-FinBert And KR-FinBert-SC

KR-FinBert & KR-FinBert-SC Much progress has been made in the NLP (Natural Language Processing) field, with numerous studies showing that domain adapt

5 Jul 29, 2022