Code for the paper "Language Models are Unsupervised Multitask Learners"

Last update: Jan 08, 2023

Related tags

Text Data & NLP paper

Overview

Status: Archive (code is provided as-is, no updates expected)

gpt-2

Code and models from the paper "Language Models are Unsupervised Multitask Learners".

You can read about GPT-2 and its staged release in our original blog post, 6 month follow-up post, and final post.

We have also released a dataset for researchers to study their behaviors.

^* Note that our original parameter counts were wrong due to an error (in our previous blog posts and paper). Thus you may have seen small referred to as 117M and medium referred to as 345M.

Usage

This repository is meant to be a starting point for researchers and engineers to experiment with GPT-2.

For basic information, see our model card.

Some caveats

GPT-2 models' robustness and worst case behaviors are not well-understood. As with any machine-learned model, carefully evaluate GPT-2 for your use case, especially if used without fine-tuning or in safety-critical applications where reliability is important.
The dataset our GPT-2 models were trained on contains many texts with biases and factual inaccuracies, and thus GPT-2 models are likely to be biased and inaccurate as well.
To avoid having samples mistaken as human-written, we recommend clearly labeling samples as synthetic before wide dissemination. Our models are often incoherent or inaccurate in subtle ways, which takes more than a quick read for a human to notice.

Work with us

Please let us know if you’re doing interesting research with or working on applications of GPT-2! We’re especially interested in hearing from and potentially working with those who are studying

Potential malicious use cases and defenses against them (e.g. the detectability of synthetic text)
The extent of problematic content (e.g. bias) being baked into the models and effective mitigations

Development

See DEVELOPERS.md

Contributors

See CONTRIBUTORS.md

Citation

Please use the following bibtex entry:

@article{radford2019language,
  title={Language Models are Unsupervised Multitask Learners},
  author={Radford, Alec and Wu, Jeff and Child, Rewon and Luan, David and Amodei, Dario and Sutskever, Ilya},
  year={2019}
}

Future work

We may release code for evaluating the models on various benchmarks.

We are still considering release of the larger models.

License

Modified MIT

Code for the paper "Language Models are Unsupervised Multitask Learners"

Related tags

Overview

gpt-2

Usage

Some caveats

Work with us

Development

Contributors

Citation

Future work

License

Owner

OpenAI

This project uses unsupervised machine learning to identify correlations between daily inoculation rates in the USA and twitter sentiment in regards to COVID-19.

A practical and feature-rich paraphrasing framework to augment human intents in text form to build robust NLU models for conversational engines. Created by Prithiviraj Damodaran. Open to pull requests and other forms of collaboration.

Exploring dimension-reduced embeddings

NL. The natural language programming language.

Code for paper "Which Training Methods for GANs do actually Converge? (ICML 2018)"

ProtFeat is protein feature extraction tool that utilizes POSSUM and iFeature.

🤗 The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools

Unet-TTS: Improving Unseen Speaker and Style Transfer in One-shot Voice Cloning

State of the Art Natural Language Processing

Arabic-Phonetic-Output - You can input the phonetic version of any Arabic text here. This software will show you output in Arabic (with vowels)

Implementation of "Adversarial purification with Score-based generative models", ICML 2021

A simple command line tool for text to image generation, using OpenAI's CLIP and a BigGAN

Official code for "Parser-Free Virtual Try-on via Distilling Appearance Flows", CVPR 2021

text to speech toolkit. 好用的中文语音合成工具箱，包含语音编码器、语音合成器、声码器和可视化模块。

Code for ACL 2022 main conference paper "STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation".

SimCTG - A Contrastive Framework for Neural Text Generation

A framework for training and evaluating AI models on a variety of openly available dialogue datasets.

基于“Seq2Seq+前缀树”的知识图谱问答

GPT-Code-Clippy (GPT-CC) is an open source version of GitHub Copilot, a language model

A music comments dataset, containing 39,051 comments for 27,384 songs.