Train GPT-3 model on V100(16GB Mem) Using improved Transformer.

Last update: Sep 11, 2022

Related tags

Text Data & NLP gpt

Overview

Pytorch GPT-X

My Own Pytorch GPT-X

1. Abstract

Train GPT-3 model on V100(16GB Mem) Using improved Transformer.

2. Model

Transformer

Additional Module

① Rezero

Rezero Is All You Need link

② Explicit Sparse Transformer

Explicit Sparse Transformer: Concentrated Attention Through Explicit Selection link

③ Macaron Architecture

Understanding and Improving Transformer From a Multi-Particle Dynamic System Point of View link

④ RealFormer, Residual Attention

RealFormer link

Train

DeepSpeed

TODO

~~ReZero~~
RealFormer, Residual Attention
~~Macaron architectures~~
~~Macaron architectures - layer Scale 0.5~~
~~Explicit Sparse Transformer~~
torch lightning
Deepspeed train on single GPU
Deepspeed parallel trainig on 2 V100 GPU with 16GB Memory

Parameter For Few-shot

The 175B parameter model is very large, but a large model is needed for Few-Shot Learning. So this repository try to use DeepSpeed for training extremely big model.

GPT-3 Config

model_name	n_params	n_layer	d_model	n_heads	d_head	batch_size	learning_rate
GPT-3 175B	175B	96	12288	96	128	3.2M	0.6 x 10^-4
GPT-3 13B	13B	40	5140	40	128	2M	1.0 x 10^-4
GPT-3 6.7B	6.7B	32	4096	32	128	2M	1.2 x 10^-4
GPT-3 2.7B	2.7B	32	25560	32	80	1M	1.6 x 10^-4

References

Transformer

lucidrains/x-transformers

DeepSpeed

ReZero

/majumderb/rezero

Explicit Sparse Transformer

x-transformer: explicit_sparse_transformer

Macaron Architecrue

Understanding and Improving Transformer From a Multi-Particle Dynamic System Point of View

Train GPT-3 model on V100(16GB Mem) Using improved Transformer.

Related tags

Overview

Pytorch GPT-X

1. Abstract

2. Model

Transformer

Additional Module

① Rezero

② Explicit Sparse Transformer

③ Macaron Architecture

④ RealFormer, Residual Attention

Train

DeepSpeed

TODO

Parameter For Few-shot

GPT-3 Config

References

Owner

Seonghwan Kim

Compute distance between sequences. 30+ algorithms, pure python implementation, common interface, optional external libs usage.

Shirt Bot is a discord bot which uses GPT-3 to generate text

Stack based programming language that compiles to x86_64 assembly or can alternatively be interpreted in Python

Fully featured implementation of Routing Transformer

this repository has datasets containing information of Uber pickups in NYC from April 2014 to September 2014 and January to June 2015. data Analysis , virtualization and some insights are gathered here

🗣️ NALP is a library that covers Natural Adversarial Language Processing.

Topic Inference with Zeroshot models

💫 Industrial-strength Natural Language Processing (NLP) in Python

This repository contains (not all) code from my project on Named Entity Recognition in philosophical text

Web Scraping, Document Deduplication & GPT-2 Fine-tuning with a newly created scam dataset.

Yet another Python binding for fastText

The simple project to separate mixed voice (2 clean voices) to 2 separate voices.

novel deep learning research works with PaddlePaddle

MicBot - MicBot uses Google Translate to speak everyone's chat messages

An easy to use, user-friendly and efficient code for extracting OpenAI CLIP (Global/Grid) features from image and text respectively.

Py65 65816 - Add support for the 65C816 to py65

BMInf (Big Model Inference) is a low-resource inference package for large-scale pretrained language models (PLMs).

Voice Assistant inspired by Google Assistant, Cortana, Alexa, Siri, ...

🧪 Cutting-edge experimental spaCy components and features

Fast, DB Backed pretrained word embeddings for natural language processing.