ICE Tokenizer

Token id [0, 20000) are image tokens.
Token id [20000, 20100) are common tokens, mainly punctuations. E.g., icetk[20000] == ' ', icetk[20003] == ' ', icetk[20006] == ','.
Token id [20100, 83823) are English tokens.
Token id [83823, 145653) are Chinese tokens.
Token id [145653, 150000) are rare tokens. E.g., icetk[145803] == 'α'.

You can install the package via

pip install icetk

Tokenization

from icetk import icetk
tokens = icetk.tokenize('Hello World! I am icetk.')
# tokens == ['▁Hello', '▁World', '!', '▁I', '▁am', '▁ice', 'tk', '.']
ids = icetk.encode('Hello World! I am icetk.')
# ids == [39316, 20932, 20035, 20115, 20344, 22881, 35955, 20007]
en = icetk.decode(ids)
# en == 'Hello World! I am icetk.' # always perfectly recover (if without 
   
    )
   

ids = icetk.encode('你好世界！这里是 icetk。')
# ids == [20005, 94874, 84097, 20035, 94947, 22881, 35955, 83823]

ids = icetk.encode(image_path='test.jpeg', image_size=256, compress_rate=8)
# ids == tensor([[12738, 12430, 10398,  ...,  7236, 12844, 12386]], device='cuda:0')
# ids.shape == torch.Size([1, 1024])
img = icetk.decode(image_ids=ids, compress_rate=8)
# img.shape == torch.Size([1, 3, 256, 256])
from torchvision.utils import save_image
save_image(img, 'recover.jpg')

A unified tokenization tool for Images, Chinese and English.

Related tags

Overview

ICE Tokenizer

Tokenization

Owner

THUDM

Utilities for preprocessing text for deep learning with Keras

:house_with_garden: Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.

Google's Meena transformer chatbot implementation

🏖 Easy training and deployment of seq2seq models.

Text-to-Speech for Belarusian language

OCR을 이용하여 인원수를 인식 후 줌을 Kill 해줍니다

PyTorch implementation and pretrained models for XCiT models. See XCiT: Cross-Covariance Image Transformer

The official implementation of "BERT is to NLP what AlexNet is to CV: Can Pre-Trained Language Models Identify Analogies?, ACL 2021 main conference"

Line as a Visual Sentence: Context-aware Line Descriptor for Visual Localization

NLP library designed for reproducible experimentation management

Creating an LSTM model to generate music

A tool helps build a talk preview image by combining the given background image and talk event description

This repository contains (not all) code from my project on Named Entity Recognition in philosophical text

This repository is home to the Optimus data transformation plugins for various data processing needs.

Auto_code_complete is a auto word-completetion program which allows you to customize it on your needs

TextAttack 🐙 is a Python framework for adversarial attacks, data augmentation, and model training in NLP

Beautiful visualizations of how language differs among document types.

BiNE: Bipartite Network Embedding

A Chinese to English Neural Model Translation Project

Quick insights from Zoom meeting transcripts using Graph + NLP