Trex is a tool to match semantically similar functions based on transfer learning.

Last update: Dec 28, 2022

Related tags

Text Data & NLP trex

Overview

Introduction

Trex is a tool to match semantically similar functions based on transfer learning.

Installation

We recommend conda to setup the environment and install the required packages.

First, create the conda environment,

conda create -n trex python=3.8 numpy scipy scikit-learn requests

and activate the conda environment:

conda activate trex

Then, install the latest PyTorch (assume you have GPU):

conda install pytorch torchvision torchaudio cudatoolkit=11.1 -c pytorch -c nvidia

Enter the trex root directory: e.g., path/to/trex, and install trex:

pip install --editable .

For large datasets install PyArrow:

pip install pyarrow

For faster training install NVIDIA's apex library:

git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" \
  --global-option="--deprecated_fused_adam" --global-option="--xentropy" \
  --global-option="--fast_multihead_attn" ./

Preparation

Pretrained models:

Create the checkpoints and checkpoints/pretrain subdirectory in path/to/trex

mkdir checkpoints, mkdir checkpoints/pretrain

Download our pretrained weight parameters and put in checkpoints/pretrain

Sample data for finetuning similarity

We provide the sample training/testing files of finetuning in data-src/similarity If you want to prepare the finetuning data yourself, make sure you follow the format shown in data-src/similarity (coming soon: tokenization script).

We have to binarize the data to make it ready to be trained. To binarize the training data for finetuning, run:

python command/finetune/preprocess.py

The binarized training data ready for finetuning (for detecting similarity) will be stored at data-bin/similarity

Training

To finetune the model, run:

./command/finetune/finetune.sh

The scripts loads the pretrained weight parameters from checkpoints/pretrain/ and finetunes the model.

Sample data for pretraining on micro-traces

We also provide (10K) samples and scripts to demonstrate how to pretrain the model. To binarize the training data for pretraining, run:

python command/pretrain/preprocess_pretrain_10k.py

The binarized training data ready for pretraining will be stored at data-bin/pretrain_10k

To pretrain the model, run:

./command/pretrain/pretrain_10k.sh

The pretrained model will be checkpointed at checkpoints/pretrain_10k

Dataset

We put our dataset here.

Trex is a tool to match semantically similar functions based on transfer learning.

Related tags

Overview

Introduction

Installation

Preparation

Pretrained models:

Sample data for finetuning similarity

Training

Sample data for pretraining on micro-traces

Dataset

Owner

A CRM department in a local bank works on classify their lost customers with their past datas. So they want predict with these method that average loss balance and passive duration for future.

Beyond Masking: Demystifying Token-Based Pre-Training for Vision Transformers

Gpt2-WebAPI - The objective of this API is to provide the 3 best possible responses to sentences that the user would input via http GET request as a parameter

Voilà turns Jupyter notebooks into standalone web applications

This is the 25 + 1 year anniversary version of the 1995 Rachford-Rice contest

EMNLP'2021: Can Language Models be Biomedical Knowledge Bases?

RIDE automatically creates the package and boilerplate OOP Python node scripts as per your needs

Code for hyperboloid embeddings for knowledge graph entities

German Text-To-Speech Engine using Tacotron and Griffin-Lim

LightSeq: A High-Performance Inference Library for Sequence Processing and Generation

A PyTorch implementation of paper "Learning Shared Semantic Space for Speech-to-Text Translation", ACL (Findings) 2021

The Internet Archive Research Assistant - Daily search Internet Archive for new items matching your keywords

CDLA: A Chinese document layout analysis (CDLA) dataset

A number of methods in order to perform Natural Language Processing on live data derived from Twitter

Repository for Project Insight: NLP as a Service

A workshop with several modules to help learn Feast, an open-source feature store

A deep learning-based translation library built on Huggingface transformers

Multilingual Emotion classification using BERT (fine-tuning). Published at the WASSA workshop (ACL2022).

MILES is a multilingual text simplifier inspired by LSBert - A BERT-based lexical simplification approach proposed in 2018. Unlike LSBert, MILES uses the bert-base-multilingual-uncased model, as well as simple language-agnostic approaches to complex word identification (CWI) and candidate ranking.

CCF BDCI 2020 房产行业聊天问答匹配赛道 A榜47/2985