Code for our ACL 2021 (Findings) Paper - Fingerprinting Fine-tuned Language Models in the wild .

Overview

🌳 Fingerprinting Fine-tuned Language Models in the wild

This is the code and dataset for our ACL 2021 (Findings) Paper - Fingerprinting Fine-tuned Language Models in the wild .

Clone the repo

git clone https://github.com/LCS2-IIITD/ACL-FFLM.git
pip3 install -r requirements.txt 

Dataset

The dataset includes both organic and synthetic text.

  • Synthetic -

    Collected from posts of r/SubSimulatorGPT2. Each user on the subreddit is a GPT2 small (345 MB) bot that is fine-tuned on 500k posts and comments from a particular subreddit (e.g., r/askmen, r/askreddit,r/askwomen). The bots generate posts on r/SubSimulatorGPT2, starting off with the main post followed by comments (and replies) from other bots. The bots also interact with each other by using the synthetic text in the preceding comment/reply as their prompt. In total, the sub-reddit contains 401,214 comments posted between June 2019 and January 2020 by 108 fine-tuned GPT2 LMs (or class).

  • Organic -

    Collected from comments of 108 subreddits the GPT2 bots have been fine-tuned upon. We randomly collected about 2000 comments between the dates of June 2019 - Jan 2020.

The complete dataset is available here. Download the dataset as follows -

  1. Download the 2 folders organic and synthetic, containing the comments from individual classes.
  2. Store them in the data folder in the following format.
data
├── organic
├── synthetic
└── authors.csv

Note -
For the below TL;DR run you also need to download dataset.json and dataset.pkl files which contain pre-processed data.
Organize them in the dataset/synthetic folder as follows -

dataset
├── organic
├── synthetic
  ├── splits (Folder already present)
    ├── 6 (Folder already present)
      └── 108_800_100_200_dataset.json (File already present)
  ├── dataset.json (to be added via drive link)
  └── dataset.pkl (to be added via drive link)
└── authors.csv (File already present)

108_800_100_200_dataset.json is a custom dataset which contains the comment ID's, the labels and their separation into train, test and validation splits.
Upon running the models, the comments for each split are fetched from the dataset.json using the comment ID's in the 108_800_100_200_dataset.json file .

Running the code

TL;DR

You can skip the pre-processing and the Create Splits if you want to run the code on some custom datasets available in the dataset/synthetic....splits folder. Make sure to follow the instructions mentioned in the Note of the Dataset section for the setting up the dataset folders.

Pre-process the dataset

First, we pre-process the complete dataset using the data present in the folder create-splits. Select the type of data (organic/synthetic) you want to pre-process using the parameter synthetic in the file. By deafult the parameter is set for synthetic data i.e True. This would create a pre-processed dataset.json and dataset.pkl files in the dataset/[organic OR synthetic] folder.

Create Train, Test and Validation Splits

We create splits of train, test and validation data. The parameters such as min length of sentences (default 6), lowercase sentences, size of train (max and default 800/class), validation (max and default 100/class) and test (max and default 200/class),number of classes (max and default 108) can be set internally in the create_splits.py in the splits folder under the commented PARAMETERS Section.

cd create-splits.py
python3 create_splits.py

This creates a folder in the folder dataset/synthetic/splits/[min_len_of_sentence/min_nf_tokens = 6]/. The train, validation and test datasets are all stored in the same file with the filename [#CLASSES]_[#TRAIN_SET_SIZE]_[#VAL_SET_SIZE]_[#TEST_SET_SIZE]_dataset.json like 108_800_100_200_dataset.json.

Running the model

Now fix the same parameters in the seq_classification.py file. To train and test the best model (Fine-tuned GPT2/ RoBERTa) -

cd models/generate-embed/ft/
python3 seq_classification.py 

A results folder will be generated which will contain the results of each epoch.

Note -
For the other models - pretrained and writeprints, first generate the embeddings using the files in the folders models/generate-embed/[pre-trained or writeprints]. The generated embeddings are stored in the results/generate-embed folder. Then, use the script in the models/classifiers/[pre-trained or writeprints] to train sklearn classifiers on generated embeddings. The final results will be in the results/classifiers/[pre-trained or writeprints] folder.

👪 Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change. For any detailed clarifications/issues, please email to nirav17072[at]iiitd[dot]ac[dot]in .

⚖️ License

MIT

Owner
LCS2-IIITDelhi
Laboratory for Computation Social Systems (LCS2) is a research group led by Dr. Tanmoy Chakraborty and Dr. Md. Shad Akhtar at IIIT-D
LCS2-IIITDelhi
Spert NLP Relation Extraction API deployed with torchserve for inference

URLMask Python program for Linux users to change a URL to ANY domain. A program than can take any url and mask it to any domain name you like. E.g. ne

Zichu Chen 1 Nov 24, 2021
SNCSE: Contrastive Learning for Unsupervised Sentence Embedding with Soft Negative Samples

SNCSE SNCSE: Contrastive Learning for Unsupervised Sentence Embedding with Soft Negative Samples This is the repository for SNCSE. SNCSE aims to allev

Sense-GVT 59 Jan 02, 2023
GNES enables large-scale index and semantic search for text-to-text, image-to-image, video-to-video and any-to-any content form

GNES is Generic Neural Elastic Search, a cloud-native semantic search system based on deep neural network.

GNES.ai 1.2k Jan 06, 2023
An attempt to map the areas with active conflict in Ukraine using open source twitter data.

Live Action Map (LAM) An attempt to use open source data on Twitter to map areas with active conflict. Right now it is used for the Ukraine-Russia con

Kinshuk Dua 171 Nov 21, 2022
A simple tool to update bib entries with their official information (e.g., DBLP or the ACL anthology).

Rebiber: A tool for normalizing bibtex with official info. We often cite papers using their arXiv versions without noting that they are already PUBLIS

(Bill) Yuchen Lin 2k Jan 01, 2023
kochat

Kochat 챗봇 빌더는 성에 안차고, 자신만의 딥러닝 챗봇 애플리케이션을 만드시고 싶으신가요? Kochat을 이용하면 손쉽게 자신만의 딥러닝 챗봇 애플리케이션을 빌드할 수 있습니다. # 1. 데이터셋 객체 생성 dataset = Dataset(ood=True) #

1 Oct 25, 2021
Malware-Related Sentence Classification

Malware-Related Sentence Classification This repo contains the code for the ICTAI 2021 paper "Enrichment of Features for Malware-Related Sentence Clas

Chau Nguyen 1 Mar 26, 2022
Honor's thesis project analyzing whether the GPT-2 model can more effectively generate free-verse or structured poetry.

gpt2-poetry The following code is for my senior honor's thesis project, under the guidance of Dr. Keith Holyoak at the University of California, Los A

Ashley Kim 2 Jan 09, 2022
运小筹公众号是致力于分享运筹优化(LP、MIP、NLP、随机规划、鲁棒优化)、凸优化、强化学习等研究领域的内容以及涉及到的算法的代码实现。

OlittleRer 运小筹公众号是致力于分享运筹优化(LP、MIP、NLP、随机规划、鲁棒优化)、凸优化、强化学习等研究领域的内容以及涉及到的算法的代码实现。编程语言和工具包括Java、Python、Matlab、CPLEX、Gurobi、SCIP 等。 关注我们: 运筹小公众号 有问题可以直接在

运小筹 151 Dec 30, 2022
Count the frequency of letters or words in a text file and show a graph.

Word Counter By EBUS Coding Club Count the frequency of letters or words in a text file and show a graph. Requirements Python 3.9 or higher matplotlib

EBUS Coding Club 0 Apr 09, 2022
YACLC - Yet Another Chinese Learner Corpus

汉语学习者文本多维标注数据集YACLC V1.0 中文 | English 汉语学习者文本多维标注数据集(Yet Another Chinese Learner

BLCU-ICALL 47 Dec 15, 2022
👄 The most accurate natural language detection library for Python, suitable for long and short text alike

1. What does this library do? Its task is simple: It tells you which language some provided textual data is written in. This is very useful as a prepr

Peter M. Stahl 334 Dec 30, 2022
Source code of paper "BP-Transformer: Modelling Long-Range Context via Binary Partitioning"

BP-Transformer This repo contains the code for our paper BP-Transformer: Modeling Long-Range Context via Binary Partition Zihao Ye, Qipeng Guo, Quan G

Zihao Ye 119 Nov 14, 2022
Voice Assistant inspired by Google Assistant, Cortana, Alexa, Siri, ...

author: @shival_gupta VoiceAI This program is an example of a simple virtual assitant It will listen to you and do accordingly It will begin with wish

Shival Gupta 1 Jan 06, 2022
CMeEE 数据集医学实体抽取

医学实体抽取_GlobalPointer_torch 介绍 思想来自于苏神 GlobalPointer,原始版本是基于keras实现的,模型结构实现参考现有 pytorch 复现代码【感谢!】,基于torch百分百复现苏神原始效果。 数据集 中文医学命名实体数据集 点这里申请,很简单,共包含九类医学

85 Dec 28, 2022
History Aware Multimodal Transformer for Vision-and-Language Navigation

History Aware Multimodal Transformer for Vision-and-Language Navigation This repository is the official implementation of History Aware Multimodal Tra

Shizhe Chen 46 Nov 23, 2022
Easy to use, state-of-the-art Neural Machine Translation for 100+ languages

EasyNMT - Easy to use, state-of-the-art Neural Machine Translation This package provides easy to use, state-of-the-art machine translation for more th

Ubiquitous Knowledge Processing Lab 748 Jan 06, 2023
Telegram bot to auto post messages of one channel in another channel as soon as it is posted, without the forwarded tag.

Channel Auto-Post Bot This bot can send all new messages from one channel, directly to another channel (or group, just in case), without the forwarded

Aditya 128 Dec 29, 2022
The Classical Language Toolkit

Notice: This Git branch (dev) contains the CLTK's upcoming major release (v. 1.0.0). See https://github.com/cltk/cltk/tree/master and https://docs.clt

Classical Language Toolkit 754 Jan 09, 2023