Web Scraping, Document Deduplication & GPT-2 Fine-tuning with a newly created scam dataset.

Overview

Neural Scam Artist

TL;DR
A dataset of scam emails is scraped from an anti-fraud website. The dataset is then deduplicated using MinHash and LSH. The deduplicated dataset is used for fine-tuning GPT-2.

Comic stolen from Agent-X Comics.

📖 Table of contents

☁️ Project Description

Objective

The goal of this project is create a new dataset of fraudulent emails that can advance the research on intelligent email assistants.

Web Scraper

Data is scraped from the website https://antifraudintl.org/. At first, a set of thread urls is collected and stored. Then, each thread is searched for emails. For each thread, at most one email is kept as the rest are duplicates. Metadata (Subject, Date etc) is removed. The resultant dataset is stored inside a csv file.

Deduplication

To avoid the quadratic complexity, a cheap alternative is selected: MinHash and LSH using the datasketch library. For each document, this method efficiently locates its nearest neighbors. Because this leads to a a large amount of false negatives (i.e. dulpicate documents that are classified as non-duplicates), the approach is extended by creating a duplicate graph. Nodes in this graph represent documents and are connected with an edge if their respective documents have been classified as duplicates. To deduplicate the dataset, connected components of the graph are located and for each component only a single node is selected. A readability criterion is used for selection.

GPT-2

A small pretrained GPT-2 model from the Huggingface library is fine-tuned on the deduplicated dataset. A collection of cherry-picked randomly selected generated samples can be found here here.

📁 Shared Files

Resource Size #Samples Link
Full dataset 128.5 MB 85,160 Link
Deduplicated dataset 74.2 MB 58,227 Link
Thread urls 6.4 MB 95,324 Link
GPT-2 Checkpoints ~1.5 GB Link

🧰 Requirements

See requirements.txt.

⚙️ Installation

$ git clone https://github.com/davidsvy/Neural-Scam-Artist
$ cd Neural-Scam-Artist
$ pip install -r requirements.txt

🧻 Usage

To generate dataset (~3 hours on Colab):


$ python create_dataset.py [-c configs/create_dataset.yaml]

To deduplicate dataset (~30 minutes on Colab):

$ python deduplicate_dataset.py [-c configs/deduplicate_dataset.yaml]

To train GPT-2 (~3 hours/epoch on Colab with K80):

$ python gpt2_train.py [-c configs/gpt2_train.yaml]

To generate text with GPT-2:

$ python gpt2_sample.py [-c configs/gpt2_sample.yaml]
Host your own GPT-3 Discord bot

GPT3 Discord Bot Host your own GPT-3 Discord bot i'd host and make the bot invitable myself, however GPT3 terms of service prohibit public use of GPT3

[something hillarious here] 8 Jan 07, 2023
WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

Google Research Datasets 740 Dec 24, 2022
结巴中文分词

jieba “结巴”中文分词:做最好的 Python 中文分词组件 "Jieba" (Chinese for "to stutter") Chinese text segmentation: built to be the best Python Chinese word segmentation

Sun Junyi 29.8k Jan 02, 2023
BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese

Table of contents Introduction Using BARTpho with fairseq Using BARTpho with transformers Notes BARTpho: Pre-trained Sequence-to-Sequence Models for V

VinAI Research 58 Dec 23, 2022
A natural language modeling framework based on PyTorch

Overview PyText is a deep-learning based NLP modeling framework built on PyTorch. PyText addresses the often-conflicting requirements of enabling rapi

Facebook Research 6.4k Dec 27, 2022
Beta Distribution Guided Aspect-aware Graph for Aspect Category Sentiment Analysis with Affective Knowledge. Proceedings of EMNLP 2021

AAGCN-ACSA EMNLP 2021 Introduction This repository was used in our paper: Beta Distribution Guided Aspect-aware Graph for Aspect Category Sentiment An

Akuchi 36 Dec 18, 2022
Original implementation of the pooling method introduced in "Speaker embeddings by modeling channel-wise correlations"

Speaker-Embeddings-Correlation-Pooling This is the original implementation of the pooling method introduced in "Speaker embeddings by modeling channel

Themos Stafylakis 10 Apr 30, 2022
PyTorch implementation of Tacotron speech synthesis model.

tacotron_pytorch PyTorch implementation of Tacotron speech synthesis model. Inspired from keithito/tacotron. Currently not as much good speech quality

Ryuichi Yamamoto 279 Dec 09, 2022
Quick insights from Zoom meeting transcripts using Graph + NLP

Transcript Analysis - Graph + NLP This program extracts insights from Zoom Meeting Transcripts (.vtt) using TigerGraph and NLTK. In order to run this

Advit Deepak 7 Sep 17, 2022
Implementation of Token Shift GPT - An autoregressive model that solely relies on shifting the sequence space for mixing

Token Shift GPT Implementation of Token Shift GPT - An autoregressive model that relies solely on shifting along the sequence dimension and feedforwar

Phil Wang 32 Oct 14, 2022
An automated program that helps customers of Pizza Palour place their pizza orders

PIzza_Order_Assistant Introduction An automated program that helps customers of Pizza Palour place their pizza orders. The program uses voice commands

Tindi Sommers 1 Dec 26, 2021
Official code of our work, Unified Pre-training for Program Understanding and Generation [NAACL 2021].

PLBART Code pre-release of our work, Unified Pre-training for Program Understanding and Generation accepted at NAACL 2021. Note. A detailed documentat

Wasi Ahmad 138 Dec 30, 2022
Dope Wars game engine on StarkNet L2 roll-up

RYO Dope Wars game engine on StarkNet L2 roll-up. What TI-83 drug wars built as smart contract system. Background mechanism design notion here. Initia

104 Dec 04, 2022
A python package to fine-tune transformer-based models for named entity recognition (NER).

nerblackbox A python package to fine-tune transformer-based language models for named entity recognition (NER). Resources Source Code: https://github.

Felix Stollenwerk 13 Jul 30, 2022
Official Pytorch implementation of Test-Agnostic Long-Tailed Recognition by Test-Time Aggregating Diverse Experts with Self-Supervision.

This repository is the official Pytorch implementation of Test-Agnostic Long-Tailed Recognition by Test-Time Aggregating Diverse Experts with Self-Supervision.

vanint 101 Dec 30, 2022
CLIPfa: Connecting Farsi Text and Images

CLIPfa: Connecting Farsi Text and Images OpenAI released the paper Learning Transferable Visual Models From Natural Language Supervision in which they

Sajjad Ayoubi 66 Dec 14, 2022
The official implementation of VAENAR-TTS, a VAE based non-autoregressive TTS model.

VAENAR-TTS This repo contains code accompanying the paper "VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis". Sa

THUHCSI 138 Oct 28, 2022
Modified GPT using average pooling to reduce the softmax attention memory constraints.

NLP-GPT-Upsampling This repository contains an implementation of Open AI's GPT Model. In particular, this implementation takes inspiration from the Ny

WD 1 Dec 03, 2021
Chinese Named Entity Recognization (BiLSTM with PyTorch)

BiLSTM-CRF for Name Entity Recognition PyTorch version A PyTorch implemention of Bi-LSTM-CRF model for Chinese Named Entity Recognition. 使用 PyTorch 实现

5 Jun 01, 2022