Tools to download and cleanup Common Crawl data

Related tags

Text Data & NLPcc_net
Overview

cc_net

Tools to download and clean Common Crawl as introduced in our paper CCNet.

If you found these resources useful, please consider citing:

@inproceedings{wenzek2020ccnet,
  title={CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data},
  author={Wenzek, Guillaume and Lachaux, Marie-Anne and Conneau, Alexis and Chaudhary, Vishrav and Guzm{\'a}n, Francisco and Joulin, Armand and Grave, {\'E}douard},
  booktitle={Proceedings of The 12th Language Resources and Evaluation Conference},
  pages={4003--4012},
  year={2020}
}

CircleCI

Installation

We only tried this on Linux but installation should be possible on MacOS too.

  1. Create or simlink a data folder to where you want to download the corpus.

  2. Run make install. This will download some resources and install required packages.

  3. If you have a C++ 17 compiler you can also run pip install .[getpy], it provides more memory efficient hashset.

  4. Install the following tools manually if make install failed:

Training Language Models

The Makefile is used to train Sentence Piece and LM on Wikipedia data.

  • make help shows help
  • make lang=de lm trains a Sentence Piece and a LM on German Wikipedia
  • make all_lm trains the same model than in the paper
  • make lang=de dl_lm downloads the LM trained for the paper
  • make dl_all_lm downloads all of them

Pipeline overview

The full mining pipeline is divided in 3 steps:

  • hashes downloads one Common-Crawl snapshot, and compute hashes for each paragraph
  • mine removes duplicates, detects language, run the LM and split by lang/perplexity buckets
  • regroup regroup the files created by mine in chunks of 4Gb

Each step needs the previous step to be over before starting. You can launch the full pipeline using python -m cc_net.

  • python -m cc_net --help shows help
  • python -m cc_net --dump 2019-13 treats a specific snapshot
  • python -m cc_net -l my -l gu restricts to specific languages
  • python -m cc_net --lm_dir my_lms/ uses custom LMs
  • python -m cc_net --lang_threshold 0.3 set a specific field in mine.Config
  • python -m cc_net --config test runs on a tiny subset of a snapshot
  • python -m cc_net --config config/my_config.json uses configuration from the given config file

Reproducing our work

Given the CPU required to run the full pipeline on such a big corpus we share a mapping from url to the information we computed. You can reconstruct the corpus used in the paper by using:

python -m cc_net --conf reproduce --dump 2019-09

Extract XLM-R data

Unsupervised Cross-lingual Representation Learning at Scale (XLM-RoBERTa) paper was trained on data extracted by an internal version of cc_net.

Due to the format being a little bit different please use the following command instead:

python cc_net/tools/dl_cc_100.py --help
python cc_net/tools/dl_cc_100.py --outdir data_cc100 --process 8

If you use this version of the data please also consider citing:

@article{conneau2019unsupervised,
  title={Unsupervised Cross-lingual Representation Learning at Scale},
  author={Conneau, Alexis and Khandelwal, Kartikay and Goyal, Naman and Chaudhary, Vishrav and Wenzek, Guillaume and Guzm{\'a}n, Francisco and Grave, Edouard and Ott, Myle and Zettlemoyer, Luke and Stoyanov, Veselin},
  journal={arXiv preprint arXiv:1911.02116},
  year={2019}
}

Adapting to your infrastructure

Given the computation cost of running the full pipeline we distributed the computation on a Slurm cluster using submitit. submitit will default to spawning processes on your machine if Slurm cluster is found. You should tweak --task_parallelism to something adapated to your machine. Defaults are 512 for mining and 20 for reproducing.

To run the tasks in-process use --execution debug.

Output format

Generated files are compressed JSON files. There is one JSON object per line.

List of fields:

  • url: webpage URL (part of CC)
  • date_download: date of download (part of CC)
  • digest: sha1 digest of the webpage (part of CC)
  • length: number of chars
  • nlines: number of lines
  • source_domain: web domain of the webpage
  • title: page title (part of CC)
  • raw_content: webpage content after deduplication
  • original_nlines: number of lines before deduplication
  • original_length: number of chars before deduplication
  • language: language detected by FastText LID
  • language_score: language score
  • perplexity: perplexity of a LM trained on Wikipedia

Sample JSON object:

{
  "url": "http://www.pikespeakhospice.org/members/1420",
  "date_download": "2019-02-15T18:40:25Z",
  "digest": "sha1:VQW3KXUOALO543IJGTK2JLVEAN2XXKHI",
  "length": 752,
  "nlines": 5,
  "source_domain": "www.pikespeakhospice.org",
  "title": "LeeRoy Aragon",
  "raw_content": "Date Honored: March 2017\nHe was a man of integrity, a hard worker, and a dedicated family man. He loved spending time with family camping, fishing, hunting, boating and just hanging out.\nHis Catholic faith was extremely important to him as he gave of his time and talents to the community. He had many friends through church and the Knights of Columbus. He was a meticulous handyman, and enjoyed building and fixing things and restoring antique furniture to perfection. He was a fan and supported his Colorado Rockies and Denver Broncos. Throughout the years he had devoted four-legged friends (his dogs and a horse named Sunny Boy).\nWe have many cherished memories of him that we will treasure until we are with him again.\n~ Family of LeeRoy F. Aragon",
  "original_nlines": 7,
  "original_length": 754,
  "language": "en",
  "language_score": 0.99,
  "perplexity": 255.11,
}

You can peak at those files using UNIX tools zcat and jq, eg: zcat data/mined/2019-09/en_head_0000.json.gz | head -1 | jq .

jq can do some complicated filtering. jsonql.py provides a Python API with multiprocess support to do more complicated operations like LM scoring of the document.

License

By contributing to cc_net, you agree that your contributions will be licensed under the LICENSE file in the root directory of this source tree.

Owner
Meta Research
Meta Research
An easy-to-use Python module that helps you to extract the BERT embeddings for a large text dataset (Bengali/English) efficiently.

An easy-to-use Python module that helps you to extract the BERT embeddings for a large text dataset (Bengali/English) efficiently.

Khalid Saifullah 37 Sep 05, 2022
🍊 PAUSE (Positive and Annealed Unlabeled Sentence Embedding), accepted by EMNLP'2021 🌴

PAUSE: Positive and Annealed Unlabeled Sentence Embedding Sentence embedding refers to a set of effective and versatile techniques for converting raw

EQT 21 Dec 15, 2022
SurvTRACE: Transformers for Survival Analysis with Competing Events

⭐ SurvTRACE: Transformers for Survival Analysis with Competing Events This repo provides the implementation of SurvTRACE for survival analysis. It is

Zifeng 13 Oct 06, 2022
Visual Automata is a Python 3 library built as a wrapper for Caleb Evans' Automata library to add more visualization features.

Visual Automata Copyright 2021 Lewi Lie Uberg Released under the MIT license Visual Automata is a Python 3 library built as a wrapper for Caleb Evans'

Lewi Uberg 55 Nov 17, 2022
aMLP Transformer Model for Japanese

aMLP-japanese Japanese aMLP Pretrained Model aMLPとは、Liu, Daiらが提案する、Transformerモデルです。 ざっくりというと、BERTの代わりに使えて、より性能の良いモデルです。 詳しい解説は、こちらの記事などを参考にしてください。 この

tanreinama 13 Aug 11, 2022
A Python 3.6+ package to run .many files, where many programs written in many languages may exist in one file.

RunMany Intro | Installation | VSCode Extension | Usage | Syntax | Settings | About A tool to run many programs written in many languages from one fil

6 May 22, 2022
Chinese version of GPT2 training code, using BERT tokenizer.

GPT2-Chinese Description Chinese version of GPT2 training code, using BERT tokenizer or BPE tokenizer. It is based on the extremely awesome repository

Zeyao Du 5.6k Jan 04, 2023
Materials (slides, code, assignments) for the NYU class I teach on NLP and ML Systems (Master of Engineering).

FREE_7773 Repo containing material for the NYU class (Master of Engineering) I teach on NLP, ML Sys etc. For context on what the class is trying to ac

Jacopo Tagliabue 90 Dec 19, 2022
English loanwords in the world's languages

Wiktionary as CLDF Content cldf1 and cldf2 contain cldf-conform data sets with a total of 2 377 756 entries about the vocabulary of all 1403 languages

Viktor Martinović 3 Jan 14, 2022
SciBERT is a BERT model trained on scientific text.

SciBERT is a BERT model trained on scientific text.

AI2 1.2k Dec 24, 2022
IndoBERTweet is the first large-scale pretrained model for Indonesian Twitter. Published at EMNLP 2021 (main conference)

IndoBERTweet 🐦 🇮🇩 1. Paper Fajri Koto, Jey Han Lau, and Timothy Baldwin. IndoBERTweet: A Pretrained Language Model for Indonesian Twitter with Effe

IndoLEM 40 Nov 30, 2022
BMInf (Big Model Inference) is a low-resource inference package for large-scale pretrained language models (PLMs).

BMInf (Big Model Inference) is a low-resource inference package for large-scale pretrained language models (PLMs).

OpenBMB 377 Jan 02, 2023
Tools to download and cleanup Common Crawl data

cc_net Tools to download and clean Common Crawl as introduced in our paper CCNet. If you found these resources useful, please consider citing: @inproc

Meta Research 483 Jan 02, 2023
simpleT5 is built on top of PyTorch-lightning⚡️ and Transformers🤗 that lets you quickly train your T5 models.

Quickly train T5 models in just 3 lines of code + ONNX support simpleT5 is built on top of PyTorch-lightning ⚡️ and Transformers 🤗 that lets you quic

Shivanand Roy 220 Dec 30, 2022
Wrapper to display a script output or a text file content on the desktop in sway or other wlroots-based compositors

nwg-wrapper This program is a part of the nwg-shell project. This program is a GTK3-based wrapper to display a script output, or a text file content o

Piotr Miller 94 Dec 27, 2022
Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

Pytorch-NLU,一个中文文本分类、序列标注工具包,支持中文长文本、短文本的多类、多标签分类任务,支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classifi

186 Dec 24, 2022
CCQA A New Web-Scale Question Answering Dataset for Model Pre-Training

CCQA: A New Web-Scale Question Answering Dataset for Model Pre-Training This is the official repository for the code and models of the paper CCQA: A N

Meta Research 29 Nov 30, 2022
A text file containing 479k English words for all your dictionary/word-based projects e.g: auto-completion / autosuggestion

List Of English Words A text file containing over 466k English words. While searching for a list of english words (for an auto-complete tutorial) I fo

dwyl 8.5k Jan 03, 2023
A programming language with logic of Python, and syntax of all languages.

Pytov The idea was to take all well known syntaxes, and combine them into one programming language with many posabilities. Installation Install using

Yuval Rosen 14 Dec 07, 2022
Code for "Generative adversarial networks for reconstructing natural images from brain activity".

Reconstruct handwritten characters from brains using GANs Example code for the paper "Generative adversarial networks for reconstructing natural image

K. Seeliger 2 May 17, 2022