Tools to download and cleanup Common Crawl data

Related tags

Text Data & NLPcc_net
Overview

cc_net

Tools to download and clean Common Crawl as introduced in our paper CCNet.

If you found these resources useful, please consider citing:

@inproceedings{wenzek2020ccnet,
  title={CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data},
  author={Wenzek, Guillaume and Lachaux, Marie-Anne and Conneau, Alexis and Chaudhary, Vishrav and Guzm{\'a}n, Francisco and Joulin, Armand and Grave, {\'E}douard},
  booktitle={Proceedings of The 12th Language Resources and Evaluation Conference},
  pages={4003--4012},
  year={2020}
}

CircleCI

Installation

We only tried this on Linux but installation should be possible on MacOS too.

  1. Create or simlink a data folder to where you want to download the corpus.

  2. Run make install. This will download some resources and install required packages.

  3. If you have a C++ 17 compiler you can also run pip install .[getpy], it provides more memory efficient hashset.

  4. Install the following tools manually if make install failed:

Training Language Models

The Makefile is used to train Sentence Piece and LM on Wikipedia data.

  • make help shows help
  • make lang=de lm trains a Sentence Piece and a LM on German Wikipedia
  • make all_lm trains the same model than in the paper
  • make lang=de dl_lm downloads the LM trained for the paper
  • make dl_all_lm downloads all of them

Pipeline overview

The full mining pipeline is divided in 3 steps:

  • hashes downloads one Common-Crawl snapshot, and compute hashes for each paragraph
  • mine removes duplicates, detects language, run the LM and split by lang/perplexity buckets
  • regroup regroup the files created by mine in chunks of 4Gb

Each step needs the previous step to be over before starting. You can launch the full pipeline using python -m cc_net.

  • python -m cc_net --help shows help
  • python -m cc_net --dump 2019-13 treats a specific snapshot
  • python -m cc_net -l my -l gu restricts to specific languages
  • python -m cc_net --lm_dir my_lms/ uses custom LMs
  • python -m cc_net --lang_threshold 0.3 set a specific field in mine.Config
  • python -m cc_net --config test runs on a tiny subset of a snapshot
  • python -m cc_net --config config/my_config.json uses configuration from the given config file

Reproducing our work

Given the CPU required to run the full pipeline on such a big corpus we share a mapping from url to the information we computed. You can reconstruct the corpus used in the paper by using:

python -m cc_net --conf reproduce --dump 2019-09

Extract XLM-R data

Unsupervised Cross-lingual Representation Learning at Scale (XLM-RoBERTa) paper was trained on data extracted by an internal version of cc_net.

Due to the format being a little bit different please use the following command instead:

python cc_net/tools/dl_cc_100.py --help
python cc_net/tools/dl_cc_100.py --outdir data_cc100 --process 8

If you use this version of the data please also consider citing:

@article{conneau2019unsupervised,
  title={Unsupervised Cross-lingual Representation Learning at Scale},
  author={Conneau, Alexis and Khandelwal, Kartikay and Goyal, Naman and Chaudhary, Vishrav and Wenzek, Guillaume and Guzm{\'a}n, Francisco and Grave, Edouard and Ott, Myle and Zettlemoyer, Luke and Stoyanov, Veselin},
  journal={arXiv preprint arXiv:1911.02116},
  year={2019}
}

Adapting to your infrastructure

Given the computation cost of running the full pipeline we distributed the computation on a Slurm cluster using submitit. submitit will default to spawning processes on your machine if Slurm cluster is found. You should tweak --task_parallelism to something adapated to your machine. Defaults are 512 for mining and 20 for reproducing.

To run the tasks in-process use --execution debug.

Output format

Generated files are compressed JSON files. There is one JSON object per line.

List of fields:

  • url: webpage URL (part of CC)
  • date_download: date of download (part of CC)
  • digest: sha1 digest of the webpage (part of CC)
  • length: number of chars
  • nlines: number of lines
  • source_domain: web domain of the webpage
  • title: page title (part of CC)
  • raw_content: webpage content after deduplication
  • original_nlines: number of lines before deduplication
  • original_length: number of chars before deduplication
  • language: language detected by FastText LID
  • language_score: language score
  • perplexity: perplexity of a LM trained on Wikipedia

Sample JSON object:

{
  "url": "http://www.pikespeakhospice.org/members/1420",
  "date_download": "2019-02-15T18:40:25Z",
  "digest": "sha1:VQW3KXUOALO543IJGTK2JLVEAN2XXKHI",
  "length": 752,
  "nlines": 5,
  "source_domain": "www.pikespeakhospice.org",
  "title": "LeeRoy Aragon",
  "raw_content": "Date Honored: March 2017\nHe was a man of integrity, a hard worker, and a dedicated family man. He loved spending time with family camping, fishing, hunting, boating and just hanging out.\nHis Catholic faith was extremely important to him as he gave of his time and talents to the community. He had many friends through church and the Knights of Columbus. He was a meticulous handyman, and enjoyed building and fixing things and restoring antique furniture to perfection. He was a fan and supported his Colorado Rockies and Denver Broncos. Throughout the years he had devoted four-legged friends (his dogs and a horse named Sunny Boy).\nWe have many cherished memories of him that we will treasure until we are with him again.\n~ Family of LeeRoy F. Aragon",
  "original_nlines": 7,
  "original_length": 754,
  "language": "en",
  "language_score": 0.99,
  "perplexity": 255.11,
}

You can peak at those files using UNIX tools zcat and jq, eg: zcat data/mined/2019-09/en_head_0000.json.gz | head -1 | jq .

jq can do some complicated filtering. jsonql.py provides a Python API with multiprocess support to do more complicated operations like LM scoring of the document.

License

By contributing to cc_net, you agree that your contributions will be licensed under the LICENSE file in the root directory of this source tree.

Owner
Meta Research
Meta Research
Neural network models for joint POS tagging and dependency parsing (CoNLL 2017-2018)

Neural Network Models for Joint POS Tagging and Dependency Parsing Implementations of joint models for POS tagging and dependency parsing, as describe

Dat Quoc Nguyen 152 Sep 02, 2022
code for modular summarization work published in ACL2021 by Krishna et al

This repository contains the code for running modular summarization pipelines as described in the publication Krishna K, Khosla K, Bigham J, Lipton ZC

Kundan Krishna 6 Jun 04, 2021
a chinese segment base on crf

Genius Genius是一个开源的python中文分词组件,采用 CRF(Conditional Random Field)条件随机场算法。 Feature 支持python2.x、python3.x以及pypy2.x。 支持简单的pinyin分词 支持用户自定义break 支持用户自定义合并词

duanhongyi 237 Nov 04, 2022
Python implementation of TextRank for phrase extraction and summarization of text documents

PyTextRank PyTextRank is a Python implementation of TextRank as a spaCy pipeline extension, used to: extract the top-ranked phrases from text document

derwen.ai 1.9k Jan 06, 2023
Code for the paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer"

T5: Text-To-Text Transfer Transformer The t5 library serves primarily as code for reproducing the experiments in Exploring the Limits of Transfer Lear

Google Research 4.6k Jan 01, 2023
Healthsea is a spaCy pipeline for analyzing user reviews of supplementary products for their effects on health.

Welcome to Healthsea ✨ Create better access to health with spaCy. Healthsea is a pipeline for analyzing user reviews to supplement products by extract

Explosion 75 Dec 19, 2022
문장단위로 분절된 나무위키 데이터셋. Releases에서 다운로드 받거나, tfds-korean을 통해 다운로드 받으세요.

Namuwiki corpus 문장단위로 미리 분절된 나무위키 코퍼스. 목적이 LM등에서 사용하기 위한 데이터셋이라, 링크/이미지/테이블 등등이 잘려있습니다. 문장 단위 분절은 kss를 활용하였습니다. 라이선스는 나무위키에 명시된 바와 같이 CC BY-NC-SA 2.0

Jeong Ukjae 16 Apr 02, 2022
A simple Speech Emotion Recognition (SER) API created using Flask and running in a Docker container.

keyword_searching Steps to use this Python scripts: (1)Paste this script into the file folder containing the PDF files you need to search from; (2)Thi

2 Nov 11, 2022
VMD Audio/Text control with natural language

This repository is a proof of principle for performing Molecular Dynamics analysis, in this case with the program VMD, via natural language commands.

Andrew White 13 Jun 09, 2022
An assignment from my grad-level data mining course demonstrating some experience with NLP/neural networks/Pytorch

NLP-Pytorch-Assignment An assignment from my grad-level data mining course (before I started personal projects) demonstrating some experience with NLP

David Thorne 0 Feb 06, 2022
CorNet Correlation Networks for Extreme Multi-label Text Classification

CorNet Correlation Networks for Extreme Multi-label Text Classification Prerequisites python==3.6.3 pytorch==1.2.0 torchgpipe==0.0.5 click==7.0 ruamel

Guangxu Xun 38 Dec 31, 2022
Perform sentiment analysis on textual data that people generally post on websites like social networks and movie review sites.

Sentiment Analyzer The goal of this project is to perform sentiment analysis on textual data that people generally post on websites like social networ

Madhusudan.C.S 53 Mar 01, 2022
Code for paper "Role-oriented Network Embedding Based on Adversarial Learning between Higher-order and Local Features"

Role-oriented Network Embedding Based on Adversarial Learning between Higher-order and Local Features Train python main.py --dataset brazil-flights C

wang zhang 0 Jun 28, 2022
Reading Wikipedia to Answer Open-Domain Questions

DrQA This is a PyTorch implementation of the DrQA system described in the ACL 2017 paper Reading Wikipedia to Answer Open-Domain Questions. Quick Link

Facebook Research 4.3k Jan 01, 2023
PyWorld3 is a Python implementation of the World3 model

The World3 model revisited in Python Install & Hello World3 How to tune your own simulation Licence How to cite PyWorld3 with Bibtex References & ackn

Charles Vanwynsberghe 248 Dec 14, 2022
A collection of Classical Chinese natural language processing models, including Classical Chinese related models and resources on the Internet.

GuwenModels: 古文自然语言处理模型合集, 收录互联网上的古文相关模型及资源. A collection of Classical Chinese natural language processing models, including Classical Chinese related models and resources on the Internet.

Ethan 66 Dec 26, 2022
Gathers machine learning and Tensorflow deep learning models for NLP problems, 1.13 < Tensorflow < 2.0

NLP-Models-Tensorflow, Gathers machine learning and tensorflow deep learning models for NLP problems, code simplify inside Jupyter Notebooks 100%. Tab

HUSEIN ZOLKEPLI 1.7k Dec 30, 2022
🚀 RocketQA, dense retrieval for information retrieval and question answering, including both Chinese and English state-of-the-art models.

In recent years, the dense retrievers based on pre-trained language models have achieved remarkable progress. To facilitate more developers using cutt

475 Jan 04, 2023
Neural Lexicon Reader: Reduce Pronunciation Errors in End-to-end TTS by Leveraging External Textual Knowledge

Neural Lexicon Reader: Reduce Pronunciation Errors in End-to-end TTS by Leveraging External Textual Knowledge This is an implementation of the paper,

Mutian He 19 Oct 14, 2022