MassiveSumm: a very large-scale, very multilingual, news summarisation dataset

Overview

MassiveSumm: a very large-scale, very multilingual, news summarisation dataset

This repository contains links to data and code to fetch and reproduce the data described in our EMNLP 2021 paper titled "MassiveSumm: a very large-scale, very multilingual, news summarisation dataset". A (massive) multilingual dataset consisting of 92 diverse languages, across 35 writing scripts. With this work we attempt to take the first steps towards providing a diverse data foundation for in summarisation in many languages.

Disclaimer: The data is noisy and recall-oriented. In fact, we highly recommend reading our analysis on the efficacy of this type of methods for data collection.

Get the Data

Redistributing data from web is a tricky matter. We are working on providing efficient access to the entire dataset, as well as expanding it even further. For the time being we only provide links to reproduce subsets of the entire dataset through either common crawl and the wayback machine. The dataset is also available upon request ([email protected]).

In the table below is a listing of files containing URLs and metadata required to fetch data from common crawl.

lang wayback cc
afr link -
amh link link
ara link link
asm link -
aym link -
aze link link
bam link link
ben link link
bod link link
bos link link
bul link link
cat link -
ces link link
cym link link
dan link link
deu link link
ell link link
eng link link
epo link -
fas link link
fil link -
fra link link
ful link link
gle link link
guj link link
hat link link
hau link link
heb link -
hin link link
hrv link -
hun link link
hye link link
ibo link link
ind link link
isl link link
ita link link
jpn link link
kan link link
kat link link
khm link link
kin link -
kir link link
kor link link
kur link link
lao link link
lav link link
lin link link
lit link link
mal link link
mar link link
mkd link link
mlg link link
mon link link
mya link link
nde link link
nep link link
nld link -
ori link link
orm link link
pan link link
pol link link
por link link
prs link link
pus link link
ron link -
run link link
rus link link
sin link link
slk link link
slv link link
sna link link
som link link
spa link link
sqi link link
srp link link
swa link link
swe link -
tam link link
tel link link
tet link -
tgk link -
tha link link
tir link link
tur link link
ukr link link
urd link link
uzb link link
vie link link
xho link link
yor link link
yue link link
zho link link
bis - link
gla - link

Cite Us!

Please cite us if you use our data or methodology

@inproceedings{varab-schluter-2021-massivesumm,
    title = "{M}assive{S}umm: a very large-scale, very multilingual, news summarisation dataset",
    author = "Varab, Daniel  and
      Schluter, Natalie",
    booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2021",
    address = "Online and Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.emnlp-main.797",
    pages = "10150--10161",
    abstract = "Current research in automatic summarisation is unapologetically anglo-centered{--}a persistent state-of-affairs, which also predates neural net approaches. High-quality automatic summarisation datasets are notoriously expensive to create, posing a challenge for any language. However, with digitalisation, archiving, and social media advertising of newswire articles, recent work has shown how, with careful methodology application, large-scale datasets can now be simply gathered instead of written. In this paper, we present a large-scale multilingual summarisation dataset containing articles in 92 languages, spread across 28.8 million articles, in more than 35 writing scripts. This is both the largest, most inclusive, existing automatic summarisation dataset, as well as one of the largest, most inclusive, ever published datasets for any NLP task. We present the first investigation on the efficacy of resource building from news platforms in the low-resource language setting. Finally, we provide some first insight on how low-resource language settings impact state-of-the-art automatic summarisation system performance.",
}
Owner
Daniel Varab
🐦: @danielvarab
Daniel Varab
This is the official PyTorch implementation of our paper: "Artistic Style Transfer with Internal-external Learning and Contrastive Learning".

Artistic Style Transfer with Internal-external Learning and Contrastive Learning This is the official PyTorch implementation of our paper: "Artistic S

51 Dec 20, 2022
Code for MSc Quantitative Finance Dissertation

MSc Dissertation Code ReadMe Sector Volatility Prediction Performance Using GARCH Models and Artificial Neural Networks Curtis Nybo MSc Quantitative F

2 Dec 01, 2022
This is code to fit per-pixel environment map with spherical Gaussian lobes, using LBFGS optimization

Spherical Gaussian Optimization This is code to fit per-pixel environment map with spherical Gaussian lobes, using LBFGS optimization. This code has b

41 Dec 14, 2022
Deep learning library for solving differential equations and more

DeepXDE Voting on whether we should have a Slack channel for discussion. DeepXDE is a library for scientific machine learning. Use DeepXDE if you need

Lu Lu 1.4k Dec 29, 2022
Eth brownie struct encoding example

eth-brownie struct encoding example Overview This repository contains an example of encoding a struct, so that it can be used in a function call, usin

Ittai Svidler 2 Mar 04, 2022
Preparation material for Dropbox interviews

Dropbox-Onsite-Interviews A guide for the Dropbox onsite interview! The Dropbox interview question bank is very small. The bank has been in a Chinese

386 Dec 31, 2022
VACA: Designing Variational Graph Autoencoders for Interventional and Counterfactual Queries

VACA Code repository for the paper "VACA: Designing Variational Graph Autoencoders for Interventional and Counterfactual Queries (arXiv)". The impleme

Pablo Sánchez-Martín 16 Oct 10, 2022
商品推荐系统

商品top50推荐系统 问题建模 本项目的数据集给出了15万左右的用户以及12万左右的商品, 以及对应的经过脱敏处理的用户特征和经过预处理的商品特征,旨在为用户推荐50个其可能购买的商品。 推荐系统架构方案 本项目采用传统的召回+排序的方案。

107 Dec 29, 2022
Gym-TORCS is the reinforcement learning (RL) environment in TORCS domain with OpenAI-gym-like interface.

Gym-TORCS Gym-TORCS is the reinforcement learning (RL) environment in TORCS domain with OpenAI-gym-like interface. TORCS is the open-rource realistic

naoto yoshida 400 Dec 27, 2022
Code and data for the paper "Hearing What You Cannot See"

Hearing What You Cannot See: Acoustic Vehicle Detection Around Corners Public repository of the paper "Hearing What You Cannot See: Acoustic Vehicle D

TU Delft Intelligent Vehicles 26 Jul 13, 2022
Pretrained models for Jax/Haiku; MobileNet, ResNet, VGG, Xception.

Pre-trained image classification models for Jax/Haiku Jax/Haiku Applications are deep learning models that are made available alongside pre-trained we

Alper Baris CELIK 14 Dec 20, 2022
Official code repository for the publication "Latent Equilibrium: A unified learning theory for arbitrarily fast computation with arbitrarily slow neurons"

Latent Equilibrium: A unified learning theory for arbitrarily fast computation with arbitrarily slow neurons This repository contains the code to repr

Computational Neuroscience, University of Bern 3 Aug 04, 2022
A large-image collection explorer and fast classification tool

IMAX: Interactive Multi-image Analysis eXplorer This is an interactive tool for visualize and classify multiple images at a time. It written in Python

Matias Carrasco Kind 23 Dec 16, 2022
A mini-course offered to Undergrad chemistry students

The best way to use this material is by forking it by click the Fork button at the top, right corner. Then you will get your own copy to play with! Th

Raghu 19 Dec 19, 2022
An easy-to-use app to visualise attentions of various VQA models.

Ask Me Anything: A tool for visualising Visual Question Answering (AMA) An easy-to-use app to visualise attentions of various VQA models. Please click

Apoorve 37 Nov 13, 2022
Implementation of "Fast and Flexible Temporal Point Processes with Triangular Maps" (Oral @ NeurIPS 2020)

Fast and Flexible Temporal Point Processes with Triangular Maps This repository includes a reference implementation of the algorithms described in "Fa

Oleksandr Shchur 20 Dec 02, 2022
you can add any codes in any language by creating its respective folder (if already not available).

HACKTOBERFEST-2021-WEB-DEV Beginner-Hacktoberfest Need Your first pr for hacktoberfest 2k21 ? come on in About This is repository of Responsive Portfo

Suman Sharma 8 Oct 17, 2022
Deep Dual Consecutive Network for Human Pose Estimation (CVPR2021)

Beanie - is an asynchronous ODM for MongoDB, based on Motor and Pydantic. It uses an abstraction over Pydantic models and Motor collections to work wi

295 Dec 29, 2022
A tight inclusion function for continuous collision detection

Tight-Inclusion Continuous Collision Detection A conservative Continuous Collision Detection (CCD) method with support for minimum separation. You can

Continuous Collision Detection 89 Jan 01, 2023
D2LV: A Data-Driven and Local-Verification Approach for Image Copy Detection

Facebook AI Image Similarity Challenge: Matching Track —— Team: imgFp This is the source code of our 3rd place solution to matching track of Image Sim

16 Dec 25, 2022