IndoNLI: A Natural Language Inference Dataset for Indonesian
This is a repository for data and code accompanying our EMNLP 2021 paper "IndoNLI: A Natural Language Inference Dataset for Indonesian". The datasets used for our experiments can be found under the data directory:
-
indonli: human-annotated NLI data, split intotrain,val, andtest(test_layandtest_expert)diagnostic: subset of examples from test_expert that are annotated with linguistic and logical phenomena -
translate_train.tar.gz: MNLI dataset translated to Indonesian (train and dev) -
translate_train_small.tar.gz: sampled oftranslate_trainused for thetranslate_train_smallexperiment.
The experiment code can be found under experiment directory, please check the related README file.
License
We use premises taken from the Indonesian Wikipedia, news, and Web articles.
Wikipedia is licensed under Creative Commons Attribution-ShareAlike 3.0 Unported License (CC-BY-SA) and the GNU Free Documentation License (GFDL).
For the news genre, we use premise text from Indonesian PUD and GSD treebanks provided by the Universal Dependencies 2.5 (Zeman et al., 2019) and IndoSum (Kurniawan and Louvan, 2018). Indonesian PUD and GSD treebanks are licensed under Creative Commons Attribution-ShareAlike 3.0 Unported License (CC-BY-SA) and Creative Commons Attribution-ShareAlike 4.0 International License (CC-BY-SA). IndoSum is licensed under Apache License, Version 2.0.
Citation
If you use our corpus in your work, please consider citing our paper:
@inproceedings{indonli,
title = "IndoNLI: A Natural Language Inference Dataset for Indonesian",
author = "Mahendra, Rahmad and Aji, Alham Fikri and Louvan, Samuel and Rahman, Fahrurrozi and Vania, Clara",
booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2021",
publisher = "Association for Computational Linguistics",
}