English loanwords in the world's languages

Last update: Jan 14, 2022

Related tags

Text Data & NLP wiktionary_cldf

Overview

Wiktionary as CLDF

Content

cldf1 and cldf2 contain cldf-conform data sets with a total of 2 377 756 entries about the vocabulary of all 1403 languages of the English Wiktionary.
raw1 and raw2 together contain 1403 csv-files of 125MB size in total. File names are languages as they appear on the English Wiktionary. Each file consist of 4 columns: 'L2_orth' representing the orthographical form of the word, 'L2_ipa' its IPA-transcription, 'L2_gloss' its English explanation and 'L2_etym' its etymology iff it is borrowed from English
lgs contains text-files with wordlists for every language that appears on the English Wiktionary. Files were created with WiktionaryParser.java.
WiktionaryParser.java is a courtesy of Tomasz Jastrząb and was used to retrieve the wordlists found in the folder lgs
lglist.txt is a complete list of languages that appear on the English Wiktioanry.
lglist_full.txt is a copy of lglist.txt - since the latter serves as input for makedfs.py it can be modified according to one's needs without losing the full list.
LICENSE: MIT
makedfs.py - The parser with which the csv files where obtained. With a download speed of 144Mbps it needed 58 hours to parse all the languages from aari until zuni.
makedfs.ipynb - Some notes, documenting the making-of of the parser
parser.log - Documenting corrupted file names and handling of errors that occured while squeezing parsed data into data frames
dfs is an empty folder into which the parser writes its results. Generated outputs were migrated to raw1 and raw2 due to Git's limitation of maximum 1000 files per directory
changelog.txt - documenting manual deletion of false positive and insertion of false negative English loanwords
cldf is an empty folder to which dfs2cldf.py writes it output. Generated output had to be migrated to folders cldf1 and cldf2 due to Github's limit of 1000 files per directory

remarks

Sometimes the column "L2_etym" is not displayed by the csv-viewer in Github. This is likely the case whenever the first 100 lines of the column are empty. Clicking on "raw", the column can be seen again.
The reason why columns have the "L2_" prefix is that this data was first used for baseline tests, where they served as pseudo-donor words (hence "L2" ~ second language ~ donor language), even though in the current setting they represent the recipient language (L1). The distinction L1-L2 is only internal.

Todo

remove middle_english and old_english.csv, and generally anything with middle_ or old_ in it.
add missing IPA transcriptions using epitran, copius_api, espeak-ng and potential other software
Try to contribute those new IPA transcriptions to Wiktionary

You might also like...

"elect", "electoral", "electorate" etc." data-original="https://github.com/gutfeeling/word_forms/raw/master/logo.png" >

Accurately generate all possible forms of an English word e.g "election" -- "elect", "electoral", "electorate" etc.

Accurately generate all possible forms of an English word Word forms can accurately generate all possible forms of an English word. It can conjugate v

570 Dec 31, 2022

A Chinese to English Neural Model Translation Project

ZH-EN NMT Chinese to English Neural Machine Translation This project is inspired by Stanford's CS224N NMT Project Dataset used in this project: News C

29 Nov 26, 2022

An easy-to-use Python module that helps you to extract the BERT embeddings for a large text dataset (Bengali/English) efficiently.

37 Sep 5, 2022

This program do translate english words to portuguese

Python-Dictionary This program is used to translate english words to portuguese. Web-Scraping This program use BeautifulSoap to make web scraping, so

1 Oct 10, 2022

STS Benchmark comprises a selection of the English datasets used in the STS tasks organized in the context of SemEval between 2012 and 2017. The selection of datasets include text from image captions, news headlines and user forums.

stsb_multi_mt_en STS Benchmark comprises a selection of the English datasets used in the STS tasks organized in the context of SemEval between 2012 an

2 Nov 5, 2021

Releases(v1.0)

v1.0(Dec 28, 2021)

cldf-conform data for 2 377 756 entries across 1403 languages found on the English Wiktionary
Source code(tar.gz)
Source code(zip)

English loanwords in the world's languages

Related tags

Overview

Wiktionary as CLDF

Content

remarks

Todo

You might also like...

Accurately generate all possible forms of an English word e.g "election" -- "elect", "electoral", "electorate" etc.

A Chinese to English Neural Model Translation Project

An easy-to-use Python module that helps you to extract the BERT embeddings for a large text dataset (Bengali/English) efficiently.

This program do translate english words to portuguese

STS Benchmark comprises a selection of the English datasets used in the STS tasks organized in the context of SemEval between 2012 and 2017. The selection of datasets include text from image captions, news headlines and user forums.

The aim of this task is to predict someone's English proficiency based on a text input.

Yodatranslator is a simple translator English to Yoda-language

Simple program that translates the name of files into English

A unified tokenization tool for Images, Chinese and English.

Releases(v1.0)

v1.0(Dec 28, 2021)

Owner

Viktor Martinović

构建一个多源（公众号、RSS）、干净、个性化的阅读环境

Simple tool/toolkit for evaluating NLG (Natural Language Generation) offering various automated metrics.

Tools for curating biomedical training data for large-scale language modeling

ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files.

jiant is an NLP toolkit

Official PyTorch implementation of Time-aware Large Kernel (TaLK) Convolutions (ICML 2020)

A 10000+ hours dataset for Chinese speech recognition

A fast and lightweight python-based CTC beam search decoder for speech recognition.

Yet Another Sequence Encoder - Encode sequences to vector of vector in python !

This is the library for the Unbounded Interleaved-State Recurrent Neural Network (UIS-RNN) algorithm, corresponding to the paper Fully Supervised Speaker Diarization.

Non-Autoregressive Translation with Layer-Wise Prediction and Deep Supervision

Binary LSTM model for text classification

Kerberoast with ACL abuse capabilities

基于百度的语音识别，用python实现，pyaudio+pyqt

Turkish Stop Words Türkçe Dolgu Sözcükleri

Unofficial Python library for using the Polish Wordnet (plWordNet / Słowosieć)

Code for EMNLP 2021 main conference paper "Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification"

Hostapd-mac-tod-acl - Setup a hostapd AP with MAC ToD ACL

A PyTorch implementation of the Transformer model in "Attention is All You Need".

REST API for sentence tokenization and embedding using Multilingual Universal Sentence Encoder.