[back]

This corpus is now a GitHub project. It is now much easier to submit corrections to the data or add new translations to the corpus.


Here you can find a multilingual parallel corpus created from translations of the Bible. This an effort to create a parallel corpus containing as many languages as possible that could be used for a number of NLP tasks. Using the Book, Chapter and Verse indices the corpus is aligned (almost) at a sentence level. (There are cases where two verses in one language are translated as one in another).


Following a similar effort by Philip Resnik and Mari Broman Olsen at the University of Maryland (website) I have encoded the text of each language in XML files using the Corpus Encoding Standard. Refer to the following paper for more details about the creation of the corpus:

    A massively parallel corpus: the Bible in 100 languages, Christos Christodoulopoulos and Mark Steedman, Language Resources and Evaluation, 49 (2)


The following table contains the XML Bibles in 100 languages (all the languages that an electronic version was freely available online) along with information about each language from Ethnologue.


Armin Hoenen from the Text Technology Lab at the Goethe Universität, has created tokenised versions of four languages (Chinese, Japanese, Thai, Vietnamese). They can be found here: https://www.hucompute.org/ressourcen/corpora or under the original languages in the table below.



Click on any column title to re-sort the table
ISO 639-3 Language Family Genus Subgenus Speakers Script Parts
acu Achuar-Shiwiar Jivaroan 5,000 Latin New Testament
afr Afrikaans Indo-European Germanic West 5,000,000 Latin COMPLETE
agr Aguaruna Jivaroan 38,300 Latin New Testament
ake Akawaio Carib Northern East-West Guiana 4,500 Latin New Testament
als Albanian Indo-European Albanian Tosk 3,000,000 Latin COMPLETE
amh Amharic Afro-Asiatic Semitic South 17,500,000 Ethiopic COMPLETE
amu Amuzgo Oto-Manguean Amuzgoan 23,000 Latin New Testament
arb Arabic Afro-Asiatic Semitic Central 206,000,000 Arabic COMPLETE
hye Armenian Indo-European Armenian 6,400,000 Armenian Gen. Exod. Gosp.
djk Aukan Creole English based Atlantic 15,500 Latin New Testament
bsn Barasana-Eduria Tucanoan Eastern Tucanoan Central 1,890 Latin New Testament
eus Basque Basque 700,000 Latin New Testament
bul Bulgarian Indo-European Slavic South 9,000,000 Cyrillic COMPLETE
cjp Cabécar Chibchan Talamanca 8,840 Latin New Testament
cak Cakchiquel Mayan Quichean Greater Quichean 132,000 Latin New Testament
cni Campa (Asháninka) Arawakan Maipuran Southern Maipuran 26,100 Latin New Testament
kbh Camsá Equatorial (?) 4,770 Latin New Testament
ceb Cebuano Austronesian Malayo-Polynesian Phillipine 15,800,000 Latin COMPLETE
cha Chamorro Austronesian Malayo-Polynesian Chamorro 92,000 Latin Psalm Gosp. Acts
chr Cherokee Iroquoian Southern Iroquoian 16,400 Cherokee New Testament
chq Chinantec (Quiotepec) Oto-Manguean Chinantecan 8,000 Latin New Testament
cmn Chinese Sino-Tibetan Sinitic Chinese 840,000,000 Chinese COMPLETE
↳   Chinese (tokenised) [Tokenisation by Armin Hoenen, using Stanford Word Segmenter 3.5.2]
cop Coptic Afro-Asiatic Egyptian Extinct Coptic New Testament
hrv Croatian Indo-European Slavic South 5,500,000 Latin COMPLETE
ces Czech Indo-European Slavic West 9,500,000 Latin COMPLETE
dan Danish Indo-European Germanic North 5,500,000 Latin COMPLETE
dik Dinka Nilo-Saharan Eastern Sudanic Nilotic 450,000 Latin New Testament
nld Dutch Indo-European Germanic West 15,700,000 Latin COMPLETE
eng English Indo-European Germanic West 328,000,000 Latin COMPLETE
↳   English (WEB tranlation) [Added by Stephen Mayhew]
epo Esperanto Constructed 1000 Latin COMPLETE
est Estonian Uralic Finno-Ugric Finno-Permic 1,000,000 Latin Gen. + New Testament
ewe Ewe Niger-Congo Atlantic-Congo Volta-Congo 2,250,000 Latin New Testament
pes Farsi (Persian) Indo-European Indo-Iranian Iranian 22,000,000 Arabic COMPLETE
fin Finnish Uralic Finno-Ugric Finno-Permic 5,000,000 Latin COMPLETE
fra French Indo-European Italic Romance 58,000,000 Latin COMPLETE
gla Gaelic (Scottish) Indo-European Celtic Insular 67,000 Latin Gospel of Mark
gbi Galela West Papuan North Halmahera Galela-Loloda 79,000 Latin New Testament
deu German Indo-European Germanic West 90,300,000 Latin COMPLETE
ell Greek Indo-European Greek Attic 13,000,000 Greek COMPLETE
guj Gujarati Indo-European Indo-Iranian Indo-Aryan 45,500,000 Gujarati New Testament
hat Haitian Creole Creole 7,700,000 Latin COMPLETE
heb Hebrew Afro-Asiatic Semitic Central 5,300,000 Hebrew COMPLETE
hin Hindi Indo-European Indo-Iranian Indo-Aryan 180,000,000 Devanagari COMPLETE
hun Hungarian Uralic Finno-Ugric Ugric 12,500,000 Latin COMPLETE
isl Icelandic Indo-European Germanic North 230,000 Latin COMPLETE
ind Indonesian Austronesian Malayo-Polynesian Malayo-Sumbawan 23,100,000 Latin COMPLETE
ita Italian Indo-European Italic Romance 61,700,000 Latin COMPLETE
jai Jakalteko Mayan Kanjobalan-Chujean Kanjobalan 77,700 Latin New Testament
jpn Japanese Japonic 122,000,000 Kanjii COMPLETE
↳   Japanese (tokenised) [Tokenisation by Armin Hoenen, using kyTea 0.4.7]
quc K'iche' Mayan Quichean-Mamean Greater Quichean 1,900,000 Latin New Testament
↳   K'iche' (SIL orthograpy)
kab Kabyle Afro-Asiatic Berber Northern 3,100,000 Latin New Testament
kan Kannada Dravidian Southern Tamil-Kannada 35,300,000 Kannada COMPLETE
kor Korean Altaic(?) 66,300,000 Hangul COMPLETE
lat Latin Indo-European Italic Latino-Faliscan Extinct Latin COMPLETE
lav Latvian Indo-European Baltic Eastern 1,500,000 Latin New Testament
lit Lithuanian Indo-European Baltic Eastern 3,100,000 Latin COMPLETE
dop Lukpa Niger-Congo Atlantic-Congo Volta-Congo 50,000 Latin New Testament
plt Malagasy Austronesian Malayo-Polynesian Greater Barito 7,520,000 Latin COMPLETE
mal Malayalam Dravidian Southern Tamil-Kannada 35,400,000 Malayalam COMPLETE
mam Mam Mayan Quichean-Mamean Greater Mamean 200,000 Latin New Testament
glv Manx Indo-European Celtic Insular 77,000 Latin Esth. Jonah Gosp.
mri Maori Austronesian Malayo-Polynesian Central-Eastern 60,000 Latin COMPLETE
mar Marathi Indo-European Indo-Iranian Indo-Aryan 68,000,000 Devanagari COMPLETE
mya Myanmar (Burmese) Sino-Tibetan Tibeto-Burman Lolo-Burmese 32,300,000 Myanmar COMPLETE
nhg Nahuatl (Tetelcingo) Uto-Aztecan Southern Uto-Aztecan Aztecan 3,500 Latin New Testament
nep Nepali Indo-European Indo-Iranian Indo-Aryan 11,100,000 Devanagari COMPLETE
nor Norwegian Indo-European Germanic North 4,600,000 Latin COMPLETE
ojb Ojibwa Algic Algonquian Central 20,000 Aboriginal Syllabics New Testament
pck Paite (Chin) Sino-Tibetan Tibeto-Burman Kuki-Chin-Naga 78,800 Latin COMPLETE
pol Polish Indo-European Slavic West 36,600,000 Latin COMPLETE
por Portuguese Indo-European Italic Romance 178,000,000 Latin COMPLETE
pot Potawatomi Algic Algonquian Central 1,300,000 Latin Matthew Acts
kek Q'eqchi' Mayan Quichean-Mamean Greater Quichean 400,000 Latin COMPLETE
quw Quichua Quechuan Quechua II B 20,000 Latin New Testament
rmn Romani Indo-European Indo-Iranian Indo-Aryan 710,000 Latin New Testament
ron Romanian Indo-European Italic Romance 23,400,000 Latin COMPLETE
rus Russian Indo-European Slavic East 143,000,000 Cyrillic COMPLETE
srp Serbian Indo-European Slavic South 7,000,000 Latin COMPLETE
shn Shona Niger-Congo Atlantic-Congo Volta-Congo 12,563,100 Latin COMPLETE
jiv Shuar (Jivaro) Jivaroan 46,700 Latin New Testament
slk Slovak Indo-European Slavic West 4,610,000 Latin COMPLETE
slv Slovene Indo-European Slavic South 1,730,000 Latin COMPLETE
som Somali Afro-Asiatic Cushitic East 8,340,000 Latin COMPLETE
spa Spanish Indo-European Italic Romance 328,000,000 Latin COMPLETE
swh Swahili Niger-Congo Atlantic-Congo Volta-Congo 788,000 Latin New Testament
swe Swedish Indo-European Germanic North 8,300,000 Latin COMPLETE
arc Syriac Afro-Asiatic Semitic Central Extinct Syriac New Testament
shi Tachelhit Afro-Asiatic Berber Northern 3,000,000 Latin New Testament
tgl Tagalog Austronesian Malayo-Polynesian Phillipine 23,900,000 Latin COMPLETE
ttq Tamajaq (Tuareg) Afro-Asiatic Berber Tamasheq 640,000 Latin Portions
tel Telugu Dravidian South-Central Telugu 69,600,000 Telugu COMPLETE
tha Thai Tai-Kadai Kam-Tai Be-Tai 20,300,000 Thai COMPLETE
↳   Thai (tokenised) [Tokenisation by Armin Hoenen, using java.text.BreakIterator with locale TH]
tur Turkish Altaic Turkic Southern 50,000,000 Latin COMPLETE
ukr Ukranian Indo-European Slavic East 37,000,000 Cyrillic New Testament
ppk Uma Austronesian Malayo-Polynesian Celebic 20,000 Latin New Testament
usp Uspanteco Mayan Quichean-Mamean Greater Quichean 3,000 Latin New Testament
vie Vietnamese Austro-Asiatic Mon-Khmer Viet-Muong 68,600,000 Latin COMPLETE
↳   Vietnamese (tokenised) [Tokenisation by Armin Hoenen, using vnTokenizer 4.1.1]
wal Wolaytta Afro-Asiatic Omotic North 1,230,000 Latin New Testament
wol Wolof Niger-Congo Atlantic-Congo Atlantic 4,000,000 Latin New Testament
xho Xhosa Niger-Congo Atlantic-Congo Volta-Congo 7,800,000 Latin COMPLETE
dje Zarma Nilo-Saharan Songhai Southern 2,350,000 Latin COMPLETE
zul Zulu Niger-Congo Atlantic-Congo Volta-Congo 9,980,000 Latin New Testament
Fork me on GitHub
Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.