The "Mafoko: South African Terminology, Lexicon, and Glossary Project" is dedicated to the comprehensive collection, meticulous cleaning, and transformative processing of South African language terminology lists, lexicons, and glossaries. This initiative is an integral part of the broader mission of the Data Science for Social Impact (DSFSI) lab/group, which aims to liberate and openly share as many language resources as possible. The quality and accuracy of each resource are maintained by the original authors, ensuring the integrity and authenticity of the linguistic data. For any questions or clarifications regarding the content, users are encouraged to directly contact the original authors. By making these linguistic assets readily accessible, the project seeks to enhance language preservation, support linguistic research, and foster educational opportunities across South Africa's diverse linguistic landscape.
Database | Description | Documentation | CSV | JSONL |
---|---|---|---|---|
DSAC | Department of Sports, Arts and Culture (DSAC) project is to support the collaborative development and dissemination of terminological resources, and thereby promoting the use of African languages in teaching and learning at higher education institutions. | README | data/dsac/combined_dsac.csv, view on datasette |
data/dsac/combined_dsac.jsonl, view on datasette |
StatsSA | The Multilingual Statistical Terminology Project by Stats SA develops statistical terminology in South Africa's 11 official languages to enhance access to vital data for all citizens, ensuring a deeper understanding and connection to the information that affects their lives. | README | data/statssa/statssa_multilingual _statistical_terminology.csv, view on datasette |
data/statssa/statssa_multilingual _statistical_terminology.jsonl,view on datasette |
UNISA Multilingual | The South African Multilingual Linguistic Terminology (SAMLT) Project is a comprehensive multilingual termbank containing 500 linguistic terms translated across nine South African languages. Each term includes translations by field experts, accompanied by concise definitions and usage examples to clarify technical linguistic concepts for classroom and academic use. This resource addresses the critical need for standardized linguistic terminology in African languages, supporting linguistics education and research across South Africa's diverse linguistic landscape. | README | data/unisa_multilingual/unisa_multilingual _linguistic_terminology.csv, view on datasette |
data/unisa_multilingual/unisa_multilingual _linguistic_terminology.csv,view on datasette |
UNISA Robotics | The UNISA Multilingual Robotics Glossary is a comprehensive collection of approximately 100 robotics and engineering terminology entries translated across South Africa's 11 official languages. This glossary was developed by the University of South Africa (UNISA) through its Inspired towards Science, Engineering and Technology (I-SET) program, in collaboration with the Department of Linguistics and Modern Languages and the Department of African Languages. This resource aims to make robotics education accessible in mother-tongue languages throughout South Africa, supporting STEM education and bridging the gap between technical terminology and linguistic diversity. | README | data/unisa_robotics/unisa_robotics_multilingual _glossary.csv, view on datasette |
data/unisa_robotics/unisa_robotics_multilingual _glossary.jsonl,view on datasette |
UP Glossary | The University of Pretoria Multilingual Academic Glossaries project promotes access to academic terminology in Afrikaans, English, and Northern Sotho to support multilingual teaching and learning, fostering inclusivity and linguistic diversity in higher education. | README | data/up_glossary/combined/combined _up_glossary.csv, view on datasette |
data/up_glossary/combined/combined _up_glossary.jsonl,view on datasette |
OERTB | Open Resource Term Bank (OERTB) project is to support the collaborative development and dissemination of terminological resources, and thereby promoting the use of African languages in teaching and learning at higher education institutions. | TBA | TBA | TBA |
@dataset{dsfsi-mafoko,
date = {2025},
title = {Mafoko: Structuring and Building Open Multilingual Terminologies for South African NLP},
url = {https://github.com/dsfsi/za-mafoko/},
author = {Vukosi Marivate and Isheanesu Dzingirai and Fiskani Banda and Richard Lastrucci and Thapelo Sindane and Keabetswe Madumo and Kayode Olaleye and Abiodun Modupe and Unarine Netshifhefhe and Herkulaas Combrink and Mohlatlego Nakeng and Matome Ledwaba}
}
@article{marivate2025mafokostructuringbuildingopen,
title={Mafoko: Structuring and Building Open Multilingual Terminologies for South African NLP},
author={Vukosi Marivate and Isheanesu Dzingirai and Fiskani Banda and Richard Lastrucci and Thapelo Sindane and Keabetswe Madumo and Kayode Olaleye and Abiodun Modupe and Unarine Netshifhefhe and Herkulaas Combrink and Mohlatlego Nakeng and Matome Ledwaba},
year={2025},
eprint={2508.03529},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2508.03529},
}