Project Description

The "Mafoko: South African Terminology, Lexicon, and Glossary Project" is dedicated to the comprehensive collection, meticulous cleaning, and transformative processing of South African language terminology lists, lexicons, and glossaries. This initiative is an integral part of the broader mission of the Data Science for Social Impact (DSFSI) lab/group, which aims to liberate and openly share as many language resources as possible. The quality and accuracy of each resource are maintained by the original authors, ensuring the integrity and authenticity of the linguistic data. For any questions or clarifications regarding the content, users are encouraged to directly contact the original authors. By making these linguistic assets readily accessible, the project seeks to enhance language preservation, support linguistic research, and foster educational opportunities across South Africa's diverse linguistic landscape.

Database Description Documentation CSV JSONL
DSAC Department of Sports, Arts and Culture (DSAC) project is to support the collaborative development and dissemination of terminological resources, and thereby promoting the use of African languages in teaching and learning at higher education institutions. README data/dsac/combined_dsac.csv,
view on datasette
data/dsac/combined_dsac.jsonl,
view on datasette
StatsSA The Multilingual Statistical Terminology Project by Stats SA develops statistical terminology in South Africa's 11 official languages to enhance access to vital data for all citizens, ensuring a deeper understanding and connection to the information that affects their lives. README data/statssa/statssa_multilingual
_statistical_terminology.csv
, view on datasette
data/statssa/statssa_multilingual
_statistical_terminology.jsonl
,view on datasette
UNISA Multilingual The South African Multilingual Linguistic Terminology (SAMLT) Project is a comprehensive multilingual termbank containing 500 linguistic terms translated across nine South African languages. Each term includes translations by field experts, accompanied by concise definitions and usage examples to clarify technical linguistic concepts for classroom and academic use. This resource addresses the critical need for standardized linguistic terminology in African languages, supporting linguistics education and research across South Africa's diverse linguistic landscape. README data/unisa_multilingual/unisa_multilingual
_linguistic_terminology.csv
, view on datasette
data/unisa_multilingual/unisa_multilingual
_linguistic_terminology.csv
,view on datasette
UNISA Robotics The UNISA Multilingual Robotics Glossary is a comprehensive collection of approximately 100 robotics and engineering terminology entries translated across South Africa's 11 official languages. This glossary was developed by the University of South Africa (UNISA) through its Inspired towards Science, Engineering and Technology (I-SET) program, in collaboration with the Department of Linguistics and Modern Languages and the Department of African Languages. This resource aims to make robotics education accessible in mother-tongue languages throughout South Africa, supporting STEM education and bridging the gap between technical terminology and linguistic diversity. README data/unisa_robotics/unisa_robotics_multilingual
_glossary.csv
, view on datasette
data/unisa_robotics/unisa_robotics_multilingual
_glossary.jsonl
,view on datasette
UP Glossary The University of Pretoria Multilingual Academic Glossaries project promotes access to academic terminology in Afrikaans, English, and Northern Sotho to support multilingual teaching and learning, fostering inclusivity and linguistic diversity in higher education. README data/up_glossary/combined/combined
_up_glossary.csv
, view on datasette
data/up_glossary/combined/combined
_up_glossary.jsonl
,view on datasette
OERTB Open Resource Term Bank (OERTB) project is to support the collaborative development and dissemination of terminological resources, and thereby promoting the use of African languages in teaching and learning at higher education institutions. TBA TBA TBA

Attributions

DSAC Attribution

Attribution Name Dataset Link
DSAC Election Terminology Attribution CSV Dataset
DSAC Life Orientation Terminology Attribution CSV Dataset
DSAC Arts & Culture Terminology – Intermediate Phase Attribution CSV Dataset
DSAC Engineering & Construction Terminology Attribution CSV Dataset
DSAC Financial Terminology Attribution CSV Dataset
DSAC HIV and AIDS Terminology Attribution CSV Dataset
DSAC Human, Social, Economic & Management Sciences Terminology Attribution CSV Dataset
DSAC ICT Dictionary Attribution CSV Dataset
DSAC Mathematics Dictionary (Grades R–6) Attribution CSV Dataset
DSAC Parliamentary Dictionary Attribution CSV Dataset
DSAC Soccer Terminology Attribution CSV Dataset
DSAC Natural Science & Technology Term List – Nguni Languages Attribution CSV Dataset
DSAC Natural Sciences & Technology Dictionary – Sotho Attribution CSV Dataset
DSAC Natural Sciences & Technology (Grades 4–6) Attribution CSV Dataset
DSAC Pharmacy Terminology – First Edition Attribution CSV Dataset
DSAC Pharmacy Terminology – Second Edition Attribution CSV Dataset

BibTeX Citation


        @dataset{dsfsi-mafoko,
        date = {2025},
        title = {Mafoko: Structuring and Building Open Multilingual Terminologies for South African NLP},
        url = {https://github.com/dsfsi/za-mafoko/},
        author = {Vukosi Marivate and Isheanesu Dzingirai and Fiskani Banda and Richard Lastrucci and Thapelo Sindane and Keabetswe Madumo and Kayode Olaleye and Abiodun Modupe and Unarine Netshifhefhe and Herkulaas Combrink and Mohlatlego Nakeng and Matome Ledwaba}
      }

      @article{marivate2025mafokostructuringbuildingopen,
        title={Mafoko: Structuring and Building Open Multilingual Terminologies for South African NLP}, 
        author={Vukosi Marivate and Isheanesu Dzingirai and Fiskani Banda and Richard Lastrucci and Thapelo Sindane and Keabetswe Madumo and Kayode Olaleye and Abiodun Modupe and Unarine Netshifhefhe and Herkulaas Combrink and Mohlatlego Nakeng and Matome Ledwaba},
        year={2025},
        eprint={2508.03529},
        archivePrefix={arXiv},
        primaryClass={cs.CL},
        url={https://arxiv.org/abs/2508.03529}, 
      }