Mafoko: Open multilingual terminologies for South African NLP — preprint & RAIL 2025 presentation
We’ve released Mafoko — an open, structured terminology resource for South African languages. Preprint live on arXiv and presented at RAIL 2025. Explore, reuse, and contribute.

🎉 We are excited to announce that the final preprint of Mafoko: Structuring and Building Open Multilingual Terminologies for South African NLP is live on arXiv, and we presented the work at the Sixth Workshop on Resources for African Indigenous Languages (RAIL 2025).
- Preprint (arXiv): https://arxiv.org/abs/2508.03529
- GitHub (data + code): https://github.com/dsfsi/za-mafoko/
- Hugging Face collection: https://huggingface.co/collections/dsfsi/za-mafoko
What is Mafoko?
Mafoko is a curated, open collection of multilingual terminologies gathered from government and academic sources across South Africa. The project converts scattered, often non-machine readable glossaries into interoperable, reusable formats so they can be used directly in NLP pipelines, research, and educational tools.
Key highlights
- 25,000+ term entries aggregated across multiple sources.
- 11 official South African languages covered in the initial release.
- Data released in CSV and JSON (TBX planned), with rich provenance metadata (source, date, contributor, ISO codes).
- Released under NOODL — an Africa-centered, equitable data license that supports local benefit-sharing.
Why this matters
Terminologies are core to domain-aware language technology. By structuring and opening these resources we:
- Improve domain consistency in machine translation and retrieval-augmented generation (RAG) pipelines.
- Provide benchmarks and gold-standard targets for low-resource language evaluation.
- Enable linguists, educators, and developers to build culturally and linguistically appropriate tools — from spellcheckers to voice assistants.
Quick demo result (proof-of-concept)
We integrated Mafoko into a RAG pipeline and tested English→Tshivenda translation with GPT-4o-mini and LLaMA3-8B. Adding Mafoko terminologies to prompts substantially improved BLEU and chrF metrics — a clear signal that terminology-aware retrieval helps LLMs translate domain-specific and rare vocabulary more accurately.
How to get started
-
Browse the repo and collection for datasets and usage notes:
- GitHub: https://github.com/dsfsi/za-mafoko/
- Hugging Face: https://huggingface.co/collections/dsfsi/za-mafoko
- Download CSV/JSON, inspect provenance metadata, and try simple retrieval-augmented prompts.
- Use Mafoko for evaluation (translation, NER, embeddings) or to seed dictionaries for apps and curricula.
- Contribute: report issues, suggest additional glossaries, or help with digitisation and enrichment.
Governance & licensing
We intentionally adopt the Nwulite Obodo Open Data License (NOODL) — a governance framework focused on equitable reuse and benefit-sharing with communities of origin. Please read the license notes in the repository before commercial reuse.
Call to the community
Mafoko is a foundation — it grows stronger with community input. If you are a researcher, translator, educator, or developer working with South African languages, please explore the data, cite the preprint, and contribute improvements or new sources.