24 Nov 2025

DSFSI 2025 Retrospective (Day 1): African NLP on the World Stage

Day 1 of our 7-day retrospective: Celebrating DSFSI's research excellence and global recognition in 2025, from ACL papers to groundbreaking multilingual resources.

Welcome to Day 1 of our 7-day retrospective celebrating DSFSI’s remarkable journey through 2025. Over the next week, we’ll be sharing the milestones, achievements, and moments that defined our year.

African NLP on the World Stage

2025 was a landmark year for DSFSI’s research impact. Our work reached global audiences, advanced the state of the art in African language processing, and demonstrated that locally-grounded research can drive international excellence.

🏆 ACL 2025: Three Papers, One Vision

In July, the Data Science for Social Impact (DSFSI) research group made significant contributions to ACL 2025 in Vienna, Austria, with three papers that push the boundaries of African Natural Language Processing.

1. AfroCS-xs: Quality Over Quantity in Code-Switching

AfroCS-xs: Creating a Compact, High-Quality, Human-Validated Code-Switched Dataset for African Languages

Code-switching is how Africans actually speak—mixing languages naturally in conversation. Yet most NLP systems struggle with this reality. Our team created AfroCS-xs, a human-validated, synthetic code-switched dataset covering Afrikaans, Sesotho, Yoruba, isiZulu, and English in the agricultural domain.

The breakthrough? Even small, high-quality resources can dramatically improve translation models when they reflect real language use.

2. HausaNLP: Mapping the Landscape

HausaNLP: Current Status, Challenges and Future Directions for Hausa Natural Language Processing

Hausa, spoken by over 120 million people across West Africa, remains underrepresented in NLP research. This comprehensive survey maps the current landscape of Hausa NLP resources, identifies critical gaps, and proposes concrete directions for dataset development, modeling strategies, and collaborative research.

It’s more than a survey—it’s a roadmap for a linguistic community that deserves technological equity.

3. BRIGHTER: Emotions Across 28 Languages

BRIGHTER: BRIdging the Gap in Human-Annotated Textual Emotion Recognition Datasets for 28 Languages

Emotions are universal, but their expression is culturally specific. BRIGHTER delivers emotion-labeled datasets in 28 languages from Africa, Asia, Latin America, and Eastern Europe—all annotated by fluent speakers.

This work helps make NLP systems more emotionally aware and globally inclusive, moving beyond the English-centric models that dominate the field.

It’s not just about building datasets—it’s about building them right.

📚 Mafoko: Opening Terminologies for African Languages

In November, we released Mafoko: Structuring and Building Open Multilingual Terminologies for South African NLP, presented at the Sixth Workshop on Resources for African Indigenous Languages (RAIL 2025).

Key achievements:

  • 25,000+ term entries across 11 official South African languages
  • Structured, machine-readable formats (CSV and JSON)
  • Released under NOODL (Nwulite Obodo Open Data License)—an Africa-centered, equitable data license

Terminologies are the backbone of domain-aware language technology. By opening these resources, we’re enabling:

  • Improved machine translation and RAG pipelines
  • Benchmarks for low-resource language evaluation
  • Tools for spellcheckers, voice assistants, and educational platforms
🔗 GitHub Repository 🤗 Hugging Face Collection

🌟 Recognition and Media Impact

Our research excellence was recognized beyond academic circles:

  • Nature Magazine’s ZA African Next Voices featured our work advancing African language technologies
  • SAICSIT Emerging Pioneer Award recognized leadership in advancing South African computer science research
  • Nature Careers Podcast featured Prof. Vukosi Marivate discussing African AI and career pathways in research

📊 By the Numbers: 2025 Research Impact

  • 3 papers at ACL 2025 (main conference + AfricaNLP workshop)
  • 25,000+ terms released in Mafoko
  • 28 languages supported in BRIGHTER emotion datasets
  • 11 South African languages with structured terminologies
  • Multiple workshops organized and co-organized at international conferences

🔬 Common Threads: Toward Inclusive, Responsible African NLP

Across all our research, several themes emerge:

  1. Human-centered datasets, even small ones, drive significant NLP improvements
  2. Ethical licensing and governance (NOODL) are essential to sustainable ecosystems
  3. Cross-institutional collaboration shapes inclusive, African-led research
  4. Tooling and evaluation for African languages continues to mature

What This Means

2025 proved that African NLP research is not just catching up—it’s leading. We’re not adapting Western models; we’re reimagining what language technology should be: adaptive, ethical, community-grounded, and technically excellent.

From Vienna to arXiv, from Hausa to isiXhosa, from emotion recognition to data governance—DSFSI is putting African languages and African values at the center of global AI conversations.


Tomorrow (Day 2): We celebrate visionary leadership—including Prof. Vukosi Marivate’s landmark Inaugural Lecture: “Beyond the Symbols.”


Stay connected:

This is Day 1 of our 7-day retrospective celebrating DSFSI’s 2025 journey. Follow along all week as we share the milestones, people, and vision that defined our year.