26 Jul 2025

DSFSI at ACL 2025: Advancing African Language NLP, Tools, and Licensing

We’re excited to share that the Data Science for Social Impact (DSFSI) research group made significant contributions to the ACL 2025 Conference in Vienna, Austria, with four papers pushing the boundaries of African Natural Language Processing (NLP). These papers reflect our continued commitment to building equitable, sustainable, and technically robust research for low-resource languages—through datasets, evaluation frameworks, language tools, and community-grounded licensing.

🔁 AfroCS-xs: Compact, High-Quality Code-Switched Data

Paper: AfroCS-xs: Creating a Compact, High-Quality, Human-Validated Code-Switched Dataset for African Languages

This work introduces AfroCS-xs, a human-validated, synthetic code-switched dataset in Afrikaans, Sesotho, Yoruba, isiZulu, and English, targeting the agricultural domain. Using LLMs and native speaker validation, the dataset demonstrates that even small, high-quality resources can significantly improve translation models when fine-tuned for code-switched text.

📚 HausaNLP: Mapping the Landscape of Hausa Language Processing

Paper: HausaNLP: Current Status, Challenges and Future Directions for Hausa Natural Language Processing

With over 120 million first-language speakers, Hausa is still underrepresented in NLP research. This paper offers a comprehensive overview of current Hausa NLP resources, model gaps, and community challenges. The authors propose concrete directions to enhance dataset development, modeling strategies, and collaborative research.

🌈 BRIGHTER: Emotion Recognition Across 28 Languages

Paper: BRIGHTER: BRIdging the Gap in Human-Annotated Textual Emotion Recognition Datasets for 28 Languages

BRIGHTER delivers emotion-labeled datasets in 28 languages from Africa, Asia, Latin America, and Eastern Europe. Annotated by fluent speakers, these datasets support emotion classification and intensity recognition, helping to make NLP systems more emotionally aware and globally inclusive.

🗣️ The Esethu Framework: Sustainable Dataset Licensing for African Languages

Paper: The Esethu Framework: Reimagining Sustainable Dataset Governance and Curation for Low-Resource Languages

The Esethu Framework introduces a new model for community-centered data licensing and curation. With the Esethu License and the release of the Vuk’uzenzele isiXhosa Speech Dataset (ViXSD), the project highlights how ethical, locally-led processes can power automatic speech recognition for African languages while protecting contributor rights.

🌍 Common Threads: Toward Inclusive, Responsible African NLP

Across these projects, we see strong unifying themes:

Human-centered datasets, even small ones, drive significant NLP improvements.
Ethical licensing and governance are essential to sustainable NLP ecosystems.
Cross-institutional collaboration continues to shape inclusive, African-led research.
Tooling and evaluation for African languages is maturing—with more to come.

A heartfelt thank you to all collaborators, contributors, and community partners who made this work possible. These projects are milestones not only in African NLP research, but in building a more inclusive and impactful future for global language technologies.

📢 Stay engaged with our work by visiting dsfsi.co.za and following us on Twitter/X and LinkedIn.