16 Mar 2026

CommonLID: A New Benchmark for Cross-Lingual Speech Identification – and Our Contribution from the University of Pretoria

The landscape of cross-lingual speech identification (CLSI) is rapidly evolving, and a new preprint, CommonLID, promises to significantly advance the field. This initiative represents a collaborative effort involving a vast network of researchers and institutions, and the University of Pretoria (UP) played a key role in its development.

What is CommonLID?

CommonLID aims to provide a comprehensive and standardized benchmark for evaluating CLSI systems. It addresses a critical need: existing datasets often focus on a limited number of languages, or lack sufficient diversity to accurately reflect real-world scenarios. CommonLID tackles this by encompassing a far broader range of languages, particularly focusing on under-resourced African languages. The dataset includes both “core” languages (those already well-represented in existing research) and a large number of African languages.

The Challenge of African Languages

As highlighted in the CommonLID work, existing large language models (LLMs) like GPT-5 struggle significantly when applied to African languages in a CLSI setting. The performance gap is substantial, reaching as much as 30% lower F1 scores compared to their performance on core languages. This underscores the importance of datasets like CommonLID that prioritize these languages and drive development of models more attuned to their unique characteristics.

University of Pretoria’s Contribution

The University of Pretoria, under the leadership of Dr. Idris Abdulmumin at the DSFSI (Data Science for Social Impact), spearheaded the coordination of the project and the establishment of the associated shared tasks. This involved significant logistical and organizational efforts, ensuring a smooth and collaborative process for the diverse team of researchers involved. This effort aligns with UP’s commitment to leveraging data science for addressing critica social challenges within Africa and beyond. A large consortium of institutions contributed to the work.

Key Highlights from the CommonLID Paper

Scale and Diversity: The dataset represents a significant expansion in the number of languages included in CLSI benchmarks.
Focus on African Languages: A dedicated focus on African languages addresses a critical gap in existing resources.
Performance Disparity: The paper reveals a significant performance gap between LLMs and specialized models (like GlotLID) when dealing with African languages. GlotLID, despite requiring more resources, outperforms GPT-5 by a considerable margin in this context.
Shared Tasks: The creation of shared tasks fosters collaboration and accelerates progress in CLSI research.

Looking Ahead

CommonLID represents a pivotal step towards building more robust and inclusive CLSI systems. The work from the University of Pretoria and the broader collaborative effort will serve as a foundation for future research, particularly in developing models that can effectively handle the challenges presented by under-resourced languages. We anticipate that CommonLID will become an essential resource for researchers in NLP, data science,