⚠️ IMPORTANT: Work in Progress
This dataset is not final. Releases and updates will continue until end of September 2025.
We appreciate your patience while we expand and improve this dataset. Stay updated by watching this repo.
Swivuriso is a large-scale multilingual speech dataset targeting over 3000 hours of audio across 7 South African languages. The dataset is developed to support Automatic Speech Recognition (ASR) and inclusive speech technologies for low-resource African languages. It combines both scripted and unscripted speech, collected through ethical, community-centered processes.
Dataset Paper: ArXiv - Work in Progress
Way With Words (Partner), Data Science Law Lab @ UP (Partner), Gates Foundation (Funder), Meta (Funder)
Vukuzenzele Newspaper [Website][Data Repo], Wikipedia, African Wordnet, GrainSA, Agricultural Research Council, SADILAR, Masakhane