Swivuriso: ZA-African Next Voices

⚠️ IMPORTANT: Work in Progress

This dataset is not final. Releases and updates will continue until end of September 2025.

  • Users must regularly check this repository for new versions, corrections, and updates.
  • For attribution, benchmarking, and publications, always reference the latest available information (as of September 2025 or later).
  • Do not cite or benchmark against old/incomplete versions. Use only the most recent release and documentation.

We appreciate your patience while we expand and improve this dataset. Stay updated by watching this repo.

About the project

Swivuriso is a large-scale multilingual speech dataset targeting over 3000 hours of audio across 7 South African languages. The dataset is developed to support Automatic Speech Recognition (ASR) and inclusive speech technologies for low-resource African languages. It combines both scripted and unscripted speech, collected through ethical, community-centered processes.

Dataset Paper: ArXiv - Work in Progress

Partners, Funders & Supporters

Data Sources

Vukuzenzele Newspaper [Website][Data Repo], Wikipedia, African Wordnet, GrainSA, Agricultural Research Council, SADILAR, Masakhane