Swivuriso is unique in that it combines both 
scripted and
unscripted speech, reflecting how people actually use language in daily life. All recordings are collected through 
ethical, community-centered processes, ensuring that participants are fairly engaged and that the data benefits the wider community. This approach strengthens both the quality of the dataset and its long-term impact.
The dataset covers the following seven languages, with the goal of building a balanced resource that reflects South Africa’s linguistic diversity:
- isiZulu – 500h
 - isiXhosa – 500h
 - Sesotho – 500h
 - Sepedi – 500h
 - Setswana – 500h
 - isiNdebele – 250h
 - Tshivenda – 250h
 
In total, the dataset will reach 
3,000 hours of high-quality, multilingual audio. These recordings will form the foundation for robust ASR models, helping to break literacy barriers, make digital content locally relevant, and accelerate innovation in South African language technologies.