Fair Forward - Open Data & Use Cases

SDG 13

Combatting Air Pollution and GHG Emissions in India through hyperlocal AI-powered mapping

India

Under this initiative, a novel approach is employed by leveraging citizen scientists and IoT-based low-cost sensors to collect hyperlocal air quality data. This data is used to identify pollution sources and risk zones, facilitating targeted actions by regulatory authorities.To showcase data outreach, the project features the VAYU Android-based application and the VAYU citizen portal digital stack, which support targeted interventions and customized solutions backed by AI/ML algorithms. These tools potentially develop new approaches in air pollution management while reducing public investment costs.

See Details

cc-by-4.0

SDG 10

SDG 8

Facilitating access to financial applications in informal settings in four Ghanaian dialects: Akuapem Twi, Ashante Twi, Fante and Ga.

Ghana

Dennis Asamoah Owusu

This speech dataset for the Ghanian languages Akan (Akuapem Twi, Asante Twi, Fante) and Ga includes 104,000 utterances (speech) across the four dialects/languages with approximately 200 speakers per dialect/language. This amounts to about 148 hours of speech in total. The dataset was developed to support the development of financial applications in native Ghanaian languages to allow illiterate and semi-literate people to fully benefit from digital financial services. Secondly, it aims to answer research questions related to domain-specific vs. general-purpose dataset development, dialects, as well as NLP system development in low resource settings. Overall, a total of 83,829 audios were recorded from which the datasets were published and made publicly accessible.

See Details

cc-by-4.0

SDG 10

SDG 5

SDG 2

Enabling machine translation from Kiswahili into the indigenous Kenyan languages Kidaw'ida, Kalenjin, and Dholuo, preserving these languages & supporting crowd-sourced voice recognition via Mozilla Common Voice for these languages

Kenya

Audrey Mbogho

The dataset was created to enable translation from Kiswahili, which is the national language in Kenya, into three indigenous languages, namely, Kidaw'ida, Kalenjin, and Dholuo. All three languages are low resource, especially Kidaw'ida, which has only around 400,000 speakers and is at immediate risk of loss. By collecting the text and speech data, the project has taken a step towards preservation of these languages . The dataset consists of three parallel corpora: Kidaw'ida-Kiswahili; Kalenjin-Kiswahili; Dholuo-Kiswahili. On average, each corpus has thirty thousand sentence pairs. In addition, the dataset has helped to preserve, revitalise and elevate the languages Kidaw'ida, Kalenjin, and Dholuo by making them work on Mozillas Common Voice Platform. The text and speech datasets are thereby supporting crowd-sourced voice recognition, promoting linguistic diversity and empoweromg local communities by enabling Natural Language Processing applications tailored to their needs. Namely, the resulting voice recognition datasets can be used to build AI voice applications such as advisory systems in local languages for farmers, citizens etc that "understand" speech input (instead of written input). As of August 2025, 120 hours of speech data have been recorded on Common Voice in Dholuo (with 14692 sentences available), 92 hours have been recorded for Kalendjin (With 29900 sentences available) and 56 hours have been recorded for Kidaw'ida (with 11773 sentences available). To download the crowd-sourced speech datasets in Kidaw'ida, Kalenjin, and Dholuo, visit https://datacollective.mozillafoundation.org/datasets?q=common+voice

See Details

Apache License 2.0

SDG 13

Helping to measure solar energy adoption across Madagascar via AI - Labelled Open solar panel data for Madagascar

Madagascar

Fabienne Rafidiharinirina, Association Maidi

This dataset will help data scientists, government and users to measure solar energy adoption across Madagascar. It laid the groundwork needed to develop a solar panel detection algorithm working in Madagaskar. Notably, this project represented all regions of the country; instead of focusing only on big cities, it also covered average and small villages as well as coasts and mountains. The team annotated 2,125 Google Earth satellite images and 9,202 drone images, forming a combination of low and high-definition solar panel views in Madagascar. The Madagascar Initiatives for Digital Innovation (MAIDI) team performed field checks for up to 25% of satellite images and, in total, annotated 22,488 polygons.

See Details

cc-by-4.0

SDG 10

Detecting sentiments and combatting hate speech in Hausa, Igbo, Nigerian-Pidgin and Yorùbá - NaijaSenti: a Nigerian Corpus for Multilingual Sentiment Analysis

Nigeria

NaijaSenti

The NaijaSenti dataset is the first large-scale human-annotated Twitter sentiment dataset for Hausa, Igbo, Nigerian-Pidgin, and Yorùbá, the four most widely spoken languages in Nigeria. It consists of around 30,000 annotated tweets per language (except for Nigerian-Pidgin), including a significant fraction of code-mixed tweets. These datasets are useful not only for sentiment analysis but also for hate speech detection. The open-source package consists of datasets, trained models, sentiment lexicons, and code. With more than 200 million people and 522 native languages, Nigeria is the most populous and linguistically diverse country in Africa, as well as the third most multilingual country in the world. The majority of the population speaks either Hausa, Igbo, Yorùbá, or Nigerian-Pidgin.

See Details

cc-by-4.0

SDG 13

SDG 2

Enable Cashew, Cocoa and Coffee farmers to make good business decisions - Drone-based Agricultural Dataset for Crop Yield Estimation in Ghana and Uganda

Ghana,Uganda

Darlington Akogo, KaraAgro

This dataset supports yield estimation, crop type detection and classification, fruit detection and counting, and fruit maturity stage detection (unripe, ripe, and spoiled) for three products that are important sources of livelihood for millions of households in Sub-Saharan Africa. It contains 14,870 drone images with bounding box annotations of cashew, cocoa, and coffee trees collected across multiple farms in Ghana and Uganda. Conventional methods of yield estimation are expensive, require a lot of labor and time, and are prone to error due to incomplete ground observations. This results in poor crop yield estimations and hinders farmers’ ability to appropriately plan and manage their fields and production pipelines. This dataset will help transform African agriculture into agribusiness by allowing for the development of yield estimation solutions that enable farmers to make good business decisions. Having key details about agricultural production readily accessible enables a timely harvest, helping farmers ensure healthy, fresh produce and, in addition, better sales.

See Details

cc-by-4.0

SDG 13

SDG 10

SDG 5

Preserving privacy and avoiding gender bias of AI systems in Luganda, Lumasaba, Hausa, and Kanuri - The Lacuna personally identifiable information Text Dataset

Uganda,Kenya,Nigeria

Andrew Katumba - Makerere University

The Lacuna PII Multilingual Text Dataset contains annotated sentences with personally identifiable information (PII) in Luganda, Lumasaba, Hausa, and Kanuri. These four languages span Central and Eastern Uganda, Nigeria, Ghana, and Northern Cameroon. The dataset can help to anonymize and pseudonymize personally identifiable information in AI tasks. This can help organizations comply with legal requirements while still being able to analyze and use and share their data effectively. It can also be used to improve machine translation systems for low-resource languages, improve the performance of NLP applications in these languages, and support the extraction of specific information from text, such as automated form filling and information retrieval systems. It comprises 4000 Luganda sentences, 5000 Lumasaba sentences, 3000 Kanuri sentences, and 3000 Hausa sentences. The team aimed to curate a dataset that is gender inclusive. It was created by Makerere Artificial Intelligence Lab in collaboration with Marconi Lab and Clear Global.

See Details

cc-by-4.0

SDG 7

Powering Rural Futures in West Africa: AI-Driven Demand Data for Smarter Electrification

Benin,Ghana,Niger,Togo, Nigeria

Reiner Lemoine Institut gGmbH (RLI)

The project provides two openly accessible datasets that were developed through a complete, reproducible data pipeline combining machine learning with stochastic energy-system simulation. The first dataset contains predicted appliance ownership and household counts for all 1,209 administrative level 2 regions (adm2) across Nigeria, Ghana, Togo, Benin, and Niger, derived from satellite-based features and socio-economic indicators using models trained on more than 3,500 household surveys. The second dataset consists of high-resolution synthetic electricity demand profiles generated with the RAMP tool, offering minute-by-minute load curves for an entire year for each adm2 region. Together, these datasets provide a unique, representative, and scalable foundation for understanding residential electricity demand in regions where measured data is scarce or entirely unavailable.

See Details

cc-by-4.0

SDG 13

SDG 7

Promoting energy conservation and market analysis in Pakistan through Residential Energy and Weather Data (REWD)

Pakistan

Lahore University of Management Sciences (LUMS)

This dataset helps to understand energy consumption patterns in relation to weather conditions in Pakistan. This can guide policymaking on energy and energy conservation, sustainabe energy initiatives and private sector use (market assessment and strategic planning as new entrants in the restructured energy market). The project produced an energy dataset of residential consumption data from buildings across six climatic zones. The energy dataset is recorded on a 1-minute granularity for entire household usage as well as for individual appliances. With over a year of detailed energy consumption data and 18 months of weather data collected from a diverse array of households across six distinct climatic zones, it is one of the most comprehensive datasets of its kind. This meteorological dataset has been accumulated over a period of 18 months for the six urban centers. This weather data comprises of up to 10 essential variables such as temperature, atmospheric pressure, humidity and Precipitation, thus providing an all-around perspective of the environmental elements impacting energy consumption.

See Details

cc-by-4.0

SDG 10

SDG 2

Datasets for transportation impact evaluation

Colombia

Fundación Despacio, World Resources Institute, Fundación Despacio

The team developed a labeled training dataset, derived from 50cm or better satellite imagery, based on a novel, pre-defined road space classification taxonomy appropriate for training and deployment of large-scale deep-learning models

See Details

cc-by-4.0

SDG 13

SDG 15

Monitoring the impact of palm oil monoculture, shrimp aquaculture & mining in continental Ecuador and the Galapagos using AI

Ecuador

Fundacion Ecociencia

The dataset can help to build systems, that can monitor the impact of palm oil monoculture, shrimp aquaculture, mining and other land transformations in continental Ecuador and Galapagos. The project created a 20.000 points land use/cover classification training dataset from existing data, with labels that can be used to train multi-spectral Earth observation (EO) data machine learning (ML) models covering continental Ecuador and the Galapagos islands.

See Details

cc-by-4.0

Indigenous Knowledge Meets AI: Monitoring Elephants and Rodents in Kenya and Ecuadorian Amazon with Biodiversity AI-Datasets

Ecuador,Kenya

Space4Innovation, Diana Mastracci diana@space4innovation.com

The Ltome-Katip datasets are the first Indigenous-labelled bioacoustic datasets designed specifically to support the development of ethical AI for biodiversity monitoring. Co-created by Indigenous data stewards from the Samburu tribe in northern Kenya and the Shuar Nation in the Ecuadorian Amazon, the recordings focus on two sentinel species: Ltome (elephant) and Katip (rodent), both of which signal ecological shifts under climate stress. All data were collected, annotated, and governed by the Indigenous communities, following locally defined protocols rooted in Indigenous data sovereignty. These datasets are not only scientifically valuable — they establish a precedent for how Indigenous communities can lead in setting standards for responsible, consent-based AI development.

See Details

cc-by-4.0

AI for Mangrove Carbon Credits: Turning Forest Data into Climate Action in Côte d’Ivoire

Cote d'Ivoire

data354

cc-by-4.0

SDG 15

Mapping Cocoa Landscapes in Ghana: Reference Data for Tracking Land Use Change

Ghana

Center for Remote Sensing and Geographic Information Services

This dataset was produced by the Centre for Remote Sensing and Geographic Information Services (CERSGIS) as part of the project Reference Data Collection for Improving Land Use Change Mapping in Ghana. The primary objective was to develop high-quality reference data to enhance the accuracy of remote sensing-based land use and land cover (LULC) change mapping using machine learning methods in Ghana’s cocoa production landscapes.

See Details

cc-by-4.0

SDG 15

Miti360: A Comprehensive Dataset for AI-Powered Forest Monitoring

Kenya

Prof. Ciira Maina, Centre for Data Science and Artificial Intelligence, Dedan Kimathi University of Technology

Miti360 is an integrated, machine-learning ready dataset for individual-tree and stand-level reforestation monitoring that fuses high-resolution drone orthophotos, terrestrial stereo and single images, precise ground measurements (tree height, crown diameter, basal diameter and GPS locations), species labels, and multi-year weather time series for nearby stations. The collection is designed to support detection, segmentation, species classification, and biophysical parameter estimation (e.g., crown diameter, height, biomass proxies), and to enable linking short-term growth dynamics to weather. Data are provided in standard GIS and ML formats (GeoTIFF, JPEG, JSON, SHP, time-series API) for immediate integration into research and operational pipelines.

See Details

cc-by-4.0

SDG 15

African Trees for Climate Resilience: A Comprehensive Database

Angola, DRC, Kenya, Mozambique, Nigeria, South Africa,Tanzania, Zambia

Professor Guy F Midgley gfmidgley@sun.ac.za University of Stellenbosch

Extensive bioinformatics resource that leverages tree species’ distribution, medicinal, food provision, and other trait data, together with southern African trees’ climate relationships and growth characteristics for climate adaptation and mitigation planning. The data can serve to promote the use of indigenous trees for reforestation, regenerative agriculture, ecological restoration, human health and livelihood support, and urban afforestation programs to adapt to and mitigate the impacts of climate change.

See Details

cc-by-4.0

SDG 13

SDG 15

Inclusive MRV for India's Eastern Himalayas

India

Vertify.earth, Michael Anthony michael@vertify.earth, Alsisar Impact, Saurabh Singhavi saurabh@alsisarimpact.com

An AI-Driven Dataset for Nature-Positive Livelihoods and Forest Restoration in Eastern Himalayas.This open-access dataset and digital MRV (Monitoring, Reporting, and Verification) toolkit was developed to help communities, NGOs and policymakers in the Eastern Himalayas measure and manage the links between forests, livelihoods and climate resilience. The project was created in response to a persistent data gap in the Himalayan region where limited open data and difficult terrain have hindered effective monitoring of forest degradation and the impact of restoration programs. Using locally calibrated AI models trained on Indian biomass data, the system combines satellite observations with ground level socio-economic surveys to measure forest dependency, identify degradation hotspots and track restoration outcomes. All datasets ranging from community level livelihood surveys in Sikkim and Mizoram to forest biomass and land use layers in Assam are openly available. Together they form a replicable, open-source “digital public good” that can be used to train local AI models for forest monitoring, biodiversity accounting or impact assessment. By providing a scalable, context-aware dataset, this initiative enables indigenous start-ups, environmental NGOs and impact investors to quantify nature positive outcomes, align with IRIS+ indicators, and guide conservation investments. The goal is to make environmental monitoring inclusive and data driven, empowering local actors to track changes in forest carbon, identify erosion risks and measure the socio-economic benefits of restoration. Ultimately, this approach helps governments, researchers and entrepreneurs build locally adapted AI systems that support both ecosystem regeneration and sustainable livelihoods in fragile mountain regions.

See Details

cc-by-4.0

SDG 2

Data-enabled climate shock absorbance through agroforestry (Agrof4resilience)

Kenya

International Center of Insect Physiology and Ecology (ICIPE)

The Agrof4Resilience geospatial datasets are open-access utilized by artificial intelligence (AI) and machine learning (ML) algorithms that are aimed at creating more robust, productive, resilient and diverse agro-ecological systems to achieve productive agro-ecosystems that have potential to improve resilience over time. This is crucial especially in dryland which are plagued with land degredation and other effects of climate change. The data demonstrates that agroforestry in Kenya is fairly low while models developed from the data indicate high agroforestry potential in Kenya. This is affected by socioeconomic factors such as education levels, occupation, age, landsize, income, gender, marital status, cultural beliefs, and family size. These datasets could be utilized to demostrate regions and practices that are environementally and socio-economically sound. This could provide insights on practices that could enhance livelihoods and promote environmental resilience in target communities across Kenya. The data was collected from 35 out of 47 counties in Kenya, spread across four predetermined transects that cover four main agro-ecological zones, and stored in a freely accessible database within icipe’s servers.

See Details

cc-by-4.0

SDG 15

Quantifying Colombian mangroves aboveground biomass and carbon content

Colombia

María Cuevas (mcuevas@cttc.es / Centre Tecnològic de Telecomunicacions de Catalunya, CTTC, Spain), Cristian Montes (cristian.montes@invemar.org.co / Instituto de Investigaciones Marinas y Costeras José Benito Vives de Andreis, INVEMAR, Colombia)

This open-access dataset supports machine learning (ML) applications for mangrove forest monitoring, addressing the need for more openly available and well-annotated datasets to calibrate and validate ML models. It focuses on improving the estimation of above-ground biomass (AGB) and above-ground carbon (AGC) in Colombian Caribbean mangroves. The pilot area, Via Parque Isla de Salamanca National Natural Park (VIPIS) in the Magdalena department, is a Ramsar and UNESCO Biosphere Reserve. Existing global AGB and AGC models often lack the precision and cost-effectiveness needed for regional monitoring. The dataset includes both plot-level and tree-level field measurements from 20 newly surveyed plots, providing detailed attributes such as DBH, height, species, AGB (from allometric equations), and AGC. Using these field data, high-resolution satellite imagery, and machine learning models, satellite-based AGB and AGC maps covering ~6,000 ha of mangroves at 10 m spatial resolution have also been generated.

See Details

cc-by-4.0

SDG 15

Phenological Dataset for Ecological Forecasting (PheDEF Project)

Ghana

bismark.ofosu-bamfo@uenr.edu.gh; bofosubamfo@gmail.com) and Daniel Yawson, School of Science, University of Energy and Natural Resources, Sunyani, Ghana/Raul Zurita-Milla, Faculty ITC, University of Twente, The Netherlands. Primary contact/Maintainance of datasets: Bismark Ofosu-Bamfo

The health of tropical forest ecosystems faces pressures from climate change, threatening the sustainable supply of leaves, flowers and fruits which provide important resources for wildlife, domestic animals and human settlements. Monitoring the timing of plant life cycle events (phenology) is one effective way to track the availability of plant resources and the impact of climate change and weather variability on their sustainable supply. This dataset is on 48 weeks of liana and tree phenology from ground observations, traditonal ecological knowedge and camera traps in the canopy in two tropical forest ecosystems (a moist semi-deciduous and a dry semi-deciduous forest). The dataset also includes land surface phenology from satellite images and in situ weather data. Phenology data from multiple sources and climate data could be combined via a machine learning model that can be used to predict phenology at community and landscape scales. This data will enhance the representation of tropical African forests in phenology research and contribute meaningful data from tropical African forests for machine learning applications in climate, forests and biodiversity conservation. The images and phenology labels could also be used to train an automation of identifying phenology events in forest canopy images.

See Details

cc-by-4.0

SDG 13

SDG 15

Acquisition and Analysis of groundtruth data and Remote Sensing Imagery to develop and Artificial Intelligence solution for Prediction Landuse and Landcover changes in Uganda

Uganda

1. Makerere University, Makerere AI Lab: https://mak.ac.ug, https://air.ug/ 2. National Forestry Authority (NFA): https://nfa.go.ug/

This project provides Uganda’s first openly accessible AI-ready satellite imagery dataset designed to predict land-use and land-cover change. It was created to address persistent deforestation and the lack of reliable, context-specific data needed to detect, forecast, and manage ecosystem degradation. Through an open-source replication kit including annotated Sentinel-2 and Hansen datasets, protocols, and inventory data, this products enables local innovators, forest agencies, and policymakers to build AI tools for monitoring forests, planning restoration, and supporting NDC and SDG15 actions. Its goal is to improve early detection of land-cover change, strengthen climate-smart decision-making, and empower local institutions with high-quality geospatial data.

See Details

cc-by-4.0

Towards Reliable Detection of Inefficient Residential Electricity Usage for Policy Intervention

Sri Lanka

Merl Chandana - Team Lead, Data Algorithms & Policy, LIRNEasia

This project aims to develop a longitudinal household-level electricity dataset by integrating repeated survey data with high-frequency smart meter & non-smart meter data. The survey component captures detailed information about household demographics, dwelling characteristics, appliance ownership, and energy-related behaviors, enabling these attributes to be linked to observed electricity consumption patterns. The resulting dataset is meant to facilitate analysis around experimentation with behavioral nudges, tariff reforms, and other demand-side management solutions relevant to energy affordability, efficiency, and policy in Sri Lanka.

See Details

The dataset is publicly available under the Creative Commons Attribution 4.0 International (CC BY 4.0) license

SDG 2

Eyes on the Ground Image Data

Kenya

Lilian Waithaka, Koen Hufkens, Berber Kramer and Benson Njuguna

This machine learning dataset of smallholder farmer's fields includes georeferenced crop images along with labels on input use, crop management, phenology, crop damage, and yields, collected across 8 counties in Kenya. This dataset enables the development of AI systems to monitor crop conditions, predict yields, and support agricultural decision-making for smallholder farmers.

See Details

cc-by-4.0

SDG 2

High-Accuracy Maize Plot Location and Yield Dataset in East Africa

Kenya, Rwanda, Tanzania

One Acre Fund

This dataset includes corrected geolocations of fields, improving the usability of the most expansive Eastern Africa crop cut yield estimation. Collected by the non-profit One Acre Fund from 2015 – 2019, this dataset covers major crop producing regions in Kenya, Rwanda, and Tanzania. It enables accurate yield mapping and agricultural monitoring across East Africa.

See Details

cc-by-4.0

SDG 2

SDG 6

Sensor Based Aquaponics Fish Pond Datasets: IoT Fish Pond Monitoring Datasets

Nigeria

Udanor Collins, Blessing Ogbuokiri, and Nweke Onyiny

This project built a remotely monitored and controlled Internet of Things (IoT) fish pond water quality management system for the generation of labeled datasets both for conventional ponds and the aquaponic pond systems. It enables monitoring and optimization of water quality parameters for sustainable aquaculture.

See Details

cc-by-4.0

SDG 2

Machine Learning Datasets for Crop Pest and Disease Diagnosis: Crop Imagery and Spectrometry Data

Uganda, Tanzania, Ghana

Joyce Nakatumba-Nabende, Andrew Katumba, Claire Babirye, Jeremy Francis Tusubira, Godliver Owomugisha, Neema Mduma, Darlington Akogo, Blessing Sibanda

This dataset contains a repository of image and spectrometry datasets for five main food security crops in Sub-Saharan Africa: cassava, maize, beans, bananas, and cocoa. Collected and curated in collaboration with the in-country agricultural experts, the datasets deliver a wide range of machine learning applications, including classification, object detection, early crop disease detection, and spatial analysis. The team collected and annotated 127,046 images and 39,300 spectral data points.

See Details

cc-by-4.0

SDG 2

SDG 16

A Decision-Supporting Tool for Developing Community-led Land Use Plans

Tanzania

Gladness Mwanga, Divine Ekwem

This dataset focuses on locations with predominantly pastoral communities in northern Tanzania to identify fine and broad-scale movements of livestock and land use patterns and to understand how these relate to communal conflicts. It is a high-quality, accurate and labeled dataset containing detailed information on ~ 2000 communal resources (e.g., rangelands, water points, and dips) and their use patterns for over 220 villages across four large districts in northern Tanzania, representative of pastoral systems of livestock production in East Africa.

See Details

cc-by-4.0

SDG 2

SDG 1

Enhanced Agriculture Datasets for Remote Crop Monitoring to Provide Access to Essential Social and Financial Services to Smallholder Farmers in Zimbabwe

Zimbabwe

Seth Odhiambo

The project created labeled yield estimates from 3000 farmers, and was used to train prediction models for yield prediction across the country, consequently using the dataset to generate high resolution crop mask layers for the different value chains. The yield prediction models were enhanced by other biophysical datasets ranging from soil properties and climate related indicators. The datasets proved a concept of scalable machine learning models training, which may be able to respond more appropriately and cost-effectively to agricultural stressors.

See Details

cc-by-4.0

SDG 2

A region-wide, multi-year set of crop field boundary labels for Africa

Multi-country Africa

Mary Dziedzorm Afenyo, Lyndon Estes, Primož Kovačič

This dataset provides continent-wide crop field labels for Africa, improving the availability and use of crop field boundary (parcel) maps. It contains 42,403 annotated geospatial polygons indicating the boundaries of individual crop fields spanning the years 2017-2023. This enables the development of automated field boundary detection systems for agricultural monitoring across Africa.

See Details

cc-by-4.0

SDG 2

CropHarvest: Informing decision-making around agricultural development, early warning systems, and trade in Sub-Saharan Africa

Kenya, Mali, Togo, Rwanda, Uganda, Ethiopia, Malawi, Zambia, Tanzania, Namibia, Sudan, Nigeria

Catherine Nakalembe

CropHarvest increases the understanding of the main types of food production in Sub-Saharan Africa and can help inform decision-making around agricultural development, early warning systems, and regional trade. It is a global, open-source remote sensing dataset for crop-type classification in Sub-Saharan Africa – specifically in Kenya, Mali, Togo, Rwanda, Uganda, Ethiopia, Malawi, Zambia, Tanzania, Namibia, Sudan, and Nigeria. The team expanded on an existing dataset to include new labeled data points, ground data for crop type mapping, street-level images, crowdsourced labeled images, and price data.

See Details

cc-by-4.0

SDG 2

SDG 1

Improving livelihoods in Ghana and Uganda: Drone-based Agricultural Dataset for Crop Yield Estimation of cashew, cocoa, and coffee

Ghana, Uganda

Darlington Akogo

This dataset supports yield estimation, crop type detection and classification, fruit detection and counting, and fruit maturity stage detection (unripe, ripe, and spoiled) for three products that are important sources of livelihood for millions of households in Sub-Saharan Africa. It contains 14,870 drone images with bounding box annotations of cashew, cocoa, and coffee trees collected across multiple farms in Ghana and Uganda. This dataset will help transform African agriculture into agribusiness by allowing for the development of yield estimation solutions that enable farmers to make good business decisions.

See Details

cc-by-4.0

Machine Learning Dataset for Rabies Diagnosis and Outbreak Prediction

Global

Asa Emmanuel: asakalonga@gmail.com, Kennedy Lushasi: klushasi@ihi.or.tz

This dataset will help in the real-time and remote diagnosis of rabies disease for humans and animals in low-resource settings. A time series approach can be applied to the outbreak dataset to predict the number of rabies cases likely to occur within an area after a given time interval. This approach can help with resource mobilization, too, such as identifying the number of vaccines required in a specific area at a given time. The number of observations from the two datasets is 12,684. There are three datasets for rabies diagnosis for animals and humans, with 7,081 and 4,585 observations, respectively. In the outbreak prediction dataset, 1,018 observations were accounted for.

See Details

cc-by-4.0

Childhood Malnutrition in Chile

Chile

Maria Paz Hermosilla: paz.hermosilla@uai.cl

This data repository will evaluate factors that contribute to child malnutrition in Chile and children's nutritional status, as well as the associated costs. The focus at this stage is on estimating health costs associated with child malnutrition and identifying biopsychosocial determinants that lead to it.

See Details

cc-by-4.0

Lacuna Malaria Datasets

Uganda, Ghana

Rose Nakasi: g.nakasi.rose@gmail.com or rose.nakasi@mak.ac.ug

This dataset will aid in the diagnosis of malaria. The dataset contains annotated images of blood samples collected in Uganda and Ghana with objects of interest, including parasites and white blood cells. It significantly increases the number of available microscopy images â€“ including metadata â€“ by 6,000 thick blood slides and 2,000 thin blood slides for use in object detection research and other areas of inquiry.

See Details

cc-by-4.0

Intraoperative Anesthesia and Outcomes Dataset: Improving patient outcomes by predicting risk of mortality and post-operative recovery

Sub-Saharan Africa

Bhiken Naik: bin4n@uvahealth.org

This dataset can be used to identify patterns of intraoperative anesthesia practice and predict postoperative length of stay and risk of mortality based on intraoperative variables. It includes 2,066 intraoperative anesthesia records from two academic centers in sub-Saharan Africa. The team photographed completed intraoperative anesthesia records using a smartphone, de-identified the images, and securely uploaded them to a HIPAA-compliant server. Using a combination of computer vision AI and manual extraction techniques, the team collected comprehensive intraoperative data: demographic data, medication data, hemodynamic data, physiological data, anesthesia type, surgery type, postoperative length of stay, and 30-day postoperative mortality.

See Details

cc-by-4.0

Brain Tumor Segmentation Africa (BraTS-Africa) Dataset

Nigeria

Udunna Anazodo: udunna.anazodo@mcgill.ca

The BraTS-Africa dataset is an aggregation of magnetic resonance imaging (MRI) scans from six centers in Nigeria aimed at providing a public dataset for the development of machine-learning solutions for the management of brain tumors in African patients. This dataset serves as a starting framework for future expansion in other regions of Africa. The team processed and annotated a total of 584 images from 146 patient scans. Ninety-five of these scans are presumed to have diffuse glioma, and 51 of them have other types of central nervous system (CNS) neoplasms. Expert radiologists annotated three distinct tumor sub-regions to delineate the enhancing tumor (ET), the necrotic tumor core (NCR), and the peritumoral oedematous/infiltrated tissue (ED) sub-regions.

See Details

cc-by-4.0

AI-Assisted Smartphone Microscopy for detection of Diarrhea Parasites

Nepal

Bishesh Khanal: bishesh.khanal@naamii.org.np

This dataset helps to detect diarrhea-causing parasites in resource-limited rural areas, particularly across the Global South, where access to expensive diagnostic tools is limited. It contains approximately 400,000 microscopic slide images from water, vegetable, and stool samples from four different provinces across Nepal, making it one of the largest datasets of its kind. The team collected water samples from different sources (i.e., tap water, bottled water, lake, river, pond, stream, spring water, wetland, well, and borewell) and used seven different types of vegetables. Using the dataset and annotations available, this team trained different deep-learning models to automatically detect parasites, specifically Giardia and Cryptosporidium cysts.

See Details

cc-by-4.0

SDG 13

Project Climate Change, Health, and Artificial Intelligence (CCHAIN): Public Health Data Insights for the Philippines

Philippines

Thinking Machines Data Science | data-for-development@thinkingmachin.es

The Project CCHAIN dataset is an open, linked, analysis-ready dataset of validated health, climate, environmental, and socioeconomic variables collected at the village ("barangay") level in 12 Philippine cities spanning 20 years (2003-2022). This dataset includes observations of about 17 diseases collected through field visits to the Philippines Department of Health (DOH) and the Philippine Statistical Authority (PSA). Focusing on the village or "barangay," the smallest administrative unit in the Philippines, also helps disaggregate health risks for vulnerable communities, particularly those in informal settlements, and provides actionable insights for local governments.

See Details

cc-by-4.0

SDG 13

Air quality dataset of abattoir centers in Southern Nigeria

Nigeria

Emmanuel Chukwuma | emmanuel.chukwuma@apse-ngo.org

This air quality dataset is the first of its kind in the country from abattoir centers. The localized dataset is crucial in air quality monitoring and prediction, as well as accurate modeling of the air quality index for early warning signals and modeling of health risk. The data was obtained from abattoir centers in Southern Nigeria. The team collected data from representative samples of various states (i.e., Anambra, Enugu, Abia, Imo, Ebonyi, and Delta) within the research area. The team visited 27 stations and conducted on-site investigations, collecting over 200,000 numerical values of particulate matter (PM) concentrations using 10 air quality sensors for PM1, PM2.5, and PM10. Additionally, aerial view images were captured using a drone at varying heights (10m, 20m, 30m) during operational hours; the images will be trained with satellite imagery for the prediction of PM values. Exposure to particulate matter and black carbon released in abattoirs has detrimental health outcomes with elevated morbidity and mortality, as shown by previous studies. This project was undertaken by the Alliance for Progressive and Sustainable Environment (APSE), a local NGO focused on environmental sustainability (see more details here: www.apse-ngo.org).

See Details

cc-by-4.0

SDG 13

Global Horizontal Irradiance Dataset for Mauritius, Rodrigues, and Agalega Islands

Mauritius, Rodrigues, and Agalega Islands

Not specified

This dataset includes 146,025 real-time solar irradiance data lines from different locations around Mauritius, Rodrigues, and Agaléga. The solar irradiance data (GHI in W/m2) spans 2017 to 2021, at an interval of one hour, and covers the hours of 07:00-18:00 each day. This dataset allows for the real-time visualization of the solar irradiance profile at the specified locations, helping with better assessment and planning of solar-generated power. The team is now collecting data (from 2023 on) at an interval of 15 minutes and plans to update this data repository to reflect that in the future. The targeted beneficiaries for this project are the Government of Mauritius, which has a goal of 60% electricity generation from renewable energy sources by the year 2030. Similarly, the Mauritius Renewable Energy Agency, which is tasked with ensuring the country's energy demand is increasingly met by renewable energy and keeping up with international commitments, can use this data on solar irradiance and forecasting mechanisms to better manage the utility's power plants, minimize carbon emissions, ensure no loss of loads (blackouts), and allow higher penetration of photovoltaic (PV) projects in the country. With free online solar maps and accuracy-enhanced solar energy data, local PV plant operators will also have accurate information for PV performance appraisal. Additionally, the public at large can benefit from a free online solar energy platform, improving acceptance of solar PV technology and increasing penetration of clean technologies in the country to further reduce greenhouse emissions. Finally, machine learning models can be trained for intra-day, daily, and even weekly predictions of the solar irradiance profiles.

See Details

cc-by-4.0

SDG 13

Labelled Open Solar Panel Data to measure solar energy adoption in Madagascar

Madagascar

Fabienne Rafidiharinirina | f.rafidiharinirina@association-maidi.mg or assomaidi@gmail.com

This team annotated 2,125 Google Earth satellite images and 9,202 drone images, forming a combination of low and high-definition solar panel views in Madagascar. The Madagascar Initiatives for Digital Innovation (MAIDI) team performed field checks for up to 25% of satellite images and, in total, annotated 22,488 polygons. This dataset will help data scientists and users develop a solar panel detection algorithm to measure solar energy adoption across Madagascar. Notably, this project represented all regions of the country; instead of focusing only on big cities, it also covered average and small villages as well as coasts and mountains.

See Details

cc-by-4.0

SDG 13

Climate Energy Dataset for Off-Grid Electricity Infrastructure

Pakistan

Dr. Zeeshan Shafiq | zeeshanshafiq@uetpeshawar.edu.pk

This dataset comprises real-time electrical measurements of a specific climate zone in Pakistan, the Kalam Region, showcasing the energy generation and demand within an off-grid electricity infrastructure. It can be used for research in energy systems analysis, climate change studies, electrical engineering, and artificial intelligence applications. It includes voltages, currents, and power factors for three-phase and single-phase systems across generation, distribution, and consumption stages. Additionally, the dataset incorporates seven different climate parameters from the ERA5 dataset (provided by the Copernicus Climate Change Service), generating a total of 85,596 data points in areas such as temperature, dew point, wind components, precipitation, snowfall, and snow cover. Collected every five minutes from June 3, 2023, to October 24, 2024, it includes over 45 million instances covering data from four micro-hydropower generators, 26 transformers (in addition to four data acquisition systems installed at Micro Hydro Power Plants (MHPs), and 585 end users. With local support, the team will continue monitoring the data until June 2025.

See Details

cc-by-4.0

SDG 4

A Nigerian Twitter Sentiment Corpus for Multilingual Sentiment Analysis

Nigeria

Shamsuddeen Hassan Muhammad, David Ifeoluwa Adelani, Sebastian Ruder, Ibrahim Said Ahmad, Idris Abdulmumin, Bello Shehu Bello, Monojit Choudhury, Chris Chinenye Emezue, Saheed Salahudeen Abdullahi, Anuoluwapo Aremu, Alipio Jeorge, and Pavel Brazdil

This dataset is the first large-scale human-annotated Twitter sentiment dataset for Hausa, Igbo, Nigerian-Pidgin, and Yorùbá, the four most widely spoken languages in Nigeria.

See Details

cc-by-4.0

SDG 4

Machine Translation Benchmark Dataset for Languages in the Horn of Africa

Horn of Africa (Ethiopia, Eritrea)

Asmelash Teka Hadgu, Gebrekirstos G. Gebremeskel, Abel Aregawi

This evaluation dataset automatically quantifies the quality of machine translation systems for Afar, Amharic, Oromo, Somali and Tigrinya.

See Details

cc-by-4.0

SDG 4

Kencorpus: Kenyan Languages Corpus for Machine Learning and Natural Language Processing

Kenya

Owen McOnyango (Maseno University), Florence Indede (Maseno University), Lilian D.A. Wanzare (Maseno University), Barack Wanjawa (University of Nairobi), Edward Ombui (Africa Nazarene University), Lawrence Muchemi (University of Nairobi)

This project collected text and speech corpora for three languages in Kenya: Kiswahili, Dholuo and 3 Luhya dialects (Lumarachi, Logooli and Lubukusu). A total of 4,442 texts were collected and 1,152 files containing spontaneous speech data totaling 176 hours.

See Details

cc-by-4.0

SDG 4

KenPos: Kenyan Languages Part of Speech Tagged dataset

Kenya

Florence Indede (Maseno University), Owen McOnyango (Maseno University), Lilian D.A. Wanzare (Maseno University), Barack Wanjawa (University of Nairobi), Edward Ombui (Africa Nazarene University), Lawrence Muchemi (University of Nairobi)

This project developed a Part of Speech (POS) Tagged dataset of 2 languages in Kenya: Dholuo and 3 Luhya dialects (Lumarachi, Lulogooli, and Lubukusi). The project tagged approximately 143,000 words.

See Details

cc-by-4.0

SDG 4

KenSpeech: Swahili Speech Transcriptions

Kenya

Dorcas Awino (University of Nairobi), Lawrence Muchemi (University of Nairobi), Lilian D.A. Wanzare (Maseno University), Edward Ombui (Africa Nazarene University), Barack Wanjawa (Maseno University), Owen McOnyango (Maseno University), Florence Indede (Maseno University)

This project produced a speech dataset that includes both read and spontaneous speech recordings, recorded in Kenya with native Swahili speakers, and corresponding transcripts. In total, the dataset includes 27 hours, 31 minutes, 50 seconds of speech data from 26 speakers.

See Details

cc-by-4.0

SDG 4

KenTrans: A Parallel Corpora for Swahili and local Kenyan Languages

Kenya

Lilian D.A Wanzare (Maseno University), Florence Indede (Maseno University), Owen McOnyango (Maseno University), Edward Ombui (Africa Nazarene University), Barack Wanjawa (University of Nairobi), Lawrence Muchemi (University of Nairobi)

This project produced a parallel corpus between Swahili and two other Kenya Languages: Dholuo and 3 Luhya dialects (Lumarachi, Logooli and Lubukusu). A total of about 12,400 sentences were translated to Kiswahili.

See Details

cc-by-4.0

SDG 4

KenSwQuAD – A Question Answering Dataset for Swahili Low Resource Language

Kenya

Barack Wanjawa (University of Nairobi), Lilian D.A. Wanzare (Maseno University), Florence Indede (Maseno University), Owen McOnyango (Maseno University), Lawrence Muchemi (University of Nairobi), Edward Ombui (Africa Nazarene University)

This project produced a large Machine Reading Comprehension dataset for the Kiswahili Language. A total of 7,526 Question-Answer (QA) pairs were developed based on 1,445 Swahili story texts.

See Details

cc-by-4.0

SDG 4

MasakhaNER 2.0: Named Entity Recognition datasets for 20 African languages

Africa (Multi-country)

David Ifeoluwa Adelani, D.ADELANI@UCL.AC.UK

MasakhaNER 2.0 is the largest human-annotated named entity recognition dataset for 20 African languages. Each language has between 4,800 – 11,000 parallel sentences for training and/or evaluation. The languages covered span across West, Central, East and Southern Africa.

See Details

cc-by-4.0

SDG 4

MAFAND-MT: Masakhane Anglo & Franco African News Corpus for Machine Translation

Africa (Multi-country)

David Ifeoluwa Adelani, D.ADELANI@UCL.AC.UK

The MAFAND-MT dataset is a few thousand high-quality and human translated parallel sentences for 16 African languages in the news domain. Each language has between 1,466 – 7838 parallel sentences for training and/or evaluation.

See Details

cc-by-4.0

SDG 4

MasakhaPOS: Part-of-Speech Tagging Dataset for 20 African Languages

Africa (Multi-country)

David Ifeoluwa Adelani, D.ADELANI@UCL.AC.UK

MasakhaPOS is the largest human-annotated part of speech tagging dataset for 20 African languages. Each language has between 1200 – 1500 sentences for training and/or evaluation. The languages covered span across West, Central, East and Southern Africa.

See Details

cc-by-4.0

SDG 4

Financial Inclusion Speech Dataset for some Ghanaian Languages

Ghana

Dennis Asamoah Owusu, DOWUSU@ASHESI.EDU.GH

This speech dataset for the Ghanian languages Akan (Akuapem Twi, Asante Twi, Fante) and Ga includes 104,000 utterances (speech) across the four dialects/languages with approximately 200 speakers per dialect/language. This amounts to about 148 hours of speech in total.

See Details

cc-by-4.0

SDG 4

IgboSynCorp: Dataset for Igbo Natural Language Processing Tasks

Nigeria

Gerald Nweya

This dataset is the first spoken corpus of labelled and unlabeled datasets for Igbo Natural Language Processing (NLP) tasks. It consists of approximately 40 hours of naturally occurring Igbo speech that is representative of all the dialects of Igbo.

See Details

cc-by-4.0

SDG 4

Bayelemabaga Aligned Bambara-French Corpus for Machine Translation

Mali/France

Christopher Homan, christopher.m.homan.phd@gmail.com

The Bayelemabaga dataset consists of 46,976 parallel machine translation-ready Bambara-French sentence pairs, originating from the Bambara Reference Corpus from INALCO's LLACAN Lab. The text is extracted from 264 text files.

See Details

cc-by-4.0

SDG 4

Makerere University NLP Datasets

Uganda, Tanzania, Kenya

Andrew Katumba | andrew.katumba@mak.ac.ug

Makerere University has created text and speech datasets for low-resourced East African Languages in Uganda, Tanzania, and Kenya. This dataset contains 10,000 parallel sentiment-tagged sentences, 100,000 Kiswahili sentences, 100,000 Luganda sentences, 40,037 Acoli sentences, and 39,999 Lumasaaba sentences.

See Details

cc-by-4.0

SDG 4

BIG-C: A Multimodal Multi-Purpose Dataset for Bemba

Zambia

Claytone Sikasote | claytonsikasote@gmail.com

The BIG-C (Bemba Image Grounded Conversations) dataset is comprised of multi-turn dialogues between Bemba speakers grounded on images, transcribed and translated to English. There are over 92,000 sentences, amounting to over 180 hours of speech data with corresponding Bemba transcriptions and English translations.

See Details

cc-by-4.0

SDG 4

KALLAAMA

Senegal

Aminata Ndiaye | amina.ndiaye@jokalante.com and Elodie Gauthier | elodie.gauthier@orange.com

This dataset will strengthen natural language processing resources for Wolof, Pulaar, and Serer, the three most widely spoken languages in Senegal. This dataset's repository of transcribed speech includes over 55 hours in Wolof, 38 hours in Serer, and 31 hours in Pulaar.

See Details

cc-by-4.0

SDG 4

NaijaVoices: Our Language is Our Strength

Nigeria

info@naijavoices.com

The NaijaVoices project has curated 1,867 hours of speech and text data featuring over 5,000 speakers in the three major Nigerian languages — Hausa, Igbo, and Yoruba. As of its release, it is the largest ever multi-speaker African speech dataset. The dataset consists of circa 1,917,686 instances.

See Details

cc-by-4.0

SDG 4

AFRIDOC-MT: Document-level MT Corpus for African Languages

Multiple African Countries

Jesujoba O. Alabi | jalabi@lsv.uni-saarland.de

AFRIDOC-MT is a document-level and multi-way translation dataset from English into five African languages — Amharic, Hausa, Swahili, Yorùbá, and Zulu. The dataset comprises 334 health and 271 information technology news documents, all of which were human-translated from English to these languages. Each domain has at least 10,000 parallel sentences per language pair and supports multiway translation, allowing translation not only between English and the African languages but also among the African languages themselves. This dataset can be used to evaluate the ability of existing neural machine translation (NMT) models and large language models (LLMs) to translate at the document level.

See Details

cc-by-4.0

SDG 4

Masakhane-NLU: Conversational AI & Benchmark datasets for African languages

Multiple African Countries

David Adelani | david.adelani@mila.quebec

This team has developed five conversational AI and benchmark datasets for 16 languages across the African continent: AfriXNLI (natural language inference), AfriMMLU (knowledge-based multi-choice questions), AfriMGSM (grade school mathematics), AfriIntent (intent classification), and AfriSlot (slot classification). These datasets cover languages including Amharic, Ewe, Hausa, Igbo, Lingala, Luganda, Oromo, Kinyarwanda, Shona, Sesotho, Swahili, Twi, Wolof, Xhosa, Yoruba and Zulu. These text-only datasets are useful for conversational chatbots in real-life applications such as banking, restaurants, travel agencies, and more.

See Details

cc-by-4.0

SDG 4

Lacuna PII Multilingual Dataset

Multiple African Countries

Andrew Katumba | katumba@mak.ac.ug, Milena Haykowska | milena.haykowska@clearglobal.org, Peter Nabende | nabende@gmail.com

This dataset contains annotated sentences with personally identifiable information (PII) in Luganda, Lumasaba, Hausa, and Kanuri. These four languages span Central and Eastern Uganda, Nigeria, Ghana, and Northern Cameroon. The team collected 3,000 sentences for both Kanuri and Hausa, 5,000 for Lumasaba, and 4,000 for Luganda. Potential use cases include named entity recognition (NER), text classification, privacy-preserving data analysis and research, language modeling, machine translation, and linguistic research. The team aimed to curate a dataset that is gender inclusive, and their work highlighted the need for standardized guidelines for annotating low-resourced languages.

See Details

cc-by-4.0

SDG 4

Building Parallel Corpora for Kenya's Indigenous Languages and Kiswahili

Multiple African Countries

Audrey Mbogho | ambogho@usiu.ac.ke

This dataset was created to enable translation from Kiswahili, Kenya's national language, into three indigenous languages: Kidaw'ida, Kalenjin, and Dholuo. All three are low-resource languages, with Kidaw'ida having only around 400,000 speakers and being at immediate risk of loss. The dataset consists of three parallel corpora: Kidaw'ida-Kiswahili, Kalenjin-Kiswahili, and Dholuo-Kiswahili, with each corpus averaging thirty thousand sentence pairs. The project has also contributed to Mozilla Common Voice Platform with 120 hours in Dholuo, 92 hours in Kalenjin, and 56 hours in Kidaw'ida for crowd-sourced voice recognition.

See Details

cc-by-4.0

SDG 4

Expanding a parallel corpus of Portuguese and the Bantu language Emakhuwa

Multiple African Countries

Felermino D. M. A. Ali | felermino.ali@unilurio.ac.mz or felerminoali@gmail.com

This dataset includes the translation of 1,897 news articles comprising 660,242 words from Portuguese to Emakhuwa, an indigenous language of Mozambique. Each article includes news headline, content, and topic classification labels. Articles are divided into training (1,337), development (185), and testing (375) sets, categorized by topics: politics, economy, culture, sports, health, society, and world news. Use cases include topic classification, translation, and loanword recognition. The dataset has shown promising outcomes when fine-tuning multilingual models like ByT5, M2M100, and NLLB200, with improvements in translation quality using loanword information.

See Details

cc-by-4.0

Combatting Air Pollution and GHG Emissions in India through hyperlocal AI-powered mapping

Facilitating access to financial applications in informal settings in four Ghanaian dialects: Akuapem Twi, Ashante Twi, Fante and Ga.

Enabling machine translation from Kiswahili into the indigenous Kenyan languages Kidaw'ida, Kalenjin, and Dholuo, preserving these languages & supporting crowd-sourced voice recognition via Mozilla Common Voice for these languages

Helping to measure solar energy adoption across Madagascar via AI - Labelled Open solar panel data for Madagascar

Detecting sentiments and combatting hate speech in Hausa, Igbo, Nigerian-Pidgin and Yorùbá - NaijaSenti: a Nigerian Corpus for Multilingual Sentiment Analysis

Enable Cashew, Cocoa and Coffee farmers to make good business decisions - Drone-based Agricultural Dataset for Crop Yield Estimation in Ghana and Uganda

Preserving privacy and avoiding gender bias of AI systems in Luganda, Lumasaba, Hausa, and Kanuri - The Lacuna personally identifiable information Text Dataset

Powering Rural Futures in West Africa: AI-Driven Demand Data for Smarter Electrification

Promoting energy conservation and market analysis in Pakistan through Residential Energy and Weather Data (REWD)

Datasets for transportation impact evaluation

Monitoring the impact of palm oil monoculture, shrimp aquaculture & mining in continental Ecuador and the Galapagos using AI

Indigenous Knowledge Meets AI: Monitoring Elephants and Rodents in Kenya and Ecuadorian Amazon with Biodiversity AI-Datasets

AI for Mangrove Carbon Credits: Turning Forest Data into Climate Action in Côte d’Ivoire

Mapping Cocoa Landscapes in Ghana: Reference Data for Tracking Land Use Change

Miti360: A Comprehensive Dataset for AI-Powered Forest Monitoring

African Trees for Climate Resilience: A Comprehensive Database

Inclusive MRV for India's Eastern Himalayas

Data-enabled climate shock absorbance through agroforestry (Agrof4resilience)

Quantifying Colombian mangroves aboveground biomass and carbon content

Phenological Dataset for Ecological Forecasting (PheDEF Project)

Acquisition and Analysis of groundtruth data and Remote Sensing Imagery to develop and Artificial Intelligence solution for Prediction Landuse and Landcover changes in Uganda

Towards Reliable Detection of Inefficient Residential Electricity Usage for Policy Intervention

Eyes on the Ground Image Data

High-Accuracy Maize Plot Location and Yield Dataset in East Africa

Sensor Based Aquaponics Fish Pond Datasets: IoT Fish Pond Monitoring Datasets

Machine Learning Datasets for Crop Pest and Disease Diagnosis: Crop Imagery and Spectrometry Data

A Decision-Supporting Tool for Developing Community-led Land Use Plans

Enhanced Agriculture Datasets for Remote Crop Monitoring to Provide Access to Essential Social and Financial Services to Smallholder Farmers in Zimbabwe

A region-wide, multi-year set of crop field boundary labels for Africa

CropHarvest: Informing decision-making around agricultural development, early warning systems, and trade in Sub-Saharan Africa

Improving livelihoods in Ghana and Uganda: Drone-based Agricultural Dataset for Crop Yield Estimation of cashew, cocoa, and coffee

Machine Learning Dataset for Rabies Diagnosis and Outbreak Prediction

Childhood Malnutrition in Chile

Lacuna Malaria Datasets

Intraoperative Anesthesia and Outcomes Dataset: Improving patient outcomes by predicting risk of mortality and post-operative recovery

Brain Tumor Segmentation Africa (BraTS-Africa) Dataset

AI-Assisted Smartphone Microscopy for detection of Diarrhea Parasites

Project Climate Change, Health, and Artificial Intelligence (CCHAIN): Public Health Data Insights for the Philippines

Air quality dataset of abattoir centers in Southern Nigeria

Global Horizontal Irradiance Dataset for Mauritius, Rodrigues, and Agalega Islands

Labelled Open Solar Panel Data to measure solar energy adoption in Madagascar

Climate Energy Dataset for Off-Grid Electricity Infrastructure

A Nigerian Twitter Sentiment Corpus for Multilingual Sentiment Analysis

Machine Translation Benchmark Dataset for Languages in the Horn of Africa

Kencorpus: Kenyan Languages Corpus for Machine Learning and Natural Language Processing

KenPos: Kenyan Languages Part of Speech Tagged dataset

KenSpeech: Swahili Speech Transcriptions

KenTrans: A Parallel Corpora for Swahili and local Kenyan Languages

KenSwQuAD – A Question Answering Dataset for Swahili Low Resource Language

MasakhaNER 2.0: Named Entity Recognition datasets for 20 African languages

MAFAND-MT: Masakhane Anglo & Franco African News Corpus for Machine Translation

MasakhaPOS: Part-of-Speech Tagging Dataset for 20 African Languages

Financial Inclusion Speech Dataset for some Ghanaian Languages

IgboSynCorp: Dataset for Igbo Natural Language Processing Tasks

Bayelemabaga Aligned Bambara-French Corpus for Machine Translation

Makerere University NLP Datasets

BIG-C: A Multimodal Multi-Purpose Dataset for Bemba

KALLAAMA

NaijaVoices: Our Language is Our Strength

AFRIDOC-MT: Document-level MT Corpus for African Languages

Masakhane-NLU: Conversational AI & Benchmark datasets for African languages

Lacuna PII Multilingual Dataset

Building Parallel Corpora for Kenya's Indigenous Languages and Kiswahili

Expanding a parallel corpus of Portuguese and the Bantu language Emakhuwa

No matching items found

Dataset Details

Fair Sharing & Digital Public Goods

About This Website