Resources & Tools
Below is a collection of datasets, tools, software, and guides that I have developed or contributed to, all publicly available for the research community.
Quick Navigation
Jump to any section:
- Datasets — Research datasets for African languages
- Tools & Software — NLP tools and platforms
- Writing NLP Papers — Guides and templates
- Other Resources — Tutorials, and more
Datasets
BRIGHTER: Multilingual Emotion Recognition Dataset
- Description: Human-annotated textual emotion recognition datasets for 28 languages
- Languages: 28 languages including African, Asian, and European languages
- Task: Emotion classification
- Paper: ACL 2025 (Best Resource Paper Award)
- Access: Website • HuggingFace
AfriSenti: African Sentiment Analysis Dataset
- Description: Twitter sentiment analysis benchmark for African languages
- Languages: 14 African languages (Hausa, Yoruba, Igbo, Amharic, Swahili, and more)
- Task: Sentiment classification (positive, negative, neutral)
- Paper: EMNLP 2023 (Best Non-archival Paper Award)
- Access: GitHub • HuggingFace • Leaderboard
NaijaSenti: Nigerian Sentiment Analysis Dataset
- Description: Sentiment analysis dataset specifically for Nigerian languages and Nigerian Pidgin
- Languages: Hausa, Yoruba, Igbo, Nigerian Pidgin
- Task: Sentiment classification (positive, negative, neutral)
- Paper: AfricaNLP Workshop
- Access: GitHub • HuggingFace
AfriHate: Hate Speech & Abusive Language Dataset
- Description: Multilingual collection of hate speech and abusive language datasets for African languages
- Recognition: IRCAI Global Top 100 AI Projects Award 2025
- Languages: Multiple African languages
- Task: Hate speech detection, abusive language classification
- Paper: NAACL 2025
- Access: GitHub • HuggingFace
MasakhaNER: Named Entity Recognition Dataset
- Description: Named entity recognition datasets for 21 African languages
- Languages: Amharic, Hausa, Igbo, Kinyarwanda, Luganda, Luo, Nigerian-Pidgin, Swahili, Wolof, Yoruba, and more
- Task: Named entity recognition (PER, LOC, ORG, DATE)
- Paper: TACL 2021
- Access: GitHub • HuggingFace
BLEnD: Everyday Knowledge Dataset
- Description: Everyday knowledge in diverse cultures and languages for evaluating cultural understanding in LLMs
- Languages: Multiple languages representing diverse cultures
- Task: Cultural knowledge understanding and reasoning
- Paper: NeurIPS 2024 (Best Non-archival Paper Award)
- Access: GitHub • HuggingFace
SemRel: Semantic Textual Relatedness Dataset
- Description: Semantic textual relatedness datasets for African and Asian languages
- Languages: 14 languages from Africa and Asia
- Task: Semantic similarity and relatedness measurement
- Paper: SemEval 2024 (Honourable Mention)
- Access: GitHub • arXiv
AFRIDOC-MT: African Document-level Machine Translation
- Description: Document-level machine translation dataset for African languages with document context
- Languages: Multiple African languages
- Task: Document-level machine translation
- Paper: EMNLP 2025
- Access: HuggingFace
INJONGO: Multimodal Dataset for African Languages
- Description: Multimodal benchmark evaluating vision-language models on African cultural contexts
- Languages: Multiple African languages
- Task: Vision-language understanding and reasoning
- Paper: ACL 2025
- Access: GitHub
POLAR: Persuasive & Offensive Language in African Regional Languages
- Description: Dataset for detecting persuasive and offensive language in African regional contexts
- Languages: African regional languages
- Task: Persuasive language detection, offensive language identification
- Paper: arXiv
- Access: GitHub
UHURA: Unified Dataset for African Languages
- Description: Comprehensive unified dataset for multiple NLP tasks in African languages
- Languages: Multiple African languages
- Tasks: Multiple NLP tasks
- Paper: arXiv
- Access: GitHub
AfroXLMR-Social: African Social Media Dataset
- Description: Social media dataset for African languages covering diverse social contexts
- Languages: Multiple African languages
- Task: Social media text processing and understanding
- Access: HuggingFace
Who Wrote This: Hausa Authorship Attribution Dataset
- Description: Dataset for authorship attribution and text classification in Hausa
- Language: Hausa
- Task: Authorship attribution, text classification
- Paper: arXiv
- Access: GitHub
HausaHate: Hausa Hate Speech Dataset
- Description: Hate speech detection dataset specifically for Hausa language
- Language: Hausa
- Task: Hate speech detection and classification
- Paper: WOAH 2024
- Access: GitHub
AfriGSM: African Grade School Math Dataset
- Description: Grade school mathematics reasoning dataset for African languages
- Languages: Multiple African languages
- Task: Mathematical reasoning and problem solving
- Access: HuggingFace
AfriMMLU: African Massive Multitask Language Understanding
- Description: Massive multitask language understanding benchmark for African languages
- Languages: Multiple African languages
- Tasks: Multiple choice question answering across various domains
- Access: HuggingFace
AfriXNLI: African Cross-lingual Natural Language Inference
- Description: Cross-lingual natural language inference dataset for African languages
- Languages: Multiple African languages
- Task: Natural language inference (entailment, contradiction, neutral)
- Access: HuggingFace
AfriCOMET: African Machine Translation Evaluation Metric
- Description: Evaluation metric specifically designed for assessing machine translation quality in African languages
- Languages: Multiple African languages
- Task: Machine translation evaluation
- Access: GitHub
MasakhaNEWS: African Language News Classification
- Description: News topic classification dataset for African languages
- Languages: Multiple African languages
- Task: News topic classification
- Access: HuggingFace
MasakhaPOS: African Language Part-of-Speech Tagging
- Description: Part-of-speech tagging dataset for African languages
- Languages: Multiple African languages
- Task: Part-of-speech tagging
- Access: HuggingFace
HaVQA: Hausa Visual Question Answering
- Description: Visual question answering dataset for Hausa language
- Language: Hausa
- Task: Visual question answering
- Access: GitHub
Hausa Visual Genome
- Description: Visual genome dataset with Hausa language annotations for image understanding
- Language: Hausa
- Task: Image captioning, visual understanding
- Access: HuggingFace
HERDPhobia: Hate Speech and Extreme Religious Discourse Dataset
- Description: Dataset focusing on hate speech and extreme religious discourse in Hausa
- Language: Hausa
- Task: Hate speech detection, religious discourse analysis
- Access: GitHub
MAFAND-MT: Machine Translation for African Languages
- Description: Machine translation dataset for African languages
- Languages: Multiple African languages
- Task: Machine translation
- Access: GitHub
FLORES Fix for Africa
- Description: Corrected and improved version of FLORES evaluation dataset for African languages
- Languages: African languages in FLORES
- Task: Machine translation evaluation
- Access: GitHub
Nigerian Speech Corpus
- Description: Large-scale speech corpus for major Nigerian languages
- Funding: Lacuna Fund ($120,000)
- Languages: Hausa, Yoruba, Igbo, and Nigerian Pidgin
- Tasks: Automatic speech recognition, text-to-speech synthesis
- Status: In development
- Access: Project Website
Tools & Software
Hausa Dictionary (Kamusun Hausa)
- Description: An online lexicon providing standardized Hausa definitions, examples, and usage
- Features: Comprehensive Hausa-English dictionary, example sentences, pronunciation guide, standardized definitions
- Access: https://kamusunhausa.hausanlp.org/
Hausa Catalogue
- Description: A centralized repository indexing datasets, benchmarks, and resources for Hausa NLP research
- Features: Searchable database of datasets, research papers, models, and tools for Hausa language
- Access: https://catalog.hausanlp.org
Annotate Platform
- Description: A lightweight web-based annotation platform for text classification, translation, and sequence labeling
- Features: Multi-task annotation support, collaborative annotation, quality control, export to standard formats
- Use Cases: Text classification, machine translation, named entity recognition, sentiment labeling
- Access: https://annotate.hausanlp.org
NLPQuiz
- Description: Interactive quiz platform for learning and testing NLP concepts
- Features: Educational quizzes, progress tracking, concept reinforcement
- Access: https://nlpquiz.hausanlp.org/auth/login
Speech Annotation Bot
- Description: A Telegram-based tool for collecting speech data in low-connectivity environments; optimized for mobile users
- Features: Mobile-first design, works in low-bandwidth areas, Telegram integration, easy audio collection
- Use Cases: Speech corpus collection, crowdsourced pronunciation data, dialect documentation
- Platform: Telegram Bot
- Languages: Hausa and other low-resource languages
Writing NLP Papers
Paper Templates & Examples
- ACL Conference Templates:
- Example Papers from My Work:
- Dataset Paper Example — BRIGHTER (ACL 2025)
- Shared Task Paper Example — SemEval Task 11
- Benchmark Paper Example — IrokoBench (NAACL 2025)
- System Description Example — AfriSenti Shared Task
Writing Guides & Best Practices
- Essential Reading:
- My Personal Recommendations:
- Start with a clear research question and contribution statement
- Use active voice and concise language
- Include comprehensive error analysis and ablation studies
- Make your code and data publicly available
- Get feedback early and often from colleagues
- Address limitations and ethical considerations explicitly
LaTeX Tips & Resources
- Useful LaTeX Packages:
-
booktabs— Professional-looking tables -
algorithm2e— Algorithm formatting -
tikz— Diagrams and figures -
natbiborbiblatex— Bibliography management -
hyperref— Cross-references and URLs
-
- Helpful Tools:
- Overleaf — Online collaborative LaTeX editor
- Tables Generator — Quick table creation
- Detexify — Find LaTeX symbols by drawing
Understanding the Peer Review Process
- Typical Timeline:
- Submission → Review assignment (2-3 weeks)
- Review period (6-8 weeks)
- Author response period (1 week)
- Final decision (2-4 weeks after response)
- Tips for Strong Author Responses:
- Be respectful and professional in all responses
- Address each reviewer concern point-by-point
- Provide specific line numbers for changes made
- Acknowledge valid criticisms graciously
- Clarify misunderstandings politely with evidence
- Additional Resources:
Tipsa and Resources
Tutorials & Workshop Materials
- Video Tutorials:
Community & Collaboration
- Research Organizations:
- HausaNLP — Hausa language NLP research group
- Arewa Data Science Academy — Free AI/ML training for underserved students
- Masakhane — Grassroots African NLP community
- Tips for Successful Grant Applications:
- Clearly articulate the problem and potential impact
- Demonstrate strong community involvement and partnerships
- Include realistic budgets with detailed justifications
- Highlight sustainability and scalability plans
- Show evidence of preliminary work or pilot studies
Acknowledgments
These resources were made possible through awosome collaborators and generous funding support from:
- Google (DeepMind Academic Fellowship)
- Lacuna Fund
- Oracle Cloud Infrastructure (OCI)
- Nigerian TETFund
- Wikimedia Foundation
- University of Porto
- Imperial College London
- And countless volunteers and community members who contributed their time and expertise
Last updated: December 2025