Resources & Tools

Below is a collection of datasets, tools, software, and guides that I have developed or contributed to, all publicly available for the research community.


Quick Navigation

Jump to any section:


Datasets

BRIGHTER: Multilingual Emotion Recognition Dataset

AfriSenti: African Sentiment Analysis Dataset

NaijaSenti: Nigerian Sentiment Analysis Dataset

  • Description: Sentiment analysis dataset specifically for Nigerian languages and Nigerian Pidgin
  • Languages: Hausa, Yoruba, Igbo, Nigerian Pidgin
  • Task: Sentiment classification (positive, negative, neutral)
  • Paper: AfricaNLP Workshop
  • Access: GitHubHuggingFace

AfriHate: Hate Speech & Abusive Language Dataset

  • Description: Multilingual collection of hate speech and abusive language datasets for African languages
  • Recognition: IRCAI Global Top 100 AI Projects Award 2025
  • Languages: Multiple African languages
  • Task: Hate speech detection, abusive language classification
  • Paper: NAACL 2025
  • Access: GitHubHuggingFace

MasakhaNER: Named Entity Recognition Dataset

  • Description: Named entity recognition datasets for 21 African languages
  • Languages: Amharic, Hausa, Igbo, Kinyarwanda, Luganda, Luo, Nigerian-Pidgin, Swahili, Wolof, Yoruba, and more
  • Task: Named entity recognition (PER, LOC, ORG, DATE)
  • Paper: TACL 2021
  • Access: GitHubHuggingFace

BLEnD: Everyday Knowledge Dataset

  • Description: Everyday knowledge in diverse cultures and languages for evaluating cultural understanding in LLMs
  • Languages: Multiple languages representing diverse cultures
  • Task: Cultural knowledge understanding and reasoning
  • Paper: NeurIPS 2024 (Best Non-archival Paper Award)
  • Access: GitHubHuggingFace

SemRel: Semantic Textual Relatedness Dataset

  • Description: Semantic textual relatedness datasets for African and Asian languages
  • Languages: 14 languages from Africa and Asia
  • Task: Semantic similarity and relatedness measurement
  • Paper: SemEval 2024 (Honourable Mention)
  • Access: GitHubarXiv

AFRIDOC-MT: African Document-level Machine Translation

  • Description: Document-level machine translation dataset for African languages with document context
  • Languages: Multiple African languages
  • Task: Document-level machine translation
  • Paper: EMNLP 2025
  • Access: HuggingFace

INJONGO: Multimodal Dataset for African Languages

  • Description: Multimodal benchmark evaluating vision-language models on African cultural contexts
  • Languages: Multiple African languages
  • Task: Vision-language understanding and reasoning
  • Paper: ACL 2025
  • Access: GitHub

POLAR: Persuasive & Offensive Language in African Regional Languages

  • Description: Dataset for detecting persuasive and offensive language in African regional contexts
  • Languages: African regional languages
  • Task: Persuasive language detection, offensive language identification
  • Paper: arXiv
  • Access: GitHub

UHURA: Unified Dataset for African Languages

  • Description: Comprehensive unified dataset for multiple NLP tasks in African languages
  • Languages: Multiple African languages
  • Tasks: Multiple NLP tasks
  • Paper: arXiv
  • Access: GitHub

AfroXLMR-Social: African Social Media Dataset

  • Description: Social media dataset for African languages covering diverse social contexts
  • Languages: Multiple African languages
  • Task: Social media text processing and understanding
  • Access: HuggingFace

Who Wrote This: Hausa Authorship Attribution Dataset

  • Description: Dataset for authorship attribution and text classification in Hausa
  • Language: Hausa
  • Task: Authorship attribution, text classification
  • Paper: arXiv
  • Access: GitHub

HausaHate: Hausa Hate Speech Dataset

  • Description: Hate speech detection dataset specifically for Hausa language
  • Language: Hausa
  • Task: Hate speech detection and classification
  • Paper: WOAH 2024
  • Access: GitHub

AfriGSM: African Grade School Math Dataset

  • Description: Grade school mathematics reasoning dataset for African languages
  • Languages: Multiple African languages
  • Task: Mathematical reasoning and problem solving
  • Access: HuggingFace

AfriMMLU: African Massive Multitask Language Understanding

  • Description: Massive multitask language understanding benchmark for African languages
  • Languages: Multiple African languages
  • Tasks: Multiple choice question answering across various domains
  • Access: HuggingFace

AfriXNLI: African Cross-lingual Natural Language Inference

  • Description: Cross-lingual natural language inference dataset for African languages
  • Languages: Multiple African languages
  • Task: Natural language inference (entailment, contradiction, neutral)
  • Access: HuggingFace

AfriCOMET: African Machine Translation Evaluation Metric

  • Description: Evaluation metric specifically designed for assessing machine translation quality in African languages
  • Languages: Multiple African languages
  • Task: Machine translation evaluation
  • Access: GitHub

MasakhaNEWS: African Language News Classification

  • Description: News topic classification dataset for African languages
  • Languages: Multiple African languages
  • Task: News topic classification
  • Access: HuggingFace

MasakhaPOS: African Language Part-of-Speech Tagging

  • Description: Part-of-speech tagging dataset for African languages
  • Languages: Multiple African languages
  • Task: Part-of-speech tagging
  • Access: HuggingFace

HaVQA: Hausa Visual Question Answering

  • Description: Visual question answering dataset for Hausa language
  • Language: Hausa
  • Task: Visual question answering
  • Access: GitHub

Hausa Visual Genome

  • Description: Visual genome dataset with Hausa language annotations for image understanding
  • Language: Hausa
  • Task: Image captioning, visual understanding
  • Access: HuggingFace

HERDPhobia: Hate Speech and Extreme Religious Discourse Dataset

  • Description: Dataset focusing on hate speech and extreme religious discourse in Hausa
  • Language: Hausa
  • Task: Hate speech detection, religious discourse analysis
  • Access: GitHub

MAFAND-MT: Machine Translation for African Languages

  • Description: Machine translation dataset for African languages
  • Languages: Multiple African languages
  • Task: Machine translation
  • Access: GitHub

FLORES Fix for Africa

  • Description: Corrected and improved version of FLORES evaluation dataset for African languages
  • Languages: African languages in FLORES
  • Task: Machine translation evaluation
  • Access: GitHub

Nigerian Speech Corpus

  • Description: Large-scale speech corpus for major Nigerian languages
  • Funding: Lacuna Fund ($120,000)
  • Languages: Hausa, Yoruba, Igbo, and Nigerian Pidgin
  • Tasks: Automatic speech recognition, text-to-speech synthesis
  • Status: In development
  • Access: Project Website

Tools & Software

Hausa Dictionary (Kamusun Hausa)

  • Description: An online lexicon providing standardized Hausa definitions, examples, and usage
  • Features: Comprehensive Hausa-English dictionary, example sentences, pronunciation guide, standardized definitions
  • Access: https://kamusunhausa.hausanlp.org/

Hausa Catalogue

  • Description: A centralized repository indexing datasets, benchmarks, and resources for Hausa NLP research
  • Features: Searchable database of datasets, research papers, models, and tools for Hausa language
  • Access: https://catalog.hausanlp.org

Annotate Platform

  • Description: A lightweight web-based annotation platform for text classification, translation, and sequence labeling
  • Features: Multi-task annotation support, collaborative annotation, quality control, export to standard formats
  • Use Cases: Text classification, machine translation, named entity recognition, sentiment labeling
  • Access: https://annotate.hausanlp.org

NLPQuiz

  • Description: Interactive quiz platform for learning and testing NLP concepts
  • Features: Educational quizzes, progress tracking, concept reinforcement
  • Access: https://nlpquiz.hausanlp.org/auth/login

Speech Annotation Bot

  • Description: A Telegram-based tool for collecting speech data in low-connectivity environments; optimized for mobile users
  • Features: Mobile-first design, works in low-bandwidth areas, Telegram integration, easy audio collection
  • Use Cases: Speech corpus collection, crowdsourced pronunciation data, dialect documentation
  • Platform: Telegram Bot
  • Languages: Hausa and other low-resource languages

Writing NLP Papers

Paper Templates & Examples

Writing Guides & Best Practices

LaTeX Tips & Resources

  • Useful LaTeX Packages:
    • booktabs — Professional-looking tables
    • algorithm2e — Algorithm formatting
    • tikz — Diagrams and figures
    • natbib or biblatex — Bibliography management
    • hyperref — Cross-references and URLs
  • Helpful Tools:

Understanding the Peer Review Process


Tipsa and Resources

Tutorials & Workshop Materials

  • Video Tutorials:

    Community & Collaboration

  • Research Organizations:
  • Tips for Successful Grant Applications:
    • Clearly articulate the problem and potential impact
    • Demonstrate strong community involvement and partnerships
    • Include realistic budgets with detailed justifications
    • Highlight sustainability and scalability plans
    • Show evidence of preliminary work or pilot studies

Acknowledgments

These resources were made possible through awosome collaborators and generous funding support from:

  • Google (DeepMind Academic Fellowship)
  • Lacuna Fund
  • Oracle Cloud Infrastructure (OCI)
  • Nigerian TETFund
  • Wikimedia Foundation
  • University of Porto
  • Imperial College London
  • And countless volunteers and community members who contributed their time and expertise

Last updated: December 2025