Corpus Creation and Analysis

Large-scale Kurdish text corpus development with linguistic annotation and analysis

Active
Started: June 2021
3 Team Members

Project Overview

Systematic creation and analysis of comprehensive Kurdish text corpora covering multiple domains and dialects. Our 50-million-word corpus includes automated quality assessment, linguistic annotation, and serves as the foundation for numerous Kurdish NLP applications.

Technologies & Methods

Corpus Linguistics Text Mining Quality Assessment Linguistic Annotation

Applications

  • Language Research
  • NLP Model Training
  • Linguistic Analysis

Related Publications

Large-Scale Kurdish Text Corpus Creation and Analysis

Dr. Zainab Hussein , Dr. Mohammad Ali , Dr. Sara Ahmed (2023)

Comprehensive methodology for creating and analyzing a 50-million-word Kurdish corpus covering multiple domains and dialects, with automated quality assessment and linguistic annotation.

Related Datasets

Kurdish Modern Text Corpus

Published: January 15, 2023 | Size: 1.8 GB

A large-scale corpus of modern Kurdish texts containing 50 million words from diverse sources including news articles, literature, academic papers, and web content. Includes linguistic annotations and …

Project Statistics

Publications: 1
Datasets: 3
Team Size: 3

Research Team

Funding

Kurdistan Academy of Sciences Research Grant