Dataset Description
A large-scale corpus of modern Kurdish texts containing 50 million words from diverse sources including news articles, literature, academic papers, and web content. Includes linguistic annotations and quality metadata.
Data Structure
Plain text files organized by source and dialect, XML annotations for POS tagging, Metadata files with source information
Primary Publication
Large-Scale Kurdish Text Corpus Creation and Analysis
Dr. Zainab Hussein , Dr. Mohammad Ali , Dr. Sara Ahmed (2023)
Comprehensive methodology for creating and analyzing a 50-million-word Kurdish corpus covering multiple domains and dialects, with automated quality assessment and linguistic annotation.
Related Publications
Deep Learning Approaches for Kurdish Optical Character Recognition
Dr. Ahmad Kurdish , Dr. Sara Ahmed , Dr. Fatima Hassan (2023)
This paper presents a comprehensive study on applying deep learning techniques to Kurdish OCR, addressing unique challenges in Kurdish script …
How to Cite
Omar, Z., Hassan, K., & Ali, A. (2023). Kurdish Modern Text Corpus v2.0. KaiLab Research Data Repository. https://doi.org/10.5281/kurd-modern-corpus.v2