Kurdish Speech Recognition Audio Corpus

Published: August 2, 2023
Size: 12.5 GB
License: CC BY-NC 4.0
Domain: Speech Recognition
Languages:
Kurdish (Sorani) Kurdish (Kurmanji) Kurdish (Pehlewani)
Formats:
WAV TXT JSON TextGrid
Contributing Organizations:

Dataset Description

Large-scale audio corpus containing 1,000 hours of Kurdish speech from 500+ speakers across different dialects, ages, and regions. Includes high-quality transcriptions and speaker metadata.

Data Structure

WAV audio files (16kHz, mono), Transcript files with timestamps, Speaker metadata (demographics, dialect), Phonetic alignments

Primary Publication

Automatic Speech Recognition for Kurdish Dialects

Dr. Mohammad Ali , Dr. John Doe , Dr. Sara Ahmed (2023)

This research develops the first comprehensive ASR system for Kurdish, supporting both Sorani and Kurmanji dialects with domain adaptation techniques. Achieved 89.3% word accuracy on conversational …

Related Publications

How to Cite

Kareem, H., Ahmed, R., & Jamal, S. (2023). Kurdish Speech Recognition Audio Corpus. KaiLab Research Data Repository. https://doi.org/10.5281/kurd-speech-corpus.v1

Dataset Information

Total Size 12.5 GB
Languages 3 languages
Formats 4 formats
Related Papers 1 publication

Data Access

Download Dataset

Licensed under CC BY-NC 4.0