Kurdish lives across Turkey, Iraq, Iran and Syria — in many dialects, scripts and accents, with no single authority. KorpusaKurdî doesn't invent a new standard. It listens to real usage, tags it by region, and democratically surfaces the most common patterns — while preserving the rare ones.
Kurmanji, Sorani and Southern dialects use different scripts and grammar. With no central authority, neighbouring villages can disagree on a single word. Existing corpora are large but rarely fine-grained or geography-tagged.
Turkey, Iraq, Iran, Syria — each with different policies, scripts and exposure.
Kurmanji (Latin), Sorani (Arabic-based), Southern — plus countless local varieties.
No academy, no ministry — usage is decided by communities, not committees.
A descriptive — not prescriptive — workflow. The crowd contributes; the data speaks.
Type, correct or record short snippets — tagged by region, dialect and source.
Profile metadata + community votes + light expert moderation.
LLM + classical NLP find common forms, variants and gaps across regions.
Open dashboards, an API, and a "living reference" of what people actually say.
Sample dashboard: distribution of a verb form across regions. Majority highlighted, minority preserved.
An honest, evolving guide based on what your neighbours actually say — not what a committee decided.
Open data feeds better TTS, translation, voice assistants and AI education tools for Kurdish.
A lean MVP today, a research-grade corpus tomorrow.
Filters, contribute & correct, basic regional stats — text only.
Short recordings, region-tagged, opt-in — feeds future Kurdish TTS / ASR.
For researchers, educators and builders of Kurdish-language tools.
Initial scoping, feasibility analysis and pattern brainstorming were stress-tested against several frontier models. We compared their answers, kept the overlap, and challenged the disagreements — so no single model frames the project alone.
Different models surface different gaps in low-resource language tooling. Cross-referencing reduces blind spots and biases in early planning.
Every contribution — a corrected sentence, a recorded snippet, a tagged dialect — strengthens the corpus.