KorpusaKurdî — A Living Corpus for the Kurdish Language

The Problem

One language, four borders, many voices.

Kurmanji, Sorani and Southern dialects use different scripts and grammar. With no central authority, neighbouring villages can disagree on a single word. Existing corpora are large but rarely fine-grained or geography-tagged.

Four countries

Turkey, Iraq, Iran, Syria — each with different policies, scripts and exposure.

Three main dialects

Kurmanji (Latin), Sorani (Arabic-based), Southern — plus countless local varieties.

No central authority

No academy, no ministry — usage is decided by communities, not committees.

How it works

Listen first. Standardize never.

A descriptive — not prescriptive — workflow. The crowd contributes; the data speaks.

STEP 01

Contribute

Type, correct or record short snippets — tagged by region, dialect and source.

STEP 02

Tag & verify

Profile metadata + community votes + light expert moderation.

STEP 03

Analyse

LLM + classical NLP find common forms, variants and gaps across regions.

STEP 04

Surface

Open dashboards, an API, and a "living reference" of what people actually say.

A glimpse of the data

Patterns, not prescriptions.

Sample dashboard: distribution of a verb form across regions. Majority highlighted, minority preserved.

"to go" — present 1sg · regional usage

North Central Southern

ez diçim

62%

min diçim

21%

eçim

12%

other / rare

Sample data for visualization. Real distributions emerge from contributors.

For speakers

An honest, evolving guide based on what your neighbours actually say — not what a committee decided.

For builders

Open data feeds better TTS, translation, voice assistants and AI education tools for Kurdish.

Under the hood

Web-first. Open-source. Ethically sourced.

A lean MVP today, a research-grade corpus tomorrow.

React Native / Flutter FastAPI · Node.js PostgreSQL + PostGIS LLM analysis (OSS / API) Tesseract OCR Vector DB (Pinecone / Weaviate) CC-BY corpus license GDPR-aligned consent

MVP

Filters, contribute & correct, basic regional stats — text only.

Voice

Short recordings, region-tagged, opt-in — feeds future Kurdish TTS / ASR.

API

Open API

For researchers, educators and builders of Kurdish-language tools.

Research collaborators

Cross-checked across multiple LLMs.

Initial scoping, feasibility analysis and pattern brainstorming were stress-tested against several frontier models. We compared their answers, kept the overlap, and challenged the disagreements — so no single model frames the project alone.

Why multi-LLM?

Different models surface different gaps in low-resource language tooling. Cross-referencing reduces blind spots and biases in early planning.

Sources consulted

Claude Grok Perplexity Gemini

Dialect continuum review Existing corpora gaps Crowdsourcing precedents Tech-stack feasibility

Join us

Help digitize and preserve Kurdish for the AI era.

Every contribution — a corrected sentence, a recorded snippet, a tagged dialect — strengthens the corpus.

Open the app Partner with us