Crowdsourced · Descriptive · Open

A living corpus for the Kurdish language.

Kurdish lives across Turkey, Iraq, Iran and Syria — in many dialects, scripts and accents, with no single authority. KorpusaKurdî doesn't invent a new standard. It listens to real usage, tags it by region, and democratically surfaces the most common patterns — while preserving the rare ones.

9:41 Kurmancî · Amed
Welcome back, Armanc
Daily contribution · 2 of 5
Which form do you use?
"I am going to the market."
Ez diçim bazarê Ez bazarê diçim Other…
Record a sentence
Tap to record · 5s
128
contributions
L4
level
7
badges
The Problem

One language, four borders, many voices.

Kurmanji, Sorani and Southern dialects use different scripts and grammar. With no central authority, neighbouring villages can disagree on a single word. Existing corpora are large but rarely fine-grained or geography-tagged.

04

Four countries

Turkey, Iraq, Iran, Syria — each with different policies, scripts and exposure.

03

Three main dialects

Kurmanji (Latin), Sorani (Arabic-based), Southern — plus countless local varieties.

00

No central authority

No academy, no ministry — usage is decided by communities, not committees.

How it works

Listen first. Standardize never.

A descriptive — not prescriptive — workflow. The crowd contributes; the data speaks.

STEP 01

Contribute

Type, correct or record short snippets — tagged by region, dialect and source.

STEP 02

Tag & verify

Profile metadata + community votes + light expert moderation.

STEP 03

Analyse

LLM + classical NLP find common forms, variants and gaps across regions.

STEP 04

Surface

Open dashboards, an API, and a "living reference" of what people actually say.

A glimpse of the data

Patterns, not prescriptions.

Sample dashboard: distribution of a verb form across regions. Majority highlighted, minority preserved.

"to go" — present 1sg · regional usage

North Central Southern
ez diçim
62%
min diçim
21%
eçim
12%
other / rare
5%
Sample data for visualization. Real distributions emerge from contributors.

For speakers

An honest, evolving guide based on what your neighbours actually say — not what a committee decided.

For builders

Open data feeds better TTS, translation, voice assistants and AI education tools for Kurdish.

Under the hood

Mobile-first. Open-source. Ethically sourced.

A lean MVP today, a research-grade corpus tomorrow.

React Native / Flutter FastAPI · Node.js PostgreSQL + PostGIS LLM analysis (OSS / API) Tesseract OCR Vector DB (Pinecone / Weaviate) CC-BY corpus license GDPR-aligned consent
M

MVP

Filters, contribute & correct, basic regional stats — text only.

V

Voice

Short recordings, region-tagged, opt-in — feeds future Kurdish TTS / ASR.

API

Open API

For researchers, educators and builders of Kurdish-language tools.

Research collaborators

Cross-checked across multiple LLMs.

Initial scoping, feasibility analysis and pattern brainstorming were stress-tested against several frontier models. We compared their answers, kept the overlap, and challenged the disagreements — so no single model frames the project alone.

Why multi-LLM?

Different models surface different gaps in low-resource language tooling. Cross-referencing reduces blind spots and biases in early planning.

Sources consulted
Claude Grok Perplexity Gemini
Dialect continuum review Existing corpora gaps Crowdsourcing precedents Tech-stack feasibility
Join us

Help digitize and preserve Kurdish for the AI era.

Every contribution — a corrected sentence, a recorded snippet, a tagged dialect — strengthens the corpus.