- Org: github.com/KorpusaKurdi
- Repo: github.com/KorpusaKurdi/korpusa-kurdi
- Kanban: GitHub Project · Board
- PROD env: korpusa-kurdi.pages.dev
1 · Read this first
Welcome. KorpusaKurdî is a Kurdish language variation tracking and documentation platform — not a standardisation project. The technical north star is to collect real usage from speakers, attach rigorous geographic metadata to every contribution, and let statistical and LLM-based analysis surface the most popular forms while preserving minority variants.
This document is a working manual. Treat it as living: open a PR if you spot anything stale.
How to read this doc
- Sections 2–4 are required reading before touching any code.
- Sections 5–12 are per-discipline: jump to your area, then skim the others.
- Sections 13–17 are process: workflow, tickets, onboarding checklist.
- Section 18 is the reference index (papers, datasets, prior work).
2 · Linguistic context engineers must know
You don't need to be a linguist, but you need to know enough not to design naïve schemas. Three facts will shape almost every technical decision.
Dialect taxonomy
| Dialect | Primary regions | Speakers | Structure | Script |
|---|---|---|---|---|
| Kurmanji (North) | Turkey, Syria, N. Iraq (Badinan), Armenia | 15–20M | Synthetic / inflected; gender, case, ergativity | Latin (Hawar) |
| Sorani (Central) | Iraqi Kurdistan, W. Iran (Rojhelat) | 6–12M | Analytic; no gender/case; word order | Perso-Arabic |
| Southern | Kermanshah, Ilam, Khanaqin | 3–5M | Transitions Sorani ↔ Laki/Pehlewani | Perso-Arabic |
| Zaza-Gorani | Tunceli, Bingöl, Hawraman | 2–3M | Distinct branch; high complexity | Varied |
Source: cross-checked against multi-LLM research (Gemini deep-research output) and Wikipedia / Translators without Borders factsheet.
Scripts & the bi-directional problem
| Feature | Hawar (Latin) | Sorani (Arabic) | Engineering impact |
|---|---|---|---|
| Direction | LTR | RTL | UI must support bi-directional layout, mirror icons |
| Vowels | 8 distinct (A, E, Ê, I, Î, O, U, Û) | Markers (و, ێ, ە) | Diacritic-aware tokenisation; never strip marks |
| Consonants | Velar / alveolar stops | Pharyngeals, uvulars (/ʕ/, /ħ/) | Phonological mapping needed for TTS / ASR |
Three-tier geographic hierarchy
Every contribution must be tagged at three levels — this is non-negotiable and shapes the database schema.
- Country level — TR / IQ / IR / SY (loanword influence: Turkish, Arabic, Persian).
- City level — major hubs (Erbil, Duhok, Diyarbakir, Mahabad…) act as dialect anchors.
- Region / village level — preserves hyper-local accents and lexical variants that broader classifications drop.
region string field. Always model country / city / locality as separate normalised columns + PostGIS point.
3 · Scientific approach: algorithmic democracy
"Algorithmic democracy" is the working name for the four-stage pipeline that turns raw contributions into usable, regionally-tagged variant statistics — without imposing a top-down standard.
1. Collect & label
Every datum tagged with country/city/locality + dialect self-ID + user profile metadata.
2. Pattern extraction
LLMs + classical NLP (clustering, frequency mining) find variants for any concept.
3. Democratic selection
For each context, compute most-popular form per region; never delete minorities.
4. Living reference
Surface results via dashboards + open API. Re-runs as new data arrives.
Corpus vs. agent: a hybrid
We deliberately build both:
- Static corpus — the ground-truth dataset, exported under
CC-BY 4.0, suitable for training third-party models. - Dynamic agent — an interactive layer that does active learning: it identifies under-represented regions and prompts users to fill those gaps, plus real-time validation hints.
Sample tasks the agent surfaces
- "Translate word X in your dialect."
- "Correct this sentence from a Rudaw article."
- "Record this 5-second daily snippet."
- "Confirm whether you'd say ez diçim bazarê or ez bazarê diçim."
4 · Data analysis & outputs
Data analysis is the bridge between contributions and insight. Every analysis we run is descriptive: we surface what speakers actually do, never decide what they should do. The result is a corpus that reveals variation rather than flattening it.
Pipeline (raw → output)
Six stages, each one auditable. Any contribution can be traced from any output back to the original speaker (anonymised, with consent), so any claim the platform makes is reproducible.
Speakers submit text, audio or labels with regional, dialect, age and register metadata.
Script unification (Hawar / Sorani / Cyrillic), token cleaning, metadata validation.
Bucket contributions by region, dialect, cohort, register and time window.
Statistical: which forms exist, at what frequency per region. Clustering: which regions group together.
Semantic clustering, novel-form detection, lemma proposals — always reviewed by humans before publication.
Outputs are versioned, signed and shipped: datasets, dashboards, APIs, papers.
What the analysis measures
| Metric | What it shows |
|---|---|
form_distribution | Frequency of a form across regions and dialects. |
dialect_cohesion | How much speakers within a region agree on a single form. |
variant_rarity | How rare a minority form is — drives the preservation index. |
geographic_spread | How wide the geographic footprint of a form is. |
temporal_drift | How a form's frequency changes over time (years, generations). |
cross_script_map | The same word across Hawar, Sorani and Cyrillic spellings. |
What the corpus enables
Each output below is something the analysis stage can produce or feed. Some are live dashboards, some are versioned datasets, some are training corpora consumed by other projects.
Public datasets
Versioned Parquet / JSON-L releases with full metadata, CC-BY licensed, citable.
Variation map
Interactive geographic map: pick a word, see how it is said across regions.
Open API
Query: "most common form for X in region Y" — JSON in, JSON out.
Living dictionary
A reference where every entry carries regional usage frequencies — not a single "standard" form.
Academic papers
Replicable methodology + dataset versioning lets researchers cite the exact slice they used.
Training corpora
Cleaned, region-tagged data for TTS, STT and Kurdish-aware LLMs.
Educational materials
Variation-aware schoolbooks: students see the range and where each form is used, not just one.
Preservation index
A live measure of how endangered a rare form is — a signal for archivists and educators.
Journalism dashboards
Reporters can verify "is this word really common?" and find where to interview real speakers.
5 · Resources & prior art
Before building anything, audit what already exists. The Kurdish NLP space has scattered but real resources — datasets, toolkits, voice corpora, comparable platforms. Reuse where possible; complement rather than duplicate.
Direct links below are deliberately conservative — only canonical URLs we are confident about. For everything else, names are leads — search HuggingFace Datasets and GitHub by name to find the current home.
Datasets (text)
kurdish-ai / kurdish-corpus
Mixed-source Kurdish text corpus on HuggingFace Datasets — fast starting point for tokenisation, embedding and language-model work.
huggingface.co →OSCAR
Massively multilingual web corpus with Kurdish slices (Kurmanji + Sorani). Useful for LM pre-training. Search HuggingFace for oscar-corpus/oscar.
CC-100
Filtered Common Crawl per-language corpora with Kurdish splits. Search HuggingFace Datasets for cc100.
Wikipedia (Kurdish)
Periodic dumps of Kurmanji + Sorani Wikipedia. Clean encyclopedic register — small, but high quality. Search the Wikimedia dumps site for kuwiki / ckbwiki.
Tatoeba (Kurdish)
Crowdsourced sentence pairs with translations — useful for parallel data and sanity checks across scripts.
AgaCKNER
Annotated NER dataset for Sorani. Smaller scale, but rare for being supervised. Search HuggingFace / GitHub by name.
Voice & speech
Mozilla Common Voice
Open multilingual voice corpus with growing Kurdish coverage. Aligned MP3 + transcript pairs, contributor pipeline we can study.
commonvoice.mozilla.org →kurdishtts.com
External TTS API used for synthesis. Possible STT integration. Already wired into the planned audio pipeline.
RHVoice
Open-source TTS engine with experimental Kurdish voices. A solid base for offline / low-resource synthesis.
NLP libraries & tools
KLPT
Kurdish Language Processing Toolkit (Sina Ahmadi). Tokenisation, lemmatisation, transliteration, normalisation across scripts. The de-facto baseline.
github.com/sinaahmadi/KLPT →Tesseract 5
OCR engine with Kurdish-trained models for both Hawar (Latin) and Sorani (Arabic) scripts.
spaCy + custom
Tokenisation / POS pipelines extended for Kurdish. No first-party model — community-trained, project-specific.
awesome-kurdish
Community-maintained index of Kurdish NLP work — datasets, papers, tools. Search GitHub for awesome-kurdish.
Comparable platforms (prior art)
Common Voice (Mozilla)
The reference for crowdsourced multilingual voice collection. Mature contribution UX, validation flow, donor experience to study.
Dia-Lingle
ETH Zürich's gamified Swiss German dialect collection. Inspiration for a playful, low-friction contribution flow.
CorCenCC
National Corpus of Contemporary Welsh — mobile recording, register tagging, public-facing dashboards. Closest analogue in scope and intent.
standwithkurds.org
Action-focused single-page UX — pattern for clarity over feature density on the public-facing side.
6 · System architecture
Five logical stages, each independently scalable. Read left-to-right.
- Contribute UI
- Voice capture
- Offline queue
- Dashboards
- Moderation console
- Rate limit
- WAF
- TLS
- JWT issuance
- OAuth
- Contribution CRUD
- Voting
- Moderation queue
- OCR jobs
- LLM jobs
- Stats refresh
- Pattern extraction
- Variant clustering
- Tokenise / lemmatise
- POS tagging
Stages are deployed independently. The intelligence layer is async — clients never block on LLM calls.
7 · Repositories & code organisation
The target setup is polyrepo under the KorpusaKurdi GitHub org. A monorepo was rejected because mobile and ML pipelines have very different CI shapes.
korpusa-kurdi exists right now — it serves as project homepage, development sandbox, and this engineering doc (deployed to korpusa-kurdi.pages.dev). The other repos in the table below are the target shape and will be spun up as each workstream actually starts.
| Repo | Status | Purpose | Stack | CI target |
|---|---|---|---|---|
korpusa-kurdi | live | Homepage + this doc + ADRs (today also dev sandbox) | HTML / static | Cloudflare Pages |
kk-mobile | planned | Cross-platform contribution app | React Native + TS | EAS / Fastlane |
kk-web | planned | Dashboards & moderation console | Next.js + TS | Vercel |
kk-api | planned | Core REST + GraphQL API | Python · FastAPI | Docker / k8s |
kk-workers | planned | OCR, LLM, ETL jobs | Python · Celery | Docker / k8s |
kk-nlp | planned | Tokenisation, lemmatisation, POS | Python · KLPT | PyPI publish |
kk-corpus | planned | Public dataset releases (CC-BY) | Parquet, JSON-L | HF datasets |
kk-infra | planned | Terraform, k8s manifests, Helm | Terraform · Helm | Atlantis |
The originally-planned kk-docs role is currently filled by korpusa-kurdi; a dedicated docs repo may split off later if scope grows.
Reference code layout (kk-api)
kk-api/ ├── app/ │ ├── api/ # FastAPI routers (v1, v2) │ ├── domain/ # business logic, no I/O │ ├── infra/ # db, queues, object storage │ ├── workers/ # celery tasks │ └── schemas/ # pydantic models ├── migrations/ # alembic ├── tests/ # pytest, >80% coverage gate ├── pyproject.toml └── Dockerfile
main; short-lived feature branches; squash-merge with semantic commit messages (feat:, fix:, chore:, docs:).
8 · Frontend (mobile)
Mobile-first because contributors are overwhelmingly on phones, often in the diaspora or in regions with patchy connectivity.
Stack decision: React Native (TypeScript)
- Cross-platform, large hiring pool, web team can review.
i18nextfor UI strings; ICU pluralisation.react-native-mmkvfor offline contribution queue.expo-avfor voice capture; opus codec for compact uploads.
Required UX rules
- RTL toggle when the user picks a Sorani-script keyboard. Test both directions.
- Never auto-correct contributions — that violates the descriptive principle.
- Always show the user's region tag at the top of the contribute screen so they can correct it.
- Voice capture is opt-in per session; explicit consent banner before mic access.
Screen inventory (MVP)
- Onboarding — country/city/region picker, dialect self-ID (or "guess for me").
- Contribute — quick type, correct-a-source, record snippet.
- Profile / Stats — contributions, badges, level, regional impact.
- Map — visual contributions per region (uses PostGIS aggregates).
9 · Backend & API
FastAPI for the synchronous edges, Celery (Redis broker) for any task that touches an LLM, OCR, or batch stats.
Sample endpoints
# Submit a free-form contribution POST /v1/contributions { "type": "text", "content": "Ez diçim bazarê", "context_concept": "go-to-market.1sg.present", "region": { "country": "TR", "city": "Diyarbakir", "locality": "Sur" }, "dialect_self_id": "kurmanji", "source": { "type": "freetype" } } # Vote on a candidate variant POST /v1/contributions/{id}/votes { "value": 1 } # +1 / -1 # Read the live "most popular" form for a concept + region GET /v1/concepts/{concept_id}/popular?country=TR&city=Diyarbakir
Moderation queue
Every contribution lands in a pending state. A row promotes to accepted when:
- ≥ 3 community up-votes and agent-side spam check passes, or
- An expert moderator (role:
linguist) confirms it.
10 · Database design
PostgreSQL 16 + PostGIS + pgvector. The schema centres on three things: what was contributed, where it was contributed from, and how confident we are.
Core tables (DDL excerpt)
CREATE TABLE users ( id uuid PRIMARY KEY, display text, origin_country char(2), origin_city text, origin_locality text, geo geography(Point, 4326), dialect_self_id text, -- kurmanji|sorani|southern|zaza-gorani|other role text DEFAULT 'contributor', created_at timestamptz DEFAULT now() ); CREATE TABLE concepts ( id uuid PRIMARY KEY, key text UNIQUE, -- e.g. go-to-market.1sg.present gloss_en text, pos text ); CREATE TABLE contributions ( id uuid PRIMARY KEY, user_id uuid REFERENCES users(id), concept_id uuid REFERENCES concepts(id), type text CHECK (type IN ('text','audio','correction')), content text, audio_uri text, source_type text, -- freetype|rudaw|book|other region_country char(2), region_city text, region_locality text, geo geography(Point, 4326), dialect_self_id text, status text DEFAULT 'pending', embedding vector(384), -- pgvector for semantic clustering created_at timestamptz DEFAULT now() ); CREATE INDEX ON contributions USING GIST (geo); CREATE INDEX ON contributions USING ivfflat (embedding vector_cosine_ops); CREATE TABLE votes ( contribution_id uuid REFERENCES contributions(id), user_id uuid REFERENCES users(id), value smallint CHECK (value IN (-1, 1)), PRIMARY KEY (contribution_id, user_id) ); CREATE MATERIALIZED VIEW popular_per_region AS SELECT concept_id, region_country, region_city, content, COUNT(*) AS support, RANK() OVER (PARTITION BY concept_id, region_country, region_city ORDER BY COUNT(*) DESC) AS rk FROM contributions WHERE status = 'accepted' GROUP BY 1,2,3,4;
geography + vector together? The geography column powers regional dashboards and "near me" tasks; the embedding powers semantic deduplication ("Ez bazarê diçim" vs "Ez diçim bazarê" cluster as variants of one concept).
11 · AI / LLM pipeline
AI is not the product. It is a processing layer that turns raw, messy crowdsourced data into structured, queryable patterns.
Stages
- Pre-processing — KLPT (Kurdish Language Processing Toolkit) for tokenisation, stemming, lemmatisation. Critical for Sorani clitics and Kurmanji ergative alignment.
- Embedding — multilingual sentence transformer; store vectors in
pgvector. - Variant clustering — semantic dedupe; same concept across surface variants.
- Pattern extraction (LLM) — given a cluster, the LLM names the morphological/lexical variation pattern.
- Active learning — agent identifies low-coverage (concept × region) cells and surfaces tasks to fill them.
Sample LLM prompt template (variant naming)
# system You are a Kurdish-language dialectology assistant. You do NOT propose a "correct" form. You name the linguistic pattern that distinguishes the given variants. # user Concept: go-to-market.1sg.present Variants observed: - "ez diçim bazarê" (n=312, region=TR/Diyarbakir) - "ez bazarê diçim" (n=104, region=TR/Mardin) - "min diçim bo bazař" (n=88, region=IQ/Erbil) Return JSON: { "pattern": "...", // e.g. SOV vs SVO order "axes": ["word_order", "case_marking"], "confidence": 0.0-1.0, "minority_preserved": true }
LLMs we use (and why we cross-check)
We treat any single model as a biased lens. Pattern-extraction prompts are run against a small ensemble (e.g. Claude, Grok, an OSS model like Llama or Qwen). Disagreement is logged and surfaced to a linguist for review.
Existing Kurdish NLP we leverage
- KLPT — tokenisation, lemmatisation, transliteration.
- AgaCKNER — ~64,563 annotated tokens, Sorani NER.
- KurdishMT, Hugging Face Kurdish corpora — seed data.
- Mozilla Common Voice (Kurdish) — bootstrap voice data.
awesome-kurdishGitHub list — staying-current index.
12 · OCR & source ingestion
A meaningful portion of "good" Kurdish text is locked in scanned books, Rudaw archives, government documents. We unlock it via a "Correct-a-Source" module.
Pipeline
- Operator uploads a PDF / image batch into
kk-workers. - Worker runs Tesseract 5 (fine-tuned for Kurdish scripts; we maintain our own model artefacts).
- The MRWL (Max Rightmost White Line) segmentation algorithm handles cursive Arabic-script Kurdish, with reported CER ≈ 0.755% on print, much higher on handwriting.
- Output is staged as candidate contributions with
source_type=book. - App users see a side-by-side: scan + AI transcription, with three correction granularities — word, sentence, "quick confirm".
contributions.source_type + a separate source_licenses table.
13 · Quality & metrics
We measure data and model quality with standard error-rate metrics so the corpus can be benchmarked alongside other low-resource language datasets.
Character / Word Error Rate
Given S substitutions, D deletions, I insertions and N reference characters (or words):
ER = (S + D + I) / N
Reported separately as CER (character) and WER (word). Track per-script (Latin Hawar vs Perso-Arabic Sorani) — they don't behave the same.
Acceptance thresholds
| Stage | Metric | Threshold |
|---|---|---|
| OCR (print) | CER | < 2% |
| OCR (handwritten) | CER | flagged for human review < 8% |
| Crowd contributions | community votes | ≥ 3 net upvotes |
| LLM pattern extraction | cross-model agreement | ≥ 2 / 3 ensemble agreement |
Gamification ↔ quality
Levels gate sensitive tasks (e.g. only Validators+ vote on others' contributions). This is both engagement and quality control.
| Level | Tasks unlocked | Reward |
|---|---|---|
| Newcomer | Profile + region setup | XP |
| Contributor | Free-type, correct-a-word | Badges, micro-tokens |
| Validator | Vote on others' data | Higher token multiplier |
| Expert (Linguist) | Manage OCR datasets, localise UI | Staking rights, governance votes |
14 · DevOps & infrastructure
Environments
local— docker-compose, mock LLM (Ollama), seeded Postgres.dev— auto-deployed frommainon every merge.staging— promoted by tagv*-rc.*; full PII-safe seed data.prod— promoted by tagv*; behind change-management.
CI/CD
- GitHub Actions per repo. Reusable workflows in
kk-infra/.github. - Required checks before merge: lint, unit, integration, security scan (Trivy), license scan.
- Coverage gate: 80% minimum per repo, 90% for
kk-api/domain. - Container images pushed to GHCR; signed with cosign.
Observability
- OpenTelemetry traces → Tempo / Grafana.
- Structured JSON logs → Loki.
- Metrics → Prometheus; dashboards in Grafana (per-region contribution rate is a first-class SLI).
- Sentry for client + server errors.
Secrets
Never commit. Local: .env via direnv. Cloud: 1Password Connect → external-secrets-operator → k8s. Rotation policy: 90 days; LLM API keys: 30 days.
15 · GitHub workflow
Repo conventions
- Default branch:
main. Protected. PRs only. - One reviewer required (two for
kk-apiandkk-infra). - Conventional Commits — release notes are auto-generated.
CODEOWNERSper directory. Linguistic-sensitive code (kk-nlp) requires sign-off from a linguist reviewer.
Issue templates
bug.yml— repro, expected, actual, region/dialect impacted.feature.yml— problem, proposal, descriptive-principle check.data.yml— corpus / source request with licensing detail.research.yml— references a paper / ADR / RFC.
Label taxonomy
- Type:
type/bug,type/feat,type/chore,type/docs,type/research. - Area:
area/mobile,area/api,area/nlp,area/ocr,area/infra. - Priority:
p0…p3. - Linguistic:
dialect/kurmanji,dialect/sorani,script/latin,script/arabic.
RFC & ADR process
Non-trivial design changes go to kk-docs/rfc first. Architecture decisions captured as ADRs (one decision, one file, immutable once accepted).
16 · Tickets & sprints
Project tracking lives in GitHub Projects (organisation-level). Two boards:
- Engineering — sprint board, two-week cadence, columns: Backlog → Ready → In progress → Review → Done.
- Roadmap — quarterly view, milestones grouped by phase (see §18).
Definition of Ready
- Acceptance criteria written.
- Linked to a milestone.
- Estimated (S / M / L).
- Has area + type labels.
Definition of Done
- Tests green; coverage threshold respected.
- Docs updated where behaviour changed (this file or a sibling).
- Telemetry / metric added if a new flow.
- Linguistic sanity-check signed off if touching
kk-nlpor contribution flow. - Migrations applied to dev; rollback documented.
17 · Standards & compliance
Code style
- Python:
ruff+black; type-checked withmypy --strictindomain/. - TS: ESLint + Prettier;
strict: trueintsconfig. - SQL: migrations only via
alembic; noautogeneratestraight to prod.
Data licensing
- Corpus releases: CC-BY 4.0.
- Code: Apache 2.0 unless a repo says otherwise.
- Models we fine-tune: same licence as the base model unless we own all the training data.
Privacy & ethics
- GDPR-aligned consent flow on first launch and before any voice capture.
- Right to deletion: hard-delete a user → soft-anonymise their contributions (keep the data for the corpus, drop the link).
- No political bias or content moderation by the team — descriptive only. Hate-speech / spam handled by community + automated filters, never by editorial choice on linguistic content.
- Underrepresented regions are actively targeted by outreach, never algorithmically penalised.
18 · Roadmap & milestones
| Phase | Engineering deliverable | Exit criteria |
|---|---|---|
| P0 · Foundations | Repos, CI, infra terraformed, schema v1, auth | Hello-world contribution lands in DB end-to-end |
| P1 · MVP (text only) | Mobile app: contribute / correct, basic stats | ≥ 100 active testers, 10k contributions across 1 country |
| P2 · Seed data | Import existing corpora; manual cleanup | Bootstrapped corpus ≥ 1M tokens, geo-tagged |
| P3 · Voice | Audio capture, opt-in, ASR ingestion | ≥ 100 hours regionally-tagged audio |
| P4 · LLM analysis | Pattern extraction, dashboards, agent prompts | Live "popular per region" view, < 1h latency |
| P5 · Open API | Public read API, dataset releases | First external integration (TTS / translation) |
19 · Onboarding checklist
Tick as you go. The whole list should be done by end of week 2.
Day 1 — accounts & orientation
Week 1 — first commits
Month 1 — own a slice
20 · Appendix & references
Research sources behind this doc
This document was distilled from a multi-LLM research pass — Claude, Grok, Perplexity and Gemini — cross-checked to reduce single-model bias. Where they disagreed we kept the most cautious option and flagged the disagreement here for future review.
External datasets, tools and comparable platforms have moved to §5 (Resources & prior art) so they are easier to find before diving into the codebase.
Glossary
- CER / WER — character / word error rate.
- KLPT — Kurdish Language Processing Toolkit.
- NER — named entity recognition.
- POS — part-of-speech tagging.
- MRWL — Max Rightmost White Line segmentation (Kurdish OCR).
- Algorithmic democracy — our four-stage pipeline (collect → extract → select → surface) for finding popular forms without imposing a standard.