Engineering Documentation

Pinned links

Org: github.com/KorpusaKurdi
Repo: github.com/KorpusaKurdi/korpusa-kurdi
Kanban: GitHub Project · Board
PROD env: korpusa-kurdi.pages.dev

1 · Read this first

Welcome. KorpusaKurdî is a Kurdish language variation tracking and documentation platform — not a standardisation project. The technical north star is to collect real usage from speakers, attach rigorous geographic metadata to every contribution, and let statistical and LLM-based analysis surface the most popular forms while preserving minority variants.

This document is a working manual. Treat it as living: open a PR if you spot anything stale.

Engineering principle: Descriptive, not prescriptive. The platform never invents grammar — it observes, counts and surfaces. Any feature that risks "deciding for users" needs a design review before merging.

How to read this doc

Sections 2–4 are required reading before touching any code.
Sections 5–12 are per-discipline: jump to your area, then skim the others.
Sections 13–17 are process: workflow, tickets, onboarding checklist.
Section 18 is the reference index (papers, datasets, prior work).

2 · Linguistic context engineers must know

You don't need to be a linguist, but you need to know enough not to design naïve schemas. Three facts will shape almost every technical decision.

Dialect taxonomy

Dialect	Primary regions	Speakers	Structure	Script
Kurmanji (North)	Turkey, Syria, N. Iraq (Badinan), Armenia	15–20M	Synthetic / inflected; gender, case, ergativity	Latin (Hawar)
Sorani (Central)	Iraqi Kurdistan, W. Iran (Rojhelat)	6–12M	Analytic; no gender/case; word order	Perso-Arabic
Southern	Kermanshah, Ilam, Khanaqin	3–5M	Transitions Sorani ↔ Laki/Pehlewani	Perso-Arabic
Zaza-Gorani	Tunceli, Bingöl, Hawraman	2–3M	Distinct branch; high complexity	Varied

Source: cross-checked against multi-LLM research (Gemini deep-research output) and Wikipedia / Translators without Borders factsheet.

Scripts & the bi-directional problem

Feature	Hawar (Latin)	Sorani (Arabic)	Engineering impact
Direction	LTR	RTL	UI must support bi-directional layout, mirror icons
Vowels	8 distinct (A, E, Ê, I, Î, O, U, Û)	Markers (و, ێ, ە)	Diacritic-aware tokenisation; never strip marks
Consonants	Velar / alveolar stops	Pharyngeals, uvulars (/ʕ/, /ħ/)	Phonological mapping needed for TTS / ASR

Three-tier geographic hierarchy

Every contribution must be tagged at three levels — this is non-negotiable and shapes the database schema.

Country level — TR / IQ / IR / SY (loanword influence: Turkish, Arabic, Persian).
City level — major hubs (Erbil, Duhok, Diyarbakir, Mahabad…) act as dialect anchors.
Region / village level — preserves hyper-local accents and lexical variants that broader classifications drop.

Anti-pattern: a single region string field. Always model country / city / locality as separate normalised columns + PostGIS point.

3 · Scientific approach: algorithmic democracy

"Algorithmic democracy" is the working name for the four-stage pipeline that turns raw contributions into usable, regionally-tagged variant statistics — without imposing a top-down standard.

1. Collect & label

Every datum tagged with country/city/locality + dialect self-ID + user profile metadata.

2. Pattern extraction

LLMs + classical NLP (clustering, frequency mining) find variants for any concept.

3. Democratic selection

For each context, compute most-popular form per region; never delete minorities.

4. Living reference

Surface results via dashboards + open API. Re-runs as new data arrives.

Corpus vs. agent: a hybrid

We deliberately build both:

Static corpus — the ground-truth dataset, exported under CC-BY 4.0, suitable for training third-party models.
Dynamic agent — an interactive layer that does active learning: it identifies under-represented regions and prompts users to fill those gaps, plus real-time validation hints.

Sample tasks the agent surfaces

"Translate word X in your dialect."
"Correct this sentence from a Rudaw article."
"Record this 5-second daily snippet."
"Confirm whether you'd say ez diçim bazarê or ez bazarê diçim."

4 · Data analysis & outputs

Data analysis is the bridge between contributions and insight. Every analysis we run is descriptive: we surface what speakers actually do, never decide what they should do. The result is a corpus that reveals variation rather than flattening it.

Pipeline (raw → output)

Six stages, each one auditable. Any contribution can be traced from any output back to the original speaker (anonymised, with consent), so any claim the platform makes is reproducible.

01 Contribute

Speakers submit text, audio or labels with regional, dialect, age and register metadata.

02 Normalise

Script unification (Hawar / Sorani / Cyrillic), token cleaning, metadata validation.

03 Aggregate

Bucket contributions by region, dialect, cohort, register and time window.

04 Detect

Statistical: which forms exist, at what frequency per region. Clustering: which regions group together.

05 LLM-assist

Semantic clustering, novel-form detection, lemma proposals — always reviewed by humans before publication.

06 Publish

Outputs are versioned, signed and shipped: datasets, dashboards, APIs, papers.

What the analysis measures

Metric	What it shows
`form_distribution`	Frequency of a form across regions and dialects.
`dialect_cohesion`	How much speakers within a region agree on a single form.
`variant_rarity`	How rare a minority form is — drives the preservation index.
`geographic_spread`	How wide the geographic footprint of a form is.
`temporal_drift`	How a form's frequency changes over time (years, generations).
`cross_script_map`	The same word across Hawar, Sorani and Cyrillic spellings.

Descriptive guarantee. The platform never declares a "correct" form. Every output carries the underlying distribution — so a reader sees both the popular form and the minority alternatives. Standardisation, if it ever happens, belongs to communities and academies, not to the corpus.

What the corpus enables

Each output below is something the analysis stage can produce or feed. Some are live dashboards, some are versioned datasets, some are training corpora consumed by other projects.

Public datasets

Versioned Parquet / JSON-L releases with full metadata, CC-BY licensed, citable.

Variation map

Interactive geographic map: pick a word, see how it is said across regions.

Open API

Query: "most common form for X in region Y" — JSON in, JSON out.

Living dictionary

A reference where every entry carries regional usage frequencies — not a single "standard" form.

Academic papers

Replicable methodology + dataset versioning lets researchers cite the exact slice they used.

Training corpora

Cleaned, region-tagged data for TTS, STT and Kurdish-aware LLMs.

Educational materials

Variation-aware schoolbooks: students see the range and where each form is used, not just one.

Preservation index

A live measure of how endangered a rare form is — a signal for archivists and educators.

Journalism dashboards

Reporters can verify "is this word really common?" and find where to interview real speakers.

5 · Resources & prior art

Before building anything, audit what already exists. The Kurdish NLP space has scattered but real resources — datasets, toolkits, voice corpora, comparable platforms. Reuse where possible; complement rather than duplicate.

Direct links below are deliberately conservative — only canonical URLs we are confident about. For everything else, names are leads — search HuggingFace Datasets and GitHub by name to find the current home.

Datasets (text)

kurdish-ai / kurdish-corpus

Mixed-source Kurdish text corpus on HuggingFace Datasets — fast starting point for tokenisation, embedding and language-model work.

huggingface.co →

OSCAR

Massively multilingual web corpus with Kurdish slices (Kurmanji + Sorani). Useful for LM pre-training. Search HuggingFace for oscar-corpus/oscar.

CC-100

Filtered Common Crawl per-language corpora with Kurdish splits. Search HuggingFace Datasets for cc100.

Wikipedia (Kurdish)

Periodic dumps of Kurmanji + Sorani Wikipedia. Clean encyclopedic register — small, but high quality. Search the Wikimedia dumps site for kuwiki / ckbwiki.

Tatoeba (Kurdish)

Crowdsourced sentence pairs with translations — useful for parallel data and sanity checks across scripts.

AgaCKNER

Annotated NER dataset for Sorani. Smaller scale, but rare for being supervised. Search HuggingFace / GitHub by name.

Voice & speech

Mozilla Common Voice

Open multilingual voice corpus with growing Kurdish coverage. Aligned MP3 + transcript pairs, contributor pipeline we can study.

commonvoice.mozilla.org →

kurdishtts.com

External TTS API used for synthesis. Possible STT integration. Already wired into the planned audio pipeline.

RHVoice

Open-source TTS engine with experimental Kurdish voices. A solid base for offline / low-resource synthesis.

NLP libraries & tools

KLPT

Kurdish Language Processing Toolkit (Sina Ahmadi). Tokenisation, lemmatisation, transliteration, normalisation across scripts. The de-facto baseline.

github.com/sinaahmadi/KLPT →

Tesseract 5

OCR engine with Kurdish-trained models for both Hawar (Latin) and Sorani (Arabic) scripts.

spaCy + custom

Tokenisation / POS pipelines extended for Kurdish. No first-party model — community-trained, project-specific.

awesome-kurdish

Community-maintained index of Kurdish NLP work — datasets, papers, tools. Search GitHub for awesome-kurdish.

Comparable platforms (prior art)

Common Voice (Mozilla)

The reference for crowdsourced multilingual voice collection. Mature contribution UX, validation flow, donor experience to study.

Dia-Lingle

ETH Zürich's gamified Swiss German dialect collection. Inspiration for a playful, low-friction contribution flow.

CorCenCC

National Corpus of Contemporary Welsh — mobile recording, register tagging, public-facing dashboards. Closest analogue in scope and intent.

standwithkurds.org

Action-focused single-page UX — pattern for clarity over feature density on the public-facing side.

How we relate to these. KorpusaKurdî is not a competitor to existing Kurdish corpora or toolkits — it consumes and complements them. Where data overlaps, we deduplicate and credit. Where toolkits already work, we extend rather than rewrite. Our distinguishing focus is variation tagging at the contribution level, which most existing corpora do not capture.

6 · System architecture

Five logical stages, each independently scalable. Read left-to-right.

1 · Clients

Mobile

React Native / Flutter

Contribute UI
Voice capture
Offline queue

Web

Next.js dashboard

Dashboards
Moderation console

2 · Edge

API gateway

Cloudflare / Nginx

Rate limit
WAF
TLS

Auth

Auth0 / Firebase

JWT issuance
OAuth

3 · Services

Core API

FastAPI (Python) or Node

Contribution CRUD
Voting
Moderation queue

Workers

Celery / BullMQ

OCR jobs
LLM jobs
Stats refresh

4 · Data

Primary

PostgreSQL + PostGIS

Vector

Pinecone / Weaviate

Object

S3-compatible (audio, scans)

5 · Intelligence

LLM layer

Hosted (OpenAI, Anthropic, xAI) + OSS fallback

Pattern extraction
Variant clustering

NLP toolkit

KLPT, spaCy, custom

Tokenise / lemmatise
POS tagging

Stages are deployed independently. The intelligence layer is async — clients never block on LLM calls.

7 · Repositories & code organisation

The target setup is polyrepo under the KorpusaKurdi GitHub org. A monorepo was rejected because mobile and ML pipelines have very different CI shapes.

Today vs. target. Only korpusa-kurdi exists right now — it serves as project homepage, development sandbox, and this engineering doc (deployed to korpusa-kurdi.pages.dev). The other repos in the table below are the target shape and will be spun up as each workstream actually starts.

Repo	Status	Purpose	Stack	CI target
`korpusa-kurdi`	live	Homepage + this doc + ADRs (today also dev sandbox)	HTML / static	Cloudflare Pages
`kk-mobile`	planned	Cross-platform contribution app	React Native + TS	EAS / Fastlane
`kk-web`	planned	Dashboards & moderation console	Next.js + TS	Vercel
`kk-api`	planned	Core REST + GraphQL API	Python · FastAPI	Docker / k8s
`kk-workers`	planned	OCR, LLM, ETL jobs	Python · Celery	Docker / k8s
`kk-nlp`	planned	Tokenisation, lemmatisation, POS	Python · KLPT	PyPI publish
`kk-corpus`	planned	Public dataset releases (CC-BY)	Parquet, JSON-L	HF datasets
`kk-infra`	planned	Terraform, k8s manifests, Helm	Terraform · Helm	Atlantis

The originally-planned kk-docs role is currently filled by korpusa-kurdi; a dedicated docs repo may split off later if scope grows.

Reference code layout (kk-api)

kk-api/
├── app/
│   ├── api/           # FastAPI routers (v1, v2)
│   ├── domain/        # business logic, no I/O
│   ├── infra/         # db, queues, object storage
│   ├── workers/       # celery tasks
│   └── schemas/       # pydantic models
├── migrations/       # alembic
├── tests/            # pytest, >80% coverage gate
├── pyproject.toml
└── Dockerfile

Branching: trunk-based on main; short-lived feature branches; squash-merge with semantic commit messages (feat:, fix:, chore:, docs:).

8 · Frontend (mobile)

Mobile-first because contributors are overwhelmingly on phones, often in the diaspora or in regions with patchy connectivity.

Stack decision: React Native (TypeScript)

Cross-platform, large hiring pool, web team can review.
i18next for UI strings; ICU pluralisation.
react-native-mmkv for offline contribution queue.
expo-av for voice capture; opus codec for compact uploads.

Required UX rules

RTL toggle when the user picks a Sorani-script keyboard. Test both directions.
Never auto-correct contributions — that violates the descriptive principle.
Always show the user's region tag at the top of the contribute screen so they can correct it.
Voice capture is opt-in per session; explicit consent banner before mic access.

Screen inventory (MVP)

Onboarding — country/city/region picker, dialect self-ID (or "guess for me").
Contribute — quick type, correct-a-source, record snippet.
Profile / Stats — contributions, badges, level, regional impact.
Map — visual contributions per region (uses PostGIS aggregates).

9 · Backend & API

FastAPI for the synchronous edges, Celery (Redis broker) for any task that touches an LLM, OCR, or batch stats.

Sample endpoints

# Submit a free-form contribution
POST /v1/contributions
{
  "type": "text",
  "content": "Ez diçim bazarê",
  "context_concept": "go-to-market.1sg.present",
  "region": { "country": "TR", "city": "Diyarbakir", "locality": "Sur" },
  "dialect_self_id": "kurmanji",
  "source": { "type": "freetype" }
}

# Vote on a candidate variant
POST /v1/contributions/{id}/votes
{ "value": 1 }   # +1 / -1

# Read the live "most popular" form for a concept + region
GET /v1/concepts/{concept_id}/popular?country=TR&city=Diyarbakir

Moderation queue

Every contribution lands in a pending state. A row promotes to accepted when:

≥ 3 community up-votes and agent-side spam check passes, or
An expert moderator (role: linguist) confirms it.

10 · Database design

PostgreSQL 16 + PostGIS + pgvector. The schema centres on three things: what was contributed, where it was contributed from, and how confident we are.

Core tables (DDL excerpt)

CREATE TABLE users (
  id          uuid PRIMARY KEY,
  display     text,
  origin_country  char(2),
  origin_city     text,
  origin_locality text,
  geo         geography(Point, 4326),
  dialect_self_id  text,         -- kurmanji|sorani|southern|zaza-gorani|other
  role        text DEFAULT 'contributor',
  created_at  timestamptz DEFAULT now()
);

CREATE TABLE concepts (
  id          uuid PRIMARY KEY,
  key         text UNIQUE,    -- e.g. go-to-market.1sg.present
  gloss_en    text,
  pos         text
);

CREATE TABLE contributions (
  id          uuid PRIMARY KEY,
  user_id     uuid REFERENCES users(id),
  concept_id  uuid REFERENCES concepts(id),
  type        text CHECK (type IN ('text','audio','correction')),
  content     text,
  audio_uri   text,
  source_type text,           -- freetype|rudaw|book|other
  region_country  char(2),
  region_city     text,
  region_locality text,
  geo         geography(Point, 4326),
  dialect_self_id text,
  status      text DEFAULT 'pending',
  embedding   vector(384),    -- pgvector for semantic clustering
  created_at  timestamptz DEFAULT now()
);

CREATE INDEX ON contributions USING GIST (geo);
CREATE INDEX ON contributions USING ivfflat (embedding vector_cosine_ops);

CREATE TABLE votes (
  contribution_id uuid REFERENCES contributions(id),
  user_id     uuid REFERENCES users(id),
  value       smallint CHECK (value IN (-1, 1)),
  PRIMARY KEY (contribution_id, user_id)
);

CREATE MATERIALIZED VIEW popular_per_region AS
SELECT concept_id, region_country, region_city,
       content,
       COUNT(*) AS support,
       RANK() OVER (PARTITION BY concept_id, region_country, region_city
                    ORDER BY COUNT(*) DESC) AS rk
FROM contributions
WHERE status = 'accepted'
GROUP BY 1,2,3,4;

Why geography + vector together? The geography column powers regional dashboards and "near me" tasks; the embedding powers semantic deduplication ("Ez bazarê diçim" vs "Ez diçim bazarê" cluster as variants of one concept).

11 · AI / LLM pipeline

AI is not the product. It is a processing layer that turns raw, messy crowdsourced data into structured, queryable patterns.

Stages

Pre-processing — KLPT (Kurdish Language Processing Toolkit) for tokenisation, stemming, lemmatisation. Critical for Sorani clitics and Kurmanji ergative alignment.
Embedding — multilingual sentence transformer; store vectors in pgvector.
Variant clustering — semantic dedupe; same concept across surface variants.
Pattern extraction (LLM) — given a cluster, the LLM names the morphological/lexical variation pattern.
Active learning — agent identifies low-coverage (concept × region) cells and surfaces tasks to fill them.

Sample LLM prompt template (variant naming)

# system
You are a Kurdish-language dialectology assistant. You do NOT propose
a "correct" form. You name the linguistic pattern that distinguishes
the given variants.

# user
Concept: go-to-market.1sg.present
Variants observed:
  - "ez diçim bazarê"   (n=312, region=TR/Diyarbakir)
  - "ez bazarê diçim"   (n=104, region=TR/Mardin)
  - "min diçim bo bazař" (n=88,  region=IQ/Erbil)

Return JSON:
{
  "pattern": "...",          // e.g. SOV vs SVO order
  "axes": ["word_order", "case_marking"],
  "confidence": 0.0-1.0,
  "minority_preserved": true
}

LLMs we use (and why we cross-check)

We treat any single model as a biased lens. Pattern-extraction prompts are run against a small ensemble (e.g. Claude, Grok, an OSS model like Llama or Qwen). Disagreement is logged and surfaced to a linguist for review.

Existing Kurdish NLP we leverage

KLPT — tokenisation, lemmatisation, transliteration.
AgaCKNER — ~64,563 annotated tokens, Sorani NER.
KurdishMT, Hugging Face Kurdish corpora — seed data.
Mozilla Common Voice (Kurdish) — bootstrap voice data.
awesome-kurdish GitHub list — staying-current index.

12 · OCR & source ingestion

A meaningful portion of "good" Kurdish text is locked in scanned books, Rudaw archives, government documents. We unlock it via a "Correct-a-Source" module.

Pipeline

Operator uploads a PDF / image batch into kk-workers.
Worker runs Tesseract 5 (fine-tuned for Kurdish scripts; we maintain our own model artefacts).
The MRWL (Max Rightmost White Line) segmentation algorithm handles cursive Arabic-script Kurdish, with reported CER ≈ 0.755% on print, much higher on handwriting.
Output is staged as candidate contributions with source_type=book.
App users see a side-by-side: scan + AI transcription, with three correction granularities — word, sentence, "quick confirm".

Legal: only ingest sources we have rights to (public domain, CC, partner agreements with publishers like Rudaw). Track provenance per row in contributions.source_type + a separate source_licenses table.

13 · Quality & metrics

We measure data and model quality with standard error-rate metrics so the corpus can be benchmarked alongside other low-resource language datasets.

Character / Word Error Rate

Given S substitutions, D deletions, I insertions and N reference characters (or words):

ER = (S + D + I) / N

Reported separately as CER (character) and WER (word). Track per-script (Latin Hawar vs Perso-Arabic Sorani) — they don't behave the same.

Acceptance thresholds

Stage	Metric	Threshold
OCR (print)	CER	< 2%
OCR (handwritten)	CER	flagged for human review < 8%
Crowd contributions	community votes	≥ 3 net upvotes
LLM pattern extraction	cross-model agreement	≥ 2 / 3 ensemble agreement

Gamification ↔ quality

Levels gate sensitive tasks (e.g. only Validators+ vote on others' contributions). This is both engagement and quality control.

Level	Tasks unlocked	Reward
Newcomer	Profile + region setup	XP
Contributor	Free-type, correct-a-word	Badges, micro-tokens
Validator	Vote on others' data	Higher token multiplier
Expert (Linguist)	Manage OCR datasets, localise UI	Staking rights, governance votes

14 · DevOps & infrastructure

Environments

local — docker-compose, mock LLM (Ollama), seeded Postgres.
dev — auto-deployed from main on every merge.
staging — promoted by tag v*-rc.*; full PII-safe seed data.
prod — promoted by tag v*; behind change-management.

CI/CD

GitHub Actions per repo. Reusable workflows in kk-infra/.github.
Required checks before merge: lint, unit, integration, security scan (Trivy), license scan.
Coverage gate: 80% minimum per repo, 90% for kk-api/domain.
Container images pushed to GHCR; signed with cosign.

Observability

OpenTelemetry traces → Tempo / Grafana.
Structured JSON logs → Loki.
Metrics → Prometheus; dashboards in Grafana (per-region contribution rate is a first-class SLI).
Sentry for client + server errors.

Secrets

Never commit. Local: .env via direnv. Cloud: 1Password Connect → external-secrets-operator → k8s. Rotation policy: 90 days; LLM API keys: 30 days.

15 · GitHub workflow

Repo conventions

Default branch: main. Protected. PRs only.
One reviewer required (two for kk-api and kk-infra).
Conventional Commits — release notes are auto-generated.
CODEOWNERS per directory. Linguistic-sensitive code (kk-nlp) requires sign-off from a linguist reviewer.

Issue templates

bug.yml — repro, expected, actual, region/dialect impacted.
feature.yml — problem, proposal, descriptive-principle check.
data.yml — corpus / source request with licensing detail.
research.yml — references a paper / ADR / RFC.

Label taxonomy

Type: type/bug, type/feat, type/chore, type/docs, type/research.
Area: area/mobile, area/api, area/nlp, area/ocr, area/infra.
Priority: p0…p3.
Linguistic: dialect/kurmanji, dialect/sorani, script/latin, script/arabic.

RFC & ADR process

Non-trivial design changes go to kk-docs/rfc first. Architecture decisions captured as ADRs (one decision, one file, immutable once accepted).

16 · Tickets & sprints

Project tracking lives in GitHub Projects (organisation-level). Two boards:

Engineering — sprint board, two-week cadence, columns: Backlog → Ready → In progress → Review → Done.
Roadmap — quarterly view, milestones grouped by phase (see §18).

Definition of Ready

Acceptance criteria written.
Linked to a milestone.
Estimated (S / M / L).
Has area + type labels.

Definition of Done

Tests green; coverage threshold respected.
Docs updated where behaviour changed (this file or a sibling).
Telemetry / metric added if a new flow.
Linguistic sanity-check signed off if touching kk-nlp or contribution flow.
Migrations applied to dev; rollback documented.

17 · Standards & compliance

Code style

Python: ruff + black; type-checked with mypy --strict in domain/.
TS: ESLint + Prettier; strict: true in tsconfig.
SQL: migrations only via alembic; no autogenerate straight to prod.

Data licensing

Corpus releases: CC-BY 4.0.
Code: Apache 2.0 unless a repo says otherwise.
Models we fine-tune: same licence as the base model unless we own all the training data.

Privacy & ethics

GDPR-aligned consent flow on first launch and before any voice capture.
Right to deletion: hard-delete a user → soft-anonymise their contributions (keep the data for the corpus, drop the link).
No political bias or content moderation by the team — descriptive only. Hate-speech / spam handled by community + automated filters, never by editorial choice on linguistic content.
Underrepresented regions are actively targeted by outreach, never algorithmically penalised.

What we never do: "fix" a user's regional grammar. Every variant is data.

18 · Roadmap & milestones

Phase	Engineering deliverable	Exit criteria
P0 · Foundations	Repos, CI, infra terraformed, schema v1, auth	Hello-world contribution lands in DB end-to-end
P1 · MVP (text only)	Mobile app: contribute / correct, basic stats	≥ 100 active testers, 10k contributions across 1 country
P2 · Seed data	Import existing corpora; manual cleanup	Bootstrapped corpus ≥ 1M tokens, geo-tagged
P3 · Voice	Audio capture, opt-in, ASR ingestion	≥ 100 hours regionally-tagged audio
P4 · LLM analysis	Pattern extraction, dashboards, agent prompts	Live "popular per region" view, < 1h latency
P5 · Open API	Public read API, dataset releases	First external integration (TTS / translation)

20 · Appendix & references

Research sources behind this doc

This document was distilled from a multi-LLM research pass — Claude, Grok, Perplexity and Gemini — cross-checked to reduce single-model bias. Where they disagreed we kept the most cautious option and flagged the disagreement here for future review.

External datasets, tools and comparable platforms have moved to §5 (Resources & prior art) so they are easier to find before diving into the codebase.

Glossary

CER / WER — character / word error rate.
KLPT — Kurdish Language Processing Toolkit.
NER — named entity recognition.
POS — part-of-speech tagging.
MRWL — Max Rightmost White Line segmentation (Kurdish OCR).
Algorithmic democracy — our four-stage pipeline (collect → extract → select → surface) for finding popular forms without imposing a standard.