Engineering · Onboarding · v0.1

Engineering Documentation

The build manual: scientific approach, system architecture, code organisation, AI/LLM pipeline, DevOps and the Day 1 / Week 1 / Month 1 path for new contributors. This document complements — and does not repeat — the project pitch.

Audience: new dev team members Status: living doc Last review: 2026-05 Owner: core engineering

1 · Read this first

Welcome. KorpusaKurdî is a Kurdish language variation tracking and documentation platform — not a standardisation project. The technical north star is to collect real usage from speakers, attach rigorous geographic metadata to every contribution, and let statistical and LLM-based analysis surface the most popular forms while preserving minority variants.

This document is a working manual. Treat it as living: open a PR if you spot anything stale.

Engineering principle: Descriptive, not prescriptive. The platform never invents grammar — it observes, counts and surfaces. Any feature that risks "deciding for users" needs a design review before merging.

How to read this doc

  • Sections 2–4 are required reading before touching any code.
  • Sections 5–12 are per-discipline: jump to your area, then skim the others.
  • Sections 13–17 are process: workflow, tickets, onboarding checklist.
  • Section 18 is the reference index (papers, datasets, prior work).

2 · Linguistic context engineers must know

You don't need to be a linguist, but you need to know enough not to design naïve schemas. Three facts will shape almost every technical decision.

Dialect taxonomy

DialectPrimary regionsSpeakersStructureScript
Kurmanji (North)Turkey, Syria, N. Iraq (Badinan), Armenia15–20MSynthetic / inflected; gender, case, ergativityLatin (Hawar)
Sorani (Central)Iraqi Kurdistan, W. Iran (Rojhelat)6–12MAnalytic; no gender/case; word orderPerso-Arabic
SouthernKermanshah, Ilam, Khanaqin3–5MTransitions Sorani ↔ Laki/PehlewaniPerso-Arabic
Zaza-GoraniTunceli, Bingöl, Hawraman2–3MDistinct branch; high complexityVaried

Source: cross-checked against multi-LLM research (Gemini deep-research output) and Wikipedia / Translators without Borders factsheet.

Scripts & the bi-directional problem

FeatureHawar (Latin)Sorani (Arabic)Engineering impact
DirectionLTRRTLUI must support bi-directional layout, mirror icons
Vowels8 distinct (A, E, Ê, I, Î, O, U, Û)Markers (و, ێ, ە)Diacritic-aware tokenisation; never strip marks
ConsonantsVelar / alveolar stopsPharyngeals, uvulars (/ʕ/, /ħ/)Phonological mapping needed for TTS / ASR

Three-tier geographic hierarchy

Every contribution must be tagged at three levels — this is non-negotiable and shapes the database schema.

  • Country level — TR / IQ / IR / SY (loanword influence: Turkish, Arabic, Persian).
  • City level — major hubs (Erbil, Duhok, Diyarbakir, Mahabad…) act as dialect anchors.
  • Region / village level — preserves hyper-local accents and lexical variants that broader classifications drop.
Anti-pattern: a single region string field. Always model country / city / locality as separate normalised columns + PostGIS point.

3 · Scientific approach: algorithmic democracy

"Algorithmic democracy" is the working name for the four-stage pipeline that turns raw contributions into usable, regionally-tagged variant statistics — without imposing a top-down standard.

1. Collect & label

Every datum tagged with country/city/locality + dialect self-ID + user profile metadata.

2. Pattern extraction

LLMs + classical NLP (clustering, frequency mining) find variants for any concept.

3. Democratic selection

For each context, compute most-popular form per region; never delete minorities.

4. Living reference

Surface results via dashboards + open API. Re-runs as new data arrives.

Corpus vs. agent: a hybrid

We deliberately build both:

  • Static corpus — the ground-truth dataset, exported under CC-BY 4.0, suitable for training third-party models.
  • Dynamic agent — an interactive layer that does active learning: it identifies under-represented regions and prompts users to fill those gaps, plus real-time validation hints.

Sample tasks the agent surfaces

  • "Translate word X in your dialect."
  • "Correct this sentence from a Rudaw article."
  • "Record this 5-second daily snippet."
  • "Confirm whether you'd say ez diçim bazarê or ez bazarê diçim."

4 · Data analysis & outputs

Data analysis is the bridge between contributions and insight. Every analysis we run is descriptive: we surface what speakers actually do, never decide what they should do. The result is a corpus that reveals variation rather than flattening it.

Pipeline (raw → output)

Six stages, each one auditable. Any contribution can be traced from any output back to the original speaker (anonymised, with consent), so any claim the platform makes is reproducible.

01 Contribute

Speakers submit text, audio or labels with regional, dialect, age and register metadata.

02 Normalise

Script unification (Hawar / Sorani / Cyrillic), token cleaning, metadata validation.

03 Aggregate

Bucket contributions by region, dialect, cohort, register and time window.

04 Detect

Statistical: which forms exist, at what frequency per region. Clustering: which regions group together.

05 LLM-assist

Semantic clustering, novel-form detection, lemma proposals — always reviewed by humans before publication.

06 Publish

Outputs are versioned, signed and shipped: datasets, dashboards, APIs, papers.

What the analysis measures

MetricWhat it shows
form_distributionFrequency of a form across regions and dialects.
dialect_cohesionHow much speakers within a region agree on a single form.
variant_rarityHow rare a minority form is — drives the preservation index.
geographic_spreadHow wide the geographic footprint of a form is.
temporal_driftHow a form's frequency changes over time (years, generations).
cross_script_mapThe same word across Hawar, Sorani and Cyrillic spellings.
Descriptive guarantee. The platform never declares a "correct" form. Every output carries the underlying distribution — so a reader sees both the popular form and the minority alternatives. Standardisation, if it ever happens, belongs to communities and academies, not to the corpus.

What the corpus enables

Each output below is something the analysis stage can produce or feed. Some are live dashboards, some are versioned datasets, some are training corpora consumed by other projects.

Public datasets

Versioned Parquet / JSON-L releases with full metadata, CC-BY licensed, citable.

Variation map

Interactive geographic map: pick a word, see how it is said across regions.

Open API

Query: "most common form for X in region Y" — JSON in, JSON out.

Living dictionary

A reference where every entry carries regional usage frequencies — not a single "standard" form.

Academic papers

Replicable methodology + dataset versioning lets researchers cite the exact slice they used.

Training corpora

Cleaned, region-tagged data for TTS, STT and Kurdish-aware LLMs.

Educational materials

Variation-aware schoolbooks: students see the range and where each form is used, not just one.

Preservation index

A live measure of how endangered a rare form is — a signal for archivists and educators.

Journalism dashboards

Reporters can verify "is this word really common?" and find where to interview real speakers.

5 · Resources & prior art

Before building anything, audit what already exists. The Kurdish NLP space has scattered but real resources — datasets, toolkits, voice corpora, comparable platforms. Reuse where possible; complement rather than duplicate.

Direct links below are deliberately conservative — only canonical URLs we are confident about. For everything else, names are leads — search HuggingFace Datasets and GitHub by name to find the current home.

Datasets (text)

kurdish-ai / kurdish-corpus

Mixed-source Kurdish text corpus on HuggingFace Datasets — fast starting point for tokenisation, embedding and language-model work.

huggingface.co →

OSCAR

Massively multilingual web corpus with Kurdish slices (Kurmanji + Sorani). Useful for LM pre-training. Search HuggingFace for oscar-corpus/oscar.

CC-100

Filtered Common Crawl per-language corpora with Kurdish splits. Search HuggingFace Datasets for cc100.

Wikipedia (Kurdish)

Periodic dumps of Kurmanji + Sorani Wikipedia. Clean encyclopedic register — small, but high quality. Search the Wikimedia dumps site for kuwiki / ckbwiki.

Tatoeba (Kurdish)

Crowdsourced sentence pairs with translations — useful for parallel data and sanity checks across scripts.

AgaCKNER

Annotated NER dataset for Sorani. Smaller scale, but rare for being supervised. Search HuggingFace / GitHub by name.

Voice & speech

Mozilla Common Voice

Open multilingual voice corpus with growing Kurdish coverage. Aligned MP3 + transcript pairs, contributor pipeline we can study.

commonvoice.mozilla.org →

kurdishtts.com

External TTS API used for synthesis. Possible STT integration. Already wired into the planned audio pipeline.

RHVoice

Open-source TTS engine with experimental Kurdish voices. A solid base for offline / low-resource synthesis.

NLP libraries & tools

KLPT

Kurdish Language Processing Toolkit (Sina Ahmadi). Tokenisation, lemmatisation, transliteration, normalisation across scripts. The de-facto baseline.

github.com/sinaahmadi/KLPT →

Tesseract 5

OCR engine with Kurdish-trained models for both Hawar (Latin) and Sorani (Arabic) scripts.

spaCy + custom

Tokenisation / POS pipelines extended for Kurdish. No first-party model — community-trained, project-specific.

awesome-kurdish

Community-maintained index of Kurdish NLP work — datasets, papers, tools. Search GitHub for awesome-kurdish.

Comparable platforms (prior art)

Common Voice (Mozilla)

The reference for crowdsourced multilingual voice collection. Mature contribution UX, validation flow, donor experience to study.

Dia-Lingle

ETH Zürich's gamified Swiss German dialect collection. Inspiration for a playful, low-friction contribution flow.

CorCenCC

National Corpus of Contemporary Welsh — mobile recording, register tagging, public-facing dashboards. Closest analogue in scope and intent.

standwithkurds.org

Action-focused single-page UX — pattern for clarity over feature density on the public-facing side.

How we relate to these. KorpusaKurdî is not a competitor to existing Kurdish corpora or toolkits — it consumes and complements them. Where data overlaps, we deduplicate and credit. Where toolkits already work, we extend rather than rewrite. Our distinguishing focus is variation tagging at the contribution level, which most existing corpora do not capture.

6 · System architecture

Five logical stages, each independently scalable. Read left-to-right.

1 · Clients
Mobile
React Native / Flutter
  • Contribute UI
  • Voice capture
  • Offline queue
Web
Next.js dashboard
  • Dashboards
  • Moderation console
2 · Edge
API gateway
Cloudflare / Nginx
  • Rate limit
  • WAF
  • TLS
Auth
Auth0 / Firebase
  • JWT issuance
  • OAuth
3 · Services
Core API
FastAPI (Python) or Node
  • Contribution CRUD
  • Voting
  • Moderation queue
Workers
Celery / BullMQ
  • OCR jobs
  • LLM jobs
  • Stats refresh
4 · Data
Primary
PostgreSQL + PostGIS
Vector
Pinecone / Weaviate
Object
S3-compatible (audio, scans)
5 · Intelligence
LLM layer
Hosted (OpenAI, Anthropic, xAI) + OSS fallback
  • Pattern extraction
  • Variant clustering
NLP toolkit
KLPT, spaCy, custom
  • Tokenise / lemmatise
  • POS tagging

Stages are deployed independently. The intelligence layer is async — clients never block on LLM calls.

7 · Repositories & code organisation

The target setup is polyrepo under the KorpusaKurdi GitHub org. A monorepo was rejected because mobile and ML pipelines have very different CI shapes.

Today vs. target. Only korpusa-kurdi exists right now — it serves as project homepage, development sandbox, and this engineering doc (deployed to korpusa-kurdi.pages.dev). The other repos in the table below are the target shape and will be spun up as each workstream actually starts.
RepoStatusPurposeStackCI target
korpusa-kurdiliveHomepage + this doc + ADRs (today also dev sandbox)HTML / staticCloudflare Pages
kk-mobileplannedCross-platform contribution appReact Native + TSEAS / Fastlane
kk-webplannedDashboards & moderation consoleNext.js + TSVercel
kk-apiplannedCore REST + GraphQL APIPython · FastAPIDocker / k8s
kk-workersplannedOCR, LLM, ETL jobsPython · CeleryDocker / k8s
kk-nlpplannedTokenisation, lemmatisation, POSPython · KLPTPyPI publish
kk-corpusplannedPublic dataset releases (CC-BY)Parquet, JSON-LHF datasets
kk-infraplannedTerraform, k8s manifests, HelmTerraform · HelmAtlantis

The originally-planned kk-docs role is currently filled by korpusa-kurdi; a dedicated docs repo may split off later if scope grows.

Reference code layout (kk-api)

kk-api/
├── app/
│   ├── api/           # FastAPI routers (v1, v2)
│   ├── domain/        # business logic, no I/O
│   ├── infra/         # db, queues, object storage
│   ├── workers/       # celery tasks
│   └── schemas/       # pydantic models
├── migrations/       # alembic
├── tests/            # pytest, >80% coverage gate
├── pyproject.toml
└── Dockerfile
Branching: trunk-based on main; short-lived feature branches; squash-merge with semantic commit messages (feat:, fix:, chore:, docs:).

8 · Frontend (mobile)

Mobile-first because contributors are overwhelmingly on phones, often in the diaspora or in regions with patchy connectivity.

Stack decision: React Native (TypeScript)

  • Cross-platform, large hiring pool, web team can review.
  • i18next for UI strings; ICU pluralisation.
  • react-native-mmkv for offline contribution queue.
  • expo-av for voice capture; opus codec for compact uploads.

Required UX rules

  • RTL toggle when the user picks a Sorani-script keyboard. Test both directions.
  • Never auto-correct contributions — that violates the descriptive principle.
  • Always show the user's region tag at the top of the contribute screen so they can correct it.
  • Voice capture is opt-in per session; explicit consent banner before mic access.

Screen inventory (MVP)

  • Onboarding — country/city/region picker, dialect self-ID (or "guess for me").
  • Contribute — quick type, correct-a-source, record snippet.
  • Profile / Stats — contributions, badges, level, regional impact.
  • Map — visual contributions per region (uses PostGIS aggregates).

9 · Backend & API

FastAPI for the synchronous edges, Celery (Redis broker) for any task that touches an LLM, OCR, or batch stats.

Sample endpoints

# Submit a free-form contribution
POST /v1/contributions
{
  "type": "text",
  "content": "Ez diçim bazarê",
  "context_concept": "go-to-market.1sg.present",
  "region": { "country": "TR", "city": "Diyarbakir", "locality": "Sur" },
  "dialect_self_id": "kurmanji",
  "source": { "type": "freetype" }
}

# Vote on a candidate variant
POST /v1/contributions/{id}/votes
{ "value": 1 }   # +1 / -1

# Read the live "most popular" form for a concept + region
GET /v1/concepts/{concept_id}/popular?country=TR&city=Diyarbakir

Moderation queue

Every contribution lands in a pending state. A row promotes to accepted when:

  • ≥ 3 community up-votes and agent-side spam check passes, or
  • An expert moderator (role: linguist) confirms it.

10 · Database design

PostgreSQL 16 + PostGIS + pgvector. The schema centres on three things: what was contributed, where it was contributed from, and how confident we are.

Core tables (DDL excerpt)

CREATE TABLE users (
  id          uuid PRIMARY KEY,
  display     text,
  origin_country  char(2),
  origin_city     text,
  origin_locality text,
  geo         geography(Point, 4326),
  dialect_self_id  text,         -- kurmanji|sorani|southern|zaza-gorani|other
  role        text DEFAULT 'contributor',
  created_at  timestamptz DEFAULT now()
);

CREATE TABLE concepts (
  id          uuid PRIMARY KEY,
  key         text UNIQUE,    -- e.g. go-to-market.1sg.present
  gloss_en    text,
  pos         text
);

CREATE TABLE contributions (
  id          uuid PRIMARY KEY,
  user_id     uuid REFERENCES users(id),
  concept_id  uuid REFERENCES concepts(id),
  type        text CHECK (type IN ('text','audio','correction')),
  content     text,
  audio_uri   text,
  source_type text,           -- freetype|rudaw|book|other
  region_country  char(2),
  region_city     text,
  region_locality text,
  geo         geography(Point, 4326),
  dialect_self_id text,
  status      text DEFAULT 'pending',
  embedding   vector(384),    -- pgvector for semantic clustering
  created_at  timestamptz DEFAULT now()
);

CREATE INDEX ON contributions USING GIST (geo);
CREATE INDEX ON contributions USING ivfflat (embedding vector_cosine_ops);

CREATE TABLE votes (
  contribution_id uuid REFERENCES contributions(id),
  user_id     uuid REFERENCES users(id),
  value       smallint CHECK (value IN (-1, 1)),
  PRIMARY KEY (contribution_id, user_id)
);

CREATE MATERIALIZED VIEW popular_per_region AS
SELECT concept_id, region_country, region_city,
       content,
       COUNT(*) AS support,
       RANK() OVER (PARTITION BY concept_id, region_country, region_city
                    ORDER BY COUNT(*) DESC) AS rk
FROM contributions
WHERE status = 'accepted'
GROUP BY 1,2,3,4;
Why geography + vector together? The geography column powers regional dashboards and "near me" tasks; the embedding powers semantic deduplication ("Ez bazarê diçim" vs "Ez diçim bazarê" cluster as variants of one concept).

11 · AI / LLM pipeline

AI is not the product. It is a processing layer that turns raw, messy crowdsourced data into structured, queryable patterns.

Stages

  1. Pre-processing — KLPT (Kurdish Language Processing Toolkit) for tokenisation, stemming, lemmatisation. Critical for Sorani clitics and Kurmanji ergative alignment.
  2. Embedding — multilingual sentence transformer; store vectors in pgvector.
  3. Variant clustering — semantic dedupe; same concept across surface variants.
  4. Pattern extraction (LLM) — given a cluster, the LLM names the morphological/lexical variation pattern.
  5. Active learning — agent identifies low-coverage (concept × region) cells and surfaces tasks to fill them.

Sample LLM prompt template (variant naming)

# system
You are a Kurdish-language dialectology assistant. You do NOT propose
a "correct" form. You name the linguistic pattern that distinguishes
the given variants.

# user
Concept: go-to-market.1sg.present
Variants observed:
  - "ez diçim bazarê"   (n=312, region=TR/Diyarbakir)
  - "ez bazarê diçim"   (n=104, region=TR/Mardin)
  - "min diçim bo bazař" (n=88,  region=IQ/Erbil)

Return JSON:
{
  "pattern": "...",          // e.g. SOV vs SVO order
  "axes": ["word_order", "case_marking"],
  "confidence": 0.0-1.0,
  "minority_preserved": true
}

LLMs we use (and why we cross-check)

We treat any single model as a biased lens. Pattern-extraction prompts are run against a small ensemble (e.g. Claude, Grok, an OSS model like Llama or Qwen). Disagreement is logged and surfaced to a linguist for review.

Existing Kurdish NLP we leverage

  • KLPT — tokenisation, lemmatisation, transliteration.
  • AgaCKNER — ~64,563 annotated tokens, Sorani NER.
  • KurdishMT, Hugging Face Kurdish corpora — seed data.
  • Mozilla Common Voice (Kurdish) — bootstrap voice data.
  • awesome-kurdish GitHub list — staying-current index.

12 · OCR & source ingestion

A meaningful portion of "good" Kurdish text is locked in scanned books, Rudaw archives, government documents. We unlock it via a "Correct-a-Source" module.

Pipeline

  1. Operator uploads a PDF / image batch into kk-workers.
  2. Worker runs Tesseract 5 (fine-tuned for Kurdish scripts; we maintain our own model artefacts).
  3. The MRWL (Max Rightmost White Line) segmentation algorithm handles cursive Arabic-script Kurdish, with reported CER ≈ 0.755% on print, much higher on handwriting.
  4. Output is staged as candidate contributions with source_type=book.
  5. App users see a side-by-side: scan + AI transcription, with three correction granularities — word, sentence, "quick confirm".
Legal: only ingest sources we have rights to (public domain, CC, partner agreements with publishers like Rudaw). Track provenance per row in contributions.source_type + a separate source_licenses table.

13 · Quality & metrics

We measure data and model quality with standard error-rate metrics so the corpus can be benchmarked alongside other low-resource language datasets.

Character / Word Error Rate

Given S substitutions, D deletions, I insertions and N reference characters (or words):

ER = (S + D + I) / N

Reported separately as CER (character) and WER (word). Track per-script (Latin Hawar vs Perso-Arabic Sorani) — they don't behave the same.

Acceptance thresholds

StageMetricThreshold
OCR (print)CER< 2%
OCR (handwritten)CERflagged for human review < 8%
Crowd contributionscommunity votes≥ 3 net upvotes
LLM pattern extractioncross-model agreement≥ 2 / 3 ensemble agreement

Gamification ↔ quality

Levels gate sensitive tasks (e.g. only Validators+ vote on others' contributions). This is both engagement and quality control.

LevelTasks unlockedReward
NewcomerProfile + region setupXP
ContributorFree-type, correct-a-wordBadges, micro-tokens
ValidatorVote on others' dataHigher token multiplier
Expert (Linguist)Manage OCR datasets, localise UIStaking rights, governance votes

14 · DevOps & infrastructure

Environments

  • local — docker-compose, mock LLM (Ollama), seeded Postgres.
  • dev — auto-deployed from main on every merge.
  • staging — promoted by tag v*-rc.*; full PII-safe seed data.
  • prod — promoted by tag v*; behind change-management.

CI/CD

  • GitHub Actions per repo. Reusable workflows in kk-infra/.github.
  • Required checks before merge: lint, unit, integration, security scan (Trivy), license scan.
  • Coverage gate: 80% minimum per repo, 90% for kk-api/domain.
  • Container images pushed to GHCR; signed with cosign.

Observability

  • OpenTelemetry traces → Tempo / Grafana.
  • Structured JSON logs → Loki.
  • Metrics → Prometheus; dashboards in Grafana (per-region contribution rate is a first-class SLI).
  • Sentry for client + server errors.

Secrets

Never commit. Local: .env via direnv. Cloud: 1Password Connect → external-secrets-operator → k8s. Rotation policy: 90 days; LLM API keys: 30 days.

15 · GitHub workflow

Repo conventions

  • Default branch: main. Protected. PRs only.
  • One reviewer required (two for kk-api and kk-infra).
  • Conventional Commits — release notes are auto-generated.
  • CODEOWNERS per directory. Linguistic-sensitive code (kk-nlp) requires sign-off from a linguist reviewer.

Issue templates

  • bug.yml — repro, expected, actual, region/dialect impacted.
  • feature.yml — problem, proposal, descriptive-principle check.
  • data.yml — corpus / source request with licensing detail.
  • research.yml — references a paper / ADR / RFC.

Label taxonomy

  • Type: type/bug, type/feat, type/chore, type/docs, type/research.
  • Area: area/mobile, area/api, area/nlp, area/ocr, area/infra.
  • Priority: p0p3.
  • Linguistic: dialect/kurmanji, dialect/sorani, script/latin, script/arabic.

RFC & ADR process

Non-trivial design changes go to kk-docs/rfc first. Architecture decisions captured as ADRs (one decision, one file, immutable once accepted).

16 · Tickets & sprints

Project tracking lives in GitHub Projects (organisation-level). Two boards:

  • Engineering — sprint board, two-week cadence, columns: Backlog → Ready → In progress → Review → Done.
  • Roadmap — quarterly view, milestones grouped by phase (see §18).

Definition of Ready

  • Acceptance criteria written.
  • Linked to a milestone.
  • Estimated (S / M / L).
  • Has area + type labels.

Definition of Done

  • Tests green; coverage threshold respected.
  • Docs updated where behaviour changed (this file or a sibling).
  • Telemetry / metric added if a new flow.
  • Linguistic sanity-check signed off if touching kk-nlp or contribution flow.
  • Migrations applied to dev; rollback documented.

17 · Standards & compliance

Code style

  • Python: ruff + black; type-checked with mypy --strict in domain/.
  • TS: ESLint + Prettier; strict: true in tsconfig.
  • SQL: migrations only via alembic; no autogenerate straight to prod.

Data licensing

  • Corpus releases: CC-BY 4.0.
  • Code: Apache 2.0 unless a repo says otherwise.
  • Models we fine-tune: same licence as the base model unless we own all the training data.

Privacy & ethics

  • GDPR-aligned consent flow on first launch and before any voice capture.
  • Right to deletion: hard-delete a user → soft-anonymise their contributions (keep the data for the corpus, drop the link).
  • No political bias or content moderation by the team — descriptive only. Hate-speech / spam handled by community + automated filters, never by editorial choice on linguistic content.
  • Underrepresented regions are actively targeted by outreach, never algorithmically penalised.
What we never do: "fix" a user's regional grammar. Every variant is data.

18 · Roadmap & milestones

PhaseEngineering deliverableExit criteria
P0 · FoundationsRepos, CI, infra terraformed, schema v1, authHello-world contribution lands in DB end-to-end
P1 · MVP (text only)Mobile app: contribute / correct, basic stats≥ 100 active testers, 10k contributions across 1 country
P2 · Seed dataImport existing corpora; manual cleanupBootstrapped corpus ≥ 1M tokens, geo-tagged
P3 · VoiceAudio capture, opt-in, ASR ingestion≥ 100 hours regionally-tagged audio
P4 · LLM analysisPattern extraction, dashboards, agent promptsLive "popular per region" view, < 1h latency
P5 · Open APIPublic read API, dataset releasesFirst external integration (TTS / translation)

19 · Onboarding checklist

Tick as you go. The whole list should be done by end of week 2.

Day 1 — accounts & orientation

Week 1 — first commits

Month 1 — own a slice

20 · Appendix & references

Research sources behind this doc

This document was distilled from a multi-LLM research pass — Claude, Grok, Perplexity and Gemini — cross-checked to reduce single-model bias. Where they disagreed we kept the most cautious option and flagged the disagreement here for future review.

External datasets, tools and comparable platforms have moved to §5 (Resources & prior art) so they are easier to find before diving into the codebase.

Glossary

  • CER / WER — character / word error rate.
  • KLPT — Kurdish Language Processing Toolkit.
  • NER — named entity recognition.
  • POS — part-of-speech tagging.
  • MRWL — Max Rightmost White Line segmentation (Kurdish OCR).
  • Algorithmic democracy — our four-stage pipeline (collect → extract → select → surface) for finding popular forms without imposing a standard.