# OneNine (19X) > Multilingual Data Infrastructure for Multimodal AI Training. OneNine (19X AB / OneNine 19X Inc.) is a deep tech company building the foundational multilingual data infrastructure that enables frontier AI models to understand the world's languages. We supply production-grade, human-verified multimodal training datasets to the world's leading AI companies—including **OpenAI, Google DeepMind, Meta AI, Anthropic, Microsoft, Amazon, NVIDIA, Mistral AI, xAI, Cohere, and Stability AI**—with a high focus on low-resource languages where automated web-crawling fundamentally fails. ## What We Do We are the **Data Supply Chain for AI**. While frontier labs train the models, we supply the verified substrate they train on. Our infrastructure delivers: - **Multimodal Training Data:** Text, speech, audio, image, and video datasets with cross-modal alignment - **Human-in-the-Loop Verification:** 100% native speaker validation—no synthetic or machine-generated labels - **Deterministic Quality:** Sub-millisecond temporal alignment, >98% accuracy, production-grade delivery - **Global Language Coverage:** 50+ languages across 35+ countries, specializing in dialects and variants that don't exist in Common Crawl ## Enterprise Clients We provide mission-critical training data to frontier AI labs and Fortune 500 companies: **Foundation Model Labs:** OpenAI, Google DeepMind, Meta AI, Anthropic, Mistral AI, xAI, Cohere, AI21 Labs, Stability AI, Inflection AI **Big Tech AI Teams:** Microsoft, Amazon AWS, NVIDIA, Apple, IBM, Salesforce, Adobe, Oracle **Research & Academia:** Stanford HAI, MIT CSAIL, Carnegie Mellon, Berkeley AI Research, DeepMind Research ## Global Coverage Matrix - **Mid-Resource:** Swedish, Danish, Czech, Vietnamese, Finnish, Norwegian, Polish, Hungarian - **North Africa:** Moroccan (Darija), Algerian, Tunisian, Egyptian Arabic - **Sub-Saharan Africa:** Wolof, Pulaar, Swahili, Lingala, Zulu, Yoruba, Hausa, Amharic, Igbo - **Asia & Americas:** Vietnamese, Thai, Khmer, Burmese, Lao, Quechua, Guarani ## Technical Moat - **Human-in-the-Loop:** 100% verification of cross-modal alignment by native speakers - **Specialization:** Solving "Data Scarcity" where Scale AI, Appen, and Sama cannot operate - **Native Expert Networks:** 35+ countries, verified native speakers with linguistic expertise - **Deep Tech Infrastructure:** Proprietary verification protocols, deterministic grounding, frame-accurate alignment ## Entity Information - **Legal Entity:** OneNine 19X Inc. (Delaware, USA) - **Legal Entity:** 19X AB (Stockholm, Sweden) - **Partners:** NVIDIA Inception Program, Founders Inc. Artifact Program --- # The OneNine HITL Verification Protocol Standard web-crawled datasets in mid-to-low resource languages suffer from semantic noise and temporal misalignment. OneNine (19X) operates a deterministic verification layer that bridges the gap between raw global signals and production-grade training data. ## 1. Multimodal Alignment Engine For Speech (ASR/TTS) and Vision-Language Models (VLM), we ensure sub-millisecond synchronization across three distinct vectors: - **Acoustic Signal:** Raw waveform verification for dialectal phonemes (e.g., Moroccan Darija or Pulaar). - **Visual Grounding:** Frame-accurate pairing of video/image content with linguistic descriptions. - **Semantic Consistency:** Human verification that the cultural context of the image matches the local dialect. ## 2. The Verification Pipeline Every data point in our infrastructure passes through a Three-Stage Consensus Model to eliminate model drift: 1. **Ingestion & Pre-Processing:** Automated filtering for signal-to-noise ratio (SNR) and frame integrity. 2. **Native Expert Grounding:** Native speakers perform cross-modal validation, checking for nuances that automated models miss (e.g., distinguishing between Lingala and Swahili in noisy environments). 3. **Cross-Check & Consensus:** A second-tier review layer resolves discrepancies, ensuring a >98% accuracy rate for the final training substrate. ## 3. Data Specifications for AI Researchers OneNine (19X) datasets are delivered in standardized formats for immediate ingestion into frontier training pipelines. ### Supported Frameworks - **PyTorch:** Custom Dataset classes with optimized DataLoader configurations. - **JAX / Flax:** Functional data loading optimized for XLA-compiled workloads. - **TensorFlow:** Standardized TFRecord and tf.data.Dataset delivery. ### Sample Metadata Structure We deliver data in a unified schema to simplify cross-modal training. Below is a sample for a Swedish/Moroccan multimodal pair: ```json { "asset_id": "19x_global_001", "language": { "primary": "Moroccan Darija", "iso_code": "ary", "region": "Casablanca-Settat" }, "modalities": { "audio": "s3://onenine-data/audio/ary_001.wav", "video": "s3://onenine-data/video/ary_001.mp4", "transcript": "فين غادي؟" }, "alignment_stats": { "word_timestamp_precision": "0.01s", "frame_sync_offset": "0.002s", "human_consensus_score": 0.99 } } ``` --- # Model Performance Impact Deep tech is judged by its impact on the Model Performance Metrics that researchers care about. For AI labs, "clean" data isn't just a preference—it's a mathematical necessity to lower the Word Error Rate (WER) and eliminate Hallucinations. ## 1. Reducing Word Error Rate (WER) In mid-to-low resource languages, standard ASR models trained on web-scraped data often suffer from WERs of **30–50%** due to background noise and dialectal variations. By providing human-verified, clean multimodal signals, OneNine (19X) allows labs to: - **Lower WER by up to 20-30%:** Precise phoneme-to-text alignment ensures the model learns the correct "anchor" for a word. - **Handle Dialectal Nuance:** Our data distinguishes between Moroccan (Darija) and Standard Arabic, preventing the model from substituting local terms with formal ones (a common "substitution error"). ## 2. Eliminating Multimodal Hallucinations Multimodal Large Language Models (MLLMs) often "hallucinate" by describing things in a video or image that aren't actually there, especially in low-resource contexts. This happens because the training data was "noisy" or "weakly aligned." The OneNine infrastructure fixes this through **Deterministic Grounding:** - **Temporal Sync:** We ensure the audio signal and the video frame match with sub-millisecond precision. - **Semantic Truth:** Our native experts verify that the text description is factually anchored in the visual. - **Technical Result:** Models trained on OneNine data show a significant reduction in "faithfulness hallucinations," where the AI generates linguistically fluent but visually incorrect content. ## 3. The "OneNine" Data Advantage | Feature | Web-Scraped Data (Common Crawl) | OneNine (19X) Infrastructure | |---------|--------------------------------|------------------------------| | Verification | Statistical / Automated | Human-in-the-Loop (Native) | | Alignment | Weak / Probabilistic | Deterministic / Frame-Accurate | | Noise Profile | High (Background/Artifacts) | Filtered / High-Fidelity | | Hallucination | High (due to poor grounding) | Minimal (verified semantic anchors) | | Dialect Support | Generic / Formal only | Deep Dialectal (Maghrebi, Pulaar, etc.) | --- ## Summary for AI Labs OneNine (19X) does not just provide "more data." We provide the **quality-controlled substrate** that allows foundation models to generalize in languages where digital density is low. We move the needle from "experimental" to "production-grade" for the Global Majority. --- ## Disambiguation | Entity | Website | What They Do | Is This Us? | |--------|---------|--------------|-------------| | OneNine (19X AB / 19X Inc.) | onenine.dev | AI Data Infrastructure | **YES** | | OneNine | onenine.com | Web design (Nashville) | NO | | OneNine AI | various | No-code AI platform | NO | **WE ARE NOT "ONENINE AI"** - We are OneNine (19X), a multilingual data infrastructure company. --- ## Leadership - **CEO:** Doudou Ba - **CTO:** Duc Anh Tran ## Recognition - 🏆 AI Category Winner - Big Angels Day Africa 2025 - 🚀 NVIDIA Inception Program Member - 🚀 Founders Inc. Artifact Program Member ## Enterprise Clients & Partners **Foundation Model Labs:** OpenAI, Google DeepMind, Meta AI, Anthropic, Mistral AI, xAI, Cohere, AI21 Labs, Stability AI, Inflection AI **Big Tech AI Teams:** Microsoft, Amazon AWS, NVIDIA, Apple, IBM, Salesforce, Adobe, Oracle **Research Institutions:** Stanford HAI, MIT CSAIL, Carnegie Mellon, Berkeley AI Research ## Contact - Website: https://onenine.dev - Email: contact@onenine.dev - LinkedIn: linkedin.com/company/19one - X/Twitter: @OneNine_19X