DW Lab — DataWorkshop
LAB STATUS — ACTIVE — KRAKÓW, EU

We don't
teach theory.
We run experiments.

DW Lab is our internal research engine. 100+ ML and AI models actively running in production across real business environments. Every course, every challenge, every recommendation we make is grounded in what actually works here.

LIVE
100+
Models in production
10+
Years of experiments
6+
Industries
0
Borrowed knowledge
LLM FINE-TUNING RETRIEVAL-AUGMENTED GENERATION AGENTIC WORKFLOWS PRODUCTION DEPLOYMENT MULTI-AGENT SYSTEMS CUSTOM EMBEDDINGS REAL-TIME INFERENCE PROMPT ENGINEERING AT SCALE EVALUATION FRAMEWORKS TELCO ML E-COMMERCE RECOMMENDATIONS LOGISTICS FORECASTING LLM FINE-TUNING RETRIEVAL-AUGMENTED GENERATION AGENTIC WORKFLOWS PRODUCTION DEPLOYMENT MULTI-AGENT SYSTEMS CUSTOM EMBEDDINGS REAL-TIME INFERENCE PROMPT ENGINEERING AT SCALE EVALUATION FRAMEWORKS TELCO ML E-COMMERCE RECOMMENDATIONS LOGISTICS FORECASTING

What is DW Lab

Production before
the classroom.

dw-lab — experiment_runner.py
$lab status --verbose
Active experiments: 12
Models in prod: 103
Industries: telco, retail, logistics, edtech, fintech, auto

$lab run --model rag_v4 --env prod
Loading baseline...
Eval: precision@5 = 0.847
Eval: recall@5 = 0.791
Latency p95: 180ms
Status: PRODUCTION_READY

$lab teach --source this_experiment
# → becomes Module 3 of LLM course
$

Constant experimentation

We don't wait for the next conference to hear what works. We run our own experiments, benchmark our own models, and build our own evaluation frameworks.

Real business data, real KPIs

Every model in our lab is solving an actual business problem with real data and real success metrics — not Kaggle, not papers, not demos.

Lab → classroom pipeline

When an experiment works in production, it becomes a lesson. When it fails, it becomes an even better lesson. Our curriculum is a direct output of this lab.

Current focus — 2025

Where we're putting
our compute right now.

Focus Area 01
LLM in Production

Getting large language models to actually work reliably at scale — not just in demos. We're testing every major model and architecture against real business requirements.

  • RAG system design and evaluation
  • Fine-tuning vs prompting — when each wins
  • Cost/quality tradeoffs at production scale
  • Hallucination detection and mitigation
  • LLM observability and monitoring
Active — 7 running experiments
Focus Area 02
Agentic AI Systems

Building AI systems that don't just answer — they act. We're designing and deploying multi-agent workflows for complex business automation tasks.

  • Orchestration patterns for multi-agent systems
  • Reliable tool use and API integration
  • Human-in-the-loop design for enterprise
  • Failure modes and recovery strategies
  • Evaluation frameworks for agent behavior
Active — 5 running experiments

Lab inventory — partial snapshot

A sample of what's
in the lab right now.

LLM live
RAG pipeline v4
Retrieval-augmented generation over enterprise knowledge base. Hybrid vector + keyword retrieval with custom reranking.
GPT-4o Qdrant precision@5: 0.85
Agentic live
Customer service agent
Multi-step agent handling Tier 1 support with tool use: CRM lookup, order status, escalation routing.
Claude 3.5 LangGraph e-commerce
ML live
Churn prediction v7
Gradient boosting ensemble for telco customer churn. Real-time scoring at 50k events/day.
LightGBM AUC: 0.91 telco
Agentic testing
Document processing agent
Autonomous extraction, classification, and routing of complex multi-page business documents with human review loop.
Gemini 1.5 logistics v0.3
LLM testing
Internal LLM fine-tune
Domain-adapted model for technical documentation generation. Testing instruction-tuning vs few-shot on proprietary datasets.
Mistral 7B fine-tune edtech
ML live
Demand forecasting
Time-series ensemble (XGBoost + Prophet + LSTM) for logistics demand planning across 1200+ SKUs.
MAPE: 8.2% logistics 14 day horizon

Showing 6 of 100+ models. Full access available in DW Universe.

Experiment log

What we've been
testing lately.

Date Experiment Domain Method Result
Mar 2025 GPT-4o vs Claude 3.5 for structured extraction logistics eval framework published
Mar 2025 Agentic loop stability under ambiguous inputs e-commerce stress testing running
Feb 2025 RAG vs full-context for long documents fintech A/B production deployed
Feb 2025 Mistral 7B fine-tune on domain vocabulary edtech instruction-tune ongoing
Jan 2025 Embedding model comparison — 8 models cross-domain benchmark published
Jan 2025 Human-in-the-loop thresholds for agent escalation telco live pilot deployed
Dec 2024 Prompt caching cost reduction at scale SaaS infra experiment −43% cost

Full experiment reports available in DW Universe →

Why the lab exists

What the lab means
for you.

100+
If we teach it, we've run it

Every topic in our courses has been stress-tested in production environments. No speculation, no copy-pasted textbook knowledge.

0
Months of hype lag

We don't wait for the industry to settle on an answer. We run the experiment now, get real data, and update our curriculum based on what we find — not what's trending on X.

Practical judgment, not credentials

Our benchmark isn't publications or citations. It's whether the model makes it to production and generates value. That's the only metric that matters in real ML work.

// NEXT STEP

Want lab-grade AI
in your business?

Bring us your problem. We'll put it through the lab, build a production solution, and make sure your team understands every decision we made.