Open Source · Apache 2.0

Data contracts that execute, not just document.

Define your data product in YAML. LakeLogic enforces it at pipeline time — bad rows are quarantined the moment they appear, not three dashboards later. Same contract runs on Polars, Spark, or DuckDB.

14K+ installs · Apache 2.0

lakelogicoss.ipynb
Python 3.12
1
Pipeline summary

Top-of-file docstring — describes what this contract does end-to-end. Picked up by Sphinx / mkdocs / IDE hovers; also surfaced inside LakeLogic Cloud as the contract description.

2
Contract metadata

Version, dataset name, owner, and target layer (Bronze/Silver/Gold) — the audit trail every contract carries.

3
Domain & system

Highly recommended (often required by the LakeLogic Orchestrator) — drives lineage tracking, automated storage routing, and data-catalog population.

4
PII flag

Marking a field PII makes it discoverable to lineage, governance, and the masking engine — even if no strategy is set yet.

5
Built-in masking

Five strategies ship out of the box: nullify · hash · redact · partial · encrypt. Applied at runtime per row.

6
Inline transformations

SQL transformations run inside the contract — the same DuckDB / Spark / Polars engine validates and shapes the data.

7
Row-level quality

Row-rules quarantine bad rows. Each failed rule is preserved with its name so triage knows exactly why a row was rejected.

8
Dataset-level quality

Aggregate guards on the entire batch — minimum revenue, pending ratio, freshness windows. Run after row rules; failures fail the run.

9
Optional physical exec

Wire the same contract directly to Delta, Parquet, or Iceberg storage with merge / append / SCD2 strategies.

10
OSS → Cloud telemetry

Flip `enabled: true` and the OSS engine streams run metadata to LakeLogic Cloud — Observatory grades health, Zeus diagnoses incidents, the Watchdog flags missing deliveries. Metadata only; row data never leaves your lakehouse.

11
Auto-generated test data

DataGenerator inspects the contract (model + quality rules) and synthesises realistic invalid rows for stress testing. Use it in CI to validate the contract itself — no staging of real bad data needed.

12
Engine selection

Same contract runs unchanged on DuckDB (local notebook), Spark (cluster), or Polars. Swap one string; the rules, masking, lineage, and quarantine logic don't change.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
"""
E-Commerce Orders → silver layer.
Executes the contract end-to-end: schema + PII + row & dataset quality rules.
Bad rows are quarantined with reasons; good rows promoted to silver.
Same contract runs on DuckDB, Spark, or Polars.
"""
# pip install lakelogic
import lakelogic as ll
contract = """
version: 1.0.0
dataset: orders
info:
title: E-Commerce Orders
owner: data-team@company.com
target_layer: silver
domain: sales
system: shopify
model:
fields:
- name: order_id
type: integer
description: "Unique numeric ID for each order"
- name: customer_email
type: string
pii: true
description: "Customer's raw email — flagged PII for governance"
# - name: customer_email_formatted
# type: string
# pii: true
- name: customer_email_encrypted
type: string
pii: true
masking: partial
description: "Same email, partially masked at runtime (r***@latezuab.co)"
- name: amount
type: float
required: true
description: "Order total in USD; row-rules guard against zero / negative"
- name: status
type: string
description: "Order lifecycle: pending · shipped · delivered"
transformations:
- sql: "SELECT *, UPPER(LTRIM(customer_email)) as customer_email_formatted
FROM source"
quality:
row_rules:
- name: positive_amount
sql: "amount > 0"
- name: valid_status
sql: "status IN ('pending','shipped','delivered')"
dataset_rules:
- name: minimum_total_revenue
description: "Batch must carry at least $1,000 in valid sales"
sql: "SELECT SUM(amount) FROM {dataset}"
must_be_greater_than: 1000
- name: max_pending_ratio
description: "Pending orders capped at 50% of the batch"
sql: "SELECT COUNT(*) FILTER(WHERE status='pending')*1.0
/ COUNT(*) FROM {dataset}"
must_be_less_than: 0.5
# ── Optional: Physical Execution Config ────────────────────────────
# source:
# type: delta
# path: "abfss://landing@myaccount.dfs.core.windows.net/orders"
# load_mode: incremental
# watermark_strategy: pipeline_log
#
# materialization:
# path: "abfss://silver@myaccount.dfs.core.windows.net/orders"
# strategy: merge
# format: delta
# merge_dedup_guard: true
# partition_by:
# - status
# ── Observatory Telemetry — OSS → LakeLogic Cloud bridge ──────────
# observatory:
# enabled: false
# environments: ["dev", "prod", "staging", "local"]
# endpoint: "${LAKELOGIC_OBSERVATORY_ENDPOINT}"
# emit_on: [success, partial, failed]
# expected_delivery: # Telemetry Delivery Watchdog
# cadence_minutes: 15
# missing_after_minutes: 45
"""
# Generate test data — 50% intentionally bad
source_df = ll.DataGenerator(contract).generate(rows=1000, invalid_ratio=0.5)
# Run the pipeline
proc = ll.DataProcessor(contract, engine="duckdb")
good, bad = proc.run(source_df)
[2]✓ 0.1sPython
2026-05-09 07:40:22.538|INFO |generator:generate:3422-📋 Generating data for: E-Commerce Orders
2026-05-09 07:40:22.539|INFO |generator:generate:3423- Records : 500 valid + 500 invalid = 1,000 total
2026-05-09 07:40:22.541|INFO |generator:generate:3439- Source : Faker + heuristic generation13Synthetic test dataDataGenerator emits 500 valid + 500 intentionally-invalid rows so quality rules have something to catch — perfect for CI tests of the contract itself.
2026-05-09 07:40:22.779|INFO |generator:generate:3488- Row generation complete: 1,000 records built
2026-05-09 07:40:22.781|INFO |generator:generate:3510- Test cases : 1023 across 6 categories14Why 1,023 > 1,000?Multi-fault injection. A single invalid row can carry 3–4 faults at once (e.g. EMPTY_STRING + NOT_NULL_VIOLATION + RANGE_VIOLATION). 1,023 distinct injections were densely packed into the 500 invalid rows — see comma-separated _test_case_types in the bad DataFrame.
2026-05-09 07:40:22.786|INFO |generator:generate:3512- NOT_NULL_VIOLATION 414 injections
2026-05-09 07:40:22.786|INFO |generator:generate:3512- EMPTY_STRING 221 injections
2026-05-09 07:40:22.789|INFO |generator:generate:3512- RANGE_VIOLATION 135 injections
2026-05-09 07:40:22.791|INFO |generator:generate:3512- EDGE_CASE_BUILTIN 121 injections
2026-05-09 07:40:23.034|INFO |duckdb:_run_dataset_rules:775-Dataset Rule: minimum_total_revenue | Result: 13382.63 (expected > 1000.0) | PASS15Aggregate guard firedminimum_total_revenue is a dataset_rule. Engine ran SUM(amount) over the batch, compared to the threshold, and stamped PASS.
2026-05-09 07:40:23.051|INFO |duckdb:_run_dataset_rules:775-Dataset Rule: max_pending_ratio | Result: 0.276 (expected < 0.5) | PASS16Ratio gateCounts pending orders as a % of the batch. Cheap to evaluate, hard to miss when something upstream stalls.
2026-05-09 07:40:23.067|INFO |masking_engine:apply:398-PII masking: 1 field(s) for user_groups=(none): customer_email_encrypted→partial17Encryption appliedmasking: partial fired only for the encrypted column. Plain `customer_email` stays untouched — strategies are field-scoped.
2026-05-09 07:40:23.073|INFO |processor:run:839-Run complete [layer=silver] | Source: 1000 | Good: 678 | Quarantine: 322 | Ratio: 32.20%18Run summary678 rows passed all checks and are promoted to Silver. 322 rows are isolated in `bad` with reasons attached — never silently dropped.
Good
678
Quarantined
322
Ratio
32.2%
good.limit(3)shape: (3, 8)
order_idcustomer_emailencryptedamountstatus
2024"maysu@gftzqqyu.co""r***@latezuab.co"25.46"shipped"
6658"imqyyezpgw@jeijriy.net""a***@fhyaul.org"52.33"delivered"
632"bmxgcelu@nrfv.co""w***@gvpavpnn.net"100.9"shipped"
bad.select([…]).limit(3)shape: (3, 7)
order_idamountstatus_lakelogic_errors_lakelogic_categories
40652.28"INVALID_LCSL"
"Rule failed: valid_status (status IN ('pending','shipped','delivered'))"
["correctness"]19_lakelogic_errorsEvery failed rule is appended verbatim. Triage tools group by rule name; observability dashboards bucket by category.
-341null"INVALID_PIFG"
"Rule failed: amount_required ("amount" IS NOT NULL)"
"Rule failed: positive_amount (amount > 0)"
"Rule failed: valid_status (status IN ('pending','shipped','delivered'))"
["completeness", "correctness"]20Multi-failure rowRow failed three independent rules. None are silently merged — each preserved so engineers and Zeus can reason about the cause graph.
4282-622.57"pending"
"Rule failed: positive_amount (amount > 0)"
["correctness"]21_lakelogic_categoriesCategorisation lets governance dashboards roll up by completeness / correctness / freshness — not just rule names.

What's in the box

One contract. Three engines.

Auto-discovers the best engine for your environment. SQL-first quality gates from Polars to petabytes.

Runtime Data Contracts

Enforce schema, quality rules, and business logic during data movement — not after. YAML, version-controlled, engine-agnostic.

Quarantine Bad Data

Automatically detour invalid rows in real-time. Bad data is isolated, tagged, and triaged — never silently dropped.

Medallion Architecture

Quality gate between Bronze → Silver → Gold. Clean the front door of your lakehouse with shift-left data quality.

Materialization Strategies

Append, merge, SCD2, or overwrite. Native Spark operations at petabyte scale. Same contract, different strategies.

Polars
Local speed
Spark
Distributed scale
DuckDB
Analytical SQL

Two Products · One Vision

Outgrow the OSS? Bring it to Cloud.

LakeLogic Open Source is free forever. When you need Zeus AI, visual lineage, governance, and team workflows — drop your YAML straight into LakeLogic Cloud.

Join Cloud waitlist
Stay on OSS

Apache 2.0 · runs on Polars, Spark, or DuckDB · 90+ runnable examples · zero vendor lock-in.

Browse the repo
Migration Path

Already using the OSS reference architecture? Drag and drop your docs/contracts/*.yaml folder straight into LakeLogic Cloud.

Bulk import contracts