Concepts
Frank has a small vocabulary. The product is easiest to understand when the extract/load side, transform side, and ontology side stay separate.
The mental model in one paragraph
You create sources from reusable source patterns. Each source discovers streams and syncs selected streams into tenant-scoped Bronze Iceberg tables. You create transforms from field mappings, SQL, dbt-style templates, or Python-runner artifacts. Transforms consume Bronze tables or other transform outputs and materialize Silver or Gold tables. You compose transforms into versioned pipelines. You can then register a curated table as a backing dataset for an ontology entity type so semantic applications can consume it.
Primitives
| Concept | What it is | Use it for |
|---|---|---|
| Tenant | The isolation boundary for sources, streams, transforms, pipelines, datasets, runs, and ontology mappings. | Separate customers, environments, or domains without mixing metadata or data paths. |
| Source pattern | A declarative connector template. Patterns describe config fields, defaults, examples, auth hints, and the extraction engine (airbyte or dlt). | Add PostgreSQL, Salesforce, REST APIs, GraphQL, files, S3, SFTP, Kafka, Stripe, Slack, and more without hardcoding the UI. |
| Source | A configured extract/load connection created from a source pattern. | Own connection config, discovery schema, status, schedule, and sync metrics. |
| Stream | A table, API resource, file glob, or event stream inside a source. | Select what to sync, set full-refresh or incremental mode, cursor fields, primary keys, and destination table names. |
| Dataset | An Iceberg table exposed through the datasets API. | Browse Bronze, Silver, or Gold data, preview rows, and inspect snapshots. |
| Transform pattern | A reusable transform recipe with a parameter schema and runtime renderer. | Filter, dedupe, join, aggregate, validate, convert, geospatially enrich, or run Python containers. |
| Transform | A materialized transformation definition. | Map source fields to target fields, apply patterns, generate SQL or Python artifacts, run manually, schedule, and inspect history. |
| Artifact | The hydrated executable output for a transform version. | Keep the runnable SQL/dbt/Python representation separate from the editable transform spec. |
| Run | A source sync, transform execution, pipeline sandbox, or ontology sync execution. | Track status, logs, metrics, workflow IDs, row counts, and failures. |
| Pipeline | A versioned DAG of transform steps. | Compose multi-step EL/T flows with edges, fan-in, terminal steps, sandbox validation, activation, and pause/resume. |
| Schema library | The target schema catalog exposed to the UI and API. | Pick FIWARE Smart Data Models or custom schemas for transforms and backing datasets. |
| Entity type | An ontology schema object served by ontology-core-v2. | Define the semantic object that a curated table backs, including fields and relationships. |
| Backing dataset | A registration that maps an Iceberg table to an ontology entity type. | Push table rows into ontology entities with column-to-property mappings and sync history. |
| Identity policy | A reusable rule for deriving stable entity identifiers. | Normalize source fields, build passthrough/composite/hash/UUID keys, and dry-run identity resolution. |
| AI workflow | A Martha workflow owned by Frank and called by the API. | Suggest schemas, mappings, pattern params, SQL reviews, code generation, CI fixes, publishing, and pipeline composition. |
How they fit together
Tenant
|
+-- Source patterns --> Sources --> Streams --> Bronze Iceberg tables
|
+-- Transform patterns --> Transforms --> Artifacts --> Runs
| |
| +--> Silver / Gold Iceberg tables
|
+-- Pipelines --> Versions --> Steps + Edges --> Transform runs
|
+-- Schema libraries / Ontology entity types
|
+--> Backing datasets --> Ontology sync runsTwo lifecycles, not one
Frank deliberately separates EL from T:
Source: draft -> ready -> syncing -> active <-> paused
|
+-> error
* -> decommissioned
Transform: draft -> ready -> retired
Runtime: none -> running -> succeeded | failedA source can be active with no transform. A transform can remain ready while its upstream source is stale. A pipeline can be drafted and sandboxed before it is activated. That separation is what makes multi-source joins, transform chaining, and partial recovery practical.
What this gets you
You do not have to build:
- Connector UX -- dynamic forms, validation, discovery, selected streams, and sync scheduling.
- Lakehouse write plumbing -- Iceberg naming, tenant namespaces, envelopes, cursor state, idempotent snapshots, and write retries.
- Transform runtime plumbing -- hydration, artifacts, renderers, run records, logs, cancellation, and retry policies.
- Pipeline safety -- DAG validation, version hashes, sandbox execution, step classification, and activation gates.
- Semantic publication -- ontology proxying, entity type versioning, backing dataset mappings, identity policy support, and sync history.
- AI orchestration -- prompt workflows, model calls, trace IDs, and graceful degradation when Martha is offline.
You still own:
- Data intent -- which data matters, what it means, and how it should be joined, cleaned, modeled, and published.
What is next
- Quickstart starts the local stack.
- Sources explains extract/load setup.
- Transforms explains transform authoring and execution.
- Pipelines explains DAG composition.
- Ontology integration explains publication into ontology-core-v2.