Sources
Sources are the extract/load side of Frank. A source connects to one system, discovers its streams, and lands selected data in tenant-scoped Bronze Iceberg tables.
Source lifecycle
draft -> ready -> syncing -> active <-> paused
|
+-> error
* -> decommissionedThis lifecycle only describes EL. It does not say whether a transform exists or whether downstream data is modeled.
Source patterns
A source starts from a pattern in backend/config/patterns. Patterns define:
- Display metadata:
id,name,description,category,complexity,icon. - Engine:
airbyteordlt. - Connector config: Airbyte source image or dlt source type.
- Field definitions: required and optional form fields.
- Defaults and templates.
- Examples, supported formats, auth methods, and transformation hints.
Current pattern coverage includes:
| Category | Examples | Engine mix |
|---|---|---|
| Databases | PostgreSQL, MySQL, SQL Server, MongoDB | Airbyte |
| Warehouses | BigQuery, Snowflake, Redshift, Databricks | Airbyte |
| CRM and finance | Salesforce, HubSpot, Stripe | Airbyte and dlt |
| APIs | REST, GraphQL, GitHub, Jira, Notion, Slack, Airtable, RSS | dlt and Airbyte |
| Files | S3, SFTP bulk, Google Sheets, filesystem, archive/ZIP | Airbyte and dlt |
| Streams | Kafka | dlt |
Patterns are synced to the database at API startup, so the UI can render dynamic forms without hardcoded connector fields.
Choosing an engine
| Use case | Recommended engine | Why |
|---|---|---|
| Mature SaaS or database connector | Airbyte | Existing connector behavior, schema discovery, and Docker isolation. |
| Custom REST or GraphQL API | dlt | Python-native source construction, pagination/auth templates, nested JSON normalization. |
| Filesystem or lightweight custom extraction | dlt | In-process readers and multi-table normalization. |
| High-volume production replication | Airbyte | Better fit for source-defined sync behavior and connector ecosystem. |
Frank hides most engine differences behind the ExtractionEngine interface. Both engines produce data envelopes, cursor state, batch progress, and Iceberg writes.
Streams
Each source owns streams. A stream is configured independently:
| Field | Meaning |
|---|---|
name | Source-side stream/table/resource name from discovery. |
namespace | Optional source namespace or schema. |
dest_table_name | Optional Bronze table override. Child tables inherit the override prefix. |
sync_mode | full_refresh or incremental. |
cursor_field | Cursor column/resource field for incremental syncs. |
write_disposition | append, replace, or merge. |
primary_key_path | Primary key fields for merge/upsert behavior. |
is_enabled | Whether this stream participates in sync runs. |
schema | JSON schema saved from discovery. |
For incremental streams, Frank persists cursor state and uses overlap windows to avoid missing late-arriving rows.
Data envelope
Every extracted record is wrapped with platform metadata before it lands in Iceberg. The envelope provides stable operational fields for dedupe, lineage, sync timing, stream identity, and later transform cursors.
The default transform cursor is _extracted_at; the default tiebreaker is _record_id.
Iceberg naming
All source writes go through Frank's ADL-006 naming helpers. The important rule: do not hand-build Iceberg paths in connector or transform code. Let the naming service combine tenant ID, source name, stream namespace, stream name, and target config.
Typical layers:
- Bronze: raw/source data from source syncs.
- Silver: cleaned and conformed transform outputs.
- Gold: curated business outputs and semantic backing datasets.
Discovery flow
- Create a source from a pattern.
- Trigger discovery.
- Source worker executes discovery through Airbyte or dlt.
- Frank stores discovered schema and candidate streams.
- User selects streams and sets sync config.
CLI:
frankctl sources create -f source.yaml
frankctl sources discover <source-id> --timeout 300
frankctl sources streams list <source-id>
frankctl sources streams set <source-id> -f streams.yamlAPI:
POST /api/v1/sources
POST /api/v1/sources/{source_id}/discover
GET /api/v1/sources/{source_id}/streams
POST /api/v1/sources/{source_id}/streams/bulkSync flow
- User triggers a sync manually or through a schedule.
- Source status moves to
syncing. - Temporal dispatches work to the source worker.
- Engine extracts selected streams and writes batches to Iceberg.
- Sync run records row counts, logs, cursor values, snapshots, and failures.
- Source status becomes
activeorerror.
CLI:
frankctl sources sync <source-id> --streams customers,orders --timeout 600
frankctl sources history <source-id>
frankctl sources logs <source-id> <run-id>Schedules
Sources support manual, cron, and interval schedules. The CLI exposes source-scoped schedules:
frankctl schedules set <source-id> -f schedule.yaml
frankctl schedules pause <source-id>
frankctl schedules resume <source-id>
frankctl schedules trigger <source-id>Example:
schedule_type: cron
schedule_value: "0 */6 * * *"Failure behavior
If a source fails, it moves to error with last_error_message and last_error_at. Downstream transforms are not deleted or invalidated; they may continue to run against stale tables or block based on their own readiness checks.
That separation is intentional. Fix credentials, rerun discovery if the schema changed, refresh stream schemas, and then sync again.