Sources

Sources are the extract/load side of Frank. A source connects to one system, discovers its streams, and lands selected data in tenant-scoped Bronze Iceberg tables.

Source lifecycle

text

draft -> ready -> syncing -> active <-> paused
                         |
                         +-> error
* -> decommissioned

This lifecycle only describes EL. It does not say whether a transform exists or whether downstream data is modeled.

Source patterns

A source starts from a pattern in backend/config/patterns. Patterns define:

Display metadata: id, name, description, category, complexity, icon.
Engine: airbyte or dlt.
Connector config: Airbyte source image or dlt source type.
Field definitions: required and optional form fields.
Defaults and templates.
Examples, supported formats, auth methods, and transformation hints.

Current pattern coverage includes:

Category	Examples	Engine mix
Databases	PostgreSQL, MySQL, SQL Server, MongoDB	Airbyte
Warehouses	BigQuery, Snowflake, Redshift, Databricks	Airbyte
CRM and finance	Salesforce, HubSpot, Stripe	Airbyte and dlt
APIs	REST, GraphQL, GitHub, Jira, Notion, Slack, Airtable, RSS	dlt and Airbyte
Files	S3, SFTP bulk, Google Sheets, filesystem, archive/ZIP	Airbyte and dlt
Streams	Kafka	dlt

Patterns are synced to the database at API startup, so the UI can render dynamic forms without hardcoded connector fields.

Choosing an engine

Use case	Recommended engine	Why
Mature SaaS or database connector	Airbyte	Existing connector behavior, schema discovery, and Docker isolation.
Custom REST or GraphQL API	dlt	Python-native source construction, pagination/auth templates, nested JSON normalization.
Filesystem or lightweight custom extraction	dlt	In-process readers and multi-table normalization.
High-volume production replication	Airbyte	Better fit for source-defined sync behavior and connector ecosystem.

Frank hides most engine differences behind the ExtractionEngine interface. Both engines produce data envelopes, cursor state, batch progress, and Iceberg writes.

Streams

Each source owns streams. A stream is configured independently:

Field	Meaning
`name`	Source-side stream/table/resource name from discovery.
`namespace`	Optional source namespace or schema.
`dest_table_name`	Optional Bronze table override. Child tables inherit the override prefix.
`sync_mode`	`full_refresh` or `incremental`.
`cursor_field`	Cursor column/resource field for incremental syncs.
`write_disposition`	`append`, `replace`, or `merge`.
`primary_key_path`	Primary key fields for merge/upsert behavior.
`is_enabled`	Whether this stream participates in sync runs.
`schema`	JSON schema saved from discovery.

For incremental streams, Frank persists cursor state and uses overlap windows to avoid missing late-arriving rows.

Data envelope

Every extracted record is wrapped with platform metadata before it lands in Iceberg. The envelope provides stable operational fields for dedupe, lineage, sync timing, stream identity, and later transform cursors.

The default transform cursor is _extracted_at; the default tiebreaker is _record_id.

Iceberg naming

All source writes go through Frank's ADL-006 naming helpers. The important rule: do not hand-build Iceberg paths in connector or transform code. Let the naming service combine tenant ID, source name, stream namespace, stream name, and target config.

Typical layers:

Bronze: raw/source data from source syncs.
Silver: cleaned and conformed transform outputs.
Gold: curated business outputs and semantic backing datasets.

Discovery flow

Create a source from a pattern.
Trigger discovery.
Source worker executes discovery through Airbyte or dlt.
Frank stores discovered schema and candidate streams.
User selects streams and sets sync config.

CLI:

bash

frankctl sources create -f source.yaml
frankctl sources discover <source-id> --timeout 300
frankctl sources streams list <source-id>
frankctl sources streams set <source-id> -f streams.yaml

API:

http

POST /api/v1/sources
POST /api/v1/sources/{source_id}/discover
GET  /api/v1/sources/{source_id}/streams
POST /api/v1/sources/{source_id}/streams/bulk

Sync flow

User triggers a sync manually or through a schedule.
Source status moves to syncing.
Temporal dispatches work to the source worker.
Engine extracts selected streams and writes batches to Iceberg.
Sync run records row counts, logs, cursor values, snapshots, and failures.
Source status becomes active or error.

CLI:

bash

frankctl sources sync <source-id> --streams customers,orders --timeout 600
frankctl sources history <source-id>
frankctl sources logs <source-id> <run-id>

Schedules

Sources support manual, cron, and interval schedules. The CLI exposes source-scoped schedules:

bash

frankctl schedules set <source-id> -f schedule.yaml
frankctl schedules pause <source-id>
frankctl schedules resume <source-id>
frankctl schedules trigger <source-id>

Example:

yaml

schedule_type: cron
schedule_value: "0 */6 * * *"

Failure behavior

If a source fails, it moves to error with last_error_message and last_error_at. Downstream transforms are not deleted or invalidated; they may continue to run against stale tables or block based on their own readiness checks.

That separation is intentional. Fix credentials, rerun discovery if the schema changed, refresh stream schemas, and then sync again.

Sources ​

Source lifecycle ​

Source patterns ​

Choosing an engine ​

Streams ​

Data envelope ​

Iceberg naming ​

Discovery flow ​

Sync flow ​

Schedules ​

Failure behavior ​

Sources

Source lifecycle

Source patterns

Choosing an engine

Streams

Data envelope

Iceberg naming

Discovery flow

Sync flow

Schedules

Failure behavior