Skip to content

Sources

Sources are the extract/load side of Frank. A source connects to one system, discovers its streams, and lands selected data in tenant-scoped Bronze Iceberg tables.

Source lifecycle

text
draft -> ready -> syncing -> active <-> paused
                         |
                         +-> error
* -> decommissioned

This lifecycle only describes EL. It does not say whether a transform exists or whether downstream data is modeled.

Source patterns

A source starts from a pattern in backend/config/patterns. Patterns define:

  • Display metadata: id, name, description, category, complexity, icon.
  • Engine: airbyte or dlt.
  • Connector config: Airbyte source image or dlt source type.
  • Field definitions: required and optional form fields.
  • Defaults and templates.
  • Examples, supported formats, auth methods, and transformation hints.

Current pattern coverage includes:

CategoryExamplesEngine mix
DatabasesPostgreSQL, MySQL, SQL Server, MongoDBAirbyte
WarehousesBigQuery, Snowflake, Redshift, DatabricksAirbyte
CRM and financeSalesforce, HubSpot, StripeAirbyte and dlt
APIsREST, GraphQL, GitHub, Jira, Notion, Slack, Airtable, RSSdlt and Airbyte
FilesS3, SFTP bulk, Google Sheets, filesystem, archive/ZIPAirbyte and dlt
StreamsKafkadlt

Patterns are synced to the database at API startup, so the UI can render dynamic forms without hardcoded connector fields.

Choosing an engine

Use caseRecommended engineWhy
Mature SaaS or database connectorAirbyteExisting connector behavior, schema discovery, and Docker isolation.
Custom REST or GraphQL APIdltPython-native source construction, pagination/auth templates, nested JSON normalization.
Filesystem or lightweight custom extractiondltIn-process readers and multi-table normalization.
High-volume production replicationAirbyteBetter fit for source-defined sync behavior and connector ecosystem.

Frank hides most engine differences behind the ExtractionEngine interface. Both engines produce data envelopes, cursor state, batch progress, and Iceberg writes.

Streams

Each source owns streams. A stream is configured independently:

FieldMeaning
nameSource-side stream/table/resource name from discovery.
namespaceOptional source namespace or schema.
dest_table_nameOptional Bronze table override. Child tables inherit the override prefix.
sync_modefull_refresh or incremental.
cursor_fieldCursor column/resource field for incremental syncs.
write_dispositionappend, replace, or merge.
primary_key_pathPrimary key fields for merge/upsert behavior.
is_enabledWhether this stream participates in sync runs.
schemaJSON schema saved from discovery.

For incremental streams, Frank persists cursor state and uses overlap windows to avoid missing late-arriving rows.

Data envelope

Every extracted record is wrapped with platform metadata before it lands in Iceberg. The envelope provides stable operational fields for dedupe, lineage, sync timing, stream identity, and later transform cursors.

The default transform cursor is _extracted_at; the default tiebreaker is _record_id.

Iceberg naming

All source writes go through Frank's ADL-006 naming helpers. The important rule: do not hand-build Iceberg paths in connector or transform code. Let the naming service combine tenant ID, source name, stream namespace, stream name, and target config.

Typical layers:

  • Bronze: raw/source data from source syncs.
  • Silver: cleaned and conformed transform outputs.
  • Gold: curated business outputs and semantic backing datasets.

Discovery flow

  1. Create a source from a pattern.
  2. Trigger discovery.
  3. Source worker executes discovery through Airbyte or dlt.
  4. Frank stores discovered schema and candidate streams.
  5. User selects streams and sets sync config.

CLI:

bash
frankctl sources create -f source.yaml
frankctl sources discover <source-id> --timeout 300
frankctl sources streams list <source-id>
frankctl sources streams set <source-id> -f streams.yaml

API:

http
POST /api/v1/sources
POST /api/v1/sources/{source_id}/discover
GET  /api/v1/sources/{source_id}/streams
POST /api/v1/sources/{source_id}/streams/bulk

Sync flow

  1. User triggers a sync manually or through a schedule.
  2. Source status moves to syncing.
  3. Temporal dispatches work to the source worker.
  4. Engine extracts selected streams and writes batches to Iceberg.
  5. Sync run records row counts, logs, cursor values, snapshots, and failures.
  6. Source status becomes active or error.

CLI:

bash
frankctl sources sync <source-id> --streams customers,orders --timeout 600
frankctl sources history <source-id>
frankctl sources logs <source-id> <run-id>

Schedules

Sources support manual, cron, and interval schedules. The CLI exposes source-scoped schedules:

bash
frankctl schedules set <source-id> -f schedule.yaml
frankctl schedules pause <source-id>
frankctl schedules resume <source-id>
frankctl schedules trigger <source-id>

Example:

yaml
schedule_type: cron
schedule_value: "0 */6 * * *"

Failure behavior

If a source fails, it moves to error with last_error_message and last_error_at. Downstream transforms are not deleted or invalidated; they may continue to run against stale tables or block based on their own readiness checks.

That separation is intentional. Fix credentials, rerun discovery if the schema changed, refresh stream schemas, and then sync again.

Frank is built by aiaiai-pt.