Skip to content

[feat] Cross-DB join key inference (prefix/suffix overlap) #758

@sahrizvi

Description

@sahrizvi

Problem

When altimate-code is connected to multiple warehouses (a real customer pattern: post-acquisition systems, multi-vendor SaaS analytics, ingest-from-many-sources), there's no built-in way to detect that columns across DBs are entity-equivalent despite naming/format mismatches. Today the agent has to discover this manually for each new pair of warehouses.

Concrete examples we hit on a benchmark workload:

  • yelp dataset: business.business_id in MongoDB has values like businessid_42; review.business_ref in DuckDB has values like businessref_42. Same entity, different prefix. The agent loses many turns figuring this out.
  • bookreview dataset: books_info.book_id (bookid_42) ↔ review.purchase_id (purchaseid_42).

This is the canonical "ill-formatted join keys" trap that the DataAgentBench paper highlights, and it shows up regularly in production warehouses (acquired-company ID prefix mismatches, dirty migrations, multi-vendor analytics).

altimate_core.column_lineage traces lineage within a query and schema_inspect describes one table — neither suggests cross-DB joins.

Proposed mechanism

A new tool altimate_core.detect_join_candidates(connections) that:

  1. For each pair of (db, table, string-column), pull ~50 sample values via existing `schema_inspect` plumbing.
  2. Compute longest common prefix of L's sample values; same for R.
  3. If both prefixes are non-empty AND end in `_` / `-` / `:` AND L's prefix ≠ R's prefix AND there's at least one matching suffix between L and R after stripping prefixes → emit a candidate.
  4. Score by suffix overlap count + same-DB-type compatibility.
  5. Return ranked: `[{left_db, left_table, left_col, right_db, right_table, right_col, prefix_rule, suffix_overlap, confidence}]`.

A working reference implementation in Python exists at `dab-bench-internal/dab_bench/preindexer.py:_detect_join_candidates` (~50 LOC) — straightforward to translate.

What it enables

  • Agent calls it once after a multi-warehouse connect; gets cross-DB join recommendations without writing analysis SQL.
  • Foundation for column_lineage to become cross-DB aware.
  • schema_inspect could surface a "see also" footer with cross-DB matches.

Beyond benchmark / general use cases

  • Acquired company's customer_id format ≠ parent's
  • Salesforce AccountId ↔ Hubspot company_id ↔ internal warehouse account_external_id
  • Dirty migrations leaving prefix mismatches (v1_user_123 vs user_123)
  • Multi-vendor analytics where each system has its own ID format

Implementation pointers

  • New file: packages/opencode/src/altimate/tools/altimate-core-detect-join-candidates.ts
  • Existing patterns to follow: packages/opencode/src/altimate/tools/schema-inspect.ts, altimate/native/altimate-core.ts
  • Tool registration: packages/opencode/src/tool/registry.ts (alongside other altimate_core_* entries)
  • Estimated scope: ~200-300 LOC + tests

Tests to add

  • Prefix-overlap unit test
  • Suffix-overlap unit test
  • Integration test against in-memory SQLite fixture with 2 fake "DBs" exhibiting the businessid_Xbusinessref_X pattern

Evidence

Benchmark trace from `dab-bench-internal` showing the agent burning turns on this exact problem available on request.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions