Skip to content

DBT: Diet Trase Coffee Trader Parents

File location: s3://trase-storage/diet-trase/coffee_traders/diet_trase_coffee_trader_parents.parquet

DBT model name: diet_trase_coffee_trader_parents

Explore on Metabase: Full table; summary statistics

Explore dependencies/lineage: link


Description

Helper table linking coffee trader groups (as known in the Trase PostgreSQL database) to their parent companies, based on a manually curated Excel workbook (CompanyWorkbookMappingsFinal_08_09_25.xlsx).

The Excel workbook lists, for each historical trader label, the cleaned company name, group, and parent, along with validity dates, a source for the parent attribution, and free-text notes. This model joins those rows to postgres_traders via the trader label, so the output exposes the canonical DB trader name and node id alongside the proposed parent.

Notes:

  • Rows with trader_label equal to 'UNKNOWN' or 'UNKNOWN AFFILIATION' are excluded.
  • The ambiguous 'M/S TATA COFFEE' parent row is excluded.
  • The join is a LEFT JOIN from the Excel data to postgres_traders, so Excel entries that do not match any DB trader label appear with NULL db_trader_name, trader_node_id, and db_group_name.
  • A trader can appear multiple times if the workbook records more than one parent for it (for example, across different validity periods).

Details

Column Type Description
db_trader_name VARCHAR Canonical trader name from postgres_traders.name, matched to the Excel Old Company Name via the trader's labels array. `NULL`` when the Excel label has no match in the DB.
trader_node_id BIGINT Trase node id of the matched trader in postgres_traders. NULL when the Excel label has no match in the DB.
db_group_name VARCHAR Group name associated with the trader in the DB, extracted from the first element of postgres_traders.groups ($.group). May differ from the workbook's proposed group.
group_parent VARCHAR Parent company proposed for the trader's group, from the workbook's Parent Name column.
group_parent_start_date DATE Start date of the parent attribution, from the workbook's Start date column. Aggregated with ANY_VALUE since the workbook sometimes specifies it on only one of several rows for the same (trader, parent) pair.
group_parent_end_date DATE End date of the parent attribution, from the workbook's End date column. Aggregated with ANY_VALUE (see group_parent_start_date).
group_parent_source VARCHAR Source supporting the parent attribution, from the workbook's Parent Source column. Aggregated with ANY_VALUE.
group_parent_notes VARCHAR Free-text notes about the parent attribution, from the workbook's Notes column. Aggregated with ANY_VALUE.

No data tests defined 🧐

Models

Sources

Macros

No called script or script source not found.

-- Links coffee trader groups (as known in `postgres_traders`) to their parent
-- companies, based on a manually curated Excel workbook.
--
-- The workbook (`CompanyWorkbookMappingsFinal_08_09_25.xlsx`) is keyed on
-- historical trader labels ("Old Company Name") and supplies, for each label,
-- a proposed parent company plus optional validity dates, source, and notes.
-- We join those rows to `postgres_traders` through the trader's `labels`
-- array to expose the canonical DB trader name, node id, and group alongside
-- the parent attribution.
--
-- The `excel_extension` pre-hook installs the DuckDB extension required to
-- read the Excel source.
{{
    config(
        pre_hook='{{ excel_extension() }}',
        materialized='external'
    )
}}

WITH
-- Workbook rows, with placeholder labels and an ambiguous TATA COFFEE parent
-- removed. Column aliases rename the workbook headers into snake_case fields
-- used by the rest of the model.
excel_data AS (
    SELECT DISTINCT
        x."Old Company Name" AS trader_label,
        x."Cleaned Company Name" AS initial_proposed_name,
        x."Group Name" AS initial_proposed_group,
        x."Parent Name" AS group_proposed_parent,
        CAST(x."Start date" AS DATE) AS group_parent_start_date,
        CAST(x."End date" AS DATE) AS group_parent_end_date,
        x."Parent Source" AS group_parent_source,
        x."Notes" AS group_parent_notes
    FROM {{ source('trase-storage-raw', 'diet_trase_coffee_2020_trader_group_parents') }} AS x
    WHERE trader_label NOT IN ('UNKNOWN', 'UNKNOWN AFFILIATION')
    AND group_proposed_parent NOT IN ('M/S TATA COFFEE') -- filtering out ambigous parent for TATA COFFEE
),
-- One row per (trader, label) in `postgres_traders`. The `labels` array is
-- unnested so each historical label can be matched against the workbook's
-- `trader_label`. `db_group_name` is read from the first entry of the
-- `groups` JSON array.
db_traders AS (
    SELECT
        label AS db_label,
        name AS db_trader_name,
        json_extract_string(groups[1], '$.group') AS db_group_name,
        trader_node_id
    FROM  {{ ref('postgres_traders') }}
    CROSS JOIN UNNEST(labels) AS t(label)
),
-- Join workbook entries to DB traders by label. LEFT JOIN: workbook rows that
-- don't match any DB trader still come through, with NULL db_* fields. The
-- workbook may carry several rows per (trader, parent) — for example, only
-- one of them filled in the start/end date — so we group and pick any value.
trader_parent AS (
    SELECT DISTINCT
        db.db_trader_name,
        db.trader_node_id,
        db.db_group_name,
        e.group_proposed_parent AS group_parent,
        -- The following are shared values, sometimes specified only once for each trader
        ANY_VALUE(e.group_parent_start_date) AS group_parent_start_date,
        ANY_VALUE(e.group_parent_end_date) AS group_parent_end_date,
        ANY_VALUE(e.group_parent_source) AS group_parent_source,
        ANY_VALUE(e.group_parent_notes) AS group_parent_notes
    FROM excel_data AS e
    LEFT JOIN db_traders AS db
    ON e.trader_label = db.db_label
    GROUP BY ALL
),
-- Debug-only CTE: traders that end up with more than one row in the output
-- (e.g. multiple parents over different validity periods). Not part of the
-- final SELECT — uncomment the WHERE clause below to inspect them.
duplicated AS (
    SELECT
        db_trader_name,
        COUNT(*) AS cnt
    FROM trader_parent
    GROUP BY 1
    HAVING cnt > 1
)
SELECT *
FROM trader_parent
--WHERE db_trader_name IN (SELECT db_trader_name FROM duplicated)
ORDER BY db_trader_name