Diet Trase

View or edit on GitHub

This page is synchronized from trase/models/diet_trase/README.md. Last modified on 2025-12-10 18:30 CET by Trase Admin. Please view or edit the original file there; changes should be reflected here after a midnight build (CET time), or manually triggering it with a GitHub action (link).

Diet Trase

Initial phase:

Read trade data for each country
Read deforestation data per year/country/commodity (s3://trase-storage/diet-trase/deduce_deforestation_intensities_july_2024.csv)
Combine into one dataset
Summary statistics, each point below is a single graph, mixing traders from all countries:
1. Top traders by volume
2. Top country of import with EU combined into one group
3. Top traders by deforestation exposure (multiply intensities by volume)
4. As above for emissions exl.
5. As above for emissions incl.
6. Sankey

The output data from this model is located at:

s3://trase-storage/diet-trase/diet-trase-results-2020.parquet

Usage

You can run the model locally from either main.ipynb or main.py. Here is an example running the model just for a single country (Colombia):

poetry run trase/models/diet_trase/coffee_fullmodel/main.py --only COLOMBIA

However, to upload the results to s3://trase-storage, you should run it through DBT:

poetry run trase/data_pipeline/dbt run --target production --select 1+diet_trase_coffee_2020

When running on SageMaker, a significant amount of memory is required. This may be due to a memory issue on SageMaker that we have not yet been able to resolve. Here are the resources we used for the run on SageMaker:

Wall clock runtime: 66 minutes
Peak memory: 122 GB

Data pipeline

This model has a slightly unusual data pipeline. Unlike other SEI-PCS models, we only select the columns needed for the model from the input trade data: year, country_of_production, port_of_export_name and mass_tonnes_raw_equivalent. The model adds an extra column production_geocode to the data. The model output is then applied to the trade data.

flowchart LR

    %% Inputs
    subgraph Inputs
        A["Input Trade Data"]
        B["Production etc. Data"]
    end

    %% Processing
    A -- "Select Only Columns Needed for Model" --> C{{"Model"}}
    B --> C

    %% Outputs
    subgraph Outputs
        D["diet-trase-results-2020-brazil.parquet"]
        E["diet-trase-results-2020-..."]
        G["diet-trase-results-2020.parquet"]

        %% Combine step
        F{{"Combine Country Outputs"}} --> G
        D --> F
        E --> F
    end

    %% Model output relationships
    C --> D
    C --> E

There are a number of advantages to this approach:

It is clearer to the user what data is actually being used in the model
The model run is faster, as less data is being processed
We can clean columns of the trade data that are not used in the model without having to re-run the model

Furthermore, we output one Parquet file per country, which makes it easier to debug issues with specific countries.

Per-country documentation

Country	Level Solved	GADM Level	Coding System / Notes
Colombia	Municipality	3
Côte d'Ivoire	District	2	Codes from GADM
Ethiopia	Region	1	Codes from GADM
Brazil	Municipality	6
India	State/Union Territory	1	28 States and 8 Union Territories. Three possible coding systems: - ISO 3166-2:IN codes (state-level only) - Census of India - Local Government Directory managed by the Ministry of Panchayati Raj. We only solve to the state level, using ISO 3166-2:IN codes. We add an additional region called "Northern Region" grouping some northern states and union territories.
Uganda	Traditional sub-regions	N/A	In Uganda we solve to traditional sub-regions (sometimes called counties of origin or historical kingdoms). In GADM, they don’t correspond to the current administrative districts (Level 2) or the 4 statistical regions (Level 1).
Indonesia	Province	2
Peru	Province	2
Tanzania	Region	1
Vietnam	Region	1

Linear programming in Tanzania

When running the linear programming model for Tanzania we encountered two issues:

Trade to "Unknown" Ports: About 6.8% of reported trade is directed to an "unknown" port. This destination cannot be used as a valid sink in our transportation model, which requires that all demand at each sink be fully met. Nearly all of this trade (6.7%) results from data padding to align with UN Comtrade totals.
Supply–Demand Imbalance: Reported production is approximately 61,000 tonnes, while exports total about 68,000 tonnes.
In other words, there isn’t enough supply to meet total export demand. Tanzania had opening stocks of around 18,500 thousand tonnes which was presumably used to meet this demand; but unfortunately opening stocks are not accounted for in the Diet Trase model.

To ensure the linear programming model can run successfully for Tanzania, we take the following steps:

Exclude "Unknown" Ports: Remove all trade flows to "unknown" ports from the LP step. The model therefore only considers trade between known production areas and known export ports.
Scale up production: Increase production uniformly across all regions by +3.4% so that exports to known ports can be satisfied.
Reintegrate "unknown" trade: After solving, reintroduce the removed trade flows to "unknown" ports, attributing them to an unspecified production source.

In short, we artificially adjust production; adding a notional component to cover exports to unknown ports and scaling up existing production to meet exports through known ports.

TODOs

Need to read in product
Need to use raw equivalent factors to correctly embed the metrics
Rename deforestation to intensity

coffee_fullmodel

main_2023_test

run_single_preprocessing