Diet Trase
View or edit on GitHub
This page is synchronized from trase/models/diet_trase/README.md. Last modified on 2025-12-10 18:30 CET by Trase Admin.
Please view or edit the original file there; changes should be reflected here after a midnight build (CET time),
or manually triggering it with a GitHub action (link).
Diet Trase
Initial phase:
- Read trade data for each country
- Read deforestation data per year/country/commodity (s3://trase-storage/diet-trase/deduce_deforestation_intensities_july_2024.csv)
- Combine into one dataset
- Summary statistics, each point below is a single graph, mixing traders from all countries:
- Top traders by volume
- Top country of import with EU combined into one group
- Top traders by deforestation exposure (multiply intensities by volume)
- As above for emissions exl.
- As above for emissions incl.
- Sankey
The output data from this model is located at:
s3://trase-storage/diet-trase/diet-trase-results-2020.parquet
Usage
You can run the model locally from either main.ipynb or main.py. Here is an example running the model just for a single country (Colombia):
poetry run trase/models/diet_trase/coffee_fullmodel/main.py --only COLOMBIA
However, to upload the results to s3://trase-storage, you should run it through DBT:
poetry run trase/data_pipeline/dbt run --target production --select 1+diet_trase_coffee_2020
When running on SageMaker, a significant amount of memory is required. This may be due to a memory issue on SageMaker that we have not yet been able to resolve. Here are the resources we used for the run on SageMaker:
- Wall clock runtime: 66 minutes
- Peak memory: 122 GB
Data pipeline
This model has a slightly unusual data pipeline.
Unlike other SEI-PCS models, we only select the columns needed for the model from the input trade data: year, country_of_production, port_of_export_name and mass_tonnes_raw_equivalent.
The model adds an extra column production_geocode to the data.
The model output is then applied to the trade data.
flowchart LR
%% Inputs
subgraph Inputs
A["Input Trade Data"]
B["Production etc. Data"]
end
%% Processing
A -- "Select Only Columns Needed for Model" --> C{{"Model"}}
B --> C
%% Outputs
subgraph Outputs
D["diet-trase-results-2020-brazil.parquet"]
E["diet-trase-results-2020-..."]
G["diet-trase-results-2020.parquet"]
%% Combine step
F{{"Combine Country Outputs"}} --> G
D --> F
E --> F
end
%% Model output relationships
C --> D
C --> E
There are a number of advantages to this approach:
- It is clearer to the user what data is actually being used in the model
- The model run is faster, as less data is being processed
- We can clean columns of the trade data that are not used in the model without having to re-run the model
Furthermore, we output one Parquet file per country, which makes it easier to debug issues with specific countries.
Per-country documentation
| Country | Level Solved | GADM Level | Coding System / Notes |
|---|---|---|---|
| Colombia | Municipality | 3 | |
| Côte d'Ivoire | District | 2 | Codes from GADM |
| Ethiopia | Region | 1 | Codes from GADM |
| Brazil | Municipality | 6 | |
| India | State/Union Territory | 1 | 28 States and 8 Union Territories. Three possible coding systems: - ISO 3166-2:IN codes (state-level only) - Census of India - Local Government Directory managed by the Ministry of Panchayati Raj. We only solve to the state level, using ISO 3166-2:IN codes. We add an additional region called "Northern Region" grouping some northern states and union territories. |
| Uganda | Traditional sub-regions | N/A | In Uganda we solve to traditional sub-regions (sometimes called counties of origin or historical kingdoms). In GADM, they don’t correspond to the current administrative districts (Level 2) or the 4 statistical regions (Level 1). |
| Indonesia | Province | 2 | |
| Peru | Province | 2 | |
| Tanzania | Region | 1 | |
| Vietnam | Region | 1 |
Linear programming in Tanzania
When running the linear programming model for Tanzania we encountered two issues:
-
Trade to "Unknown" Ports: About 6.8% of reported trade is directed to an "unknown" port. This destination cannot be used as a valid sink in our transportation model, which requires that all demand at each sink be fully met. Nearly all of this trade (6.7%) results from data padding to align with UN Comtrade totals.
-
Supply–Demand Imbalance: Reported production is approximately 61,000 tonnes, while exports total about 68,000 tonnes.
In other words, there isn’t enough supply to meet total export demand. Tanzania had opening stocks of around 18,500 thousand tonnes which was presumably used to meet this demand; but unfortunately opening stocks are not accounted for in the Diet Trase model.
To ensure the linear programming model can run successfully for Tanzania, we take the following steps:
-
Exclude "Unknown" Ports: Remove all trade flows to "unknown" ports from the LP step. The model therefore only considers trade between known production areas and known export ports.
-
Scale up production: Increase production uniformly across all regions by +3.4% so that exports to known ports can be satisfied.
-
Reintegrate "unknown" trade: After solving, reintroduce the removed trade flows to "unknown" ports, attributing them to an unspecified production source.
In short, we artificially adjust production; adding a notional component to cover exports to unknown ports and scaling up existing production to meet exports through known ports.
TODOs
- Need to read in product
- Need to use raw equivalent factors to correctly embed the metrics
- Rename deforestation to intensity