Beef
View or edit on GitHub
This page is synchronized from trase/models/brazil/beef/README.md. Last modified on 2025-12-13 00:30 CET by Trase Admin.
Please view or edit the original file there; changes should be reflected here after a midnight build (CET time),
or manually triggering it with a GitHub action (link).
Brazil Beef
The 2019 and 2020 models require around 59GB and 50GB of memory respectively, and run in 15-20 minutes end to end.
A note on the "State of production" in the model output
The model output includes two columns: STATE_OF_PRODUCTION and MUNICIPALITY.
Curiously, they don't always match up: it is possible for the municipality to be in a different state than the state of production.
In the model, they are assigned in different ways:
| Aspect | STATE_OF_PRODUCTION |
MUNICIPALITY (from origin.geocode) |
|---|---|---|
| Assigned | During decision tree (live_cattle, others) | In join_in_supply_sheds_and_aggregate_small_flows() |
| Source logic | Based on exporter registration or assumptions | Derived from supply shed data (GTA or SIGSIF) |
The difference in interpretation is not entirely clear; surely the model should identify a single location of production for each flow?
When we look at the v2.2.0 data that was published to trase.earth and the database ingestion script that was used, we can see that the STATE_OF_PRODUCTION column was completely ignored.
Instead, the "state of production" displayed on the website was derived from the MUNICIPALITY column.
For now we will continue to use the MUNICIPALITY column as the state of production, but there is some outstanding investigation that needs to be done to clarify the difference between the two columns.
tl;dr: For historical consistency we currently treat MUNICIPALITY as the authoritative "state of production" but more investigation is needed.
Ingestion and publication
The ingestion process has been streamlined. Rather than using the old normalized ingestion pipeline, enriched data is streamed directly to PostgreSQL via a Python script.
You don't always need to run all steps—e.g., if changing only the embedding, skip step 1.
1. Run the SEI-PCS model
trase/data_pipeline/dbt run --target production --select seipcs_brazil_beef_202{1,2,3}
Output: CSV files s3://trase-storage/brazil/beef/sei_pcs/v2.2.1/SEIPCS_BRAZIL_BEEF_<year>.csv
2. Run the embedding scripts
Rscript trase/runbook/brazil/beef/indicators/embedding/a_embedding_quants.R
Rscript trase/runbook/brazil/beef/indicators/embedding/b_embedding_f500.R
Rscript trase/runbook/brazil/beef/indicators/embedding/b_zdc_preparation.R
Rscript trase/runbook/brazil/beef/indicators/embedding/c_embeddind_zdc.R
Rscript trase/runbook/brazil/beef/indicators/embedding/d_embedding_tac.R
Output: Parquet files in s3://trase-storage/brazil/beef/sei_pcs/v2.2.1/post_embedding/
3. Enrich the data
Enrich with Trase IDs and other information from the database such as Exporter Groups:
# Using DBT
trase/data_pipeline/dbt run --target production --select brazil_beef_sei_pcs_v2_2_1_for_ingest_2010_2023
# Alternative: as a Python script
python trase/data/brazil/beef/sei_pcs/v2_2_1/post_embedding/brazil_beef_sei_pcs_v2_2_1_for_ingest_2010_2023.py --upload
Output: s3://trase-storage/brazil/beef/sei_pcs/v2.2.1/post_embedding/brazil_beef_sei_pcs_v2_2_1_for_ingest_2010_2023.parquet
4. Ingest into the database
python trase/runbook/brazil/beef/trade/a_ingest_raw_supply_chain.py
The script takes around an hour and uses ~7GB peak memory.
Output: PostgreSQL table s3.brazil_beef_sei_pcs_v2_2_1_post_embedding
When the ingestion script completes successfully, it writes a table-level comment detailing some information about the ingestion: when it ran, who ran it, etc. This metadata can be viewed like so:
select obj_description('s3.brazil_beef_sei_pcs_v2_2_1_post_embedding'::regclass);
5. Generate table in supply_chain_datasets
This table will be the starting point of the Trase Earth Data Pipeline:
export PGUSER=my_postgres_user
trase/database/trase_earth_data_pipeline/dbt deps
trase/database/trase_earth_data_pipeline/dbt run --select brazil_beef_v2_2_1
Output: PostgreSQL table: supply_chain_datasets.brazil_beef_v2_2_1
6. Run the Trase Earth Data Pipeline
See instructions in trase_earth_data_pipeline/README.md
Outputs:
- https://review.trase.earth
- https://staging.trase.earth
Model Revisions
This section documents any changes which would be of interest to a consumer of the model output. We do not document backend changes (technical debt, refactors, etc.) which have no impact on the data itself.
2.2.1:
- Use SIGSIF data from 2020 at the latest, since subsequent years do not differentiate between state of slaughter and state of origin (see https://app.asana.com/0/1192763178041434/1202375302492795/f)
- Adjusted anonymisation threshold to 500kg raw carcass equivalent (was previously 250), and included the logistics hub and other columns in the anonymisation. This further reduces row count by another 30% and only affects 0.83% of volume.
- Removed 2010 from the website. We do not have production data for this year and so are unable to add domestic consumption as with other years, making the time series inconsistent #5002.
- Embedding using Mapbiomas Collection 10.
- Altered the way that ZDCs are calculated, in particular using "UNKNOWN" for flows from the unknown biome.