Beef

View or edit on GitHub

This page is synchronized from trase/models/brazil/beef/README.md. Last modified on 2025-12-13 00:30 CET by Trase Admin. Please view or edit the original file there; changes should be reflected here after a midnight build (CET time), or manually triggering it with a GitHub action (link).

Brazil Beef

The 2019 and 2020 models require around 59GB and 50GB of memory respectively, and run in 15-20 minutes end to end.

A note on the "State of production" in the model output

The model output includes two columns: STATE_OF_PRODUCTION and MUNICIPALITY. Curiously, they don't always match up: it is possible for the municipality to be in a different state than the state of production.

In the model, they are assigned in different ways:

Aspect	`STATE_OF_PRODUCTION`	`MUNICIPALITY` (from `origin.geocode`)
Assigned	During decision tree (live_cattle, others)	In `join_in_supply_sheds_and_aggregate_small_flows()`
Source logic	Based on exporter registration or assumptions	Derived from supply shed data (GTA or SIGSIF)

The difference in interpretation is not entirely clear; surely the model should identify a single location of production for each flow?

When we look at the v2.2.0 data that was published to trase.earth and the database ingestion script that was used, we can see that the STATE_OF_PRODUCTION column was completely ignored. Instead, the "state of production" displayed on the website was derived from the MUNICIPALITY column. For now we will continue to use the MUNICIPALITY column as the state of production, but there is some outstanding investigation that needs to be done to clarify the difference between the two columns.

tl;dr: For historical consistency we currently treat MUNICIPALITY as the authoritative "state of production" but more investigation is needed.

Ingestion and publication

The ingestion process has been streamlined. Rather than using the old normalized ingestion pipeline, enriched data is streamed directly to PostgreSQL via a Python script.

You don't always need to run all steps—e.g., if changing only the embedding, skip step 1.

1. Run the SEI-PCS model

trase/data_pipeline/dbt run --target production --select seipcs_brazil_beef_202{1,2,3}

Output: CSV files s3://trase-storage/brazil/beef/sei_pcs/v2.2.1/SEIPCS_BRAZIL_BEEF_<year>.csv

2. Run the embedding scripts

Rscript trase/runbook/brazil/beef/indicators/embedding/a_embedding_quants.R
Rscript trase/runbook/brazil/beef/indicators/embedding/b_embedding_f500.R
Rscript trase/runbook/brazil/beef/indicators/embedding/b_zdc_preparation.R
Rscript trase/runbook/brazil/beef/indicators/embedding/c_embeddind_zdc.R
Rscript trase/runbook/brazil/beef/indicators/embedding/d_embedding_tac.R

Output: Parquet files in s3://trase-storage/brazil/beef/sei_pcs/v2.2.1/post_embedding/

3. Enrich the data

Enrich with Trase IDs and other information from the database such as Exporter Groups:

# Using DBT
trase/data_pipeline/dbt run --target production --select brazil_beef_sei_pcs_v2_2_1_for_ingest_2010_2023

# Alternative: as a Python script
python trase/data/brazil/beef/sei_pcs/v2_2_1/post_embedding/brazil_beef_sei_pcs_v2_2_1_for_ingest_2010_2023.py --upload

Output: s3://trase-storage/brazil/beef/sei_pcs/v2.2.1/post_embedding/brazil_beef_sei_pcs_v2_2_1_for_ingest_2010_2023.parquet

4. Ingest into the database

python trase/runbook/brazil/beef/trade/a_ingest_raw_supply_chain.py

The script takes around an hour and uses ~7GB peak memory.

Output: PostgreSQL table s3.brazil_beef_sei_pcs_v2_2_1_post_embedding

When the ingestion script completes successfully, it writes a table-level comment detailing some information about the ingestion: when it ran, who ran it, etc. This metadata can be viewed like so:

select obj_description('s3.brazil_beef_sei_pcs_v2_2_1_post_embedding'::regclass);

5. Generate table in `supply_chain_datasets`

This table will be the starting point of the Trase Earth Data Pipeline:

export PGUSER=my_postgres_user
trase/database/trase_earth_data_pipeline/dbt deps
trase/database/trase_earth_data_pipeline/dbt run --select brazil_beef_v2_2_1

Output: PostgreSQL table: supply_chain_datasets.brazil_beef_v2_2_1

6. Run the Trase Earth Data Pipeline

See instructions in trase_earth_data_pipeline/README.md

Outputs: - https://review.trase.earth
- https://staging.trase.earth

Model Revisions

This section documents any changes which would be of interest to a consumer of the model output. We do not document backend changes (technical debt, refactors, etc.) which have no impact on the data itself.

2.2.1:

Use SIGSIF data from 2020 at the latest, since subsequent years do not differentiate between state of slaughter and state of origin (see https://app.asana.com/0/1192763178041434/1202375302492795/f)
Adjusted anonymisation threshold to 500kg raw carcass equivalent (was previously 250), and included the logistics hub and other columns in the anonymisation. This further reduces row count by another 30% and only affects 0.83% of volume.
Removed 2010 from the website. We do not have production data for this year and so are unable to add domestic consumption as with other years, making the time series inconsistent #5002.
Embedding using Mapbiomas Collection 10.
Altered the way that ZDCs are calculated, in particular using "UNKNOWN" for flows from the unknown biome.

2018_beef_input_data

BNP Paribas data request

CNPJ Descriptive Analysis

JBS Exposure

Lookup CNPJ

Minerva and Marfrig CNPJ

Old vs New

QA of new customs data

README_2024_update

Tiny Flows

Unknowns

Version 2.1 versus 2.2-Copy1

Version 2.1 versus 2.2

add_exporter_group_to_s3

ciff

main

output_43_0.png

output_44_0.png

qa_beef