Sei Pcs

View or edit on GitHub

This page is synchronized from trase/data/brazil/beef/sei_pcs/README.md. Last modified on 2025-12-14 23:19 CET by Trase Admin. Please view or edit the original file there; changes should be reflected here after a midnight build (CET time), or manually triggering it with a GitHub action (link).

Brazilian beef SEIPCS v2.2.0

Library of code for subnational mapping of Brazilian beef SEI-PCS. Lead author: Erasmus zu Ermgassen (erasmus.zuermgassen@uclouvain.be).

Workflow

The workflow to re-run BR-beef SEI-PCS is to:

Update BR-beef logistic hubs
Identify sourcing per slaughterhouse
Implement the DT
Join municipal-supply sheds to each logistic hub
Calculate municipal cattle production

Update BR-beef logistic hubs

In v2.2.0, I exported the CNPJs for Neural Alpha to query (3_LOGISTICS/BRAZIL/GTA/NETWORK/EXPORT_TAX_NUM_TO_QUERY.R) before consolidating all the logistic hub data, because of time constraints - updating the SIE, SIBI data etc is time consuming.

In future, before processing the GTAs, I would ideally create a new version of the beef logistic hub data (which would then be called when indentifying CNPJs to do traversal searches on, in 3_LOGISTICS/BRAZIL/GTA/NETWORK/EXPORT_TAX_NUM_TO_QUERY.R).

In either case, identifying slaughterhouses (whether as part of the logistics map or not) requires cleaning the sisbov, sisbi, sif, sie, sim, & Monitac datasets, described in turn below.

Clean SIF data

Ran (in order):

data/brazil/logistics/sanitary_inspections/animal_products/SIF/scrape_sigsif_estabelecimento.R.
data/brazil/logistics/sanitary_inspections/animal_products/SIF/scrape_sigsif_approved_exporters.R.
data/brazil/logistics/sanitary_inspections/animal_products/SIF/2020_CLEAN_SIF_EXPORT_PERMISSIONS.R.
data/brazil/logistics/sanitary_inspections/animal_products/SIF/2020_CLEAN_SIF.R.

Clean SISBI

data/brazil/logistics/sanitary_inspections/animal_products/SISBI/2020_scrape_clean_sisbi_data.R.

Note the 2020 SISBI website is completely different and a lot more detailed than the 2018 data (which was only accessible as individual csvs, saved in trase-storage~brazil/logistics/sanitary_inspections/animal_products/sisbi/downloaded_data/).

When updating these in future, we'll need to check the format of the data (and the number of facilities in the SISBI data), as I suspect the data/website may change again.

Clean SIE data

A list of state websites where slaughterhouse data is available is stored in trase-storage~brazil/logistics/sanitary_inspections/animal_products/sie/SIE_states.xlsx. The raw data files are saved in: trase-storage~brazil/logistics/sanitary_inspections/animal_products/sie/raw/2020/, and processed in trase/data/brazil/logistics/sanitary_inspections/animal_products/SIE/clean_brazil_sie_2020.R to produce output files in: trase-storage~brazil/logistics/sanitary_inspections/animal_products/sie/out/ The README.txt in S3 explains that there are two versions of the data * [1] unified 2020-06-06-SIE_ALL.json * [2] 2020-06-08-SIE_CLEAN.csv (subset for BEEF/PORK/CHICKEN SIE facilities only, with CNPJ, where available).

When updating the data in future, SIE_states.xlsx needs to be updated (manually, a day or two's work), and then the cleaning scripts re-run.

Scrape Monitac

Data from: http://monitac.oeco.org.br/ Script in: trase/tools/scrapers/SCP_Monitac.R. Data in: brazil/indicators/performance/environmental/monitac/MONITAC_scrape_2021-02-24.csv

In future we should consider using Boi na Linha data instead.

Identify sourcing per slaughterhouse

Process GTAs

First, clean and standardize the GTAs. Run the following scripts. You may need to update HELPERS/GTA_CLEANING_FUNCTIONS.R if the format of the input gta data have changed at all.

3_LOGISTICS/BRAZIL/GTA/1_CLEAN_GTAS.R
3_LOGISTICS/BRAZIL/GTA/2_BUILD_PROPERTY_NAMING_DIC.R
3_LOGISTICS/BRAZIL/GTA/3_JOIN_TO_PROPERTY_NAMING_DIC.R

These scripts were submitted on the UCLouvain CECI servers using the bash scripts HELPERS/submit_gta_clean_step1.sh (and _step2, _step3).

Second, identify the list of CNPJs to identify the supply network for - this list is produced in 3_LOGISTICS/BRAZIL/GTA/NETWORK/EXPORT_TAX_NUM_TO_QUERY.R (as explained above, ideally this would be based on the logistic hubs). Share this list with Neural Alpha, for a traversal search operation. This list could be improved in future, to capture more CNPJs (SIM, SIE slaughterhouses).

Third, export to S3 the GTAs in the supply network of each logistic hub (slaughterhouse or live cattle exporter). 1. export_network.R (submitted on CECI server using submit_gta_network_export.sh). 2. check_export_network.R - double-checks the export didn't mess up and drop some GTAs.

Fourth, calculate the municipal supply shed of all logistic hubs.

This was done in data/brazil/logistics/gta/supply_shed/MUNICIPAL_LEVEL_FLOWS.R, which was submitted on the UCLouvain CECI servers using the bash scripts submit_supply_shed.sh (and submit_supply_shed_v3.sh). See the README in the folder where these scritps live for more detail.

Note: In v2.2.0 we calculate a single municipal supply shed for each slaughterhouse - i.e. we used the same supply shed in 2010, 2011, 2019, etc.

We did this because we tested in 3_LOGISTICS/BRAZIL/GTA/SUPPLY_SHED/ASSESS_ANNUAL_VARIATION.R and 3_LOGISTICS/BRAZIL/GTA/SUPPLY_SHED/ASSESS_ANNUAL_VARIATION.Rmd (the latter script visualises analyses done in the former) and found that at the municipal-level, these supply sheds are very consistent - with a mean and median multiple R-sq > 0.9. Using a single supply shed also smooths over years where we have incomplete GTA data.

Fifth, these municipal-level supply sheds per CNPJ need to be aggregated per CNPJ8 and municipality. Slaughter businesses may register multiple CNPJs per GEOCODE (i.e. have multiple plants per location, one for slaughter, one for processing etc). When linking logistic hubs to the cattle they buy, we therefore want to do this based on the CNPJ8 and GEOCODE, to avoid missing any movements made to a sister-CNPJ (to the one identified as the CNPJ).

This was done in trase/data/brazil/logistics/gta/supply_shed/CNPJ8_SUPPLY_SHED.R.

Also update the README.txt in brazil/logistics/gta/network/BOV/

SIGSIF

First run brazil/production/statistics/sigsif/original/SIGSIF_SLAUGHTER_2022.py to download most recent SIGSIF data on municipal origin of animal slaughtered in SIF facilities.
The data is exported as csv to S3 in: brazil/production/statistics/sigsif/in.
Then run the trase/data/brazil/production/statistics/sigsif/out/archive/SIF_2021_2022.py to preprocess the data and export (with GEOCODE added) to: brazil/production/statistics/sigsif/out.
Then run brazil/production/statistics/sigsif/clean_bovine_sigsif_municipal_slaughter_data.Rmd to produce brazil/production/statistics/sigsif/out/br_bovine_slaughter_2015_2022.csv.

Implement the DT

Implemented in deforestationfree.

A tree-diagram can be found HERE.

Specify parent-subsidiary relationships in: BEEF_EXPORTER_SUBSIDIARIES.csv (I made a mistake and cannot find the script used to generate this - it may even have been something I manually made).

Note: I think this script: trase/data/brazil/specify_beef_company_relationships.R, which produced the similar brazil/dictionaries/parent_company_dict_with_beef.csv, is defunct as it is not called in the decision-tree.

Note: I implement no special rules in this decision tree. Links between Zanchetta Alimentos and Mondelli, and J.N.J. COMERCIAL and FRIGORIFICO ASTRA DO PARANA are coded in the BEEF_EXPORTER_SUBSIDIARIES.csv, and the link between JBS' exports from RJ is now marked as unknown, rather than being linked to CNPJ 02916265003266 in TRES RIOS (GEOCODE:3306008), because on double-checking this is a canned-goods factory (LEVEL 4), and we don't really know the origin of cattle processed in the facility.

Join municipal-supply sheds to each logistic hub

Implemented in deforestationfree.

QA of these files (comparing versions, looking for peculiarities) was done in trase/data/brazil/beef/sei_pcs/qa_br_beef_seipcs_output.R.

Calculate municipal cattle production

This is technically under the indicators workstream, but Erasmus provided the calculation of 'carcass production' per municipality to Vivi.

These data on 'carcass production' are produced from: * Run brazil\production\statistics\ibge\cattle\ppm\original\cattle_production_1974_2021.py to download to S3 brazil\production\statistics\ibge\cattle\ppm\original\cattle_production_1974_2021.csv * brazil\production\statistics\ibge\cattle\ppm\out\cattle_production_1974_2021.py (producing brazil/production/statistics/ibge/cattle/ppm/out/PPM_BOV_HERD_1974_2021.csv) * brazil\beef\production\statistics\anualpec\2021_CLEAN_ANUALPEC_BOV_ABATE_DATA.R (producing brazil/production/statistics/anualpec/out/ANUALPEC_BOV_ABATE_2021.csv). * brazil\beef\production\statistics\ibge\abate\CARCASS_WEIGHTS.R (producing 2022-06-02-STATE_CARCASS_WEIGHTS.csv) * And brazil/production/statistics/sigsif/out/SIGSIF.csv - I'm not actually sure who wrote the script to clean SIGSIF data, it wasn't me. * Finally, trase/data/brazil/beef/production/cattle_production_2010_2021.Rmd nits all the data together to make: brazil/production/statistics/ibge/beef/out/2023-03-10-CATTLE_PRODUCTION_5_YR.csv.

QA of v2.2.0 data

The script used for QA of the v2.2.0 is trase/data/brazil/beef/seipcs/qa_br_beef_seipcs_output.R. This includes comparisons with previous versions of the SEIPCS data and computes some summary stats reported in the methods doc.

Changes made to v2.2.0 vs v2.0.0 and v2.1.0

Extend sei-pcs time series for 2010-2019 (prior version was 2015-2017 only).
the entire workflow now pulls data from S3 only (no local data, phewgles).
the DT is implemented in deforestationfree.
the processing of the customs data is now handled by Harry Biddle, living legend, and others in the data team.

Things to improve in v2.2.1 (i.e. future versions)

Specify na = 'string' in toJSON() e.g. in fn_export_gtas_per_year_split(). This prevents columns of NAs from being dropped when the data are exported, and will mean we don't need a custom function to load the processed GTAs. NB these NA columns will be default be imported as type 'logical' and will need to be converted to character to avoid class() conflict if loading multiple files, e.g. with map_df().
Drop MISSING_INFO column in fn_load_raw_gtas()
Fix the DESCRIPTION of bovine animals when cleaning the GTAs. Currently there are a variety of alternatives (as specified in 'age_weight_dictionary', which could be standardized to the same format: e.g. "BOVINO,FEMEA,0 A 12 MESES", also be sure to correct "BOVINA"->"BOVINO", and remove ",DE " e.g. in ",DE 0 A 12 MESES". Examples of how to clean these are in the fn_identify_inter_muni_flows() function).
Drop GTAs with INFO_STATUS == INATIVA - these are cancelled GTAs. This can then be removed from fn_load_gtas_filter_for_lifecycle().
Do traversal search on a CNPJ8 and GEOCODE, not a CNPJ. This will pick up 'linked CNPJs' and cut out a step in the GTA-processing, but will require adding a CNPJ8 column to the GTAs.
Update the version of the CNPJs called in all analyses.
Investigate the origin of JBS’ meat products in Lins, SANTO ANTÔNIO DE POSSE, SAO PAULO.
I can ad 0.2% more matching in the decision tree by adding in a special rule for RXM, based on the CD_GEOCODE and the SIFs they report they source from: https://www.rxm.com.br/en/parceiros/
Rename GTA_CLEANING_FUNCTIONS.R -> GTA_PROCESSING_FUNCTIONS.R
Move USEFUL_SNIPPETS.R (currently in tools/seipcs_old folder)
Delete all scripts which do not appear in the current workflow.

qa_phase2_general_plots.html

qa_phase2_specific_plots.html