Gta

View or edit on GitHub

This page is synchronized from trase/data/brazil/logistics/gta/README.md. Last modified on 2025-12-14 23:19 CET by Trase Admin. Please view or edit the original file there; changes should be reflected here after a midnight build (CET time), or manually triggering it with a GitHub action (link).

Description of the GTA cleaning process

Contact Erasmus zu Ermgassen (erasmus.zuermgassen@uclouvain.be) with queries.

These scripts load GTAs from three sources with five different file structures and standardize them.

The scripts should be run in order: 1_CLEAN_GTAS.R 2_BUILD_PROPERTY_NAMING_DIC.R 3_JOIN_TO_PROPERTY_NAMING_DIC.R

These scripts call on a series of helper functions in: HELPERS/GTA_CLEANING_FUNCTIONS.R

And produce files with the following column structure:

standardized_colnames <-
    c(
      "ID",
      "INFO_STATUS",
      "TRANSPORT_DATE",
      "TRANSPORT_YEAR",
      "DESTINATION_CODE",
      "DESTINATION_CITY",
      "DESTINATION_NAME",
      "DESTINATION_FARMER",
      "DESTINATION_TAX_NUMBER",
      "DESTINATION_STATE",
      "DESTINATION_TYPE",
      "DESTINATION_GEOCODE",
      "TIMELINE_OBS",
      "SPECIES_NAME",
      "SPECIES_PURPOSE",
      "SPECIES_GROUP",
      "ORIGIN_CODE",
      "ORIGIN_CITY",
      "ORIGIN_NAME",
      "ORIGIN_FARMER",
      "ORIGIN_TAX_NUMBER",
      "ORIGIN_STATE",
      "ORIGIN_GEOCODE",
      "ORIGIN_TYPE",
      "ORIGIN_GEOCODE",
      "TRANSPORT",
      "ANIMALS",
      "SLAUGHTER_GTA", 
      "SPECIES",
      "NUM_ANIMALS_ASSUMED",
      "GTA_SOURCE"
    )

Structure on S3

# ├── data
#     ├── GTA
#          ├── ORIGINALS          (raw data)
#          |   ├── data2value     (from Neural Alpha)
#          |       ├── 2017_COLLECTION (stored as .json files)
#          |       ├── 2019_COLLECTION (stored as .json.gz files)
#          |   ├── REPORTER       (from Reporter Brasil)
#          |       ├── 2017_COLLECTION
#          |       ├── 2019_COLLECTION
#          |       ├── 2020_COLLECTION
#          |   ├── RIBEIRO        (from Vivian Ribeiro)
#          |   ├── IMAFLORA
#          |   ├── INDEA_MT (we don't use these - they are the same as the Imaflora ones, but with less info)
#          |  
#          ├── OUT
#              ├── BOV (bovine GTAs)
#                  ├── CLEAN_STEP_1 (output of this script)
#                  |   ├── SP, TO, etc. one file per year and state.
#                  ├── CLEAN_STEP_3 (output of script 3_JOIN_TO_PROPERTY_NAMING_DICTIONARY.R)
#              ├── POU (poultry: chicken, duck, goose, guinea-fowl, capoiera GTAs) 
#              ├── POR (pig GTAs)
#              ├── EQU (horse, mule, donkey GTAs)
#              ├── SHE (sheep GTAs)
#              ├── GOA (goat GTAs)
#              ├── BUF (buffalo GTA)
#              ├── AQU (fish GTAs)
#              ├── WIL (wild/ornamental animal GTAs)
#              ├── OTH (Other species' GTA)
#              ├── UNK (UNKNOWN species, or a mix of species/conflicting info in different species columns)

Notes:

1_CLEAN_GTAS.R was run on the UCLouvain CECI server, as a 'job array' parralelised script - the script for submitting this process is in HELPERS/submit_clean.sh. One state, RS, may need more memory to complete the processing.

Changes from first version (V2.0.0):

GTAs for all species are cleaned and standardized (not just cattle)
All data loaded to/from S3, and S3 bucket is restructured for easier use.
Where I previously tried to identify the 'correct slaughterhouse' for each slaughter movement, I now drop the scripts 4_IDENTIFY_CORRECT_SLAUGHTERHOUSE.R and 5_ASSIGN_CORRECT_SLAUGHTERHOUSE.R. These were used in cases where the destination tax number for slaughter was not the slaughterhouse (e.g. a 'third-party' buyer), but I think this over-complicates things for the basic Trase use: SEI-PCS. By not including this step, we miss some cattle per slaughterhouse, but still have plenty enough data to identify their sourcing shed, and avoid introducing potential errors from mis-allocations.
3_JOIN_TO_PROPERTY_NAMING_DIC uses two dictionaries - one based on the 'property code', another based on the property/owner names.
Load standardised version of geocode dictionary from github (rather than my local version).
Converted column names to SCREAMING_SNAKE_CASE, in line with Trase coding style guidelines.
I've replaced empty cells ("" or " ") with NA_character_
Where before I replaced erroneous or missing TAX_NUMs with the "id", now they are now set to NA

Things to improve in next version/update

Update the brazil municipalities tool with missing municipality-GEOCODE matches (stored in tmp/geocodes_to_check/).
Specify na = 'string' in toJSON() e.g. in fn_export_gtas_per_year_split(). This prevents columns of NAs from being dropped when the data are exported, and will mean we don't need a custom function to load the processed GTAs. NB these NA columns will be default be imported as type 'logical' and will need to be converted to character to avoid class() conflict if loading multiple files, e.g. with map_df().
Drop MISSING_INFO column in fn_load_raw_gtas()

Changes still to make to current version - Rename GTA_CLEANING_FUNCTIONS.R -> GTA_PROCESSING_FUNCTIONS.R - Update methods doc to correct "We use year and state-specific estimates of the slaughter rate S (Equation 4), calculated as the herd size divided by the number of cattle slaughtered per state".

originals

summary