Skip to content

View or edit on GitHub

This page is synchronized from doc/SEI-PCS.md. Last modified on 2025-12-09 00:30 CET by Trase Admin. Please view or edit the original file there; changes should be reflected here after a midnight build (CET time), or manually triggering it with a GitHub action (link).

SEI-PCS: Spatially Explicit Information on Production to Consumption Systems

The term "SEI-PCS" refers to the modelling technique we use to produce our "core data product" which traces international trade shipments back to subnational regions of production.

[!TIP] See also:

Workflow

The broad strokes of the SEI-PCS workflow are as follows:

graph TD;    
    Download-->Preprocessing-->Modelling-->qa[Quality Assurance]-->Embedding-->Ingestion-->Publication
    qa-->Preprocessing
    qa-->Modelling
Download
Locate raw data from its source (online, via APIs, etc) and download it to Amazon S3.
Pre-processing
Clean the datasets using Python or R scripts and perform quality assurance.
Modelling
Use the clean datasets as an input to the SEI-PCS model, to produce model output.
Quality assurance
Verify the model output is high-quality and go back to the pre-processing or modelling stage if it is not.
Embedding
Add one or more columns containing spatially-explicit metrics such as deforestation into the model output.
Ingestion
Store the embedded model output in the Trase database.
Publication
Make the data available on https://trase.earth.
Wrap-up
Complete any outstanding documentation.

After this process there is then a downstream process across Trase to communicate the new dataset, publish analyses of the key points that the data demonstrate, and so on. That is not covered here!

We generally have one SEI-PCS per "context", which we define as a commodity in a country, for example "Brazil soy".

Work on SEI-PCS models can be organised into three categories:

  • New context/model: creating a brand new
  • Context/model update: taking an existing model and adding one or more new year's worth of data, without significantly altering the model.
  • Context/model revision: as above, but with significant alterations to the model.

[!TIP] See also:

Quality Assurance

Quality assurance of the trade flows is organised into three "rounds":

  • Round 1: Basic checks that input in == output in, volume is preserved, etc. Here is an example.
  • Round 2: Simple data analysis, involving statistics and charts, that yearly trends are preserved, there are no outliers, etc. Here is an example
  • Round 3: A "sanity check" from an expert on that particular context, often referred to as expert review.

The development of a model is an iterative process:

graph TB;

    preprocessing[Pre-processing]
    model[Run model]
    qa1[Quality assurance,\nround 1]
    qa2[Quality assurance,\nround 3]
    qa3[Quality assurance,\nround 3]

    preprocessing-->model-->qa1-->qa2-->qa3
    qa1-->preprocessing
    qa2-->preprocessing
    qa3-->preprocessing

Embedding

Embedding of spatial metrics into the trade flows is done using R scripts, prior to ingestion of the trade flows into the database:

graph LR;    
    a[SEI-PCS Model]-->b-->c[Embedding scripts\nin R]-->d    
    subgraph s3[AWS S3]
    b[(pre-embed\nCSV file)]
    d[(post-embed\nCSV file)]
    end

The embedding process uses files in the following example locations:

# pre-embed file
s3://trase-storage/cote_divoire/cocoa/sei_pcs/v1.1.0/SEI_PCS_COTE_DIVOIRE_COCOA_2021.csv

# post-embed file 
s3://trase-storage/cote_divoire/cocoa/sei_pcs/v1.1.0/SEI_PCS_COTE_DIVOIRE_COCOA_POST_EMBEDDED_2021.csv

The embedding process makes no alterations to the CSV file other than to add extra columns containing the embedded values.

Example embedding scripts can be found in [trase/runbook/cote_divoire/cocoa/embedding/].

Ingestion

The purpose of ingestion is to take data on S3 and insert it into the database. We do this for the following reasons:

  • We ensure the data adheres to consistency standards. This is because the database has a unique entry for every "entity" of interest. For example there is only one "Côte d'Ivoire" entity with a standard spelling, a standard identification, a standard geometry, and so on. When we publish our data, if Côte d'Ivoire is mentioned in any dataset, then it is done consistently across all of our data.
  • We can build a data pipeline for publication to the website.
  • We make the data queryable via an SQL endpoint, which is useful for e.g. Metabase.

Ingestion scripts are located in the following example locations:

  • Validation and ingestion of flows data: [trase/runbook/cote_divoire/cocoa/trade]
  • Ingestion of spatial metrics: [trase/runbook/cote_divoire/cocoa/indicators/a_ingest_full_indicators.py]
  • Ingestion of spatial geometry: [trase/runbook/indonesia/spatial/ingest_kabupaten_boundaries.py]
  • Ingestion of logistics data: [trase/runbook//indonesia/wood_pulp/logistics/a_ingest_raw_dataset.py]

In the past we also ingested Zero Deforestation Commitment data. However, this is no longer done as standard, since the embedding of those data occurs prior to database ingestion.

Runbook:

  1. Identify the ingest metadata file on AWS S3, e.g. s3://trase-storage/country/commodity/sei_pcs/vx.y.z/country_commodity_vx.y.z.py
  2. If you are ingesting flows this should have been generated by the SEI-PCS tool, however, if it is a legacy model or was not generated using the tool, you will have to construct this yourself.
  3. If you are ingesting node attributes you will likely have to write the JSON by hand and upload it to S3.

See also metadata_migrations. 1. Create and run an "ingest_raw_dataset.py" file in trase/runbook/. 1. Using your skill and creativity, deal with any issues that occur 🙂 1. If you are ingesting flow datasets, bring new supply chain into Python and perform some basic checks (path/node roles is correct, total volume, etc.) 1. If you are ingesting node attribute datasets: 1. Quality check on the ingested results, mostly in Python: check that all attributes for all the nodes were ingested 1. Make sure to set any old node attribute references linked to the same attribute to False, this could otherwise break the release process. 1. If the dataset is to be released set the ref to latest. 1. If you want to ingest a new hs code (e.g., HS6) for an existing commodity, you can use insert_commodity_code_value(commodity_id, code_id, value). 1. code_id can be found with instance get_commodity_code_id("HS6") and 1. commodity_id can be found with get_commodity_id({product_name}), one example of product_name: "SOYBEAN CAKE".

Unique identifiers for entities (Trase IDs)

See Trase ID.

Levels of spatially-situated entities

For any new countries we recommend following the GADM levels (not boundaries). To find out these levels you can:

  1. Visit https://gadm.org/download_country_v3.html and choose your country
  2. Download the Shapefile and unzip it
  3. Locate each of the .DBF files. You can open these in Excel and there will be a list of human-readable names, which you can cross-reference to see what each leve lis referring to.

For Brazil, Paraguay, and Argentina, a structure level different than GADM was already stablished. For this reason, if you are working with one of these countries please replace use the level described in the table below.

Argentina

Level Name
2 Province
3 Department

Bolivia

Level Name
2 Department
3 Province
4 Municipality

Brazil

Level Name
2 Region
3 State
4 Mesoregion
5 Microregion
6 Municipality

Cote d'Ivoire

Level Name
2 District
3 Region
4 Department

Colombia

Level Name
2 Department
3 Municipality

Ecuador

Level Name
2 Province
3 Canton
4 Parish

Indonesia

Level Name
2 Region
3 Province
4 Kabupaten

Paraguay

Level Name
2 Department
3 District

Detailed workflow diagram

This section goes into more detail regarding the workflow, dependencies, and responsibilities of the whole SEI-PCS process.

The goal of the diagram is to highlight inter-dependencies between steps and teams:

  1. Illustrate the steps in the supply chain modelling process
  2. Illustrate the interdependencies between the steps
  3. Outline which team is responsible for which step

In the diagram there are three teams. The "context team" consists of one of, or a mix of, the supply chain mapping team and the data team, depending on the nature of the work (for example, whether it is a context update or a new context).

graph TB;
    subgraph Legend
        direction TB
        td(Data Team)
        ts(Spatial Team)
        tc(Context Team)
        td-->ts-->tc
    end

    linkStyle 0,1 display:none;
    style td fill:#ED71C8;
    style tc fill:#8EA094;
    style ts fill:#8889F7;

    subgraph preprocessing[Pre-Processing]
        direction LR
        style preprocessing fill:transparent,stroke-dasharray: 4 4,stroke:#000,stroke-width:1px;

        gather_trade(Gather trade data)
        gather_other(Gather other input data)    
        gather_logistics(Gather logistics data)  

        ingest_trader(Ingest trader synonyms)  

        clean_trade(Gather and clean\ntrade data)
        clean_other(Gather and clean\nother input data)    
        clean_logistics(Gather and clean\nlogistics data)              
        generate_geom(Generate geometries as GeoJSON\nUpload to S3)
    end


    ingest_geom(Ingest geometries)        
    database_ingest(Ingest into database)
    python_metadata(Complete Python metadata\nfor website)
    build_data_views(Build data view in supply_chains schema)
    generate_zdc(Extract top traders\nGenerate ZDCs)

    run_model(Run SEI-PCS model\nUpload results to S3)        
    validate_results(Clean and validate\nresults against database \nUpload to S3)
    generate_metrics(Generate spatial metrics)
    validate_metrics(Validate spatial metrics\nagainst database)            
    embed_metrics(Embed spatial metrics\nUpload to S3)
    qa(Quality assurance of flows)
    sign_off_spatial{{<b>Spatial sign-off</b>}}    
    sign_off_modelling{{<b>Modelling sign-off</b>}}                
    sign_off_database{{<b>Database sign-off</b>}}
    publication_on_staging(Publication to staging)


    gather_trade-->ingest_trader-->clean_trade
    gather_other-->ingest_trader-->clean_other
    gather_logistics-->ingest_trader-->clean_logistics

    generate_geom-->generate_metrics 
    clean_trade-->generate_zdc    
    qa-->sign_off_modelling
    clean_trade-->run_model
    clean_other-->run_model
    clean_logistics-->run_model
    run_model-->validate_results    
    validate_results-->qa
    generate_metrics-->validate_metrics
    generate_zdc-->validate_metrics
    validate_metrics-->embed_metrics
    embed_metrics-->sign_off_spatial
    sign_off_spatial-->database_ingest
    generate_geom-->ingest_geom           
    ingest_geom-->sign_off_database
    database_ingest-->build_data_views
    python_metadata-->publication_on_staging
    build_data_views-->sign_off_database
    sign_off_modelling-->embed_metrics
    sign_off_modelling-->database_ingest

    sign_off_database-->publication_on_staging

    style gather_logistics       fill:#8889F7;
    style generate_geom          fill:#8889F7;
    style ingest_geom            fill:#ED71C8;
    style clean_trade            fill:#8EA094;
    style gather_trade           fill:#8EA094;
    style gather_other           fill:#8EA094;
    style clean_other            fill:#8EA094;
    style ingest_trader          fill:#ED71C8;
    style clean_logistics        fill:#8889F7;
    style generate_zdc           fill:#8889F7;
    style run_model              fill:#8EA094;
    style validate_results       fill:#ED71C8;
    style generate_metrics       fill:#8889F7;
    style validate_metrics       fill:#ED71C8;
    style embed_metrics          fill:#8889F7;    
    style qa                     fill:#8EA094;    
    style database_ingest        fill:#ED71C8;
    style publication_on_staging fill:#ED71C8;
    style python_metadata        fill:#8EA094;
    style build_data_views       fill:#ED71C8;

    style sign_off_modelling  fill:transparent,stroke:#8EA094,stroke-width:3;
    style sign_off_spatial    fill:transparent,stroke:#8889F7,stroke-width:3;
    style sign_off_database   fill:transparent,stroke:#ED71C8,stroke-width:3;

Notes:

  • There is a possible circular dependency between ingesting trader synonyms and pre-processing of data.
  • After pre-processing trade data should contain cleaned traders, even if the model does not require it. This is to be able to unblock the ZDC work.
  • To "Validate spatial metrics against database" means to check that all geocodes in the file exist in the database, and that if the name of the region appears in the file, it is one of the node labels in the database.
  • To "clean and validate results against database" means the following:
  • Validate that all database objects (regions, traders, etc.) referenced in the results file exist in the database
  • Validate trader labels, Trase IDs, default trader names and trader groups against the database.
  • Either add parent regions (e.g. state) and biomes or, if they are already in the results file, validate that these are already aligned with the database An example of this process can be found at trase/models/brazil/soy/post_processing.ipynb

Country of destination vs. import and re-exports

Trase makes a distinction between two types of destination country:

  • Country of import is where a container ship initially docks and unloads its goods.
  • Country of destination is where the goods end up, possibly after a secondary transportation by land or sea.

If there was no secondary transportation then we would consider the country of destination to be the same as the country of import.

Typically, customs declarations will describe the country of destination whereas bills of lading will contain the country of import.

Re-export is the process of transporting goods that have not yet been used or processed from the initial import country to another country, whether by land or by sea.

Asset Levels

LevelDescription
1

This type of asset is geographically and tightly link to the exact location where the commodity is originated in the first place.

Examples of this type of activity include, but are not limited to, the following assets: farms, plantations, mines, quarries, fisheries, aquaculture ponds, production concessions of any kind (e.g. forestry or mining concessions) etc.. Soy farms and oil palm plantations are examples of Level 1 assets we have dealt with in current Trase commodities.

2

This type of asset is geographically and tightly linked to the local region where the commodity is originated in the first place, usually where it is gathered for the first time and becomes part of an international supply chain, not under the control of direct producers of the commodity. This has been traditionally named in Trase as “logistic hubs”, and constitute the targeted level of sourcing detail we aim for in Trase. This acceptable level or geographic resolution needs to be determined by the SEIPCs developer, after context and data scoping.

Examples of this type of activity include the following assets: silos and other storage forms, mills, and in general any form of activity that involves the reception, collection, preparation, conditioning, selection and classification of raw materials in the vicinity of the areas where they are produced. These include a variety of facilities for washing, drying, cooling, hulling, husking, milling (depending on the commodity), peeling, slaughtering, fish-cutting, sawing, collecting, smelting, heating etc..., which SEIPCS developers need to identify for each supply chain. Soy silos, palm oil mills or cattle slaughterhouses are examples of Level 2 assets.

3

This type of asset is generally geographically disentangled from the production area. As a result, relevant flow constrains data needs to be cross with these level 3 assets in the appropriate decision tree branch, in order to establish unambiguous or plausible relationships between these assets and the assets in levels 1 and 2. Generally speaking, these assets provide with added value to the raw material and/or prepare the product to meet internal trade and market requirements.

Examples of these assets include, but are not limited to: oil refineries, meat factories, packaging and canning facilities, timber processing plants, roasters, tanneries, cooking preparation, chemical facilities etc... These activities are often concentrated in large industrial poles geographically removed from production areas and closer to national and international markets, but they retain the ownership by the same companies as level 1 and level 2 assets for the more vertically integrated companies. For supply chains with limited vertical integration these assets will not be useful to determine level 1 or 2 assets of origin, leading to unknown origin records. The quality and detail of the material flow constrains data is key to avoid this.

4

This type of asset is completely or partially removed from any physical handling of the commodity of interest, and/or fits a role considerably downstream the supply chain of interest, and therefore its location is completely independent geographically from assets described in level 1, 2 and 3.

Examples of these assets include, but are not limited to: cargo handling facilities in ports, wholesale retailing, import/export trading, international transport services etc... These assets can be of importance for the development of a SEIPCS context because they often allow to circumscribe material flows observed in the detailed trade data to specific locations by deriving conclusions on logistics and economic feasibility of potential supply chain configurations. However, as with level 3 assets, to link these assets with sourcing regions upstream relevant flow constrains data needs to be included.

5

Nominally a level 5 is used to refer to the activity type of trade records for which no information is available.

Dataset Versions

Our supply chain datasets have a public version in the format major.minor.patch, for example "Brazil Soy v2.5.0".

The meanings of the three components are as follows:

  • Major: This has only four possible values:
  • 0: National-level,
  • 1: Subnational-level relying heavily on modelling techniques, typically using transportation costs and constrained optimization algorithms to allocate export volumes to individual production regions.
  • 2: Subnational-level relying on a more data-driven approach that identifies the subnational origin of individual material trade flows within the trade and customs information, and triangulates this information with other independent data sets. Modelling techniques are applied in mapping volumes from sub-national processing facilities to production regions.
  • 3: Subnational-level going one step further, mapping exports back to individual production areas (e.g. farms, concessions) rather than regional administrative boundaries like districts.
  • Minor: A version of the model logic: we bump this when there is a significant change in the supply chain modelling approach
  • Patch: We bump this when there has been a data update, but the supply chain model logic stayed the same. This happens, for example, if we have updated production data. We also bump the patch version when we have an updated trader hierarchy. We bump the patch version when we have new embedded indicator data, even if there was a change in methodology for the indicator data.

Examples:

  • An initial release of a subnational dataset determined by distance-based LP: 1.0.0
  • An initial release of a subnational dataset determined by LP weighed by trader logistics : 2.0.0

Organisation of Amazon Simple Storage Service (AWS S3)

All input data for SEI-PCS are stored in AWS S3. To gain access to S3 contact Harry Biddle (harry.biddle@sei.org) or Toby Reid (toby.reid@sei.org). You may be granted access to only a subset of the smorgasbord of folders in the trase-storage repository.

Once you have your credentials you should install the AWS CLI. You can then open a shell/command prompt/terminal and run aws configure, where you will be prompted to enter your credentials.

The file structure for S3 is broadly arranged with data broken down per country:

├── country
|   ├── commodity1 <- commodity specific data
|   ├── commodity2
|   ├── commodity3
|   ├── trade
|   ├── production
|   ├── logistics
|   ├── metadata
|   ├── dictionaries
|   ├── indicators
|   ├── spatial
├── world

Development Process

Data visualisation in R

If working in [R], install the traseviz package, which includes a set of tools for quickly producing Trase-style visualizations in R. Contact traseviz founder, Andrew Feierman (andrew.feierman@sei.org), for details.

We are currently moving a lot of data vizualisation work (particularly of the completed SEI-PCS products) to Observable. A webinar about Observable is available on Dropbox and you can interact with the data on Observable. Contact Observable jedi master Ben Ayre (b.ayre@globalcanopy.org) for details.

Coding standards

Some coding standards (such as the use of Black in Python) are automatically enforced using pre-commit. There are instructions on how to set up pre-commit in the root README.md.

Some (hopefully) helpful code for SEI-PCS work in R is available here.

When organizing the scripts, please give the R script a name and save it in a location which matches the data in S3.

At the top of each script please add a few lines of comments which describe the file, what it does, and who wrote it.

Please always aim for legibility! If someone shared the script with you, would you be able to easily understand what the script does, if not given any other information? Examples of best practices include:

  • Load libraries at the top of the script, and only load the libraries you need.
  • Delete all non-necessary code. (e.g. don't create unnecessary objects).
  • Remove failed experiments (ie. all commented out code lines) before committing the code to github.
  • Standardize strings as far as possible - wherever appropriate, remove special characters and capitalize ("Olam International Ltd." -> “OLAM INTERNATIONAL LTD”). See the str_trans r function in the r helper link
  • Where possible, simplify the code – you create separate objects for data_mondelez_olam and data_mondelez_touton but then duplicate when you create new columns for each, to standardize the data. I edited the code so that they are not separated in the first place, which reduces the lines of code. Ditto, you create dist_kumasi, but then rename it dist_other… it could just be named dist_other (or similar) in the first place. Overall, my code is 354 lines vs 405 for the previous version.
  • The janitor package is great for data cleaning in general – e.g. clean_names(“screaming_snake”) for cleaning column names. Check it out. :)
  • Write the code in a way that it is unlikely to break – e.g. avoid slice() if you’re actually filtering by a value.
  • Use clear subheadings! In Rstudio, if you press Ctrl + Shift + R it creates tidy subheadings for you.
  • Always save the data on S3 (not locally!).
  • Wherever you can, please work with id codes, not strings of names – for example, we work with the GEOCODE of each jurisdiction, not the NAME of each jurisdiction. In this script, you join the equivalence factors to the trade data using the PRODUCT name – instead, please use the HS_CODE. These are standardized across countries and reduce the likelihood of introducing errors during data processing (e.g. because of spelling erros in product descriptions). Not sure what an HS-code is? Please check the Trase glossary.
  • Wherever possible, please load data (e.g. customs records) from the Trase database, not S3. The data on the database have already undergone standardization (e.g. trader name cleanining). Example code to load data from the database is here: https://github.com/sei-international/TRASE/blob/master/doc/r_helper_code.md.

I started jotting down some suggestions on Google Docs, but this requires team input before being widely adopted.