Models Docs

View or edit on GitHub

This page is synchronized from trase/models/README.md. Last modified on 2025-12-09 00:30 CET by Trase Admin. Please view or edit the original file there; changes should be reflected here after a midnight build (CET time), or manually triggering it with a GitHub action (link).

Trase SEI-PCS Tool

Requirements

Python 3.6+
Poetry (pip3 install poetry)
A virtual environment with the dependencies installed: poetry install
The GLPK solver for Pulp

Development

If you are going to alter the library code within the "trase" directory you should also set up a virtual environment. This should be done using Poetry, which guarantees that dependencies are installed with precisely the correct versions.

Create a virtual environment using Poetry.
```
pip3 install poetry
cd path/to/repository
poetry install
```
If you get an error about missing Python 3.6 then refer to the Poetry documentation on managing environments.
Check pre-commit is working. Pre-commit enforces formatting and linting checks before you push to GitHub:
```
poetry run pre-commit install
poetry run pre-commit run --all-files
```

(Optional) Integrate pre-commit into your editor. It can be annoying if your commits constantly fail linting checks! To avoid this problem apply auto-formatting as soon as you click "Save" in your editor:

First install pre-commit outside your virtualenv
```
pip install pre-commit
```

Then, configure your editor to run pre-commit run --files <files> on save. In PyCharm this can be done by adding a new file watcher. If you do not have a professional license you may need to install this as a plugin. When it is installed go to Preferences/Settings > File Watchers > + with the following settings:

Key	Value
Name	`pre-commit`
File Type	`Python`
Scope	`Project Files`
Program	output of "which pre-commit"
Arguments	`run --files $FilePath$`
Output paths to refresh	$FilePath$
Working directory	$ProjectFileDir$
Auto-save edited files to trigger the watcher	`No`
Show console	`Never`

Make a formatting change (like re-ordering an import), save the file, and confirm that the file was auto-fixed.

Introduction

How to use this tool

This tool is intended to help researchers to implement models according to the SEI-PCS approach. Before using the tool, the researcher should gather enough data to be able to evaluate the feasibility of developing a model.

Once the data has been analyzed and the researcher wishes to develop the model, it is expected but not necessary that the data is stored in S3. After that, the researcher needs to work through three steps in order to complete a model:

Create a model definition file.
Create a preparation file to bring the original data to the tool.
Develop the model script using the helper methods provided by the tool.

Description

The Trase SEI-PCS SupplyChain tool is a set of methods and a predefined structure that provides a framework for researchers to develop models in a standardazied way.

Workflow

The tool assumes the following workflow:

The user collects all required data for the model and stores in in S3.
The user defines the model structure in the definition.py file.
The user creates a preparation.py script to convert the original data into a toolbox-friendly format according to the definition.py file (see below).
The user creates the model.py script to run the supply chain.

All files need to be stored following the file structure, see below.

Concepts

The SEI-PCS SupplyChain tool assumes that models can be abstracted using the following concepts:

Nodes: Nodes are a mechanism to store data that will be iterated during the model, i.e. assets, municipalities, etc. The tool can load different types of nodes and will give each node a unique identifier. Consider defining a node type if a data entity in the model has more than one attribute. The benefit of defining nodes to handle the entities is that the tool performs several checks and can help to identify issues.

File structure

The toolbox forces a specific file structure, in which models are stored first by country, then by commodity and finally by year.

Country and commodity names should start with upper letter like "Brazil" and "Chicken". The tool uses the following folder structure:

Level 1: Country
Level 2: Commodity
Level 3: Year

The folder structure is created automatically by calling the constructor of the tool:

SupplyChain("my_country/my_commodity", 2019)

This should be the first step when starting the development of a model. Once the structure is created the researcher can continue with the next step.

In each level, we can find the following subfolders that are expected to be used to store data in different stages: - 'ORI': Original data, which will be used during the preparation step (usually CSV files) - 'IN': Prepared data, which has been prepared specifically to be compatible with the tool (also CSV files) - 'PRE': Pre model execution data. This data has been loaded into the tool format and is stored as cache data (json) in case the researcher wants to jump the preparation step. - 'OUT': Post model execution data. This data is basically the resulting flow data in the tool's format (json). - 'POST': This is where the resulting data from the model is stored back in CSV format before being uploaded to S3.

AWS credentials

As a standard, it is assumed that the researcher has stored the data in AWS S3 buckets, so credentials to get access to the buckets are required to use the tool. For the quickest access to S3, you also need to set a region.

To configure the credentials you first have to install the Amazon CLI:

pip install awscli

followed by:

aws configure

You will be prompted for the credentials, the default region and the default output format:

AWS Access Key ID [None]: <fill in your access key ID>
AWS Secret Access Key [None]: <fill in your secret access key>
Default region name [None]: <press enter for None or type a suitable AWS region code>
Default output format [None]: <press enter>

Now you are all set!

Test the connection

The fastest way to check your connection is to run the following code in Python:

import boto3
client = boto3.client('s3')
client.list_buckets()['Buckets']

If you have successfully configured the credentials, you should get a rather messy response containing the names of all available buckets and some related data.

If you get something like this, something is wrong with your configuration:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.8/site-packages/botocore/client.py", line 276, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/usr/lib/python3.8/site-packages/botocore/client.py", line 572, in _make_api_cal
  [...]
  File "/usr/lib/python3.8/site-packages/botocore/signers.py", line 160, in sign
    auth.add_auth(request)
  File "/usr/lib/python3.8/site-packages/botocore/auth.py", line 357, in add_auth
    raise NoCredentialsError
botocore.exceptions.NoCredentialsError: Unable to locate credentials

Building a model

1. Definition file (definition.py)

The definition file is a Python file that describes how the model data will be loaded and stored in Python dictionaries. The researcher can work directly against these dictionaries or use the available helper methods to handle it in an easy way during the execution of the model.

There are some conceptual data structures that need to be understood in order to properly define the model.

It is important to recall the double functionality of this file, the loading and storing of the data, both in one. The flow export section also automatically generates the ingestion metadata when calling upload_results. Read below to get more details of these mechanisms.

1.1. Metadata:

The definition file allows to store metadata in order to identify and track the file, the metadata is made up of variables. The content of the metadata entries is not limited and is up to the researcher to include relevant/necessary information. Having said this there are a set of variables that need to be defined. - The description attribute - The version attribute - The commodity_equivalence_group_name: which is the uppercase name of the commodity equivalence factor group in the database (don't worry if it is not exact, the validation occurs later) - The year attribute: used to limit which years can be run with the model.

Example:

from trase.tools.sps import Dataset, c, e
description = "Brazil Chicken SupplyChain Model"
version = "1.0"
commodity_equivalence_group_name = "CHICKEN"
years = [2015, 2016]

# ...

1.2. Nodes

Nodes are a mechanism to abstract an asset, geolocation or any other entity to be used inside the model. This abstraction allows to store the data to be used inside the model and help iterating with it and detecting possible data issues.

A model definition file specifies what kind of node types will be used, their attributes and how the data will be loaded into them. The model data, once loaded, may contain different amount of nodes of different types, each one with a unique identifier which will help to iterate between them and even reference/link them. But basically, a node is a Python dictionary where the items represent attributes (keys) and their values.

These attributes can contain raw data (option A) or links to other nodes (option B). The difference between raw data and links is that the second ones are a reference to another node that in addition may contain its attributes and other links. This allows two functionalities during the loading and execution of the model:

Perform checks during the loading of the model data, and
Iterate with linked node attributes and their own links.

As an example, let's say a node type called 'asset' has an attribute called 'municipality' which refers to the geolocation of this 'asset' node (similar to a postal code). If this attribute is handled as raw data (option A) the value could be anything that respects the defined datatype, let's say a string. Values (including typos and spelling variations) that refer the same municipality would be valid, i.e. 'Stockholm', 'stockholm', '53628' or 'Estocolmo'... This would allow two different 'asset' nodes to refer the same municipality with different values, which could be a problem if the attribute is to be used in the model for checks or comparisons (e.g. getting all 'asset' nodes with a municipality attribute equal to <some value>). On the other side, if 'municipality' type nodes are defined in the model and the attribute in the 'asset' node is defined as a link to a 'municipality' node, the tool will perform a check during the loading of the 'asset' nodes and if the link is not valid because the referring municipality name is not defined, it would assert an error to help the researcher to fix it.

Nodes are defined inside the 'nodes' key as shown in the example below. Notice that all attributes of the 'state' nodes are defined as raw data, using the key of each entry to define the name of the attribute and the value to define the datatype. To load the data into the tool, there should be a CSV file called 'state.csv' in an 'IN' folder after the preparation, which should include a column with the name of each attribute, and where each row will represent a 'state' node. The tool will even try to convert the data in the CSV file to the given datatype before setting the attribute values and detect any issues.

On the other side, the node type 'municipality' defines some raw attributes and even an attribute called 'state' that will be a link to a state node. To define a link attribute, the value should have the following format:

link:<node type to link>.<attribute used to link>

In the example bellow, during the loading, the column 'state' of the 'municipality.csv' file will be expected to contain valid state 'uf_code' attribute values. During the loading, the tool will create a node for each row in the CSV file and will try to find a state that has the same uf_code as the one stored in the state column. If a valid 'state' node is found, the unique identifier will be set as the value of the 'state' attribute. But if a valid state is not found, the tool will assert an error for the researcher.

datasets = {
    "state": Dataset([
        c("name"),
        c("uf_code", int),
        c("uf_name"),
    ]),
    "municipality": Dataset([
        c("name"),
        c("geocode", int),
        c("state", link="state.uf_code"),
    ]),
    # ...

To summarize, once the nodes are defined in the definition file, the SupplyChain will use it as part of the loading step in order to find CSV files (with the corresponding node names) inside the 'IN' folders and load them automatically. The files are expected to have a column named equal to each node attribute. Additional columns in the CSV files will not cause any issues.

TODO: Describe "list=" option for raw attributes

1.2 Flows:

The flows are the base of the model. They contain the information of an export from a production country to a destination country, including the HS code, volume and FOB, and other data. The flows are processed during the model execution to add more information to them and fill in the gaps (e.g. adding a more exact production location or additional attributes such indicators).

Flows are loaded from a file called 'flows.csv' that is expected to be inside an 'IN' folder.

The flow definition is quite similar to the node definition. Each attribute name is defined as the key of the attribute entry and its value will represent the datatype or a link to a node. This allows the same check and iteration functionality with flows as with nodes.

In the example below, some flow attributes will be loaded directly as raw data and datatype conversion will be carried out to detect any issues. The linked attributes will be handled same way as linked attributes of nodes. If a link is not found it will be pointed out to the researcher.

#...
flows = [
    c("year", int),
    c("hs6", str),
    c("vol", int),
    c("fob", int),
    c("state", link="state.name"),
    c("exporter", link="exporter.cnpj"),
    c("port", link="port.name"),
    c("country", link="country.name"),
    c("importer"),
]
# ...

If additional attributes are required during model execution but aren't loaded from the source data, they can be defined in the definition of the flow in the same way as with raw attributes. In that case, a default value appropriate to the attribute data type will be set, i.e. False for booleans, '' (empty string) for strings, 0 for integers, 0.0 for floats, etc.

TODO: Describe node_load_order

1.3 Constraints:

Constraints are a different mechanism used to enable quick access to additional data during the execution of the model, the decision tree or the linear programming (LP) by making use of the power of dictionaries in Python.

As an example we could have the common cost matrix that is used to constrain many LP problems. Usually the cost matrix is a Pandas dataframe that includes three columns: the 'origin', the 'destination' and the 'cost'. The tool allows you to easily load this dataset. By providing both keys, the cost attribute can be retrieved from the dictionary:

constraints = {
    "cost": Dataset([
        c("origin", link="municipality.geocode", key=True),
        c("destination", link="municipality.geocode", key=True),
        c("cost", int),
    ]),
    # ...

The definition of the constraints differ a bit between nodes and flows. For each constraint definition ('cost' in the example above), the keys and attributes are defined independently. This is done to keep the keys in the order specified in the definition (using a list), while the attributes are unordered (using a dictionary). The order in the list will define the order of the dictionary keys, first 'origin' and second 'destination' in the example. For each key, the column name of the CSV file and the value needs to be defined. As before, keys can be raw data or links to nodes ('municipality' geocodes in the example). This to ensure that all loaded 'origin' and 'destination' values are valid. Attributes of the constraint can also be raw data or links.

1.4 Constants:

Working on this...

1.5 Flows export:

All columns in this list will be included in the csv uploaded to s3. Follow the naming conventions laid out here Start with the flow nodes in order of the flow path as the position in the flows are automatically generated based on the type of column. This will determine metadata for the ingest too and it will recreate itself with every call of upload_results().

The first argument is the name of the output column and the second is the name of this column in the flows list. Then there are additional optional arguments.

Example:

# ...
flows_export = [
    # Flow nodes in order
    e("COUNTRY_OF_PRODUCTION", "country_of_production", role="COUNTRY_OF_PRODUCTION"),
    e("MUNICIPALITY_TRASE_ID_PROD", "parish_of_production", role="MUNICIPALITY"),
    e("MUNICIPALITY_TRASE_ID_LH", "parish_of_processing"=0, role="LOGISTICS HUB"),
    e("PORT_OF_EXPORT", "port_of_export", role="PORT"),
    e("EXPORTER_TAX_ID", "exporter_trase_id", ingest=False),
    e("EXPORTER", "exporter"),
    e("IMPORTER", "importer"),
    e("COUNTRY_OF_DESTINATION", "country_of_import", role="COUNTRY"),

    # Additional columns
    e("YEAR", "year"),
    e("HS6", "hs6"),
    e("VOLUME_RAW", "volume_raw"),
    e("FOB", "fob"),
    e("BRANCH", "branch"),
    e("EXTRA ATTRIBUTE", "extra", attribute_type="ind"),
]

Optional key arguments:

Argument	Type	What?	Columns this applies for
ingest	boolean (True or False)	Refers to whether you want to include this column in the database ingest or not.	all
role	string	How to label the node in this column in the sankey and on the site. E.g. 'COUNTRY OF PRODUCTION' or 'BIOME'. There are defaults so it is possible to omit this.	flow nodes
date_type	string	Defaults to "YEAR", options: "YEAR", "DATE_START", "DATE_END", "YEAR_START", "YEAR_END", "TIME_START", "TIME_END"	"YEAR"
code_name	string	Defaults to HS6, if there is a need for a different code contact the data team.	"HS6"
unit	string	Defaults to "TONNES", options="KG", "TONNES"	"VOLUME_RAW"
attribute_type	string	ind, qual or quant	additional indicator columns, can be named anything

2. Preparation of data (preparation.py)

Pinning Object Versions After Model Development

Once we finish the development of a model, it is useful to be able to run that model in the future and still have the same output. However, if somebody moves, modifies, or deletes a file in S3 that is used by the model, this could unexpectedly break the model or change the output. To avoid this we recommend that you "pin" the versions of the S3 objects that were used to produced the final output.

To do this, first find out the latest version of the S3 objects you reference in preparation.py, either from the AWS web console or with this command:

aws s3api list-object-versions \
  --bucket my-bucket \
  --prefix my/object/key.csv \
  --query 'Versions[?IsLatest].[VersionId][0][0]'

Then, set the version_id attribute on every preprocessor class.

3. Modeling (model.py)

Loading Tabular Data

This tool reads many kinds of files that contain tabular data: Excel worksheets, CSV files, JSON files, and so on. This raises questions like:

How do you convert the string "0" to a boolean?
Is "" a missing value or is it an empty string?
How do you convert a JSON null to a float: is it 0, nan, or None?

In answering these questions, this tool follows the following principles:

Behaviour should be clear and consistent across different file types.
Where there is ambiguity it is better to error than to make the wrong choice.
Nullable and nan objects are avoided, since they generally cause more confusion than they help.
Pandas dataframes with colums of mixed types (object columns) are avoided.
It should be possible to load data from a range of sources and formats.

The sections below document how the tool behaves in various scenarios.

Dates and Times

The framework does not support loading dates and times. These should be loaded as strings and then parsed appropriately.

Loading JSON

We expect the JSON file to be an array of dictionaries:

[
  {
    "first_column": 1,
    "second_column": 1
  }
]

All JSON data is loaded using Python's standard library. We defer to this library for constructing Python types (float, int, etc). However, the following values are rejected:

None	any type*	→	error
nan	float, int	→	error
±inf	float, int	→	error

* with the exception of object; see below.

Mis-matched types will be rejected: you should load data using the correct type and then cast it later:

True	str	→	error
"True"	bool	→	error

There is one exception to this: casting a float to an int. The number will be truncated rather than rounded:

3.8

int

→

3

Lists are supported.

[True, False]

List[bool]

→

[True, False]

Otherwise the type must match the type in the file.

The object type can be used for a more complex column like a dictionary or a list, columns of mixed types, or a column containing any of the above rejected values. However, where possible it is recommended to use one of the more explicit types.

Loading CSVs

CSVs are the most difficult format because they are not typed: data is stored as strings and the framework must interpret them. By default the framework assumes UTF-8 files with semi-colon-separated and minimally-double-quoted CSV files. However, this can be overidden.

Integer (`int`)

Strings containing numbers behave as expected:

"99"	→	99
" 3 "	→	3
"03017"	→	3017

Floating-point values are truncated rather than rounded:

"3.8"

→

3

All other strings result in an error:

""	→	error
"nan"	→	error
"NA"	→	error

Float (`float`)

Strings containing numbers behave as expected:

"99"	→	99.0
" 3 "	→	3.0
"03017"	→	3017.0
"3.8"	→	3.8

All other strings result in an error. In particular, the special types nan and inf are rejected:

""	→	error
"nan"	→	error
"NA"	→	error
"inf"	→	error

Strings (`str`)

Strings are unchanged:

""	→	""
" "	→	" "
"nan"	→	"nan"
"03017"	→	"03017"
"NA"	→	"NA"

Boolean (`bool`)

The following values are truthy:

"True"	→	True
"true"	→	True
"TRUE"	→	True
"1"	→	True

The following values are falsey:

"False"	→	False
"false"	→	False
"FALSE"	→	False
"0"	→	False

All other strings result in an error:

""	→	error
" false"	→	error
"nan"	→	error
"NA"	→	error

Lists

The default delimiter is a comma, but this can be overridden:

"01, 2"	List[int]	→	[1, 2]
" hello "	List[str]	→	[" hello "]
", "	List[str]	→	["", " "]

Delimiter-quoting is not supported in lists: you cannot have a list item contain the delimiter.

"',',"

List[str]

→

["'", "'", ""]

Interpreting an empty string is ambiguous: it could mean missing data, the empty list, or a list containing the empty string. Therefore, we error:

""

List[*]

→

error

Diff - Path

argentina

bolivia

brazil

candyland

colombia

cote_divoire

diet_trase

indonesia

paraguay

peru