View or edit on GitHub

This page is synchronized from doc/Trase-Database.md. Last modified on 2026-01-22 21:12 CET by Florian Gollnow. Please view or edit the original file there; changes should be reflected here after a midnight build (CET time), or manually triggering it with a GitHub action (link).

Trase Database

Trase's main database is a PostgreSQL database hosted by Amazon Web Services (AWS). It is accessible to anybody with an internet connection and the right access credentials, and contains a wealth of information including, but not limited to:

All of our SEI-PCS model results
All of our map attributes
All of our node attributes, such as ZDCs and so on
Information on about half a million trading companies: where they operate, the various uncleaned names they are known by, other companies they are associated with, tax codes, etc.
Database of country names in various languages and spellings, linked to their ISO code
Conversion factors between various commodities, specific to the country or region where appropriate, along with the various commodity codes they are known by (NCM, HS6, etc.)

An overview of the database schema can be found at https://sei-international.github.io/.

Key database concepts
How to Access the Database
1. Not in a scripting environment
2. An SEI-PCS model on DeforestationFree
3. A Python script
- The Quick Way
- The Best Way
4. An R script
Which tables should I access?
Supply Chains
Spatial Metrics
Traders
Commodities and Equivalence Factors
Commodities
Commodity names
Commodity tree
Commodity codes
Equivalence factors
Equivalence groups
Commodity ratios
Changing equivalence factors
Runbook
Adding a new user and/or altering permissions
Updating schema documentation
Deleting a user
Ingesting SEI-PCS results

Key database concepts

There are a few key concepts that are useful to know when working with our database.

Data is accessed using SQL. SQL is a language used to query and manipulate tabular data. It is quite easy to get started with SQL, such as the Learn SQL course at CodeAcademy.
The database is normalised. There are many resources on the internet that explain this concept, such as the Basic Concept of Database Normalization video on YouTube.
Database views provide a convenient way to access the data. See the page on SQL Views at W3 Schools to understand what an SQL view is.

How to Access the Database

1. Not in a scripting environment

I would recommend using DBeaver or PGAdmin to explore and download the data. You will need database credentials: a username and password. For the host use "trase-db-instance.c6jsbtgl0u2s.eu-west-1.rds.amazonaws.com", and the database name "trase".

2. An SEI-PCS model on DeforestationFree

This is the easiest of all since the connection credentials you need are already set up! You will need to put the following in your "preparation.py" file:

from trase.tools import CUR
from trase.tools.etl.processors import SQLPreprocessor, Column

class Country(SQLPreprocessor):
    query = """
        select
            unnest(synonyms) as label,
            name
        from splitgraph.regions
        where region_type = 'COUNTRY'
    """
    columns = [Column("label"), Column("name")]
    outname = "countries.csv"

3. A Python script

First check whether the data you need is on splitgraph.com/trase. If so you can connect to it without any special credentials.

Otherwise, to connect to the Trase database you will first need credentials: a username and password. Then, there are two ways forwards: the quick way and the best way!

Once you have got a connection going check out What tables should I access?

The Quick Way

If you are using PyCharm, create an enviroment variable called PGPASS with your password. If you are not in PyCharm, create a .pgpass file and put it there: the instructions are in README.md (search for "pgpass" in the page)
Ensure you have the "pandas" and "psycopg2" libraries, e.g. pip install pandas psycopg2

Run the following code

import pandas as pd
import psycopg2

connection = psycopg2.connect(
    "host=trase-db-instance.c6jsbtgl0u2s.eu-west-1.rds.amazonaws.com "
    "dbname=trase "
    "user=<myuser> "
 )
df = pd.read_sql(
    "select synonyms, name from splitgraph.countries",
    connection
)

The Best Way

This is the best thing to do if you are working within this repository:

Set up a virtual environment and configuration file

Within the virtual environment, run:

import pandas as pd
from trase.tools.pcs.connect import CNX

df = pd.read_sql(
    "select synonyms, name from splitgraph.countries",
    con=CNX.cnx
)

4. An R script

First check whether the data you need is on splitgraph.com/trase. If so you can connect to it without any special credentials.

Otherwise, get database credentials: a username and password.

Create system environment variables PGUSER and PGPASSWORD to store your credentials

Ensure you have the "dplyr" and "RPostgres" libraries:

install.packages("dplyr")
install.packages("RPostgres")

Run the following code (this is for the Trase database, adapt as needed if you are using Splitgraph):

library(dplyr)
library(RPostgres)
library(dbplyr)
library(DBI)

connection <- dbConnect(
  Postgres(),
  host = "trase-db-instance.c6jsbtgl0u2s.eu-west-1.rds.amazonaws.com",
  dbname = "trase"
)
data <- tbl(connection, in_schema("supply_chains_datasets", "brazil_soy_v2_6_0"))
head(data)

You can also list tables and views using dbListTables(connection). Once you have got a connection going check out What tables should I access?

Which tables should I access?

Trase provides access to various "data views" intended for consumption by internal Trase users. Some of these views also provide the basis for data which is published to https://trase.earth.

In contrast to the normalized tables that are the source-of-truth of the data, these views are wholly- or partially-denormalised and are intended to be user-friendly.

Supply Chains

There is one table for each supply chains context in the PostgreSQL schema called "supply_chains_datasets". This schema will contain the latest-ingested data for each supply chain version. The naming format is supply_chains_datasets.country_commodity_vx_y_z. For example:

supply_chains_datasets.brazil_palm_oil_v0_0_2
supply_chains_datasets.paraguay_corn_v1_0_0
supply_chains_datasets.argentina_corn_v0_2_3
supply_chains_datasets.indonesia_cocoa_v0_1_0
supply_chains_datasets.ecuador_shrimp_v1_0_1
# etc...

View on Metabase.

Spatial Metrics

Spatial metrics are only available in the staging_trase_earth.spatial_metrics, which is used by the website.

View on Metabase

The data in this table is what is currently being displayed on https://staging.trase.earth: if you are looking for data which is not on the staging site, there is currently no view/table to easily extract it from the database. This table is also available from BigQuery - get in touch with us if you would prefer to use that.

Traders

Traders data is provided in the following views:

main.trader_labels
main.trader_groups
views.f500
views.rejected_name_matches

Commodities and Equivalence Factors

The database stores a list of commodities, HS codes, and equivalence factors, both global and also regional-specific in the table views.commodities.

View on Metabase

Here is an SQL example which selects the equivalence factors for a particular country, here Cocoa factors for Ivory Coast:

select commodity, eq_factor, product_type, hs_code
from views.commodities c1
where commodity = 'COCOA' and (
  -- select the ivory coast-specific commodity code
  location_trase_id = 'CI' or (
      -- or, if that doesn't exist, select the world code
      location_trase_id = 'WORLD' and not exists (
      select 1 from views.commodities c2
      where c1.hs_code = c2.hs_code and c2.location_trase_id = 'CI'
    )
  )
)

For a code snippet in R see this example function.

Commodities

Supply chains are organized by context: a context is a country of production and a commodity, e.g. Brazil soy.

A given commodity comprises products derived from the same raw material. E.g. SOYBEANS, SOYBEAN CAKE and SOYBEAN OIL are all products of the SOY supply chain.

Data about commodities is contained in the following tables: - main.commodities; - main.commodity_names; - main.commodity_code_values; - main.commodity_codes; - main.commodity_equivalence_factors; - main.commodity_equivalence_refs; - main.commodity_equivalence_groups; - main.commodity_equivalence_group_factors; - main.commodity_ratios.

The terminology can be a bit confusing, so the term "commodity" will be used to refer to a raw material and all its derived products (e.g. commodity "soy" will be used to mean products SOYBEANS + SOYBEAN CAKE + SOYBEAN OIL + SOY RESIDUE), and the term "product" will be used for specific primary or derived products (e.g. PALM OIL, REFINED PALM OIL).

Each product corresponds to one record in the table main.commodities.

The list of all commodities in the database can be obtained with the "trase CLI" command trase report commodities, and reports about specific commodities (and their products) can be generated with the command trase report commodity {commodity_name}.

Commodity names

Each product has a default name and any number of alternate names defined in table main.commodity_names.

Commodity tree

Each product has a "parent" (defined in main.commodities.parent_id), which is the product it is derived from.

Primary products have themselves as parent, and the name of the primary product is used as the name of the commodity.

For instance PALM OIL is the name of the palm oil product, and is a primary product, so it is also the name of the commodity:

[6] PALM OIL
 |
 |-- [105] REFINED PALM OIL

In some cases, we may want to give a name for the commodity different from the names of its products. This is the case for soy; we want the commodity to be named SOY instead of SOYBEANS (although these are the same thing), so we create a product for SOY which becomes the parent of SOYBEANS, which itself is the parent of derived soy products:

[1] SOY
 |
 |-- [2] SOYBEANS
      |
      |-- [100] SOYBEAN RESIDUE
      |
      |-- [3] SOYBEAN OIL
      |
      |-- [4] SOYBEAN CAKE

Commodity codes

Trade datasets and production statistics sometimes identify products using Harmonized System (HS) codes. HS codes can have different length, with longer codes being more specific. Sometimes longer codes allow to differentiate between different products (e.g. 1602 corresponds to meat preparations, 16023 is for meat preparations of poultry, 16024 is for meat preparations of pork), but sometimes the longer codes correspond to details that are not relevant to the identification of product, such as the packaging. For instance, HS6 code 160521 is for shrimp not in airtight containers, and code 160529 is for shrimp in airtight containers.

The database contains definitions for different codes (HS or other) in table main.commodity_codes.

Each product can have multiple codes attached in table main.commodity_code_values. For instance, the following codes all correspond to the same product:

SOYBEANS
    FCL:    236
    HS4:    1201
    HS6:    120100, 120110, 120190

Due to the fact that multiple codes can correspond to a single product, there is a loss of information when datasets are ingested in the database. For instance, a dataset of exports may contain shipments with HS6 codes 120100 and 1201110, but once they are ingested in the database it is impossible to know which shipment had which code, as the only information is now that the product is SOYBEANS. This is a conscious design decision that allows to simplify the supply chains; if there is a need to preserve the distinction in the database, we would need to redefine the products differently.

Equivalence factors

When products are obtained by transformation of a raw material, the mass of product is altered, due to addition, removal of material, or waste. In order to be able to compare quantities of products in a supply chain, we convert their weights to equivalent weights or primary product.

For instance, crushing of soybeans into oil and cake results in a waste of ~3%, so the weight of soybeans that was used to produce 1 ton of soybean cake was ~1.03 tons.

These equivalence factors are stored in table main.commodity_equivalence_factors; they relate quantities of derived products to quantities of primary product. Note that the primary product is always the "parent that has itself as a parent", and does not have to be specified.

Equivalence factors can be time-specific (with a time range defined with main.commodity_equivalence_factors.time_start and time_end) and/or location-specific (with the node_id of the location they apply to).

Each factor can be linked to a reference in table main.commodity_equivalence_refs, which give details about the origin of the value.

Equivalence groups

Different methods can be used to calculate equivalence factors. For instance, the factors can be calculated using processing loss, or calorific content of the product, etc.

In order to group sets of equivalence factors calculated using the same method, we define "commodity equivalence groups" in table main.commodity_equivalence_groups.

The factors that belong to a group are specified in table main.commodity_equivalence_group_factors.

Each group usually contains values for all products of the commodity, but there is a notable exception: product SOYBEAN RESIDUE is most often not included in the trade data, and therefore its volumes have to be inferred, which is achieved using the equivalence factors in group SOY CRUSHING WITH LOSS. A loss of ~3% is inferred, so derived products have equivalence factors of ~1.03, and the product SOYBEAN RESIDUE receives a NULL value as an equivalence factor in this group.

However, soy exports from Paraguay do include amounts of SOYBEAN RESIDUE, and therefore all soy products have equivalence factor 1.0 in equivalence group SOY LOSSLESS.

Commodity ratios

Some derived products have limits on the quantity that can be made from a given amount of primary product. For instance, when soybeans are crushed into soybean oil and soybean cake, ~20% of the weight is converted into oil, and ~80% into cake. We define the commodity ratios as the maximum fraction of the primary product that can be converted into derived product.

Commodity ratios are stored in table main.commodity_ratios. They can be time-specific or location-specific when specified in columns time_start and time_end, or node_id, respectively.

Products that do not have limits on how much can be produced must also be present in the table with NULL values as a ratio.

Commodity ratios are attached to commodity equivalence groups, and the sum of ratios within a group must be equal to 1.

Changing equivalence factors

[!CAUTION]

In the database, both the raw volumes and product volumes are recorded in main.flows.raw_vol and main.steps.volume, respectively. When flows datasets are ingested, the equivalence factor is retrieved from the specified equivalence group, used to calculate the raw volume, and recorded in main.steps.com_eq_factor_id. The equivalence group is also recorded in main.flows.com_eq_group_id.

Therefore, it is not safe to directly modify the value of existing factors as it would lead to mismatching ratios between main.flows.raw_vol and main.steps.volume between the datasets ingested before and after the change. We tried to add database constraints to make sure that flows.raw_vol = steps.volume * eq_factor, but it was a bit complicated so we gave up.

When new equivalence factors need to be added to the database, the following method should be used:

Insert a new equivalence group (and make it default if needed).
Insert new factors.
Add the new factors to the new group.
If you want to use these new equivalence factors in a flows dataset you will, as the caution note implies, need to re-ingest that dataset.
Refresh the views.commodities view so that the new equivalence factors are available to users.

The following script gives an example of how a change in defaults factors can be safely performed:

from trase.tools.pcs import (
    get_commodity_id,
    get_commodity_equivalence_group_id,
    insert_commodity_equivalence_group,
    get_commodity_equivalence_factor_id,
    add_factor_to_commodity_equivalence_group,
    insert_commodity_equivalence_reference,
    insert_commodity_equivalent,
)

raw_commodity_id = get_commodity_id("PAPAYA")
old_group_id = get_commodity_equivalence_group_id("OLD PAPAYA FACTORS")

# insert new equivalence group and make it default
new_group_id = insert_commodity_equivalence_group(
    name="NEW PAPAYA FACTORS",
    description="From International Association of Papaya Enthusiasts (IAPE)",
    commodity_id=raw_commodity_id,
    is_default=True,
)

# "reuse" existing factor and add to new group
factor_id = get_commodity_equivalence_factor_id(
    commodity_id=raw_commodity_id,
    commodity_equivalence_group_id=old_group_id,
)
add_factor_to_commodity_equivalence_group(
    commodity_equivalence_factor_id=factor_id,
    commodity_equivalence_group_id=new_group_id,
)

# insert new factor with new reference and add it to new group
commodity_id = get_commodity_id("DRIED PAPAYA")
ref_id = insert_commodity_equivalence_reference(
    title="IAPE DRY MATTER",
    description="Based on dry matter content of papaya products",
)
factor_id = insert_commodity_equivalent(
    commodity_id=commodity_id,
    eq_factor=1.6,
    ref_id=ref_id,
)
add_factor_to_commodity_equivalence_group(
    commodity_equivalence_factor_id=factor_id,
    commodity_equivalence_group_id=new_group_id,
)

Runbook

Adding a new user and/or altering permissions

Run the below query, replacing xxxx with a long, secure password:

create user my_user with encrypted password 'xxxx';
grant connect on database trase to my_user;

Depending on what you would like the user to access, you can then grant them access to certain roles. The most common is to allow the user to select de-normalised views and tables:

grant results_user to my_user;

Or, you can also grant specific permissions to database structures like schemas, tables, etc.

Updating schema documentation

Download SchemaSpy and the Java PostgreSQL driver, and run:

java -jar {path}/{to}/schemaspy-6.2.4.jar -t pgsql -db trase -u {user_name} -o {path}/{to}/TRASE/doc/db/main/ -host $TRASE_DB_HOST -dp {path}/{to}/postgresql-42.7.2.jar -p {password} -s main

This will re-generate the documentation locally, but it will not update https://sei-international.github.io/, since that is sourced from the repository https://github.com/sei-international/sei-international.github.io.

Deleting a user

In the snippet below, replace the_user with the name of the user to delete, and existing_user with a name of a user to transfer anything owned by the original user to.

REVOKE ALL PRIVILEGES ON ALL TABLES IN SCHEMA main FROM the_user;
REVOKE ALL PRIVILEGES ON ALL SEQUENCES IN SCHEMA main FROM the_user;
REVOKE ALL PRIVILEGES ON ALL FUNCTIONS IN SCHEMA main FROM the_user;
REVOKE ALL PRIVILEGES ON SCHEMA main FROM the_user;
ALTER DEFAULT PRIVILEGES IN SCHEMA main REVOKE ALL ON SEQUENCES FROM the_user;
ALTER DEFAULT PRIVILEGES IN SCHEMA main REVOKE ALL ON TABLES FROM the_user;
ALTER DEFAULT PRIVILEGES IN SCHEMA main REVOKE ALL ON FUNCTIONS FROM the_user;
REVOKE USAGE ON SCHEMA main FROM the_user;

REVOKE ALL PRIVILEGES ON ALL TABLES IN SCHEMA spatial FROM the_user;
REVOKE ALL PRIVILEGES ON ALL SEQUENCES IN SCHEMA spatial FROM the_user;
REVOKE ALL PRIVILEGES ON ALL FUNCTIONS IN SCHEMA spatial FROM the_user;
REVOKE ALL PRIVILEGES ON SCHEMA spatial FROM the_user;
ALTER DEFAULT PRIVILEGES IN SCHEMA spatial REVOKE ALL ON SEQUENCES FROM the_user;
ALTER DEFAULT PRIVILEGES IN SCHEMA spatial REVOKE ALL ON TABLES FROM the_user;
ALTER DEFAULT PRIVILEGES IN SCHEMA spatial REVOKE ALL ON FUNCTIONS FROM the_user;
REVOKE USAGE ON SCHEMA spatial FROM the_user;

REVOKE ALL PRIVILEGES ON ALL TABLES IN SCHEMA views FROM the_user;
REVOKE ALL PRIVILEGES ON ALL SEQUENCES IN SCHEMA views FROM the_user;
REVOKE ALL PRIVILEGES ON ALL FUNCTIONS IN SCHEMA views FROM the_user;
REVOKE ALL PRIVILEGES ON SCHEMA views FROM the_user;
ALTER DEFAULT PRIVILEGES IN SCHEMA views REVOKE ALL ON SEQUENCES FROM the_user;
ALTER DEFAULT PRIVILEGES IN SCHEMA views REVOKE ALL ON TABLES FROM the_user;
ALTER DEFAULT PRIVILEGES IN SCHEMA views REVOKE ALL ON FUNCTIONS FROM the_user;
REVOKE USAGE ON SCHEMA views FROM the_user;

REASSIGN OWNED BY the_user TO existing_user;
REVOKE ALL ON database trase from the_user;
DROP USER the_user;

Ingesting SEI-PCS results

To begin with, you will need to locate the CSV files on S3. For example, s3://trase-storage/indonesia/palm_oil/sei_pcs/v1.2.2/SEIPCS_INDONESIA_PALM_OIL_2021_KM_0_MGD_30_POST_EMBEDDING.csv.

You will then need to write Python scripts to indicate

These are located in GitHub under trase/runbook/<country>/<commodity>/trade/, for example trase/runbook/indonesia/palm_oil/trade/.

In that folder there should be a script called a_ingest_raw_dataset.py. This script indicates how the CSV files from S3 should be ingested into the dataase. If one exists then modify it. Otherwise, create one by using a script of the same name from another country/commodity.