Brazil soy sourcing analysis

View or edit on GitHub

This page is synchronized from trase/products/analysis/notebooks/brazil_soy_sourcing_analysis.ipynb. Last modified on 2026-05-07 15:52 CEST by Nicolas Martin. Please view or edit the original file there; changes should be reflected here after a midnight build (CET time), or manually triggering it with a GitHub action (link).

Can time-lagged Trase data on Brazil soy be used for scope three emissions assessments?

Harry Biddle, 10th April 2024

This notebook explores the possibility of using Trase Brazil soy data to perform scope three emissions assessments for a company when that data is a few years older than the year in which the assessments is being done. For example, using 2022 Trase data to perform an emissions assessment in 2025. A higher-level summary of this notebook is in Google Docs.

We interpret the company to be an "exporter group" in Trase: that is, a collection of companies which act in the role of exporters/shippers of Brazil soy.

The data used in this notebook is as follows:

Brazil soy Trase data for 2020 and earlier is taken from the database, specifically the supply_chains_datasets.brazil_soy_v2_6_0 view
Brazil soy Trase data for 2021 and 2022 is taken from S3, specifically the files s3://trase-storage/brazil/soy/sei_pcs/v2.6.1/SEIPCS_BRAZIL_SOY_%YEAR%.csv.
Municipality-level emissions and production data is from s3://trase-storage/brazil/soy/indicators/out/q4_2023/soy_Net_and_Gross_ghg_deforestation_2013_2022.csv.

When we read the Brazil soy Trase data we exclude:

Domestic trade
The unknown exporter group
Exports for which the municipality of production is unknown

To convert this notebook into HTML output which strips out code cells use the following command:

jupyter nbconvert  --no-input --to html brazil_soy_sourcing_analysis.ipynb

import pandas as pd
from trase.tools import sps
from trase.tools import CNX
import plotly.express as px
import plotly.graph_objects as go
import numpy as np
import geopandas as gpd
import requests
from pandas.core.dtypes.common import is_numeric_dtype
from IPython.display import display
import plotly.io as pio
pio.renderers.default = "png"

############################################################
# sei-pcs model
############################################################

# brazil soy data (2020 and earlier) is from the database
df_earlier = pd.read_sql(
    """
    select
        year,
        municipality_of_production_trase_id,
        exporter_group,
        sum(volume) as volume
    from supply_chains_datasets.brazil_soy_v2_6_0
    where
        exporter_group != 'UNKNOWN' and
        exporter_group != 'PROCESSED DOMESTICALLY' and
        municipality_of_production_trase_id != 'BR-XXXXXXX' and
        exporter_trase_id != 'BR-TRADER-XXXXXXXX'
    group by 1, 2, 3
    """,
    CNX.cnx,
)

# brazil soy data (2021 and later) is from S3
df_new = (
    sps.concat(
        sps.get_pandas_df_once(
            f"brazil/soy/sei_pcs/v2.6.1/SEIPCS_BRAZIL_SOY_{year}.csv",
            dtype=str,
            na_filter=False,
            usecols=["YEAR", "VOLUME_RAW", "LVL6_TRASE_ID_PROD", "EXPORTER_GROUP", "EXPORTER_TRASE_ID"],
        )
        for year in [2021, 2022]
    )
    .rename(
        columns={
            "YEAR": "year",
            "VOLUME_RAW": "volume",
            "LVL6_TRASE_ID_PROD": "municipality_of_production_trase_id",
            "EXPORTER_GROUP": "exporter_group",
            "EXPORTER_TRASE_ID": "exporter_trase_id",
        }
    )
    .astype({"year": int, "volume": float})
)
df_new = df_new[~df_new["exporter_group"].isin(["UNKNOWN", "PROCESSED DOMESTICALLY", "NONE"])]
df_new = df_new[df_new["municipality_of_production_trase_id"] != "BR-XXXXXXX"]
df_new = df_new[df_new["exporter_trase_id"] != "BR-XXXXXXX"]
df_new["volume"] /= 1_000  # convert kilograms to tons
df_new_volumes = sps.consolidate(df_new, ["volume"], ["year", "municipality_of_production_trase_id", "exporter_group"])

# combine
df_sei_pcs = sps.concat([df_earlier, df_new_volumes])

############################################################
# SEI-PCS MDIC model
############################################################

df_sei_pcs_mdic = (
    sps.concat(
        sps.get_pandas_df_once(
            f"brazil/soy_mdic/sei_pcs/v1.0.0/SEIPCS_BRAZIL_SOY_MDIC_{year}.csv",
            dtype=str,
            na_filter=False,
            usecols=["YEAR", "VOLUME_RAW", "LVL6_TRASE_ID_PROD", "EXPORTER_GROUP", "EXPORTER_TRASE_ID"],
        )
        for year in range(2004, 2022 + 1)
    )
    .rename(
        columns={
            "YEAR": "year",
            "VOLUME_RAW": "volume",
            "LVL6_TRASE_ID_PROD": "municipality_of_production_trase_id",
            "EXPORTER_GROUP": "exporter_group",
            "EXPORTER_TRASE_ID": "exporter_trase_id",
        }
    )
    .astype({"year": int, "volume": float})
)
# df_sei_pcs_mdic = df_sei_pcs_mdic[~df_sei_pcs_mdic["exporter_group"].isin(["UNKNOWN", "PROCESSED DOMESTICALLY", "NONE", ""])]
# df_sei_pcs_mdic = df_sei_pcs_mdic[df_sei_pcs_mdic["municipality_of_production_trase_id"] != "BR-XXXXXXX"]
# df_sei_pcs_mdic = df_sei_pcs_mdic[df_sei_pcs_mdic["exporter_trase_id"] != "BR-XXXXXXX"]
# df_sei_pcs_mdic["volume"] /= 1_000  # convert kilograms to tons
# df_sei_pcs_mdic = sps.consolidate(df_sei_pcs_mdic, ["volume"], ["year", "municipality_of_production_trase_id", "exporter_group"])

############################################################
# example exporter groups
############################################################

df_groups = pd.read_sql(
    """
    select distinct
        exporter_group,
        exporter
    from supply_chains_datasets.brazil_soy_v2_6_0
    """,
    CNX.cnx,
)

############################################################
# CNPJs per exporter group
############################################################

df_earlier_cnpjs = pd.read_sql(
    """
    select distinct
        exporter_group,
        exporter_trase_id
    from supply_chains_datasets.brazil_soy_v2_6_0
    where
        exporter_group != 'UNKNOWN' and
        exporter_group != 'PROCESSED DOMESTICALLY' and
        municipality_of_production_trase_id != 'BR-XXXXXXX' and
        exporter_trase_id != 'BR-TRADER-XXXXXXXX'
    """,
    CNX.cnx,
)
df_later_cnpjs = df_new[["exporter_group", "exporter_trase_id"]].drop_duplicates()
df_cnpjs = sps.concat([df_earlier_cnpjs, df_later_cnpjs]).drop_duplicates()
df_cnpjs = df_cnpjs.groupby("exporter_group")["exporter_trase_id"].agg("count").rename("number_cnpjs").reset_index()

############################################################
# emissions and production
############################################################

df_emissions = (
    sps.get_pandas_df_once(
        "brazil/soy/indicators/out/q4_2023/soy_Net_and_Gross_ghg_deforestation_2013_2022.csv",
        dtype=str,
        na_filter=False,
        usecols=["TRASE_ID", "YEAR", "TN", "SOY_GrossDEF_NORM"],
    )
    .rename(
        columns={
            "TRASE_ID": "municipality_of_production_trase_id",
            "YEAR": "year",
            "SOY_GrossDEF_NORM": "emissions",
            "TN": "production",
        },
        errors="raise",
    )
    .astype({"emissions": float})
)
df_emissions["year"] = df_emissions["year"].str.slice(1).astype(int)

# not sure what NA production means but I'm just going to assume it's 0 for now!
df_emissions["production"] = df_emissions["production"].replace("NA", "0").astype(float)

Here are three random rows of the Trase Brazil soy data that we have loaded. We are interested only in exporter groups and their sourcing patterns: that is, which municipalities Trase has identified as producing the soy that gets exported, and in which volumes they are sourced. Volumes are in metric tonnes of soy bean equivalent:

df_sei_pcs.sample(3)

Note that we cluster the model results into exporter groups based on the name of the group (e.g. "COFCO"). This means that our grouping is very dependent on how much data cleaning has been done. Larger exporter groups will have been extensively cleaned, so that, for example, all of the exporters associated with Bunge are under the "BUNGE" exporter group. However smaller exporter groups will not have been cleaned, and will consist of only one exporter.

Here are a few random examples to illustrate this:

exporters_per_group = df_groups.groupby("exporter_group")["exporter"].agg(list)

random_exporter_groups = set(exporters_per_group.sample(2).index.values)
random_exporter_groups.add("BUNGE")
for exporter_group in random_exporter_groups:
    exporters, = exporters_per_group[exporters_per_group.index == exporter_group].values
    print(f"Exporters associated with exporter group {exporter_group}:")
    for exporter in exporters:
        print(f"  - {exporter}")
    print("")

from trase.tools import (
    find_label,
    get_country_id,
    get_label_trader_id,
    get_node_name,
    get_trader_group_id,
    uses_database,
)
from trase.tools.aws.aws_helpers_cached import get_pandas_df_once
from trase.tools.aws.metadata import write_csv_for_upload
from trase.tools.pandasdb.find import (
    find_default_name_by_node_id,
    find_economic_blocs_by_trase_id,
    find_traders_and_groups_by_label,
    find_traders_and_groups_by_trase_id,
)
from trase.tools.utilities.helpers import clean_string

@uses_database
def clean_exporters(df, cur=None, cnx=None):
    """
    TODO: try to do this more concisely / in fewer lines of code
    """
    trase_ids = "BR-TRADER-" + df["exporter.cnpj"].str.slice(0, 8)
    trase_ids = trase_ids.replace({"BR-TRADER-00000000": None})
    df = df.assign(**{"exporter.trase_id": trase_ids})
    df_exporters = df[["exporter.label", "exporter.trase_id"]].drop_duplicates()

    # clean exporter names using trase id
    df_exporters[
        ["exporter.trader_id", "exporter.group", "count"]
    ] = find_traders_and_groups_by_trase_id(
        df_exporters.rename(columns={"exporter.trase_id": "trase_id"})[["trase_id"]],
        returning=["trader_id", "group_name", "count"],
        year=sql.Literal(YEAR),
        cur=cur,
        cnx=cnx,
    )
    counts = df_exporters.pop("count")
    assert all(counts.isin([0, 1]))
    not_found_by_trase_id = counts == 0
    print(
        f"{sum(~not_found_by_trase_id)} exporters were found by Trase ID and "
        f"{sum(not_found_by_trase_id)} were not"
    )
    df_found_by_trase_id = df_exporters[~not_found_by_trase_id]
    df_missing = df_exporters[not_found_by_trase_id].copy()

    # if not found by Trase ID, then look by name
    labels = df_missing["exporter.label"].drop_duplicates()
    df_labels = pd.DataFrame(labels)
    df_labels[
        ["exporter.trader_id", "exporter.group", "count"]
    ] = find_traders_and_groups_by_label(
        df_labels.rename(columns={"exporter.label": "trader_label"}),
        returning=["trader_id", "group_name", "count"],
        year=sql.Literal(YEAR),
    )

    # special case for UNKNOWN CUSTOMER
    is_unknown = (df_labels["count"] != 1) & (
        df_labels["exporter.label"] == "UNKNOWN CUSTOMER"
    )
    if any(is_unknown):
        brazil_id = get_country_id("BRAZIL", cur=cur)
        label_id = find_label("UNKNOWN CUSTOMER", cur=cur)
        trader_id = get_label_trader_id(label_id, brazil_id)
        group_id = get_trader_group_id(trader_id, cur=cur)
        group_name = get_node_name(group_id, cur=cur)
        df_labels.loc[is_unknown, "exporter.trader_id"] = trader_id
        df_labels.loc[is_unknown, "exporter.group"] = group_name
        df_labels.loc[is_unknown, "count"] = 1

    # we should have found one unique node for every importer
    bad = df_labels.pop("count") != 1
    if any(bad):
        raise ValueError(f"Missing some exporters:\n{df_labels[bad]}")

    # merge exporters found by trase id back into results
    right = df_found_by_trase_id[
        ["exporter.trase_id", "exporter.trader_id", "exporter.group"]
    ].drop_duplicates()
    df1 = pd.merge(
        df,
        right,
        on=["exporter.trase_id"],
        how="left",
        validate="many_to_one",
        indicator=True,
    )
    merge = df1.pop("_merge")
    df_solved1 = df1[merge == "both"]

    # merge exporters found by label back into results
    df_unsolved = df1[merge != "both"]
    df_unsolved = df_unsolved.drop(
        columns=["exporter.trader_id", "exporter.group"], errors="raise"
    )

    right = df_labels[
        ["exporter.label", "exporter.trader_id", "exporter.group"]
    ].drop_duplicates()
    df_solved2 = pd.merge(
        df_unsolved,
        right,
        on=["exporter.label"],
        how="left",
        validate="many_to_one",
        indicator=True,
    )
    merge = df_solved2.pop("_merge")
    assert all(merge == "both")

    # combine the two
    expected_columns = list(set(df.columns) | {"exporter.trader_id", "exporter.group"})
    assert sorted(df_solved2.columns) == sorted(expected_columns)
    assert sorted(df_solved1.columns) == sorted(expected_columns)
    df_final = pd.concat([df_solved1, df_solved2]).reset_index(drop=True)

    # guarantee that we didn't change the original data
    a = df.sort_values(list(df.columns)).reset_index(drop=True)
    b = df_final[df.columns].sort_values(list(df.columns)).reset_index(drop=True)
    b.columns.name = a.columns.name  # needed for assert equal but don't know what it is
    pd.testing.assert_frame_equal(a, b)

    # add exporter names
    df_final = df_final.astype({"exporter.trader_id": int})
    df_final[["exporter.name"]] = find_default_name_by_node_id(
        df_final[["exporter.trader_id"]].rename(
            columns={"exporter.trader_id": "node_id"}
        ),
        returning=["name"],
        cnx=cnx,
        cur=cur,
    )

    return df_final

Here are three random rows of the emission and production data. We have emissions and production for each municipality

interesting = (df_emissions["production"] > 0) & (df_emissions["emissions"] > 0)
df_emissions[interesting].sample(3)

How much does Trase data vary over time?

To start with, we exclude the emissions data and focus on exported volume. We would like to investigate whether the sourcing pattern of companies varies a lot between years. By sourcing pattern, we mean the distribution of volume across municipalities of production. To be able to make a better comparison across years, we normalise volumes to the company's total export in that year.

The definition of a sourcing pattern for a company X in year Y, is the percentage of soy sourced by each municipality in that year, where 100% represents the total soy export of company X in year Y from known municipalities. We are interested this because it gives us a way to compare a typical ton of a company's soy across different years. We can say that a in 2020 Bunge sourced, say, 20% from municipality A and 80% from municipality B. Yet in 2022 they might have source only 15% from municipality A.

Below, we print an example sourcing pattern of a random exporter group in a random year.

Also bear in mind the exclusion criteria at the top of this notebook: we excluded trades with unknown municipality of production. So we are looking only at the "known" sourcing pattern of each company.

# add sourcing pattern to SEI-PCS data
df = df_sei_pcs.copy()
df["yearly_exporter_volume"] = df.groupby(["year", "exporter_group"])["volume"].transform("sum")
df["percent"] = 100 * df["volume"] / df["yearly_exporter_volume"]

# pick a random year
year, = df["year"].sample(1)
year = 2014
data = df[df["year"] == year].copy()

# discard very tiny percentages to avoid 0's
data = data[data["yearly_exporter_volume"] > 100]

# pick a random exporter group with between 3 and 7 municipalities
groups = data["exporter_group"].value_counts()
exporer_group, = groups[(groups > 2) & (groups < 8)].sample(1).index
exporer_group="GLENCORE"
data = data[data["exporter_group"] == exporer_group]

with sps.pandas_display_context(float_format=sps.format_float, rows=10):
    df_ = data
    df_ = df_.set_index(["year", "exporter_group", "municipality_of_production_trase_id"])
#     df_ = df_[["volume", "yearly_exporter_volume", "percent"]]
    display(df_)

Now we ask: if we compare a company's sourcing pattern from one year to the next, how much overlap is there? To calculate this, for each exporter group we take two years' of data, and look at each municipality in turn. We have a value for the first year and a value for the second. We take the minimum of these two values. This represents the overlap of the sourcing from that municipality. When we add that up across Brazil, we have a number for the total overlap.

For example, suppose a company only sources from two municipalities, A and B:

Municipality	Sourcing (Year 1)	Sourcing (Year 2)	Overlap
A	80%	70%	70% `= min(80, 70)`
B	20%	30%	20% `= min(20, 30)`
Total overlap	-	-	90% `= 70 + 20`

We get a single number—90%—representing the overlap in the sourcing pattern of this exporter group between year one and year two. We repeat the above procedure for every exporter group and every pair of subsequent years (2010/11, 2011/12, etc), giving us a lot of numbers. The chart below shows those numbers binned and presented in a histogram. This gives us a picture of how much companies' sourcing patterns typically overlap over time.

We are interested not only in a year's lag but also what happens when that lag increases. What if we want to do a scope 3 emissions analysis in 2025 but only have data from 2022? Therefore, the chart below also includes lags of two years (2010/12, 2011/13, ...) and three years (2010/13, 2011/14, ...).

Additionally, since there was a change in methodology for Trase's Brazil soy model between the 2017 and 2018, we exclude any comparisons that cross that time boundary.

Bear in mind that we are only looking at the change in municipalities. Some municipalities cover a large area, some a small area - they are not evenly distributed over Brazil. Additionally, we have no measure of the "distance" that soy moves: moving to a neighbouring municipality counts just as moving to a municipality on the other side of the country.

# Define a function to calculate the overlap - we'll use this again later!
def calculate_overlap(
    df: pd.DataFrame, source_column: str, lag: int = 1
) -> pd.DataFrame:
    """
    This function expects a dataframe with rows that look like this:

        year                                        2005
        [source_column]                       BR-5005251
        exporter_group                             BUNGE
        percent                                 1.342955

    It will return the following dataframe:

        exporter_group  COOPERATIVA AGRARIA AGROINDUSTRIAL
        year                                          2022
        percent_stayed                           99.976574

    """

    # check the dataframe has expected columns
    expected_columns = {"year", "exporter_group", "percent", source_column}
    missing_columns = expected_columns - set(df.columns)
    assert not missing_columns, f"Missing columns: {missing_columns}"

    # check that percent and year are numeric
    assert is_numeric_dtype(df["year"].dtype)
    assert is_numeric_dtype(df["percent"].dtype)

    # we are going to group the data frame by exporter group and municipality of production
    # in each group, we will compare the value at the current year with the value at the earlier year,
    # by which we mean year - n.

    # for each row, add the year that we will compare to
    # if the earlier year crosses the 2017/2018 boundary, then set it to null
    df_ = df.copy()
    current_year = df_["year"]
    earlier_year = df_["year"] - lag
    bad_comparison = (earlier_year <= 2017) & (current_year >= 2018)
    df_["earlier_year"] = np.where(bad_comparison, np.nan, earlier_year)

    # merge in the percent value - if any - from the previous year
    # we do an inner merge, which means that rows which don't have a comparison are dropped
    group_by = ["exporter_group", source_column]
    df_ = pd.merge(
        df_,
        df_[[*group_by, "year", "percent"]].rename(
            columns={"year": "earlier_year", "percent": "earlier_percent"}
        ),
        on=[*group_by, "earlier_year"],
    )

    # take the minimum, representing how much percent stayed in the municipality
    df_["percent_stayed"] = np.minimum(df_["percent"], df_["earlier_percent"])

    # add up this value for every exporter_group and year
    df_ = sps.consolidate(df_, ["percent_stayed"], ["exporter_group", "year"])

    return df_

# do the calculation for different lags
dfs = {
    lag: calculate_overlap(df_sei_pcs, "municipality_of_production_trase_id", lag)
    for lag in [1, 2, 3]
}

# plot it
df_overlaps = sps.concat(df.assign(lag=n) for (n, df) in dfs.items())
px.box(
    df_overlaps,
    y="percent_stayed",
    color="lag",
    labels={"lag": "Time lag in years", "percent_stayed": "Overlap (%)"},
    title="Typical overlap in an exporter group's sourcing pattern for different lags<br /><sub>Comparing each exporter group's sourcing pattern for each year with the sourcing pattern in the 1/2/3 years before</sub>",
)

Here are the median values expressed in a table:

with sps.pandas_display_context(float_format=sps.format_float):
    series_ = df_overlaps.groupby("lag")["percent_stayed"].median()
    series_ = series_.rename("Median overlap (%)")
    series_.index = series_.index.rename("Time lag in years")
    df_ = series_.reset_index()
    df_.index = [""] * len(df_)  # hide index
    display(df_)

The above results demonstrate that if you take a typical exporter group, you would expect 82% of their known sourcing pattern to be the same one year to the next. If you increase the gap to three years, then 77% of their sourcing pattern would be unchanged.

# # This cell is used to test the correctness of the above calculation by choosing a random row
# # from the results, and printing the source data. By manually doing the calculation on the
# # source data we can see if the results are reproduced.

# random_row = data.sample(1)
# exporter_group = random_row["exporter_group"].values[0]
# year = random_row["year"].values[0]
# lag = random_row["lag"].values[0]

# print("A random row from our results:")
# display(random_row)

# print("The source data:")
# is_previous = (df_top["exporter_group"] == exporter_group) & (df_top["year"] == year)
# is_current = (df_top["exporter_group"] == exporter_group) & (df_top["year"] == year - lag)
# pd.merge(
#     df_top[is_previous][["municipality_of_production_trase_id", "percent"]],
#     df_top[is_current][["municipality_of_production_trase_id", "percent"]],
#     on=["municipality_of_production_trase_id"],
#     suffixes=("_previous", "_current"),
#     how="outer",
# )

The effect of exporter group size

The above analysis is run for all exporter groups which appear in subsequent years: about 500 in total.

However, we hypothesise that the size of the exporter group will make a big difference to how well the sourcing pattern overlaps. The CNPJ that an exporter is listed with in the customs/shipment data is very often used to locate processing assets, and if an exporter group changes CNPJs often, the sourcing pattern will change too. This may happen more or less with bigger exporter groups. Usually, that is because we have focussed data cleaning efforts to associate CNPJs with exporter groups on the larger exporter groups.

We use two heuristics for size of an exporter group:

Yearly export volume, split into quartiles, where the 1st Quartile is the smallest 25% of exporters and the 4th Quartile the largest 25%.
Number of CNPJs that the exporter group has listed in customs/shipment records over the whole time series

In the graphs below we show the same data as the previous box plot, except broken down by these two heuristics. We only show a time lag of one year (i.e. comparing a sourcing pattern to that of the previous year).

# add export volume for each year
data = pd.merge(
    df_overlaps,
    sps.consolidate(df_sei_pcs, ["volume"], ["exporter_group", "year"]),
    on=["exporter_group", "year"],
    validate="many_to_one",
    how="left",
    indicator=True
)
assert all(data.pop("_merge") == "both")

# divide up the data points into four equal groups based on
# the quartiles of the volume
v = data["volume"]
bins = [v.quantile(0), v.quantile(0.25), v.quantile(0.55), v.quantile(0.75), v.quantile(1)]
data["volume_bin"] = pd.cut(v, bins, labels=["1st Quartile", "2nd Quartile", "3rd Quartile", "4th Quartile"])

# merge in CNPJ count
data = pd.merge(
    data,
    df_cnpjs,
    on="exporter_group",
    validate="many_to_one",
    how="left",
    indicator=True,
)
assert all(data.pop("_merge") == "both")

# focus only on a time lag of 1 year
data = data[data["lag"] == 1]

# -------------------------------------------------------- #
# Overlap broken down by CNPJ count
# -------------------------------------------------------- #

# As box plot
fig = px.box(
    data,
    x="number_cnpjs",
    y="percent_stayed",
    labels={
        "number_cnpjs": "Number of CNPJs in exporter group",
        "percent_stayed": "Overlap (%)",
    },
    title="Typical overlap in an exporter group's sourcing pattern<br /><sub>Comparing each exporter group's sourcing pattern for each year with the sourcing pattern in the year before</sub>",
)
fig.show()

# As table
with sps.pandas_display_context(float_format=sps.format_float):
    series_ = data.groupby("number_cnpjs")["percent_stayed"].median()
    series_ = series_.rename("Median overlap (%)")
    series_.index = series_.index.rename("Number of CNPJs in exporter group")
    df_ = series_.reset_index()
    df_.index = [""] * len(df_)  # hide index
    display(df_)

# -------------------------------------------------------- #
# Overlap broken down by volume quartile
# -------------------------------------------------------- #

# As box plot
fig = px.box(
    data,
    x="volume_bin",
    y="percent_stayed",
    labels={
        "volume_bin": "Yearly export volume",
        "percent_stayed": "Overlap (%)",
    },
    title="Typical overlap in an exporter group's sourcing pattern<br /><sub>Comparing each exporter group's sourcing pattern for each year with the sourcing pattern in the year before</sub>",
)
fig.show()

# As table
with sps.pandas_display_context(float_format=sps.format_float):
    series_ = data.groupby("volume_bin")["percent_stayed"].median()
    series_ = series_.rename("Median overlap (%)")
    series_.index = series_.index.rename("Yearly export volume")
    df_ = series_.reset_index()
    df_.index = [""] * len(df_)  # hide index
    display(df_)

This demonstrates that exporter group size is indeed very important when considering how much sourcing patterns overlap over time. For the biggest exporter groups, we only see around 40% overlap in the sourcing pattern from one year to the next.

Dividing Brazilian municipalities into "high emissions" and "low emissions"

Now we bring in the additional concept of emissions. When looking at the suitability for a scope three emissions analysis, it is important not only to know how much the sourcing pattern is changing, but also whether those changes are significant for the emissions footprint. It might be that the changes are on frontiers of high emissions.

Our emissions data is municipality-level. However, there are over 5,000 municipalities! To make the analysis simpler, we would like to group them into "high" and "low" emissions municipalities. This is done in each year with the following algorithm:

Calculate the total production of soy in Brazil for that year
If a municipality produces less than 0.001% of Brazil's total soy production, designate it "low emissions"
For all remaining municipalities (which we call "soy-producing municipalities") calculate the emissions per ton of soy: the total emissions of the municipality divided by the total production of the municipality.
Take the average of this value across all soy-producing municipalities. Designate a soy-producing municipality "high emissions" if the emissions per ton of soy is above the average, otherwise designate it "low emissions".

# the "production cutoff": if a municipality produces more than 0.001% of yearly soy
production_cuttoff_percent = 0.001
groups = df_emissions.groupby(["municipality_of_production_trase_id", "year"])
production = groups["production"].transform("sum")
total_production = df_emissions.groupby("year")["production"].transform("sum")
cutoff = total_production * (production_cuttoff_percent / 100)
has_sufficient_production = production > cutoff

# the "emissions cutoff": after the above filter is applied, is in the top 50% of emisions per kg soy
d = df_emissions[has_sufficient_production]
filtered_groups = d.groupby(["municipality_of_production_trase_id", "year"])
median_emission_per_production = (filtered_groups["emissions"].sum() / filtered_groups["production"].sum()).mean()

emissions = groups["emissions"].transform("sum")
emission_per_production = emissions / production
emissions_above_median = emission_per_production > median_emission_per_production

df_emissions_categorised = df_emissions.copy()
df_emissions_categorised["produces_sufficient_soy"] = has_sufficient_production
df_emissions_categorised["above_average_emissions"] = emissions_above_median
df_emissions_categorised["is_high_emissions"] = has_sufficient_production & emissions_above_median

Here is a decision tree demonstrating the algorithm and how many municipalities fall into each category in 2022:

from graphviz import Digraph
import IPython.display as display

dot = """
digraph G {
  all[label="All municipalities\n5,572"]
  production[label="Produces more than 0.001%\nof yearly Brazil soy?", shape=diamond]
  emissions[label="Above average \nemissions-per-ton-soy?", shape=diamond]
  high[label="High emissions\n378"]
  low[label="Low emissions\n5,194"]

  all -> production
  production -> emissions[label="Yes\n 2,051"]
  production -> low[label="No\n 3,521"]
  emissions -> high[label="Yes\n 378"]
  emissions -> low[label="No\n 1,673"]
}
"""

data = df_emissions_categorised[df_emissions_categorised["year"] == 2022].copy()
counts = data.groupby(["is_high_emissions", "produces_sufficient_soy", "above_average_emissions"])["year"].count().rename("count").to_dict()

produces_sufficient_soy_yes = sum(count for (key, count) in counts.items() if key[1])
produces_sufficient_soy_no = sum(count for (key, count) in counts.items() if not key[1])
high_emissions = sum(count for (key, count) in counts.items() if key[0])
low_emissions = sum(count for (key, count) in counts.items() if not key[0])
above_average_emissions_per_ton_yes = sum(count for (key, count) in counts.items() if key[1] and key[2])
above_average_emissions_per_ton_no = sum(count for (key, count) in counts.items() if key[1] and not key[2])

graph = Digraph()
graph.attr(rankdir='TD')
graph.attr('node', shape='box')

graph.node('all', f'All municipalities\n{sum(counts.values()):,d}')
graph.node('production', f'Produces more than {production_cuttoff_percent}%\nof yearly Brazil soy?', shape="diamond")
graph.node('emissions', 'Above average\nemissions-per-ton-soy?', shape="diamond")
graph.node('high', f'High emissions\n{high_emissions:,d}')
graph.node('low', f'Low emissions\n{low_emissions:,d}')

graph.edge('all', 'production')
graph.edge('production', 'emissions', label=f'Yes\n{produces_sufficient_soy_yes:,d}')
graph.edge('production', 'low', label=f'No\n{produces_sufficient_soy_no:,d}')
graph.edge('emissions', 'high', label=f'Yes\n{above_average_emissions_per_ton_yes:,d}')
graph.edge('emissions', 'low', label=f'No\n{above_average_emissions_per_ton_no:,d}')

graph.render('graph', format='png', cleanup=True)
display.display(display.Image(filename='graph.png'))

Another way to visualise the algorithm is a scatter graph with emissions on the y-axis, production on the x-axis, and municipalities as dots. High emissions municipalities are in the upper rectangle of the scatter plot, with a x-axis cutoff on the production (note, graph is in log-scale and for 2022 data):

data = df_emissions_categorised[df_emissions_categorised["year"] == 2022]
assert not any(data["municipality_of_production_trase_id"].duplicated())
fig = px.scatter(
    data,
    x="production",
    y="emissions",
    color="is_high_emissions",
    log_x=True,
    log_y=True,
    title="Emissions and soy production of Brazilian municipalities<br /><sup>In 2022. Both axes are log-scale</sup>"
)
fig.show()

Finally, it is interesting to look at where the high-producing municipalities are located in Brazil. They mostly reside in the Cerrado savanna in the centre of Brazil and the Pampas grasslands in the south:

response = requests.get("https://raw.githubusercontent.com/eupimenta/geoJSON/master/BRMUE250GC_SIR.geojson")
response.encoding = "utf-8-sig"
data = df_emissions_categorised[df_emissions_categorised["year"] == 2022].copy()
data["geocode"] = data["municipality_of_production_trase_id"].str.slice(3)
fig = px.choropleth(
    data,
    geojson=response.json(),
    locations="geocode",
    featureidkey="properties.CD_GEOCMU",
    color="is_high_emissions"
)
fig.update_geos(fitbounds="locations")
fig.update_traces(marker_line_width=0)
fig.update_layout(title="Categorisation of high (red) and low (blue) emission-per-kg-soy municipalities in 2022")
fig.show()

Armed with this categorisation, we can now repeat the analysis of how much a sourcing pattern overlaps one year to the next; except that we consider sourcing from only two locations: the part of Brazil which is "high-emissions" and the part which is "low-emissions".

That is, from one year to the next, how much percentage of an exporter's group soy sourcing moves between low/high regions?

# filter the SEI-PCS results to only the years for which we have emissions data
df = pd.merge(df_sei_pcs, df_emissions["year"].drop_duplicates())
assert not df.empty

# add in the categorisation of high/low into the SEI-PCS results
df = pd.merge(
    df,
    df_emissions_categorised[["municipality_of_production_trase_id", "year", "is_high_emissions"]],
    on=["municipality_of_production_trase_id", "year"],
    how="left",
    validate="many_to_one",
    indicator=True
)
assert all(df.pop("_merge") == "both")

# now calculate the percentage sourced in each region
df = sps.consolidate(df, ["volume"], ["year", "is_high_emissions", "exporter_group"])
df["yearly_exporter_volume"] = df.groupby(["exporter_group", "year"])["volume"].transform("sum")
df["percent"] = 100 * df["volume"] / df["yearly_exporter_volume"]

# do the calculation for different lags
dfs = {
    lag: calculate_overlap(df, "is_high_emissions", lag)
    for lag in [1, 2, 3]
}

# display it in a table
data = sps.concat(df.assign(lag=n) for (n, df) in dfs.items())
series_ = data.groupby("lag")["percent_stayed"].median()
series_ = series_.rename("Median overlap (%)")
series_.index = series_.index.rename("Time lag in years")
df__ = series_.reset_index()
df__.index = [""] * len(df__)  # hide index
with sps.pandas_display_context(float_format=lambda x: f"{x:.1f}"):
    print(df__.to_string())

# # This cell is used to test the correctness of the above calculation by choosing a random row
# # from the results, and printing the source data. By manually doing the calculation on the
# # source data we can see if the results are reproduced.

# random_row = dfs[1].sample(1)
# exporter_group = random_row["exporter_group"].values[0]
# year = random_row["year"].values[0]
# lag = 1

# print("A random row from our results:")
# print(random_row.to_markdown())

# print("The source data:")
# is_previous = (df["exporter_group"] == exporter_group) & (df["year"] == year)
# is_current = (df["exporter_group"] == exporter_group) & (df["year"] == year - lag)
# pd.merge(
#     df[is_previous][["is_high_emissions", "percent"]],
#     df[is_current][["is_high_emissions", "percent"]],
#     on=["is_high_emissions"],
#     suffixes=("_previous", "_current"),
#     how="outer",
# )

data = df.pivot(columns=["is_high_emissions"], index=["exporter_group", "year"], values=["percent"]).fillna(0)
data.columns = [{('percent', False): "low_emissions", ('percent', True): "high_emissions"}[c] for c in data.columns]
data = data.reset_index()

# plot of sample
number_of_years = data.groupby("exporter_group")["year"].transform("count")
interesting = number_of_years > 1
random_groups = data[interesting]["exporter_group"].drop_duplicates().sample(10)
fig = px.line(
    pd.merge(data, random_groups),
    x="year",
    y="high_emissions",
    color="exporter_group",
    width=1000,
)
fig.update_layout(
    title="Percentage of soy sourced from high-emissions muncipalities<br /><sup>Ten random exporter groups</sup>",
    yaxis_title="% soy from high-emissions muncipalities",
)
fig.show()

# histogram
fig = px.histogram(
    data,
    x="high_emissions",
)
fig.update_layout(
    bargap=0.1,
    title="Histogram: percentage of supply from high-emissions municipalities",
    yaxis_title="Number of exporter groups per year",
    xaxis_title="Percentage of soy sourced from high-emissions municipalities",
)
fig.show()

Change in emissions footprint

We now ask ourselves: how much does the actual emissions footprint of an exporter change over time?

We are interested primarily in the emissions per ton of exported soy*. First, we embed the emissions data into the SEI-PCS results. Then, for each exporter group we add up the total volume of soy and the total emissions. We decide the latter by the former to give us an average emissions per ton of exporter soy; for each exporter group and each year.

Here is a line plot showing the data, for ten random exporter groups:

* which is also sourced from a known municipality

def embed_emissions(df):
    # add emissions and production to trade data
    df_embedded = pd.merge(
        df,
        df_emissions_categorised.rename(
            columns={
                "emissions": "municipality_emissions",
                "production": "municipality_production",
            },
            errors="raise",
        ),
        on=["municipality_of_production_trase_id", "year"],
        validate="many_to_one",
        how="left",
        indicator=True,
    )

    # if a value for a municipality is missing then we assume both production and emissions to be zero
    missing = df_embedded.pop("_merge") != "both"
    df_embedded["municipality_emissions"] = df_embedded["municipality_emissions"].mask(missing, 0)
    df_embedded["municipality_production"] = df_embedded["municipality_production"].mask(missing, 0)

    # convert from a municipality
    emissions_per_production = df_embedded["municipality_emissions"] / df_embedded["municipality_production"]
    volume = df_embedded["volume"]
    df_embedded["emissions"] = emissions_per_production * volume

    # now calculate year-on-year emissions footprint per exporter group
    return df_embedded

def embed_emissions_and_calculate_emissions_per_volume(df):
    # embed emissions and calculate the average emissions per volume for each exporter group & year
    df = embed_emissions(df)
    df = sps.consolidate(df, ["volume", "emissions"], ["exporter_group", "year"])
    df["emissions_per_volume"] = df["emissions"] / df["volume"]
    return df

# embed emissions into SEI-PCS results
df_sei_pcs_embedded = embed_emissions_and_calculate_emissions_per_volume(df_sei_pcs)

# show a graph of ten random groups
random_groups = df_sei_pcs_embedded["exporter_group"].drop_duplicates().sample(20)
fig = px.line(
    pd.merge(df_sei_pcs_embedded, random_groups),
    x="year",
    y="emissions_per_volume",
    color="exporter_group",
    width=1000,
)
fig.update_layout(
    title="Emissions-per-volume over time<br /><sup>Ten random exporter groups</sup>",
    yaxis_title="emissions per ton of exported soy (t CO2eq / t)",
)
fig.show()

We are interested in how much this data changes. Or rather, if I would like the emissions-per-ton value for year X but I use the value from year X - 1, by how much was I wrong? We use this formula:

100 * abs(emissions_per_ton(X) - emissions_per_ton(X - 1)) / emissions_per_ton(X - 1)

You can think of this as the slope between each pair of points on the line chart above.

If we take all of these values---one for every exporter for every year of data---and plot it in a box plot, we get a picture of our typical relative error. The median value is quite high: almost 50%. That is to say, in the median case the emissions per ton in year X - 1 is 50% different from the emissions per ton in year X.

# now imagine that we want to predict the current emissions-per-volume, in say 2017
# we get a value from 2016. how wrong were we?
# we take the relative difference compared to the old value to get error bounds:
#
#    
#
# so the 2016 value is 20% wrong or whatever
# we exclude the 2018/2018 boundary
data = df_sei_pcs_embedded.copy()
data["previous_year"] = data["year"] - 1
bad_comparison = data["year"] == 2018
data = data[~bad_comparison]
data = pd.merge(
    data,
    data[["year", "exporter_group", "emissions_per_volume"]].rename(columns={"year": "previous_year"}),
    on=["previous_year", "exporter_group"],
    suffixes=("", "_previous"),    
)
current = data["emissions_per_volume"]
previous = data["emissions_per_volume_previous"]
data["error"] = 100 * np.abs(current - previous) / previous

px.box(
    data, 
    y="error", 
    labels={"error": "relative error (%)"},
    title="Relative error in the emissions per ton of each exporter group compared to <br />the year before<br /><sup>Y-axis is log scale. Points show outliers</sup>",
    log_y=True,
).show()


print(f"Median Error: {data['error'].median():.1f}%")

Comparison to a simpler model

We now move away from the question of how much Trase data changes over time. We instead turn to look at whether Trase data performs better than a much simpler model. Our reference model works as follows:

Input data is as follows:
Trade data (customs declarations before mid-2018, bills of lading after)
SEXEC trade statistics
Production data
First we divide a shipment among all "states of origin" in the SECEX data.
Then, we divide those flows among all municipalities in the state, proportionally to the total production of each municipality.

This is a production-weighted model, but adhereing to the constraints of the SECEX dataset.

df_embedded_sei_pcs = embed_emissions_and_calculate_emissions_per_volume(df_sei_pcs)
df_embedded_sei_pcs_mdic = embed_emissions_and_calculate_emissions_per_volume(df_sei_pcs_mdic)

df = pd.merge(
    df_embedded_sei_pcs,
    df_embedded_sei_pcs_mdic,
    on=["exporter_group", "year"],
    validate="one_to_one",
    suffixes=("_sei_pcs", "_sei_pcs_mdic"),
)
random_groups = df["exporter_group"].drop_duplicates().sample(10)

data = sps.concat([
    df_embedded_sei_pcs[df_embedded_sei_pcs["exporter_group"].isin(random_groups)].assign(source="SEI-PCS"),
    df_embedded_sei_pcs_mdic[df_embedded_sei_pcs_mdic["exporter_group"].isin(random_groups)].assign(source="SECEX model"),    
])
px.line(
    data,
    x="year",
    y="emissions_per_volume",
    color="source",
    facet_row="exporter_group",
height=5000
)

df_embedded_sei_pcs

def process(df):
    df = sps.consolidate(df, ["volume", "emissions"], ["year", "exporter_group"])
    df["emissions_per_volume"] = df["emissions"] / df["volume"]
    return df

a = process(df_embedded_sei_pcs)
b = process(df_embedded_sei_pcs_mdic)

sps.dumbbell_compare(
    a,
    b,
    "emissions_per_volume",
    ["exporter_group"],
    labels=("SEI-PCS", "SEI-PCS (MDIC)")
)

# sps.compare_dataframes_single(
#     a,
#     b,
#        "emissions_per_volume",
#     ["exporter_group"], 
# )