Can time-lagged Trase data on Brazil soy be used for scope three emissions assessments?¶

Harry Biddle, 10th April 2024

This notebook explores the possibility of using Trase Brazil soy data to perform scope three emissions assessments for a company when that data is a few years older than the year in which the assessments is being done. For example, using 2022 Trase data to perform an emissions assessment in 2025.

We interpret the company to be an "exporter group" in Trase: that is, a collection of companies which act in the role of exporters/shippers of Brazil soy.

The data used in this notebook is as follows:

  • Brazil soy Trase data for 2020 and earlier is taken from the database, specifically the supply_chains.brazil_soy_v2_6_0 view
  • Brazil soy Trase data for 2021 and 2022 is taken from S3, specifically the files s3://trase-storage/brazil/soy/sei_pcs/v2.6.1/SEIPCS_BRAZIL_SOY_%YEAR%.csv.
  • Municipality-level emissions and production data is from s3://trase-storage/brazil/soy/indicators/out/q4_2023/soy_Net_and_Gross_ghg_deforestation_2013_2022.csv.

When we read the Brazil soy Trase data we exclude:

 - Domestic trade
  • The unknown exporter group
  • Exports for which the municipality of production is unknown

To convert this notebook into HTML output, use the following command:

jupyter nbconvert  --no-input --to html brazil_soy_sourcing_analysis.ipynb

Here are three random rows of the Trase Brazil soy data that we have loaded. We are interested only in exporter groups and their sourcing patterns: that is, which municipalities Trase has identified as producing the soy that gets exported, and in which volumes they are sourced. Volumes are in metric tonnes of soy bean equivalent:

year municipality_of_production_trase_id exporter_group volume
21107 2004 BR-4117255 BIANCHINI 17.567314
45765 2015 BR-4317806 ADM 21.630404
24771 2014 BR-5107925 C. VALE 5.498450

Note that we cluster the model results into exporter groups based on the name of the group (e.g. "COFCO"). This means that our grouping is very dependent on how much data cleaning has been done. Larger exporter groups will have been extensively cleaned, so that, for example, all of the exporters associated with Bunge are under the "BUNGE" exporter group. However smaller exporter groups will not have been cleaned, and will consist of only one exporter.

Here are a few random examples to illustrate this:

Exporters associated with exporter group BUNGE:
  - BUNGE FERTILIZANTES SA
  - BUNGE
  - BUNGE ALIMENTOS S/A
  - BUNGE FOOD SERVICE COMERCIO E SERVICOS LTDA
  - USINA MOEMA ACUCAR E ALCOOL LTDA
  - PEDRO AFONSO ACUCAR & BIOENERGIA
  - BUNGE IBERICA

Exporters associated with exporter group AGROMON AGRICULTURA E PECUARIA:
  - AGROMON AGRICULTURA E PECUARIA

Exporters associated with exporter group BONATO COUROS SA:
  - BONATO COUROS SA

Here are three random rows of the emission and production data. We have emissions and production for each municipality

municipality_of_production_trase_id year production emissions
48423 BR-4311601 2015 15000.0 26806.205755
2245 BR-1504752 2017 70500.0 109934.655838
54566 BR-5212204 2018 25280.0 7437.334536

Do sourcing patterns vary across years?¶

To start with, we exclude the emissions data and focus on exported volume. We would like to investigate whether the sourcing pattern of companies varies a lot between years. By sourcing pattern, we mean the distribution of volume across municipalities of production. To be able to make a better comparison across years, we normalise volumes to the company's total export in that year.

The definition of a sourcing pattern for a company X in year Y, is the percentage of soy sourced by each municipality in that year, where 100% represents the total soy export of company X in year Y from known municipalities. We are interested this because it gives us a way to compare a typical ton of a company's soy across different years. We can say that a in 2020 Bunge sourced, say, 20% from municipality A and 80% from municipality B. Yet in 2022 they might have source only 15% from municipality A.

Below, we print an example sourcing pattern of a random exporter group in a random year.

Also bear in mind the exclusion criteria at the top of this notebook: we excluded trades with unknown municipality of production. So we are looking only at the "known" sourcing pattern of each company.

volume yearly_exporter_volume sourcing_percentage
year exporter_group municipality_of_production_trase_id
2021 SHANGHAI PENGXIN GROUP CO., LTD. BR-5102637 14,736 32,360 46
BR-5107768 6,504 32,360 20
BR-5106240 5,197 32,360 16
BR-5108006 4,861 32,360 15
BR-5108907 569 32,360 2
BR-5106505 346 32,360 1

Now we ask: if we compare a company's sourcing pattern from one year to the next, how much overlap is there? To calculate this, for each exporter group we take two years' of data, and look at each municipality in turn. We have a value for the first year and a value for the second. We take the minimum of these two values. This represents the overlap of the sourcing from that municipality. When we add that up across Brazil, we have a number for the total overlap.

For example, suppose a company only sources from two municipalities, A and B:

Municipality Sourcing (Year 1) Sourcing (Year 2) Overlap
A 80% 70% 70% = min(80, 70)
B 20% 30% 20% = min(20, 30)
Total overlap - - 90% = 70 + 20

We get a single number—90%—representing the overlap in the sourcing pattern of this exporter group between year one and year two. We repeat the above procedure for every exporter group and every pair of subsequent years (2010/11, 2011/12, etc), giving us a lot of numbers. The chart below shows those numbers binned and presented in a histogram. This gives us a picture of how much companies' sourcing patterns typically overlap over time.

We are interested not only in a year's lag but also what happens when that lag increases. What if we want to do a scope 3 emissions analysis in 2025 but only have data from 2022? Therefore, the chart below also includes lags of two years (2010/12, 2011/13, ...) and three years (2010/13, 2011/14, ...).

Additionally, since there was a change in methodology for Trase's Brazil soy model between the 2017 and 2018, we exclude any comparisons that cross that time boundary.

Bear in mind that we are only looking at the change in municipalities. Some municipalities cover a large area, some a small area - they are not evenly distributed over Brazil. Additionally, we have no measure of the "distance" that soy moves: moving to a neighbouring municipality counts just as moving to a municipality on the other side of the country.

Here are the median values expressed in a table:

Time lag in years Median overlap (%)
1 82
2 77
3 75

The above results demonstrate that if you take a typical exporter group, you would expect 82% of their known sourcing pattern to be the same one year to the next. If you increase the gap to three years, then 77% of their sourcing pattern would be unchanged.

The effect of exporter group size¶

The above analysis is run for all exporter groups which appear in subsequent years: about 500 in total.

However, we hypothesise that the size of the exporter group will make a big difference to how well the sourcing pattern overlaps. The CNPJ that an exporter is listed with in the customs/shipment data is very often used to locate processing assets, and if an exporter group changes CNPJs often, the sourcing pattern will change too. This may happen more or less with bigger exporter groups. Usually, that is because we have focussed data cleaning efforts to associate CNPJs with exporter groups on the larger exporter groups.

We use two heuristics for size of an exporter group:

  1. Yearly export volume, split into quartiles, where the 1st Quartile is the smallest 25% of exporters and the 4th Quartile the largest 25%.
  2. Number of CNPJs that the exporter group has listed in customs/shipment records over the whole time series

In the graphs below we show the same data as the previous box plot, except broken down by these two heuristics. We only show a time lag of one year (i.e. comparing a sourcing pattern to that of the previous year).

Number of CNPJs in exporter group Median overlap (%)
1 100
2 42
3 42
4 47
5 45
6 45
Yearly export volume Median overlap (%)
1st Quartile 100
2nd Quartile 100
3rd Quartile 59
4th Quartile 38

This demonstrates that exporter group size is indeed very important when considering how much sourcing patterns overlap over time. For the biggest exporter groups, we only see around 40% overlap in the sourcing pattern from one year to the next.

Dividing Brazilian municipalities into "high emissions" and "low emissions"¶

Now we bring in the additional concept of emissions. When looking at the suitability for a scope three emissions analysis, it is important not only to know how much the sourcing pattern is changing, but also whether those changes are significant for the emissions footprint. It might be that the changes are on frontiers of high emissions.

Our emissions data is municipality-level. However, there are over 5,000 municipalities! To make the analysis simpler, we would like to group them into "high" and "low" emissions municipalities. This is done in each year with the following algorithm:

  1. Calculate the total production of soy in Brazil for that year
  2. If a municipality produces less than 0.001% of Brazil's total soy production, designate it "low emissions"
  3. For all remaining municipalities (which we call "soy-producing municipalities") calculate the emissions per ton of soy: the total emissions of the municipality divided by the total production of the municipality.
  4. Take the average of this value across all soy-producing municipalities. Designate a soy-producing municipality "high emissions" if the emissions per ton of soy is above the average, otherwise designate it "low emissions".

Here is a decision tree demonstrating the algorithm and how many municipalities fall into each category in 2022:

Another way to visualise the algorithm is a scatter graph with emissions on the y-axis, production on the x-axis, and municipalities as dots. High emissions municipalities are in the upper rectangle of the scatter plot, with a x-axis cutoff on the production (note, graph is in log-scale and for 2022 data):

Finally, it is interesting to look at where the high-producing municipalities are located in Brazil. They mostly reside in the Cerrado savanna in the centre of Brazil and the Pampas grasslands in the south:

Change in emissions footprint¶

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[298], line 2
      1 # add emissions and production to trade data
----> 2 df_embedded = pd.merge(
      3     df,
      4     df_emissions.rename(
      5         columns={
      6             "emissions": "municipality_emissions",
      7             "production": "municipality_production",
      8         },
      9         errors="raise",
     10     ),
     11     on=["municipality_of_production_trase_id", "year"],
     12     validate="many_to_one",
     13     how="left",
     14     indicator=True,
     15 )
     17 # if a value for a municipality is missing then we assume both production and emissions to be zero
     18 missing = df_embedded.pop("_merge") != "both"

File /opt/tljh/user/lib/python3.8/site-packages/pandas/core/reshape/merge.py:110, in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
     93 @Substitution("\nleft : DataFrame or named Series")
     94 @Appender(_merge_doc, indents=0)
     95 def merge(
   (...)
    108     validate: str | None = None,
    109 ) -> DataFrame:
--> 110     op = _MergeOperation(
    111         left,
    112         right,
    113         how=how,
    114         on=on,
    115         left_on=left_on,
    116         right_on=right_on,
    117         left_index=left_index,
    118         right_index=right_index,
    119         sort=sort,
    120         suffixes=suffixes,
    121         indicator=indicator,
    122         validate=validate,
    123     )
    124     return op.get_result(copy=copy)

File /opt/tljh/user/lib/python3.8/site-packages/pandas/core/reshape/merge.py:703, in _MergeOperation.__init__(self, left, right, how, on, left_on, right_on, axis, left_index, right_index, sort, suffixes, indicator, validate)
    696 self._cross = cross_col
    698 # note this function has side effects
    699 (
    700     self.left_join_keys,
    701     self.right_join_keys,
    702     self.join_names,
--> 703 ) = self._get_merge_keys()
    705 # validate the merge keys dtypes. We may need to coerce
    706 # to avoid incompatible dtypes
    707 self._maybe_coerce_merge_keys()

File /opt/tljh/user/lib/python3.8/site-packages/pandas/core/reshape/merge.py:1179, in _MergeOperation._get_merge_keys(self)
   1175 if lk is not None:
   1176     # Then we're either Hashable or a wrong-length arraylike,
   1177     #  the latter of which will raise
   1178     lk = cast(Hashable, lk)
-> 1179     left_keys.append(left._get_label_or_level_values(lk))
   1180     join_names.append(lk)
   1181 else:
   1182     # work-around for merge_asof(left_index=True)

File /opt/tljh/user/lib/python3.8/site-packages/pandas/core/generic.py:1850, in NDFrame._get_label_or_level_values(self, key, axis)
   1844     values = (
   1845         self.axes[axis]
   1846         .get_level_values(key)  # type: ignore[assignment]
   1847         ._values
   1848     )
   1849 else:
-> 1850     raise KeyError(key)
   1852 # Check for duplicates
   1853 if values.ndim > 1:

KeyError: 'municipality_of_production_trase_id'