Harry Biddle, 10th April 2024
This notebook explores the possibility of using Trase Brazil soy data to perform scope three emissions assessments for a company when that data is a few years older than the year in which the assessments is being done. For example, using 2022 Trase data to perform an emissions assessment in 2025.
We interpret the company to be an "exporter group" in Trase: that is, a collection of companies which act in the role of exporters/shippers of Brazil soy.
The data used in this notebook is as follows:
supply_chains.brazil_soy_v2_6_0 views3://trase-storage/brazil/soy/sei_pcs/v2.6.1/SEIPCS_BRAZIL_SOY_%YEAR%.csv.s3://trase-storage/brazil/soy/indicators/out/q4_2023/soy_Net_and_Gross_ghg_deforestation_2013_2022.csv.When we read the Brazil soy Trase data we exclude:
- Domestic trade
To convert this notebook into HTML output, use the following command:
jupyter nbconvert --no-input --to html brazil_soy_sourcing_analysis.ipynb
Here are three random rows of the Trase Brazil soy data that we have loaded. We are interested only in exporter groups and their sourcing patterns: that is, which municipalities Trase has identified as producing the soy that gets exported, and in which volumes they are sourced. Volumes are in metric tonnes of soy bean equivalent:
| year | municipality_of_production_trase_id | exporter_group | volume | |
|---|---|---|---|---|
| 21107 | 2004 | BR-4117255 | BIANCHINI | 17.567314 |
| 45765 | 2015 | BR-4317806 | ADM | 21.630404 |
| 24771 | 2014 | BR-5107925 | C. VALE | 5.498450 |
Note that we cluster the model results into exporter groups based on the name of the group (e.g. "COFCO"). This means that our grouping is very dependent on how much data cleaning has been done. Larger exporter groups will have been extensively cleaned, so that, for example, all of the exporters associated with Bunge are under the "BUNGE" exporter group. However smaller exporter groups will not have been cleaned, and will consist of only one exporter.
Here are a few random examples to illustrate this:
Exporters associated with exporter group BUNGE: - BUNGE FERTILIZANTES SA - BUNGE - BUNGE ALIMENTOS S/A - BUNGE FOOD SERVICE COMERCIO E SERVICOS LTDA - USINA MOEMA ACUCAR E ALCOOL LTDA - PEDRO AFONSO ACUCAR & BIOENERGIA - BUNGE IBERICA Exporters associated with exporter group AGROMON AGRICULTURA E PECUARIA: - AGROMON AGRICULTURA E PECUARIA Exporters associated with exporter group BONATO COUROS SA: - BONATO COUROS SA
Here are three random rows of the emission and production data. We have emissions and production for each municipality
| municipality_of_production_trase_id | year | production | emissions | |
|---|---|---|---|---|
| 48423 | BR-4311601 | 2015 | 15000.0 | 26806.205755 |
| 2245 | BR-1504752 | 2017 | 70500.0 | 109934.655838 |
| 54566 | BR-5212204 | 2018 | 25280.0 | 7437.334536 |
To start with, we exclude the emissions data and focus on exported volume. We would like to investigate whether the sourcing pattern of companies varies a lot between years. By sourcing pattern, we mean the distribution of volume across municipalities of production. To be able to make a better comparison across years, we normalise volumes to the company's total export in that year.
The definition of a sourcing pattern for a company X in year Y, is the percentage of soy sourced by each municipality in that year, where 100% represents the total soy export of company X in year Y from known municipalities. We are interested this because it gives us a way to compare a typical ton of a company's soy across different years. We can say that a in 2020 Bunge sourced, say, 20% from municipality A and 80% from municipality B. Yet in 2022 they might have source only 15% from municipality A.
Below, we print an example sourcing pattern of a random exporter group in a random year.
Also bear in mind the exclusion criteria at the top of this notebook: we excluded trades with unknown municipality of production. So we are looking only at the "known" sourcing pattern of each company.
| volume | yearly_exporter_volume | sourcing_percentage | |||
|---|---|---|---|---|---|
| year | exporter_group | municipality_of_production_trase_id | |||
| 2021 | SHANGHAI PENGXIN GROUP CO., LTD. | BR-5102637 | 14,736 | 32,360 | 46 |
| BR-5107768 | 6,504 | 32,360 | 20 | ||
| BR-5106240 | 5,197 | 32,360 | 16 | ||
| BR-5108006 | 4,861 | 32,360 | 15 | ||
| BR-5108907 | 569 | 32,360 | 2 | ||
| BR-5106505 | 346 | 32,360 | 1 |
Now we ask: if we compare a company's sourcing pattern from one year to the next, how much overlap is there? To calculate this, for each exporter group we take two years' of data, and look at each municipality in turn. We have a value for the first year and a value for the second. We take the minimum of these two values. This represents the overlap of the sourcing from that municipality. When we add that up across Brazil, we have a number for the total overlap.
For example, suppose a company only sources from two municipalities, A and B:
| Municipality | Sourcing (Year 1) | Sourcing (Year 2) | Overlap |
|---|---|---|---|
| A | 80% | 70% | 70% = min(80, 70) |
| B | 20% | 30% | 20% = min(20, 30) |
| Total overlap | - | - | 90% = 70 + 20 |
We get a single number—90%—representing the overlap in the sourcing pattern of this exporter group between year one and year two. We repeat the above procedure for every exporter group and every pair of subsequent years (2010/11, 2011/12, etc), giving us a lot of numbers. The chart below shows those numbers binned and presented in a histogram. This gives us a picture of how much companies' sourcing patterns typically overlap over time.
We are interested not only in a year's lag but also what happens when that lag increases. What if we want to do a scope 3 emissions analysis in 2025 but only have data from 2022? Therefore, the chart below also includes lags of two years (2010/12, 2011/13, ...) and three years (2010/13, 2011/14, ...).
Additionally, since there was a change in methodology for Trase's Brazil soy model between the 2017 and 2018, we exclude any comparisons that cross that time boundary.
Bear in mind that we are only looking at the change in municipalities. Some municipalities cover a large area, some a small area - they are not evenly distributed over Brazil. Additionally, we have no measure of the "distance" that soy moves: moving to a neighbouring municipality counts just as moving to a municipality on the other side of the country.
Here are the median values expressed in a table:
| Time lag in years | Median overlap (%) | |
|---|---|---|
| 1 | 82 | |
| 2 | 77 | |
| 3 | 75 |
The above results demonstrate that if you take a typical exporter group, you would expect 82% of their known sourcing pattern to be the same one year to the next. If you increase the gap to three years, then 77% of their sourcing pattern would be unchanged.
The above analysis is run for all exporter groups which appear in subsequent years: about 500 in total.
However, we hypothesise that the size of the exporter group will make a big difference to how well the sourcing pattern overlaps. The CNPJ that an exporter is listed with in the customs/shipment data is very often used to locate processing assets, and if an exporter group changes CNPJs often, the sourcing pattern will change too. This may happen more or less with bigger exporter groups. Usually, that is because we have focussed data cleaning efforts to associate CNPJs with exporter groups on the larger exporter groups.
We use two heuristics for size of an exporter group:
In the graphs below we show the same data as the previous box plot, except broken down by these two heuristics. We only show a time lag of one year (i.e. comparing a sourcing pattern to that of the previous year).
| Number of CNPJs in exporter group | Median overlap (%) | |
|---|---|---|
| 1 | 100 | |
| 2 | 42 | |
| 3 | 42 | |
| 4 | 47 | |
| 5 | 45 | |
| 6 | 45 |
| Yearly export volume | Median overlap (%) | |
|---|---|---|
| 1st Quartile | 100 | |
| 2nd Quartile | 100 | |
| 3rd Quartile | 59 | |
| 4th Quartile | 38 |
This demonstrates that exporter group size is indeed very important when considering how much sourcing patterns overlap over time. For the biggest exporter groups, we only see around 40% overlap in the sourcing pattern from one year to the next.
Now we bring in the additional concept of emissions. When looking at the suitability for a scope three emissions analysis, it is important not only to know how much the sourcing pattern is changing, but also whether those changes are significant for the emissions footprint. It might be that the changes are on frontiers of high emissions.
Our emissions data is municipality-level. However, there are over 5,000 municipalities! To make the analysis simpler, we would like to group them into "high" and "low" emissions municipalities. This is done in each year with the following algorithm:
Here is a decision tree demonstrating the algorithm and how many municipalities fall into each category in 2022:
Another way to visualise the algorithm is a scatter graph with emissions on the y-axis, production on the x-axis, and municipalities as dots. High emissions municipalities are in the upper rectangle of the scatter plot, with a x-axis cutoff on the production (note, graph is in log-scale and for 2022 data):
Finally, it is interesting to look at where the high-producing municipalities are located in Brazil. They mostly reside in the Cerrado savanna in the centre of Brazil and the Pampas grasslands in the south:
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) Cell In[298], line 2 1 # add emissions and production to trade data ----> 2 df_embedded = pd.merge( 3 df, 4 df_emissions.rename( 5 columns={ 6 "emissions": "municipality_emissions", 7 "production": "municipality_production", 8 }, 9 errors="raise", 10 ), 11 on=["municipality_of_production_trase_id", "year"], 12 validate="many_to_one", 13 how="left", 14 indicator=True, 15 ) 17 # if a value for a municipality is missing then we assume both production and emissions to be zero 18 missing = df_embedded.pop("_merge") != "both" File /opt/tljh/user/lib/python3.8/site-packages/pandas/core/reshape/merge.py:110, in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate) 93 @Substitution("\nleft : DataFrame or named Series") 94 @Appender(_merge_doc, indents=0) 95 def merge( (...) 108 validate: str | None = None, 109 ) -> DataFrame: --> 110 op = _MergeOperation( 111 left, 112 right, 113 how=how, 114 on=on, 115 left_on=left_on, 116 right_on=right_on, 117 left_index=left_index, 118 right_index=right_index, 119 sort=sort, 120 suffixes=suffixes, 121 indicator=indicator, 122 validate=validate, 123 ) 124 return op.get_result(copy=copy) File /opt/tljh/user/lib/python3.8/site-packages/pandas/core/reshape/merge.py:703, in _MergeOperation.__init__(self, left, right, how, on, left_on, right_on, axis, left_index, right_index, sort, suffixes, indicator, validate) 696 self._cross = cross_col 698 # note this function has side effects 699 ( 700 self.left_join_keys, 701 self.right_join_keys, 702 self.join_names, --> 703 ) = self._get_merge_keys() 705 # validate the merge keys dtypes. We may need to coerce 706 # to avoid incompatible dtypes 707 self._maybe_coerce_merge_keys() File /opt/tljh/user/lib/python3.8/site-packages/pandas/core/reshape/merge.py:1179, in _MergeOperation._get_merge_keys(self) 1175 if lk is not None: 1176 # Then we're either Hashable or a wrong-length arraylike, 1177 # the latter of which will raise 1178 lk = cast(Hashable, lk) -> 1179 left_keys.append(left._get_label_or_level_values(lk)) 1180 join_names.append(lk) 1181 else: 1182 # work-around for merge_asof(left_index=True) File /opt/tljh/user/lib/python3.8/site-packages/pandas/core/generic.py:1850, in NDFrame._get_label_or_level_values(self, key, axis) 1844 values = ( 1845 self.axes[axis] 1846 .get_level_values(key) # type: ignore[assignment] 1847 ._values 1848 ) 1849 else: -> 1850 raise KeyError(key) 1852 # Check for duplicates 1853 if values.ndim > 1: KeyError: 'municipality_of_production_trase_id'