Skip to content

Customs 2019

View or edit on GitHub

This page is synchronized from trase/models/brazil/customs_2019/README.md. Last modified on 2025-12-13 00:30 CET by Trase Admin. Please view or edit the original file there; changes should be reflected here after a midnight build (CET time), or manually triggering it with a GitHub action (link).

Brazil MDIC Disaggregation (2019)

Problem Description

Our Brazil models require datasets of customs data (CD). Unfortunately, these datasets are not available from 2018 onwards. To get around this we attempt to make a best-effort reproduction of the customs data using two key datasets:

  • Datasets of domestic flows from the Ministry of Development, Industry and Foreign Trade (MDIC)
  • Privately-signed shipping contracts called Bills of Lading (BoL)

These two datasets report quite different data! The MDIC dataset is a summarised version of the customs declaration whereas the BoL reports on maritime trade.

The BoL is an interesting dataset. Traders don't want to expose information about their activities. However, they do sign contracts which are important for liability; i.e. insurance against goods breaking or becoming lost during shipping.

Aggregation of MDIC

Unfortunately, MDIC isn't available to us in its entirety. Instead, it is only available as three separate aggregations of the original data:

  • MDIC (Port): aggregated by the port of export and the state of production.
  • MDIC (Municipal): aggregated by commodity (HS4) code and the municipality of taxation.
  • SECEX: aggregated by company tax code (CNPJ), municipality of taxation and an economic activity code called the National Classification of Economic Activities (CNAE).

The below diagram illustrates this aggregation:

Graph of MDIC aggregation

Overview of Matching Problem

Our goal, then, is to combine the information in the MDIC, BoL, and SECEX dataset to produce a "CD-like" dataset for each commodity. The below diagram illustrates the matching problem.

Graph of MDIC-BoL matching problem

The dotted lines indicate fields that are related, but not necessarily identical:

  • Exporter: the BoL record contains the CNPJ of an exporting company. The MDIC record also contains the CNPJ of an exporting company, available in an aggregated form in the SECEX dataset. However, these companies may not be the same! The CNPJ in the MDIC data is a legal requirement: it will be for the legal entity where _the cargo was last transformed in a sigificant way (often this is a logistics hub of some kind, but it can also be an office, some banks even export soy!). In contrast, the BoL is simply a private contract in which traders use a CNPJ that is convenient to them: this can be very different from what you see in MDIC.
  • Country: these are quite different between MDIC and BoL. The latter contains maritime locations, while MDIC, being derived from customs, takes into account the actual importing country.
  • Month: the months can be offset between MDIC and BoL, and as such are not reliable to compare against. BoL is reported just before the trade happens whereas MDIC is ex post.
  • Volume: BoL usually reports volumes calculated off standard shipping container sizes using lookup tables. The volumes MDIC can be different to this, perhaps using different tables.
  • Port: for the most part the port is the same, but there are two reasons for differences: Firstly, MDIC, being derived from customs, can reflect "changes in the plan" made after the shipment left the source port. In contrast, BoL represents the "initially intended route" for the shipment; a contract before the trade happened. Secondly, MDIC covers more than just maritime trade, so includes a wider range of locations than the maritime ports you find in BoL. Finally, the names often are just slightly different. We have a mapping of port names for this purpose.

Commodity-specific points

There are a few points specific to certain commodities:

  • Resolution of commodity codes: HS4 is sufficient for Soy, but we need HS6 for Chicken due to offal and canned meat.
  • We have some extra constraints for slaughtered commodities such as Chicken. In particular, (a) datasets of slaughterhouses and (b) a "halal constraint" for certain countries which only receive meat from halal-approved slaughterhouses.

Approach

We start off with the BoL as our base data, loaded as flows. The MDIC (Municipal) will be used to comfirm the municipality of taxation against the declaring CNPJ in the BoL.

To do this we slice the BoL data into groups of:

  • Municipality of taxation according to the exporter CNPJ in SECEX,
  • Country of import, and
  • HS4 code

For each slice we get the equivalent records from MDIC (Municipal). If there is more volume in MDIC than there is in the BoL, then we:

  1. Adjust the FOB of the BoL records using the average price in MDIC.
  2. Adjust the volume of each row of MDIC (Municipal) downwards by a proportion so that the total volume now matches BoL.

We consider these flows to be solved.

The remaining flows, i.e. where there was less (or no) volume in MDIC, go to a linear programming step. The linear programming step is a simple source/sink/cost (transportation) problem:

  • The sinks are groups of exporter/HS4/country, which demand their total volume in the BoL.
  • The sources are groups of municipality of taxation/HS4. They supply the total volume in the MDIC (Municipal). However, we adjust this supply upwards by 10% across the whole country to ensure there is enough supply.
  • The costs between each source and sink are constructed using the following algorithm:
    1. If there are any records at all in MDIC (Munipical) which indicate an export to the country of the sink for the given HS4 code then the cost is 100 units. Otherwise, it is 1000 units.
    2. If SECEX has any records for the exporter of the sink that are in the municipality of the source, discount the cost by 90%.

Once we have solved the linear programming problem, we use it to split the records of the BoL. To do this we slice the BoL by exporter/HS4/country, and for each slice we iterate over the municipalities of taxation and their volumes that the LP provided as a solution. We split each row proportionally among these municipalities based on a "ratio" that is derived from the MDIC (Port) via an agorithm I haven't yet fully understood.

Should the output file be more like BoL or more like MDIC?

In the above appraoch we have decided to produce output data that "looks" more like BoL than MDIC. The motivation for this was the notion that Trase should be more "sensitive" to a company's description of their trade (their market share, ports, etc.) than the origin of the commodity. This is because Trase typically targets a per-company view of international trade.

If we were to switch the problem and use MDIC as our "base" we would have to solve, for example, for N different companies out of each location. In other words, using the MDIC as a base would have that cost that we might mess up the companies, weighed against the advantages that (a) we get the countries correct, and (b) we would match UN statistics exactly, as these statistics are derived from customs data.

The above approach is a sort of hybrid of the both, in which we take the company-volume-port from BoL and look for the municipality of origin based on all the potentials CNPJs per municipality of taxation from the customs data, based on the percentage that a given company in the past was sourcing from each municipality.

Outstanding Questions

  • Can we not use the information in MDIC (Port) to our advantage?
  • Is LP a good choice given that the costs are just one of three values (10, 100, or 1000)?
  • Why do we not consider month resolution to improve the accuracy of the model?
  • How is this ratio calculated? The code looked a bit complicated, I saw mention of a median in there.

Future Ideas

  • Use previous years' CDs as auxiliary datasets to add additional constraints.