5/14/2021

We’ve known there are issues with the CD data for a while

An excerpt from a very old QA document

The issue

  • Argentina customs declarations information does not align with data from COMTRADE.

  • Each commodity appears to have unique issues.

How we solved this issue in the past

  • Vivian == 🧙‍️
  • Neat inferential techniques
  • Identifying reliable data points and using averages (e.g. FOB/ton) to infer missing data

BUT THEN

dramatic music

lightning crash

…I started looking at shipments by date, rather than aggregated to the year level

the smoking gun

Nearly 6 months of missing data

  • First, export country is always listed as unknown for a few months… and then there is no trade volume data at all for a few months

  • All shipments from 2017/10/31 - 2018/04/03 have a listed country of origin as “DESCONOCIDO” (unknown)

  • We had been filtering out all this information, and didn’t realize we were missing anything 😬

Why is this the case?

  • New data privacy law meant records were obscured upstream of our data provider
  • After searching around, we decided to use AFIP data, which is much more crude, to fill in missing export records

More info at Pull request 2225

So, we solved the issue! Right?

WRONG

WRONG

WRONG

WRONG

WRONG

Trade volumes do not align

Even after accounting for missing months of data, and replacing them with other (more crude) data, volumes do not match up.

There are large discrepancies year over year and month over month

CD trade volumes != AFIP trade volumes

Large discrepancies persist

What next?

  • Despite progress… no resolution
  • Time to go into the belly of the beast

A decision tree style solution

  1. Open up CD data
  2. Find rows that had obvious discrepancies
  3. Code up a method to filter for and then fix those rows
  4. Move “solved” rows to their own object

And bit by bit…

B1 Both measures as KG, and values are within 1% of each other

We solved larger and larger issues…

B2 one measure is Tons, the other is KG, and they match after conversion (within 1%)

B3 No information in VOLUME_RAW, but there is a value in Total Quantity 1, and we can confirm via FOB per Unit and FOB

B7 No second value is listed, but a value and measure are listed for statistical unit and weight, and the declaration number only appears once

Part C: Adjusting where multiple rows include a summed volume

There are many cases where the listed VOLUME_RAW/Cantidad Estadistica lists not the correct volume for a single flow, but for a group of records.

Until nearly everything was matched

WOOHOO!

Summary:

  • Identified unknown/missing data
  • Found new data source (AFIP), incorporated that crude data when available
  • Went through CD bit by bit and resolved missing or grouped values
  • Mike is a saint
  • Andrew is a grump

Appendix (for records)

Soy:

For all of 2018, the CD has .93x the volume of Comtrade (30.5 million tons vs 32.7 million). For the month of April, AFIP has 1.8 million tons more trade volume than the CD. Suggested course of action: Use CD data and backfill with AFIP.

Corn:

For all 2018, the CD has .8x the volume of Comtrade (18.6 million tons vs 23.2 million) For the month of April, AFIP has 2.1 million tons more trade volume than the CD. Suggested course of action: Use CD data and backfill with AFIP.

Beef:

For all of 2018, CD has 1.7x the volume of Comtrade (887,000 tons vs 521,000). For the month of April, AFIP has 19,000 fewer tons than the CD. Using AFIP data instead of CD data for April would have a small effect on aligning our values with Comtrade. Suggested course of action: Use CD, ignore AFIP

Shrimp:

For all 2018, the CD has 6.2x the volume of Comtrade (1.1 million tons vs 184,000). For the month of April, AFIP has 4,500 fewer tons than Comtrade. Using AFIP data instead of CD data for April would have a negligible effect towards aligning our values with Comtrade. Suggested course of action: Used CD, ignore AFIP

Cotton: No AFIP data. Have to use CD.

Wood pulp: No AFIP data. Have to use CD.