Skip to content

View or edit on GitHub

This page is synchronized from doc/Preprocessing.md. Last modified on 2025-12-09 00:30 CET by Trase Admin. Please view or edit the original file there; changes should be reflected here after a midnight build (CET time), or manually triggering it with a GitHub action (link).

Preprocessing

Data Specifiction

This document details some conventions about Trase's data systems, such as how missing values are represented, string normalization, column names, and so on.

See also:

Outliers in Trade Data

Very frequently we come up with the issue of outliers in customs data: impossibly expensive shipments, or just really odd ones. This page is an attempt to capture learnings from many hours of tackling this.

  • Very often FOB is a more reliable field than Volume: it can be possible to impute the volume based on the FOB and a $/kg quantity.
  • If volumes are small, then the outliers are totally possible (a batch of genetically modified soybeans, a high quality variety, a specific demand for which logistic costs surpass the value of the small cargo etc...). Also in containerized goods it is important to stress that if the data is not disentangled by the customs office (e.g. it is a BoL, or the customs office is mirroring the shipping manifests) then the main HS code inside the container will be used, not a list of all goods inside the container. So other goods can vastly modify total weight or total value, depending on the combination of goos in the container.
  • Always look for potential issues with different units (pounds vs kilos, tones, cubic meters etc..).
  • We are always trying to use net weight (obviously) but in some occasions the data can be in gross weight (i.e. includes the weight of the bottles, the packages etc..)
  • Remember that trade volumes are not weighed, they are estimated from volumetric quantieies (a 20 foot container TEU, where a big containerhsip carries 2000-3000 containers)
  • Outliers can be correct. Sometimes a kilo of soy is genetically engineered and more expensive, or simply there is a missing payment or a compensation that is adjusted on a later shipment. The important thing is to look at price/kilo for large shipments, not for cases with some few kilos (and often transported by air).
  • On duplicates: remember that these are often standardized shipments, so duplicates can be correct. When you own 20 containers of whatever, then you get 20 records identical.
  • Unless you see something fragantly wrong and explanable (they meant kilos but clearly it is tones, or they used a non metric measure etc...) I would just accept the data
  • It is not always clear if the price provided is FOB or CIF
  • SOY RESIDUES can cause confusion, sometimes present and not expected, sometimes we want them in

Processing

Bills of Lading

A bill of lading is a crucial commercial document used in the international shipping industry. It serves three primary functions: it is a receipt issued by the carrier acknowledging that they have received the cargo; it is a contract of carriage that outlines the terms under which the goods are transported; and it is a document of title, meaning it can be used to claim ownership of the cargo or transfer it to another party. This multifunctionality makes it a legally binding instrument in international trade, and it has been a cornerstone of maritime logistics for centuries.

Importantly, every bill of lading must list a consignee, the person or entity meant to receive the cargo. Even if the final destination or importer is unclear at the time of shipment—such as in some speculative or opportunistic trades—the document must still designate someone as the responsible receiver. This ensures legal clarity and prevents disputes over who is entitled to collect the goods. In many cases, shipping operations cannot proceed—such as unloading cargo—unless the consignee can present an endorsed bill of lading, highlighting its critical role in ensuring the secure transfer of goods across borders.

A bill of lading is a private document. They are not supposed to be shared externally and when available through third-party data vendors it is possible that they have been obtained illegally - this is different to customs documents.

However, similar information could potentially be acquired through using AI web scrapers to monitor imports from Reuters etc., and from auto-downloading daily open-access reports from specififc ports. This is the approach that Alejandro Soumah at the University of California San Diego takes, see his medium article here: https://medium.com/@alejandrosoumah_6185/holistic-data-integration-for-the-ivorian-cocoa-industry-using-automatic-llm-agents-and-kalman-c56b39ede36e

Often a company might try to fill in the bill of lading in such a way that satisfies the parties involved in the shipment yet obfuscates their supply chain.

Here are some common terms that you might find in source data:

Trase Column Name Notes
Importer Company Consignee The person to whom the carrier is supposed to deliver the goods. In most cases the consignee is the buyer of the goods, but not always: it could also be the agent nominated by the buyer, or also be the buyer’s bank.
Exporter Company Shipper The person who is responsible for packing and preparing a shipment to turn over to the carrier for transport. For example, this might be a supplier who is sending your company parts for use in your manufacturing processes.
Carrier The party that actually moves the cargo.
Company Notify The party to whom the carrier is suppose to notify regarding the arrival ETA’s of the vessel. The notify party is then responsible for arranging the arrival formalities of the vessel. The notify party could be agent, receiver of the cargo, or any other person/entity who has the interest in the arrival of the cargo, such as 3rd parties that need to be made aware of the shipment updates, progress and delivery.
Place_and_Ports/POR "Place of receipt": the location where the cargo will be received from the shipper.
Port of export* Place_and_Ports/POL* "Port of loading": a port or a place where goods are loaded aboard a ship, secured and ready to be transported. POL can also be called the port of exit.
Place_and_Ports/POD "Port of discharge" (also known as the "port of unloading": the port where the cargo will be arriving. This is a place where a vessel discharges or unloads some or all of its shipments. The shipments are, thereafter, dispatched to their various consignees. If the port of delivery has been mentioned in the bill of lading, the responsibility to deliver goods at place mentioned in Bill of Lading is vested with the carrier of goods. The Port of discharge can be a destination sea port, where in Place of delivery be at an inland location away from port of discharge. If the carrier accepted goods from shipper to deliver at a port other than port of discharge, the cost of delivering cargo from port of discharges to the place mentioned to deliver cargo has to be met by the carrier of goods.
Place_and_Ports/DEST The final destination that the shipper has agreed to deliver the goods to.
Place_and_Ports/POMD "Port of maritime destination"
Place_and_Ports/POMO* "Port of maritime origin"
Volume WTKG Trade volume (actually a weight), usually in kilograms or tonnes.

* Should we use Port of Loading (POL) or Port of Maritime Origin (POMO) to be what we call the "Port of Export"? Practically speaking these are almost always the same: for example, when we analysed the bills of lading for Brazil in 2020 we found that they were the same for 99.6% of all volume. We therefore consider the choice a very minor issue. However, we generally recommend using Port of Loading (POL)

Here are a few abbreviations that you might see in trade data:

  • P/C: in French this can be "pour le compte de" ("for the account of"). If a company is marked as P/C then they may be receiving the shipment for another company (the real buyer).

Customs Declarations

A customs declaration is an official document submitted to a country's customs authority, detailing goods that are being imported or exported. This document includes information such as the nature, quantity, value, and origin or destination of the cargo. The declaration enables governments to monitor trade flows, enforce regulations, and collect taxes and duties. Submitting accurate customs declarations is a legal requirement for cross-border trade, and failure to do so can lead to fines, confiscations, or delays.

Customs declarations play a vital role in national security, public health, and economic policy. They allow customs authorities to detect illegal goods (e.g., drugs, weapons), prevent under-invoicing or mislabeling for tax evasion, and track commodities subject to trade restrictions or sanctions. Moreover, trade statistics that inform national policy and international agreements are typically derived from customs data. While exporters may sometimes be lax in reporting, importing countries are often more rigorous, as they rely on declarations to collect revenue and enforce import controls.

Processing

If you are processing a dataset independently for Trase, please follow these guidelines:

  1. Organize work on a given dataset within a single folder. Within the root, create another folder called 'ORIGINALS' which contains an 'original' copy of the data before being tampered with. This is kept as a reference copy and should not be altered.

  2. The finished product should be stored in the root folder, following standard naming conventions.

  3. Document your process meticulously, preferably with a script with plenty of comments. If you use excel or a GIS program be sure to write out your steps in detail in a README.txt file so it is reproducible and transparent.

If you use a script, make all paths relative to the root directory. No absolutes here. For example if using R...

#setwd('my/local/path/to/data/root/')
infile <- 'ORIGINALS/orginalData.csv'
outfile <- 'descriptively_named_data_dont_worry_about_length_use_underscores_never_spaces.csv'
# code for transforming original file(s) to processed file

(4.) IS TO BE REVISED:

  1. Add any and all metadata to a README.txt file. Key elements include source, website, citation information and temporal range.

Example structure

someDataSet/
├── out
│   └── processedFile.csv
├── raw
│   ├── rawCSV.csv
│   └── rawShapefile.shp
├── README.txt
└── script.sh


Processing with metadata access: TO BE UPDATED

The centralized metadata file for Trase ETL can simplify the task of scripting by removing the need for hard-coded file paths and other parameters. The script can simply query the metadata file to find out the proper location of inputs and outputs. This adds a layer of independence between scripting and file-storage, ensuring that the script will still 'work' even if files move around or other parameters change.

CSV formatting

  • please use semicolon separators (! note the change from comma separators in Jan 2018 !)

TO BE REVISED:

  • If data is multi-temporal, please make sure there is a column called 'year'. Please do not include different years as columns.

For example please reformat this:

geoid 2001 2002 2004
COL2341 10 1 1
COL2342 1 5 6

to this:

geoid year variable value
COL2341 2001 yeild 10
COL2341 2002 yeild 1
COL2341 2003 yeild 1
COL2342 2001 yeild 1

Geospatial Formatting

Please use WGS 84 projection if possible.

The proj4 string for WGS84 lat/lon is: +proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0

Alternatively some processes require equal area projections. For brazil we have used SAD69 / Brazil Polyconic (epsg29101) +proj=poly +lat_0=0 +lon_0=-54 +x_0=5000000 +y_0=10000000 +ellps=aust_SA +units=m +no_defs


Naming conventions

  • No spaces, please (bad for unix and mac systems)
  • scripts are typically named based on the categories they fall under, with increasing specificity, and ending with the specific datasource. For example, a dataset on Brazilian peanut production statistics from the IBGE would be called brazil_production_peanuts_ibge.R

Zero Deforestation Commitments

Each flow, or row of trade data, may have one value representing a zero deforestation commitment. This can take the following values:

  • A single commitment, e.g. TAC
  • Multiple commitments separated by &, e.g. COMPANY COMMITMENT & TAC
  • The value NONE, which means that it has been confirmed that the flow has no commitment.
  • The value UNKNOWN, which means that it is not known whether the flow has a commitment or not.

No other values (empty strings, NA, etc.) are permitted.

Spatially-situated ZDCs

We produce a dataset in which we aggregate the percentage of exported goods that are covered by ZDC commitments, for each producing jurisdiction. If a jurisdiction does not export any goods then it will be absent from this dataset.

Forest 500

The Forest 500 commodity score (0 to 5) at TRASE is derived from the forest 500 commodity score percent and grouped as:
0% = 0/5
0.1-19.9% = 1/5
20-39.9% = 2/5
40-59.9% = 3/5
60-79.9% = 4/5
80-100% = 5/5

f500_commodity <- f500_commodity %>% mutate(
    commodity_score_perc = (commodity_score / commodity_max_pts) * 100,
    commodity_score_0to5 = case_when(
        commodity_score_perc == 0 ~ 0,
        commodity_score_perc == 100 ~ 5,
        .default = 1 + floor(commodity_score_perc / 20)
    )
)
Companies not assessed should be assigned 'NOT ASSESSED' during embedding.

String Normalisation

In the database, strings are always stored:

  • In UPPER CASE
  • Without leading or trailing whitespace
  • Without repeated whitespace (HELLO WORLD becomes HELLO WORLD)
  • As unicode characters only, with control characters etc. removed

Preprocessing Conventions

This document describes our conventions for scripts that preprocess data. Most of the code examples are in Python but all of the advice is equally applicable to R or any other language.

Script Location

Scripts should reside in this repository under "trase/data". The structure follows that of S3:

trase/data
└── <country>
    ├── auxiliary
    ├── boundaries
    ├── indicators
    ├── logistics
    ├── trade
    │   ├── cd
    │   └── ...
    └── <commodity>
        ├── auxiliary
        ├── logistics
        └── ...    

Where possible the preprocessing script should match exactly the file that it produces. Try to avoid names like "clean_boundaries.R" or "scrape_trade_data.py". By doing this we reduce the cognitive burden of the layout of S3 (already very unwieldly and chock-a-block with exceptions!) by avoiding introducing a second, subtly different layout in GitHub.

For example, if a script produces the following file:

s3://trase-storage/brazil/logistics/abiove/br_crushingFacilities.csv

Then it should reside in the following location:

trase/data/brazil/logistics/abiove/br_crushingFacilities.py

This is not always possible when the script produces multiple files, but try to match at least a common prefix. For example a script that uploads multiple years could be trase/data/brazil/trade/bol/BRAZIL_BOL_201X.py.

Reading data

Always read data directly from S3. Do NOT read data from your local hard-drive / from your local Dropbox folder / etc. This will make it impossible for anybody else to run the script, and "breaks" the link to the original source data.

In Python you can use the get_pandas_df or get_pandas_df_once function:

## DO NOT do this
df = pd.read_csv("~/Dropbox/bolivia/comex_bolivia_soy_2019.csv")

## DO do this
from trase.tools.sps import get_pandas_df_once 
df = get_pandas_df_once(
   "bolivia/trade/comex/comex_bolivia_soy_2019.csv", dtype=str, keep_default_na=False
)

When you read data, try to include the full S3 path as a single string literal. This makes it easier to find the script that produced a file in S3 simply by searching the codebase for the S3 path.

## DO NOT do this
root = "bolivia/trade/"
dataset = "comex"
years = [2019, 2020]
for year in years:
    s3_key = f"{root}/{dataset}/{dataset}_bolivia_soy_{year}.csv"

## DO do this
s3_keys = [
   "bolivia/trade/comex/comex_bolivia_soy_2019.csv",
   "bolivia/trade/comex/comex_bolivia_soy_2020.csv"
]
for s3_key in s3_keys:
   # ...

Caveats:

  • While the script is in development you can read from your local drive. However, before pushing it to GitHub switch the source to S3 and check the script runs.
  • Reading data from the Trase database / Splitgraph for cleaning

Writing data

Your script should NOT automatically upload the data to S3 by default. This raises the risk that somebody will run your script and accidentally update files on S3 without even realising it.

Instead, follow this pattern:

  1. When your script is finished, write the data to a temporary location
  2. Inform the user where the data is so that they can expect it
  3. Inform the user that they should now upload the data to S3
  4. Alternatively, provide a non-default command-line option to your script that will perform the upload

For Python, always use the write_csv_for_upload function:

## DO NOT do this
df.to_csv("comex_bolivia_soy_2019.csv")
s3.upload_file(
   Filename="comex_bolivia_soy_2019",
   Bucket="trase-storage", 
   Key="bolivia/trade/comex/comex_bolivia_soy_2019.csv",
)

## DO do this
from trase.tools.aws.metadata import write_csv_for_upload

write_csv_for_upload(
    df, "bolivia/trade/comex/comex_bolivia_soy_2019.csv",
)

If you do this you will see the following message:

$ python trase/data/bolivia/trade/comex/comex_bolivia_soy_2019.py
Written /var/folders/_m/s00h1rdj3m75ptngsv9tv52w0000gp/T/comex_bolivia_soy_2019.csv
Written /var/folders/_m/s00h1rdj3m75ptngsv9tv52w0000gp/T/comex_bolivia_soy_2019.yml

The file has been written but not uploaded. Once you have inspected the file you can either:

  1) Re-run the script with --upload, or

  2) Use the Trase CLI:

       trase s3 upload '/var/folders/_m/s00h1rdj3m75ptngsv9tv52w0000gp/T/comex_bolivia_soy_2019.csv' 'bolivia/trade/comex/comex_bolivia_soy_2019.csv'

Think of the next person to find your script

We will quite likely need to re-run your script a year from now. It's quite possible that person won't be you. Or, it will be you, but you will have forgotten about the script.

Try to make the next person's life easier: leave comments, make sure the code is laid out logically, etc.

Minimising dependencies and code-reuse

We usually write a script, run it, and then don't even look at it for years. We have no automated systems that guarantee that a script will continue to run. In fact, it's pretty much certain that the script won't work a year from now!

To minimise the burden of the next person to try to run the script, try to make them as standalone as possible.

  • Do not import Python code from elsewhere in the repository. Instead, copy the function definition into the script.
    # DO NOT do this
    from trase.tools.etl.utilities import clean_string
    
    # DO do this:
    def clean_string(text):
        text = " ".join(text.split())
        text = unidecode(text)
        return text.upper()
    
    If you would like to develop a function or library for use in many scripts, consider putting it in trase/tools.
  • Scripts should not import code from other scripts. Instead, copy-and-paste / duplicate the code that you need.
  • If you use a not-so-common external library put a comment on the import to say which version it was:
    import fcsparser  # 0.2.4
    

Caveats:

  • A few specific functions (get_pandas_df, get_pandas_df_once, aws_session, write_csv_for_upload) are OK

Pandas Best Practice

See Pandas for best practice when using Pandas.

Using GitHub

Always create a branch for your code changes. Create a draft pull request while you are working on the code. Create a pull request when it is ready to merge. Try to merge the code into the master as soon as it is finished and working.

Archiving Old Scripts

Trase is a project over many years and with tens of contributors. Our codebase is already quite large! If we kept all of the scripts we ever created, we would drown in code.

The master branch should contain the "most recent" version of a script. If a script has been superceded or the dataset is no longer relevant then you should archive the script in Git, for example using the git rm command or GitHub's "delete file".

The script is never deleted completely, we will always be able to find it again.

DO NOT move the script into an "archive" folder.

Naming Conventions and Hierarchy

See also List of names for data standardization

Naming rules

General * avoid spaces or special characters in names! (they are hard to work with from the command lines). Instead, please use underscores _. * favor clarity over brevity

TO BE REVISED

GEO IDS

  • Each country has its own geographic hierarchy (and quirks). While geographies do change over time (for example a municipality splitting into two), they are relatively stable and to make transitions between databases easier we have decided to use a set of permanent geoids uniquely naming each feature.

Countries

Level Level Type Description Example code Example entity
1 Country ISO 2 Code AG AFGANISTAN
1 Country ISO 2 Code XX UNKNOWN COUNTRY

Brazil

Level Level Type Description Example Code Example Entity
1 Country ISO-2 Code BR BRAZIL
2 Region IBGE region code BR-5 Centro-Oeste
3 State IBGE state code BR-51 Mato Grosso
4 Mesoregion IBGE mesoregion code BR-5105 Sudeste Mato-Grossense
5 Microregion IBGE microregion code BR-51020 Tesouro
6 Municipality IBGE 7 digit code BR-5107008 Poxoreo
X Biome Arbitrary coding BR-BIO-4 CAATINGA

Colombia

Level Level Type Description Example Code Example Entity
1 Country ISO-2 Code CO Colombia
2 Department DANE department code CO-05 Antioquia
3 Municipality DANE municipality code CO-05002 Abejorral

Paraguay

Level Level Type Description Example Code Example Entity
1 Country ISO-2 Code PY Paraguay
2 Department code from GADM PY-10 Central
3 District code from GADM PY-100083 Capiata

Argentina

Level Level Type Description Example Code Example Entity
1 Country ISO-2 Code AR Argentina
2 Province INDEC Province code AR-50 Mendoza
3 District INDEC District code AR-50077 Malargue

Indonesia

Level Level Type Description Example Code Example Entity
1 Country ISO-2 Code ID Indonesia
2 Province BPS Province Code ID-12 North Sumatra
3 Kabupaten BPS Kabupaten/Kota code ID-1210 Dairi

Datasets * Each dataset entered into the system gets a character id, referenced by 'title' in the metadata file. The format for this id is: three-letter ISO country code_category name_subcategory_organization or author So for example fire reports from Brazil's space program INPE are titled bra_env_fire_inpe. The category names as follows:

type definition
env environmental data
social social data
meta metadata -- codes, keys, etc
prod production
trade trade related data
spatial boundary spatial objects
logistics related to supply chain logistics (ports, roads etc)
actors related to supply chain actors

Scripts Scripts should follow conventions above. By legacy, the category part and geography part have been switched however.

Folder Hierarchy

The trase filesystem is split into two main branches: IN and OUT. IN contains all incoming 'raw' data. Its organization is a bit more cluttered, but that is OK, because it is not meant to be accessed by many. In general the folders are organized according to TOPIC/[SUB]TOPIC/REGION/DATASET/SPECIFIC DATA.

OUT contains the processed datasets that are ready for analysis and integration into Trase. Here is a snapshot of this structure, revealing the general structure of REGION / TOPIC / [SUB]TOPIC /[sub-sub] topic/ DATASET-ORGANIZATION.

It is important that the penultimate folder (the leaf) be labeled according to the dataset source for clarity and to distinguish different sources for the same data type (e.g. deforestation)

├── ARGENTINA
│   ├── 1-TRADE
│   │   └── CUSTOMS
│   │       ├── BOVINE
│   │       └── SOY
│   ├── 4-METADATA
│   │   └── CODES
│   └── 7-GEOGRAPHY
│       └── ADMIN
├── BRAZIL
│   ├── 1-TRADE
│   │   ├── BoL
│   │   │   ├── archive
│   │   │   └── SOY
│   │   ├── CUSTOMS
│   │   │   ├── archive
│   │   │   ├── CD_NEW
│   │   │   ├── CD_OLD
│   │   │   └── MDIC
│   │   ├── KASTNER
│   │   └── MERGED
│   ├── 2-PRODUCTION
│   │   └── IBGE
│   │       ├── agave
│   │       ├── apple
...
│   │       ├── wheat
│   │       └── yerbamate
│   ├── 3-LOGISTICS
│   │   ├── COSTS
│   │   ├── PORTS
│   │   ├── ROADS
│   │   ├── SILOS
│   │   │   ├── ABIOVE
│   │   │   ├── SELFDEC
│   │   │   └── SICARM
│   │   └── SOYSHEDS
│   ├── 4-METADATA
│   │   └── CODES
│   │       ├── CEP
│   │       ├── CNPJ
│   │       └── CNPJ_MDIC
│   ├── 5-SOCIAL
│   │   ├── CONFLICT
│   │   ├── EMBARGOES
│   │   ├── GDP
│   │   ├── HDI
│   │   ├── LEGALITY
│   │   │   └── APP
│   │   ├── SMALLHOLDERS
│   │   └── ZERO_DEFORESTATION
│   ├── 6-ENVIRONMENTAL
│   │   ├── BIODIVERSITY_IMPACT
│   │   ├── CARBON
│   │   │   ├── GHG
│   │   │   └── STOCKS
│   │   ├── FIRE
│   │   │   ├── INPE
│   │   │   └── old
│   │   ├── LULC
│   │   │   ├── AMAZON
│   │   │   ├── BRAZIL
│   │   │   ├── CERRADO
│   │   │   ├── MATA_ATLANTICA
│   │   │   └── TOTAL
│   │   └── WATER
│   │       └── ANA
│   ├── 7-GEOGRAPHY
│   │   ├── ADMIN
│   │   ├── BIOMES
│   │   ├── PROTECTED_AREAS
│   │   │   └── MMA
│   │   ├── SOYSHEDS
│   │   └── WATER
│   ├── 8-TRASE
│   │   └── V2_0
│   └── 9-ACTORS
│       └── COMMITMENTS
│           └── SOY
├── COLOMBIA
│   ├── 1-TRADE
│   │   ├── BOL
│   │   │   ├── BEEF
│   │   │   ├── PALM
│   │   │   └── SOY
│   │   └── CUSTOMS
│   │       └── PALM
│   ├── 4-METADATA
│   │   └── CODES
│   │       └── ADMIN
│   ├── 5-SOCIAL
│   │   └── POVERTY
│   ├── 6-ENVIRONMENTAL
│   │   ├── GHG
│   │   │   └── IDEAM
│   │   └── LAND_COVER
│   └── 7-GEOGRAPHY
│       └── ADMIN
├── INDONESIA
│   ├── 1-TRADE
│   │   └── CUSTOMS
│   │       └── PALM
│   ├── 2-PRODUCTION
│   │   └── OILPALM
│   ├── 3-LOGISTICS
│   │   └── PORTS
│   ├── 4-METADATA
│   │   └── CODES
│   └── 7-GEOGRAPHY
├── PARAGUAY
│   ├── 1-TRADE
│   │   ├── BOL
│   │   │   └── SOY
│   │   └── CUSTOMS
│   │       ├── BOVINE
│   │       └── SOY
│   ├── 2-PRODUCTION
│   │   └── SOY
│   ├── 3-LOGISTICS
│   │   └── PORTS
│   │       └── ADUANAS
│   ├── 4-METADATA
│   │   └── CODES
│   │       └── ADMIN
│   ├── 6-ENVIRONMENTAL
│   │   └── DEFORESTATION
│   │       ├── ESA
│   │       └── SOY
│   └── 7-GEOGRAPHY
│       └── ADMIN
├── PERU
│   └── 1-TRADE
│       └── BOL
│           ├── BEEF
│           ├── PALM
│           └── SOY
└── WORLD
    ├── 2-PRODUCTION
    │   └── FAO
    ├── 3-LOGISTICS
    │   └── PORTS
    ├── 4-METADATA
    │   └── CODES
    └── 7-GEOGRAPHY
        └── ADMIN

Copies of the raw files fed into Trase should be kept 'as is' as much as seems reasonable*.

Save the files to S3 in an 'ORIGINALS' folder in the proper location following standard data naming conventions and hierarchy.

Scripts used to download datasets should be stored in a /download repo on the TRASE github, with a name that clearly identifies the dataset being downloaded. In our experience, however, it is not recommended to over-automate the download process with scripting, as websites tend to change their formats frequently.

* An exception to this would be simple formatting manipulations which are easy to do by hand, don't change the data and would be really annoying to code.

Pandas

Here are a few tips to make Pandas more robust and readable!

Always prefer column operations to iterrows and friends

Pandas is exceptionally fast if you do operations on columns. You also usually find that your code is shorter, more readable, and less prone to ordering bugs if you stick to working on whole columns. The Series class has a huge range of functions that you can run: check them out!

Avoid NaNs

Missing data in Pandas dataframes, and NaNs in Python generally, are very weird objects. Their behaviour is often unclear or surprising (see The Weird World of Missing Values in Pandas). Another example is that NaN is never equal to NaN, except when using pd.merge. It's best to avoid them!

They can often creep in if you use pd.read_csv: unless you explicitly tell it not to it will try to auto-detect missing values; which is particularly sad for Norway with ISO code "NA"! Always pass keep_default_na=False in to this function.

Avoid auto-detection

It is generally better to be explicit than try to infer behaviour from the data. This is more robust to errors and easier to read. That means:

  • Do not auto-detect encodings in code: instead pass it in explicitly e.g. pd.read_csv(encoding=latin-1, ...)
  • Do not auto-detect CSV separators: pass it in explicitly e.g. pd.read_csv(sep=",", ...)
  • Do not auto-detect missing values: handle these cases in your code explicitly
  • Do not auto-detect types: cast explicitly to the types are you are expecting
  • Explicitly select the columns you are expecting rather than just reading them all in, e.g. pd.read_csv(usecols=["country"], ...)

If you use pd.read_csv or similar, always stick to these defaults:

df = pd.read_csv(filename, sep=";", dtype=str, encoding="latin-1", keep_default_na=False, usecols=["exporter", "vol"])
df = df[df["exporter"] != "NA"].copy()
df = df.astype({"vol": int})

Learn to use groupby and merge

They will make your code a lot faster and readable!