Cost Matrix V2

View or edit on GitHub

This page is synchronized from trase/data/brazil/logistics/cost_matrix/cost_matrix_v2/README.md. Last modified on 2026-03-21 22:30 CET by Trase Admin. Please view or edit the original file there; changes should be reflected here after a midnight build (CET time), or manually triggering it with a GitHub action (link).

Brazilian Freight Cost Matrix Generator

Pipeline Documentation | OSRM + Machine Learning Integration

📄 Overview

This pipeline was engineered to generate a comprehensive national road transportation cost matrix for Brazil. It computes transportation costs for all possible origin–destination routes between municipalities across the country. These costs are represented by travel time and freight cost per ton. This approach makes both variables available, allowing users to rely on either one individually or combine them into a single Generalized Cost index, depending on whether time or price (or a combination of both) proves to be the superior predictor.

Data Sources: Real freight prices from CONAB (~200 routes per year) merged with the OpenStreetMap (OSM) road network.
Routing Engine: OSRM (Open Source Routing Machine) to calculate distances and travel times.
Intelligence: ML Model (Scikit-Learn) to estimate freight costs for routes without values.

Final Output: A Parquet file containing over 31 million routes connecting all 5,570 Brazilian municipalities.

💻 Technical Requirements

Environment: Docker Container (18 GB RAM, 4 GB Disk).
Performance: Total processing completed in ~9 minutes.

⚙️ Pipeline Execution Steps

Infrastructure Setup: Deployment of a local OSRM backend via Docker, pre-loaded with the Brazilian OSM network.
- Install and run Docker on you environment
- Download OSM road network running: trase/data/brazil/logistics/cost_matrix/cost_matrix_v2/osm_source.sh
- Run docker compose: docker-compose -f trase/data/brazil/logistics/cost_matrix/cost_matrix_v2/docker-compose.yml up -d
Generation of Brazilian Transportation Costs
- Run the script trase/data/brazil/logistics/cost_matrix/cost_matrix_v2/1_generate_cost_matrix.py

Methodology

The distance and travel time was derived directly from OSM, generating all possible routes origin-destination. Regarding freight prices, since we don't have all prices for all possible routes across Brazil, this pipeline utilizes a Log-Log Linear Regression model trained on high-quality historical data from CONAB to fill the freght price gap.

The model estimates the freight cost per ton ($price_ton$) by applying linear regression to the natural logarithms of the input features. The relationship is defined by the following equation:

$$\log(\text{price}) = \beta_0 + \beta_1 \log(\text{distance}) + \beta_2 \log(\text{duration}) + \sum \beta_i (\text{State}_{\text{origin}})$$

The process involves three critical steps: * Feature Engineering: Continuous variables—distance (km) and duration (hours)—are transformed into their logarithmic forms to linearize the non-linear relationship between distance and cost. * Spatial Context: We incorporate State-level dummies (e.g., orig_MT, orig_SP) to capture regional price variations, as indicators of infrastructure quality, tax differences (ICMS), and local market demand. * Back-Transformation: After predicting the log-price, we apply the exponential function ($\exp$) to return the value to its original scale (BRL/ton).

Why this method was chosen

We selected this specific approach over more complex models (like Random Forests or Gradient Boosting) for several strategic reasons:

Capturing Cost Degressivity: Road transportation costs are not strictly linear; the "cost per kilometer" typically decreases as the total distance increases. A log-log model is the mathematical standard for capturing this power-law degressivity efficiently.
Computational Efficiency: With a target matrix of over 31 million routes, the model must support "Massive Inference." Linear Regression is extremely fast, allowing the entire matrix to be filled in minutes.
Interpretability: This method allows us to see exactly how much each additional kilometer or specific origin state contributes to the final price, which is essential for auditing the logistics logic.
Handling Heteroscedasticity: Logarithmic transformations help stabilize the variance of error terms, which is often higher for long-haul routes in large-scale datasets.

Model Validation Parity Plot Figure 1: Parity Plot - Predicted vs. Actual Freight Costs (CONAB Validation Set).

📊 Quality Assessment (QA) & Results

1. Connectivity & Reliability Comparison

Comparison between the legacy matrix and the current V2 version focusing on network integrity:

Metric	Matrix (2019)	New Matrix (2023)
Total Routes	31,013,761	31,024,900
Zero Duration Errors	0.0180%	0.0180%
Critical Failures (< 1 min duration)	2	0 (Cleaned)

2. V2 Dataset Insights & Predictive Performance `NEW IN V2`

In this version we are introducing two variables: distance and freight priceson which we can derive the following metrics:

Metric	Value
Avg. Speed	72.72 km/h
Avg. Distance	1,853.77 km
Avg. Real Price (CONAB)	R$ 315.38
Avg. Predicted Price	R$ 441.65
Price Outliers (>2x P99)	0.00%

3. Travel Time Differences Between Datasets

This comparison quantifies the shifts in travel time for identical routes across the two dataset versions (<50 hours). By isolating routes present in both matrices, we can observe the evolution of the Brazilian road network over a four-year period—moving from the 2019 Legacy baseline to the 2023 V2 update.

Travel Time Differences Between Old and New versions Figure 2: Distribution of travel time deltas for routes under 50 hours

The histogram above underscores that most routes associated with Sorriso experienced reduced travel times (negative values). In contrast, the previous dataset shows a small number of routes with shorter travel times than those in the new dataset. These few cases correspond to locations that are quite far from Sorriso and occur when OSM could not identify a feasible route and instead applied a fallback parameter. This does not pose a risk to the dataset, as it results in consistently long travel times in such cases. See the figure below.

Figure 3: Map of Travel Time Differences for routes under 50 hours

Map of Surface Cost with all routes

Surfance Cost in 2019 and 2023 Figure 4: Map of Surface Cost in 2019 and 2023

More details can be found in the notebook located at qa/cm_exploration.ipynb

🧪 Research Goal & Future Decision

While our current models rely heavily on Travel Time as a proxy for operational effort, the introduction of the Freight Cost variable serves a strategic purpose:

Direct Economic Output: We are testing whether using actual freight prices (which account for fuel, tolls, and regional market volatility) provides a more "ground-truth" reflection of the Brazilian logistics landscape than time-based metrics alone.
Hybrid Modeling: We are exploring whether a combination of duration and cost yields a more robust indicator for supply chain optimization.
Validation Phase: These new variables are currently included to determine if the added complexity of the ML price-prediction layer significantly enhances the decision-making quality of the Trase.earth Logistics Module.

Note for Analysts: At this stage, we recommend continuing using travel time. Future iterations will decide if we will consolidate into a single "Generalized Cost" index or if one specific metric (Time vs. Price) proves to be the superior predictor.

Author: Jailson S. - Data Scientist, Data Product Team
Date: January 2026

qa