Skip to content

Osm Br Soy Distance Matrix

s3://trase-storage/brazil/mars_pilot/in/soy/osm_br_soy_distance_matrix.csv

Dbt path: trase_production.main_brazil.osm_br_soy_distance_matrix

Explore on Metabase: Full table; summary statistics

Containing yaml file link: trase/data_pipeline/models/brazil/mars_pilot/in/soy/_schema.yml

Model file link: trase/data_pipeline/models/brazil/mars_pilot/in/soy/osm_br_soy_distance_matrix.py

Calls script: trase/data/brazil/soy/mars_pilot/distance_matrix/calculate_cost_distance_matrix_origin_to_silo.py

Dbt test runs & lineage: Test results ยท Lineage

Full dbt_docs page: Open in dbt docs (includes lineage graph -at the bottom right-, tests, and downstream dependencies)

Tags: mock_model, brazil, in, mars_pilot, soy


osm_br_soy_distance_matrix

Description

This model was auto-generated based off .yml 'lineage' files in S3. The DBT model just raises an error; the actual script that created the data lives elsewhere. The script is located at trase/data/brazil/soy/mars_pilot/calculate_cost_distance_matrix_origin_to_silo.py [permalink]. It was last run by Osvaldo Pereira.


Details

Column Type Description

Models / Seeds

  • source.trase_duckdb.trase-storage-raw.location_soy_silos_origins_from_to

Sources

  • ['trase-storage-raw', 'location_soy_silos_origins_from_to']
"""
Scrapping cost distance from openstreetmap API (https://project-osrm.org/docs/v5.22.0/api/#general-options) using
parallel processing from list of locations.
The input data should be a table with columns for: 1. origin/destination; 2. lat/long from origin and; 3. lat/long
to destination, showing all desired connections between origin and destination.
About parallel processing: the scrapper accesses osm databased opening several requests at the same time.
Avoid using more than 10 jobs to avoid "null" values.
created by: osvaldo pereira
"""

import pandas as pd
import requests
import json
from joblib import Parallel, delayed
from trase.tools.aws.aws_helpers import get_pandas_df
from trase.tools.aws.metadata import write_csv_for_upload

# input matrix wih lat and long for all routes
df = get_pandas_df(
    "brazil/mars_pilot/ori/soy/location_soy_silos_origins_from_to_step1_test.csv",
    sep=";",
    keep_default_na=True,
)


def f(x):
    """request from osm page. For this case, we are using the "driving mode" with
    no alternative rotes to create only one possible distance/duration by connection.
    All options for osm api can be found at: http://project-osrm.org/docs/v5.5.1/api/#general-options
    """

    res = requests.get(
        (
            f"http://router.project-osrm.org/route/v1/driving/"
            f"{x['from_lat']},{x['from_long']};{x['to_lat']},{x['to_long']}"
            f"?overview=false"
        ),
        timeout=300,
    )

    # parse output into a dict if valid. 200 is the code for valid requests
    if res.status_code == 200:
        data = json.loads(res.text)
        output = {
            "origin_destination": x["origin_destination"],
            "origin_long": data["waypoints"][0]["location"][0],
            "origin_lat": data["waypoints"][0]["location"][1],
            "destination_long": data["waypoints"][1]["location"][0],
            "destination_lat": data["waypoints"][1]["location"][1],
            "distance": data["routes"][0]["distance"],
            "duration": data["routes"][0]["duration"],
        }
    # Failed requests will show as NaN in the output
    else:
        output = {
            "origin_destination": x["origin_destination"],
            "origin_long": x["origin_long"],
            "origin_lat": x["origin_lat"],
            "destination_long": x["origin_long"],
            "destination_lat": x["origin_lat"],
            "distance": float("nan"),
            "duration": float("nan"),
        }

    return output


"""run f(x) in parallel. A high number of jobs may cause failure. 
A low number of jobs may significantly increase the processing time.
each row takes about 1 sec to run from osm API. Always check the output for 
NaN values. re-run the model decreasing jobs until achieving 100% of
coverage"""
distance_matrix = Parallel(n_jobs=10, backend="threading")(
    delayed(f)(x) for index, x in df.iterrows()
)

# cost distance matrix output with duration (sec) and distance (m) for each route
distance_matrix = pd.DataFrame(distance_matrix)
distance_matrix[["origin", "destination"]] = distance_matrix[
    "origin_destination"
].str.split("&", 1, expand=True)

write_csv_for_upload(
    distance_matrix,
    "brazil/mars_pilot/in/soy/osm_br_soy_distance_matrix_step1_test.csv",
    columns=[
        "origin",
        "destination",
        "origin_long",
        "origin_lat",
        "destination_long",
        "destination_lat",
        "distance",
        "duration",
    ],
)
import pandas as pd


def model(dbt, cursor):
    dbt.source("trase-storage-raw", "location_soy_silos_origins_from_to")

    raise NotImplementedError()
    return pd.DataFrame({"hello": ["world"]})