Osm Br Soy Distance Matrix
s3://trase-storage/brazil/mars_pilot/in/soy/osm_br_soy_distance_matrix.csv
Dbt path: trase_production.main_brazil.osm_br_soy_distance_matrix
Explore on Metabase: Full table; summary statistics
Containing yaml file link: trase/data_pipeline/models/brazil/mars_pilot/in/soy/_schema.yml
Model file link: trase/data_pipeline/models/brazil/mars_pilot/in/soy/osm_br_soy_distance_matrix.py
Calls script: trase/data/brazil/soy/mars_pilot/distance_matrix/calculate_cost_distance_matrix_origin_to_silo.py
Dbt test runs & lineage: Test results ยท Lineage
Full dbt_docs page: Open in dbt docs (includes lineage graph -at the bottom right-, tests, and downstream dependencies)
Tags: mock_model, brazil, in, mars_pilot, soy
osm_br_soy_distance_matrix
Description
This model was auto-generated based off .yml 'lineage' files in S3. The DBT model just raises an error; the actual script that created the data lives elsewhere. The script is located at trase/data/brazil/soy/mars_pilot/calculate_cost_distance_matrix_origin_to_silo.py [permalink]. It was last run by Osvaldo Pereira.
Details
| Column | Type | Description |
|---|---|---|
Models / Seeds
source.trase_duckdb.trase-storage-raw.location_soy_silos_origins_from_to
Sources
['trase-storage-raw', 'location_soy_silos_origins_from_to']
"""
Scrapping cost distance from openstreetmap API (https://project-osrm.org/docs/v5.22.0/api/#general-options) using
parallel processing from list of locations.
The input data should be a table with columns for: 1. origin/destination; 2. lat/long from origin and; 3. lat/long
to destination, showing all desired connections between origin and destination.
About parallel processing: the scrapper accesses osm databased opening several requests at the same time.
Avoid using more than 10 jobs to avoid "null" values.
created by: osvaldo pereira
"""
import pandas as pd
import requests
import json
from joblib import Parallel, delayed
from trase.tools.aws.aws_helpers import get_pandas_df
from trase.tools.aws.metadata import write_csv_for_upload
# input matrix wih lat and long for all routes
df = get_pandas_df(
"brazil/mars_pilot/ori/soy/location_soy_silos_origins_from_to_step1_test.csv",
sep=";",
keep_default_na=True,
)
def f(x):
"""request from osm page. For this case, we are using the "driving mode" with
no alternative rotes to create only one possible distance/duration by connection.
All options for osm api can be found at: http://project-osrm.org/docs/v5.5.1/api/#general-options
"""
res = requests.get(
(
f"http://router.project-osrm.org/route/v1/driving/"
f"{x['from_lat']},{x['from_long']};{x['to_lat']},{x['to_long']}"
f"?overview=false"
),
timeout=300,
)
# parse output into a dict if valid. 200 is the code for valid requests
if res.status_code == 200:
data = json.loads(res.text)
output = {
"origin_destination": x["origin_destination"],
"origin_long": data["waypoints"][0]["location"][0],
"origin_lat": data["waypoints"][0]["location"][1],
"destination_long": data["waypoints"][1]["location"][0],
"destination_lat": data["waypoints"][1]["location"][1],
"distance": data["routes"][0]["distance"],
"duration": data["routes"][0]["duration"],
}
# Failed requests will show as NaN in the output
else:
output = {
"origin_destination": x["origin_destination"],
"origin_long": x["origin_long"],
"origin_lat": x["origin_lat"],
"destination_long": x["origin_long"],
"destination_lat": x["origin_lat"],
"distance": float("nan"),
"duration": float("nan"),
}
return output
"""run f(x) in parallel. A high number of jobs may cause failure.
A low number of jobs may significantly increase the processing time.
each row takes about 1 sec to run from osm API. Always check the output for
NaN values. re-run the model decreasing jobs until achieving 100% of
coverage"""
distance_matrix = Parallel(n_jobs=10, backend="threading")(
delayed(f)(x) for index, x in df.iterrows()
)
# cost distance matrix output with duration (sec) and distance (m) for each route
distance_matrix = pd.DataFrame(distance_matrix)
distance_matrix[["origin", "destination"]] = distance_matrix[
"origin_destination"
].str.split("&", 1, expand=True)
write_csv_for_upload(
distance_matrix,
"brazil/mars_pilot/in/soy/osm_br_soy_distance_matrix_step1_test.csv",
columns=[
"origin",
"destination",
"origin_long",
"origin_lat",
"destination_long",
"destination_lat",
"distance",
"duration",
],
)
import pandas as pd
def model(dbt, cursor):
dbt.source("trase-storage-raw", "location_soy_silos_origins_from_to")
raise NotImplementedError()
return pd.DataFrame({"hello": ["world"]})