Results Full Clean Pivoted
s3://trase-storage/world/UK_indicator_data_TEST/results_full_clean_pivoted.csv
Dbt path: trase_production.main.results_full_clean_pivoted
Explore on Metabase: Full table; summary statistics
Containing yaml file link: trase/data_pipeline/models/world/uk_indicator_data_test/_schema.yml
Model file link: trase/data_pipeline/models/world/uk_indicator_data_test/results_full_clean_pivoted.py
Calls script: trase/data/world/UK_indicator_data_TEST/results_full_clean_pivoted.py
Dbt test runs & lineage: Test results ยท Lineage
Full dbt_docs page: Open in dbt docs (includes lineage graph -at the bottom right-, tests, and downstream dependencies)
Tags: mock_model, UK_indicator_data_TEST, world
results_full_clean_pivoted
Description
This model was auto-generated based off .yml 'lineage' files in S3. The DBT model just raises an error; the actual script that created the data lives elsewhere. The script is located at trase/data/world/UK_indicator_data_TEST/results_full_clean_pivoted.py [permalink]. It was last run by Harry Biddle.
Details
| Column | Type | Description |
|---|---|---|
Models / Seeds
source.trase_duckdb.trase-storage-raw.results_full_clean
Sources
['trase-storage-raw', 'results_full_clean']
"""
The UK Indicator data is long:
year ... metric value_clean
---- --- ------ -----------
2005 ... production_embedded_in_consumption 714.663896900188
2005 ... production_embedded_in_consumption 11800.3963584192
2005 ... production_embedded_in_consumption 1109.40947849267
However, Splitgraph stores data column-wise. To try to improve performance, in this
script we pivot the data:
year ... production_embedded_in_consumption
---- --- ------
2005 ... 714.663896900188
2005 ... 11800.3963584192
2005 ... 1109.40947849267
"""
from trase.tools.aws import get_pandas_df
import pandas as pd
from trase.tools.aws.metadata import write_csv_for_upload
INDEX_COLUMNS = [
"year",
"commodity_code",
"commodity_name",
"producing_country_code",
"producing_country_name",
"consuming_country_code",
"consuming_country_name",
]
def main():
df = get_pandas_df(
"world/UK_indicator_data_TEST/results_full_clean.csv",
sep=";",
dtype=str,
keep_default_na=False,
)
# there seems to be a ton of duplicates: not sure why that is
df = df.drop_duplicates()
# pivot on metric. the units are unique to the metric, so we add them to the name
df["metric"] = df["metric"] + "_" + df["units"]
df = df.set_index(INDEX_COLUMNS)
df = df.pivot(columns="metric")["value_clean"]
df = df.reset_index()
# sort columns by order of cardinality
columns = list(df[INDEX_COLUMNS].apply(pd.Series.nunique).sort_values().index)
df = df.sort_values(by=columns)
# write to disk: we leave it to the user to do the actual upload
write_csv_for_upload(
df, "world/UK_indicator_data_TEST/results_full_clean_pivoted.csv"
)
if __name__ == "__main__":
main()
import pandas as pd
def model(dbt, cursor):
dbt.source("trase-storage-raw", "results_full_clean")
raise NotImplementedError()
return pd.DataFrame({"hello": ["world"]})