Ibge Biome Municipality 2024 Gold

s3://trase-storage/brazil/spatial/boundaries/ibge/br_municipality_biome/ibge_biome_municipality_2024_gold.parquet

Dbt path: trase_production.main_brazil.ibge_biome_municipality_2024_gold

Explore on Metabase: Full table; summary statistics

Containing yaml file link: trase/data_pipeline/models/brazil/spatial/boundaries/ibge/br_municipality_biome/_schema.yml

Model file link: trase/data_pipeline/models/brazil/spatial/boundaries/ibge/br_municipality_biome/ibge_biome_municipality_2024_gold.py

Calls script: trase/data/brazil/spatial/boundaries/ibge/br_municipality_biome/ibge_biome_municipality_2024_gold.py

Dbt test runs & lineage: Test results · Lineage

Full dbt_docs page: Open in dbt docs (includes lineage graph -at the bottom right-, tests, and downstream dependencies)

Tags: brazil

ibge_biome_municipality_2024_gold

Description

IBGE – Relationship Between Municipalities and Biomes (2024)

This dataset contains the relationship between Brazilian municipalities and their predominant biome, as published by the Brazilian Institute of Geography and Statistics (IBGE) in June 2024. It is part of IBGE’s Biomas project, which supports environmental analysis, policymaking, and statistical studies.

What the dataset contains

The source data — Bioma_Predominante_por_Municipio_2024 — assigns one biome to each municipality or special administrative area in Brazil. This biome is the one covering the largest territorial area in that municipality.

The file includes: - 5,568 municipalities - The Federal District (Brasília – DF) - The State District of Fernando de Noronha – PE - Two State Operational Areas (Lagoa dos Patos and Lagoa Mirim – RS)
Total: 5,572 rows, one per location.

How we use the dataset in Trase

This dataset is:

Used through our Brazil pipelines whenever we need to access biome information.
Ingested into the Trase database

How to fetch the data from the source

Update frequency: Irregular (based on IBGE releases)
Latest release: June 2024
Next expected update: Unknown; depends on IBGE biomes project updates

The original IBGE data can be downloaded manually from https://www.ibge.gov.br/.

The script used to process/clean the dataset

The file trase/data/brazil/spatial/boundaries/ibge/br_municipality_biome/ibge_biome_muncipality_2024_gold.py processes the IBGE list into a clean, analysis-ready dataset.

History

2025-08: Cleaned by Harry

Details

ColumnsDepends OnCalled script codeModel code

Column	Type	Description
`MUNICIPALITY_GEOCODE_IBGE`	`VARCHAR`	7-digit IBGE municipality geocode (or special area code) from the source file.
`MUNICIPALITY_LABEL`	`VARCHAR`	Municipality name (accent/spacing preserved as in IBGE, cleaned for consistency).
`MUNICIPALITY_STATE_UF`	`VARCHAR`	Two-letter state (UF) abbreviation.
`BIOME_LABEL_ORIGINAL`	`VARCHAR`	Biome label exactly as provided by IBGE (“Bioma predominante”).
`MUNICIPALITY_TRASE_ID`	`VARCHAR`	Trase node identifier for the municipality, formatted as BR-.
`MUNICIPALITY_NODE_ID`	`BIGINT`	Integer node id in of the muncipality in the Trase PostgreSQL database.
`BIOME_LABEL_CLEANED`	`VARCHAR`	Cleaned biome label (uppercased/diacritics-normalized via clean_string) used for node matching.
`BIOME_NAME`	`VARCHAR`	Canonical/default node name for the matched biome in the Trase PostgreSQL database.
`BIOME_TRASE_ID`	`VARCHAR`	Trase node identifier for the biome.

Models / Seeds

source.trase_duckdb.trase-storage-raw.br_municipality_biome_2024

Sources

['trase-storage-raw', 'br_municipality_biome_2024']

import pandas as pd
from psycopg2 import sql

from trase.tools import get_node_sub_type_id, get_country_id, CNX
from trase.tools.aws import get_pandas_df
from trase.tools.aws.metadata import write_parquet_for_upload
from trase.tools.pandasdb.find import find_nodes_by_trase_id, find_nodes_by_name
from trase.tools.utilities.helpers import clean_string


def process(df):

    # the bottom of the CSV contains a footer with metadata - let's skip this
    geocodes_in_footer = [
        "--------",
        "CAMPO",
        "Geocódigo",
        "Nome do município",
        "Sigla da UF",
        "Bioma predominante",
        "Referência",
    ]
    df = df[~df["Geocódigo"].isin(geocodes_in_footer)]

    # remove two lagoons
    lagoons = [
        "4300001",  # Lagoa Mirim
        "4300002",  # Lagoa dos Patos
    ]
    df = df[~df["Geocódigo"].isin(lagoons)]

    # add trase id
    ibge_geocodes = df["Geocódigo"]
    assert all(
        ibge_geocodes.str.isdigit() & (ibge_geocodes.str.len() == 7)
    ), "Geocódigos must be 7 digits long"
    df["trase_id"] = "BR-" + ibge_geocodes

    # add municipality node id
    df[["municipality_node_id"]] = find_nodes_by_trase_id(
        df[["trase_id"]],
        returning=["node_id"],
    )
    assert not any(df["municipality_node_id"].isna())
    df["municipality_node_id"] = df["municipality_node_id"].astype(int)
    assert all(df["municipality_node_id"] > 0)

    # identify the biomes
    biome_sub_type_id = get_node_sub_type_id("BIOME")
    brazil_country_node_id = get_country_id("BRAZIL")
    df["biome_label"] = df["Bioma predominante"].apply(clean_string)
    df[["biome_name", "biome_trase_id"]] = find_nodes_by_name(
        df,
        returning=["default_name", "trase_id"],
        name=sql.Identifier("biome_label"),
        on_extra_columns="ignore",
        sub_type_id=sql.Literal(biome_sub_type_id),
        parent_id=sql.Literal(brazil_country_node_id),
    )
    assert not any(df["biome_name"].isna())

    # are there any municipalities without a biome?
    # yes - but mostly -AGGREGATED, -XXXX etc
    municipality_sub_type_id = get_node_sub_type_id("MUNICIPALITY")
    df_all_municipalities = pd.read_sql(
        f"""
        select id as municipality_node_id, trase_id from main.nodes
        where sub_type_id = {municipality_sub_type_id}
        and trase_id like 'BR-%'
        """,
        CNX.cnx,
    )
    df_municipalities_without_biome = pd.merge(
        df_all_municipalities,
        df[["trase_id"]],
        on="trase_id",
        how="outer",
        indicator=True,
    )
    indicator = df_municipalities_without_biome.pop("_merge")
    assert not any(indicator == "right_only")
    df_municipalities_without_biome = df_municipalities_without_biome[
        indicator == "left_only"
    ]
    trase_ids_without_biome = df_municipalities_without_biome["trase_id"]
    assert all(
        trase_ids_without_biome.str.endswith("-AGGREGATED")
        | trase_ids_without_biome.str.endswith("XXXXX")
        | trase_ids_without_biome.str.endswith("IMPORTED-MUNICIPALITY")
    )

    # match conventional names for other indicators
    df.columns = [clean_string(column) for column in df.columns]
    df = df.rename(
        columns={
            "GEOCODIGO": "MUNICIPALITY_GEOCODE_IBGE",
            "NOME DO MUNICIPIO": "MUNICIPALITY_LABEL",
            "SIGLA DA UF": "MUNICIPALITY_STATE_UF",
            "BIOMA PREDOMINANTE": "BIOME_LABEL_ORIGINAL",
            "BIOME_LABEL": "BIOME_LABEL_CLEANED",
            "TRASE_ID": "MUNICIPALITY_TRASE_ID",
        },
        errors="raise",
    )

    return df


if __name__ == "__main__":
    df_original = get_pandas_df(
        "brazil/spatial/boundaries/ibge/br_municipality_biome/Bioma_Predominante_por_Municipio_2024.csv",
        sep=";",
        dtype=str,
        na_filter=False,
    )
    df = process(df_original)
    write_parquet_for_upload(
        df,
        "brazil/spatial/boundaries/ibge/br_municipality_biome/ibge_biome_municipality_2024_gold.parquet",
    )

from trase.data.brazil.spatial.boundaries.ibge.br_municipality_biome.ibge_biome_municipality_2024_gold import (
    process,
)


def model(dbt, cursor):
    dbt.config(materialized="external")

    df = dbt.source("trase-storage-raw", "br_municipality_biome_2024").df()
    return process(df)