Empresas Cadastro 2020

s3://trase-storage/brazil/auxiliary/secex/cleaned/EMPRESAS_CADASTRO_2020.csv

Dbt path: trase_production.main_brazil.empresas_cadastro_2020

Explore on Metabase: Full table; summary statistics

Containing yaml file link: trase/data_pipeline/models/brazil/auxiliary/secex/_schema.yml

Model file link: trase/data_pipeline/models/brazil/auxiliary/secex/cleaned/empresas_cadastro_2020.py

Calls script: trase/data/brazil/auxiliary/secex/cleaned/EMPRESAS_CADASTRO_201X.py

Dbt test runs & lineage: Test results · Lineage

Full dbt_docs page: Open in dbt docs (includes lineage graph -at the bottom right-, tests, and downstream dependencies)

Tags: mock_model, auxiliary, brazil, cleaned, secex

empresas_cadastro_2020

Description

SECEX Empresas Cadastro

The Cadastro de Empresas Exportadoras e Importadoras (Registration of Exporting and Importing Companies) was a dataset produced by the Subsecretariat of Intelligence and Foreign Trade Statistics (SECEX), part of Brazil's Ministry of Development, Industry, Commerce, and Services (MDIC). This dataset comprised a list of establishments that have engaged in export, import, or both activities during a given reference year. It included information on where the CNPJs are registered (with zip code (CEP), municipality, etc.).

The dataset was discontinued due to privacy concerns. The reasoning is detailed in this document (also available on S3 under s3://trase-storage/brazil/auxiliary/secex/Nota-sobre-lista-de-exportadores-e-importadores.pdf). The document presents a problem of confidentiality on tax info for exporting and importing companies, when people merge two different datasets available: (a) List of companies exporting/importing and (b) Municipalities exporting/importing. Because of that they decided to remove one of them, and the decision was in favor of removing the list of companies, considering that the municipality statistics are more popular and useful for public policies.

The latest data provided was 2021, which can be seen on the last web archive snapshot of the download page. We downloaded and processed all years up to and including 2021 and stored it on S3.

How we use the dataset in Trase

This dataset was used in the "MDIC Disaggregation" model, which is an important input to the Brazil beef SEI-PCS model. It was also used as part of the Brazil soy SEI-PCS model.

How often the dataset is updated, and when the next update is likely to be.

This dataset is not being updated anymore.

How to re-fetch the dataset from the original source.

This used to be available at https://www.gov.br/produtividade-e-comercio-exterior/pt-br/assuntos/comercio-exterior/estatisticas/outras-estatistica-de-comercio-exterior but the link is no longer active. However, the data is stored on our AWS S3 account.

The script that is used to process/clean the dataset

trase/data/brazil/auxiliary/secex/cleaned/EMPRESAS_CADASTRO_201X.py.

When the dataset was last updated, and by whom

The 2021 dataset was last updated by Nanxu Su on 2023-07-26. Previous years were updated by Harry Biddle.

A history of changes/notes of the dataset

2023: the above PDF notice was posted and the website was taken down

Acceptance criteria for sufficient level of quality of the dataset

There are no acceptance criteria defined.

Details

ColumnsDepends OnCalled script codeModel code

Column	Type	Description

Models / Seeds

source.trase_duckdb.trase-storage-raw.uf
source.trase_duckdb.trase-storage-raw.originals_empresas_cadastro_2020

Sources

['trase-storage-raw', 'uf']
['trase-storage-raw', 'originals_empresas_cadastro_2020']

import os
from tempfile import gettempdir

import pandas as pd
import stdnum.br.cnpj

from trase.tools.aws.aws_helpers_cached import get_pandas_df_once
from trase.tools.aws.metadata import write_csv_for_upload
from trase.tools.aws.tracker import S3_OBJECTS_ACCESSED_IN_CURRENT_SESSION
from trase.tools.utilities.helpers import clean_string
from trase.tools.pcs.connect import uses_database

YEARS = [2015, 2017, 2018, 2019, 2020, 2021]


def main():
    for year in YEARS:
        S3_OBJECTS_ACCESSED_IN_CURRENT_SESSION.clear()

        df = process(year)

        write_csv_for_upload(
            df, f"brazil/auxiliary/secex/cleaned/EMPRESAS_CADASTRO_{year}.csv"
        )


def process(year):
    df = get_pandas_df_once(
        f"brazil/auxiliary/secex/originals/EMPRESAS_CADASTRO_{year}.xlsx",
        sheet_name="EXP_CNPJ14",
        skiprows=7,
        encoding="latin-1",
        dtype=str,
        sep=";",
        keep_default_na=False,
        xlsx=True,
    )
    df = df.rename(
        columns={
            "CNPJ": "cnpj",
            "EMPRESA": "label",
            "ENDEREÇO": "street",
            "NÚMERO": "house_number",
            "BAIRRO": "neighbourhood",
            "CEP": "cep",
            "MUNICÍPIO": "municipality.label",
            "UF": "state.uf",
            "CNAE PRIMÁRIA": "activity.description",
            "NATUREZA JURÍDICA": "legal_nature.description",
        },
        errors="raise",
    )
    df = df.applymap(clean_string)
    df = df[df["cnpj"] != "99999997999999"].copy()
    df.loc[df["municipality.label"] == "", "municipality.label"] = "NAO DECLARADO"
    df.loc[df["state.uf"] == "", "state.uf"] = "ND"
    df = clean_activity_cnae_codes(df)
    df = clean_legal_nature_codes(df)
    df = replace_state_uf_code_with_trase_id(df)
    df = clean_municipalities(df)
    assert_cnpjs_valid(df)
    return df


def assert_cnpjs_valid(df):
    assert all(df["cnpj"].str.len() == 14)
    assert all(df["cnpj"].apply(stdnum.br.cnpj.is_valid))


def clean_activity_cnae_codes(df):
    df = df.copy()
    df["activity.cnae"] = df["activity.description"].str.slice(0, 4)
    df["activity.description"] = df["activity.description"].str.slice(7)
    assert all(df["activity.cnae"].str.len() == 4)
    assert all(df["activity.cnae"].str.isdigit())
    return df


def clean_legal_nature_codes(df):
    df = df.copy()
    df["legal_nature.code"] = df["legal_nature.description"].str.slice(0, 3)
    df["legal_nature.description"] = df["legal_nature.description"].str.slice(6)
    assert all(df["legal_nature.code"].str.len() == 3)
    assert all(df["legal_nature.code"].str.isdigit())
    return df


def clean_municipalities(df):
    df = pd.merge(
        df,
        get_municipalities(),
        on=["municipality.label", "state.trase_id"],
        how="left",
        validate="many_to_one",
    )
    assert not any(df.isna().any())
    return df


@uses_database
def get_municipalities(cnx=None):
    return pd.read_sql(
        """
        select distinct
            name as "municipality.name",
            unnest(synonyms) as "municipality.label",
            trase_id as "municipality.trase_id",
            substring(trase_id, 0, 6) as "state.trase_id"
        from views.regions 
        where country = 'BRAZIL' and region_type = 'MUNICIPALITY'
        """,
        cnx.cnx,
    )


def replace_state_uf_code_with_trase_id(df):
    df = pd.merge(
        df, get_state_uf_codes(), on="state.uf", how="left", validate="many_to_one"
    )
    assert not any(df.isna().any())
    return df


def get_state_uf_codes():
    df = get_pandas_df_once(
        "brazil/metadata/UF.csv",
        usecols=["CO_UF_IBGE", "UF"],
        sep=",",
        dtype=str,
        keep_default_na=False,
    )
    df = df.rename(
        columns={"CO_UF_IBGE": "state.code", "UF": "state.uf"}, errors="raise"
    )
    df = df.append(
        {"state.uf": "ND", "state.code": "XX"},
        ignore_index=True,
        verify_integrity=True,
    )
    df["state.trase_id"] = "BR-" + df["state.code"]
    return df


if __name__ == "__main__":
    main()

import pandas as pd


def model(dbt, cursor):
    dbt.source("trase-storage-raw", "uf")
    dbt.source("trase-storage-raw", "originals_empresas_cadastro_2020")

    raise NotImplementedError()
    return pd.DataFrame({"hello": ["world"]})