Empresas Cadastro 2020
s3://trase-storage/brazil/auxiliary/secex/cleaned/EMPRESAS_CADASTRO_2020.csv
Dbt path: trase_production.main_brazil.empresas_cadastro_2020
Explore on Metabase: Full table; summary statistics
Containing yaml file link: trase/data_pipeline/models/brazil/auxiliary/secex/_schema.yml
Model file link: trase/data_pipeline/models/brazil/auxiliary/secex/cleaned/empresas_cadastro_2020.py
Calls script: trase/data/brazil/auxiliary/secex/cleaned/EMPRESAS_CADASTRO_201X.py
Dbt test runs & lineage: Test results · Lineage
Full dbt_docs page: Open in dbt docs (includes lineage graph -at the bottom right-, tests, and downstream dependencies)
Tags: mock_model, auxiliary, brazil, cleaned, secex
empresas_cadastro_2020
Description
SECEX Empresas Cadastro
The Cadastro de Empresas Exportadoras e Importadoras (Registration of Exporting and Importing Companies) was a dataset produced by the Subsecretariat of Intelligence and Foreign Trade Statistics (SECEX), part of Brazil's Ministry of Development, Industry, Commerce, and Services (MDIC). This dataset comprised a list of establishments that have engaged in export, import, or both activities during a given reference year. It included information on where the CNPJs are registered (with zip code (CEP), municipality, etc.).
The dataset was discontinued due to privacy concerns.
The reasoning is detailed in this document (also available on S3 under s3://trase-storage/brazil/auxiliary/secex/Nota-sobre-lista-de-exportadores-e-importadores.pdf).
The document presents a problem of confidentiality on tax info for exporting and importing companies, when people merge two different datasets available: (a) List of companies exporting/importing and (b) Municipalities exporting/importing.
Because of that they decided to remove one of them, and the decision was in favor of removing the list of companies, considering that the municipality statistics are more popular and useful for public policies.
The latest data provided was 2021, which can be seen on the last web archive snapshot of the download page. We downloaded and processed all years up to and including 2021 and stored it on S3.
How we use the dataset in Trase
This dataset was used in the "MDIC Disaggregation" model, which is an important input to the Brazil beef SEI-PCS model. It was also used as part of the Brazil soy SEI-PCS model.
How often the dataset is updated, and when the next update is likely to be.
This dataset is not being updated anymore.
How to re-fetch the dataset from the original source.
This used to be available at https://www.gov.br/produtividade-e-comercio-exterior/pt-br/assuntos/comercio-exterior/estatisticas/outras-estatistica-de-comercio-exterior but the link is no longer active. However, the data is stored on our AWS S3 account.
The script that is used to process/clean the dataset
trase/data/brazil/auxiliary/secex/cleaned/EMPRESAS_CADASTRO_201X.py.
When the dataset was last updated, and by whom
The 2021 dataset was last updated by Nanxu Su on 2023-07-26. Previous years were updated by Harry Biddle.
A history of changes/notes of the dataset
- 2023: the above PDF notice was posted and the website was taken down
Acceptance criteria for sufficient level of quality of the dataset
There are no acceptance criteria defined.
Details
| Column | Type | Description |
|---|---|---|
Models / Seeds
source.trase_duckdb.trase-storage-raw.ufsource.trase_duckdb.trase-storage-raw.originals_empresas_cadastro_2020
Sources
['trase-storage-raw', 'uf']['trase-storage-raw', 'originals_empresas_cadastro_2020']
import os
from tempfile import gettempdir
import pandas as pd
import stdnum.br.cnpj
from trase.tools.aws.aws_helpers_cached import get_pandas_df_once
from trase.tools.aws.metadata import write_csv_for_upload
from trase.tools.aws.tracker import S3_OBJECTS_ACCESSED_IN_CURRENT_SESSION
from trase.tools.utilities.helpers import clean_string
from trase.tools.pcs.connect import uses_database
YEARS = [2015, 2017, 2018, 2019, 2020, 2021]
def main():
for year in YEARS:
S3_OBJECTS_ACCESSED_IN_CURRENT_SESSION.clear()
df = process(year)
write_csv_for_upload(
df, f"brazil/auxiliary/secex/cleaned/EMPRESAS_CADASTRO_{year}.csv"
)
def process(year):
df = get_pandas_df_once(
f"brazil/auxiliary/secex/originals/EMPRESAS_CADASTRO_{year}.xlsx",
sheet_name="EXP_CNPJ14",
skiprows=7,
encoding="latin-1",
dtype=str,
sep=";",
keep_default_na=False,
xlsx=True,
)
df = df.rename(
columns={
"CNPJ": "cnpj",
"EMPRESA": "label",
"ENDEREÇO": "street",
"NÚMERO": "house_number",
"BAIRRO": "neighbourhood",
"CEP": "cep",
"MUNICÍPIO": "municipality.label",
"UF": "state.uf",
"CNAE PRIMÁRIA": "activity.description",
"NATUREZA JURÍDICA": "legal_nature.description",
},
errors="raise",
)
df = df.applymap(clean_string)
df = df[df["cnpj"] != "99999997999999"].copy()
df.loc[df["municipality.label"] == "", "municipality.label"] = "NAO DECLARADO"
df.loc[df["state.uf"] == "", "state.uf"] = "ND"
df = clean_activity_cnae_codes(df)
df = clean_legal_nature_codes(df)
df = replace_state_uf_code_with_trase_id(df)
df = clean_municipalities(df)
assert_cnpjs_valid(df)
return df
def assert_cnpjs_valid(df):
assert all(df["cnpj"].str.len() == 14)
assert all(df["cnpj"].apply(stdnum.br.cnpj.is_valid))
def clean_activity_cnae_codes(df):
df = df.copy()
df["activity.cnae"] = df["activity.description"].str.slice(0, 4)
df["activity.description"] = df["activity.description"].str.slice(7)
assert all(df["activity.cnae"].str.len() == 4)
assert all(df["activity.cnae"].str.isdigit())
return df
def clean_legal_nature_codes(df):
df = df.copy()
df["legal_nature.code"] = df["legal_nature.description"].str.slice(0, 3)
df["legal_nature.description"] = df["legal_nature.description"].str.slice(6)
assert all(df["legal_nature.code"].str.len() == 3)
assert all(df["legal_nature.code"].str.isdigit())
return df
def clean_municipalities(df):
df = pd.merge(
df,
get_municipalities(),
on=["municipality.label", "state.trase_id"],
how="left",
validate="many_to_one",
)
assert not any(df.isna().any())
return df
@uses_database
def get_municipalities(cnx=None):
return pd.read_sql(
"""
select distinct
name as "municipality.name",
unnest(synonyms) as "municipality.label",
trase_id as "municipality.trase_id",
substring(trase_id, 0, 6) as "state.trase_id"
from views.regions
where country = 'BRAZIL' and region_type = 'MUNICIPALITY'
""",
cnx.cnx,
)
def replace_state_uf_code_with_trase_id(df):
df = pd.merge(
df, get_state_uf_codes(), on="state.uf", how="left", validate="many_to_one"
)
assert not any(df.isna().any())
return df
def get_state_uf_codes():
df = get_pandas_df_once(
"brazil/metadata/UF.csv",
usecols=["CO_UF_IBGE", "UF"],
sep=",",
dtype=str,
keep_default_na=False,
)
df = df.rename(
columns={"CO_UF_IBGE": "state.code", "UF": "state.uf"}, errors="raise"
)
df = df.append(
{"state.uf": "ND", "state.code": "XX"},
ignore_index=True,
verify_integrity=True,
)
df["state.trase_id"] = "BR-" + df["state.code"]
return df
if __name__ == "__main__":
main()
import pandas as pd
def model(dbt, cursor):
dbt.source("trase-storage-raw", "uf")
dbt.source("trase-storage-raw", "originals_empresas_cadastro_2020")
raise NotImplementedError()
return pd.DataFrame({"hello": ["world"]})