DBT: Cnpj 2023 08 12 Gold

File location: s3://trase-storage/brazil/logistics/cnpj/gold/cnpj_2023_08_12/cnpj_2023_08_12_gold.parquet

DBT model name: cnpj_2023_08_12_gold

Explore on Metabase: Full table; summary statistics

DBT details

Lineage
Test results
Dbt path: trase_production.main_brazil.cnpj_2023_08_12_gold
Containing yaml link: trase/data_pipeline/models/brazil/logistics/cnpj/_schema_cnpj_receita_federal_do_brasil.yml
Model file: trase/data_pipeline/models/brazil/logistics/cnpj/gold/cnpj_2023_08_12_gold.sql
Original dbt_docs
Tags: brazil, cnpj, rfb, gold, 2023

Description

Final CNPJ table based on data from 'Receita Federal do Brasil'. Versions of different years should be located at brazil/logistics/cnpj/gold/cnpj_YYYY_MM_DD/cnpj_YYYY_MM_DD_gold.parquet.

Brazilian National Registry of Legal Entities (CNPJ) from the Federal Tax Agency (RFB)

This is a dataset of CNPJs (Cadastro Nacional de Pessoas Jurídicas, or National Registry of Legal Entities) downloaded from the publicly-available dataset hosted at Brazil's RFB (Receita Federal do Brasil, or Federal Tax Agency). CNPJs are unique identifiers for legal entities in Brazil. This dataset is important for Trase's efforts to map and model supply chains; in particular for all of our SEI-PCS models in Brazil.

It is a large dataset: in the 2023 download the final, cleaned CNPJ table was 57 million rows and 24GB uncompressed.

It typically includes the following information:

File	Description
cnpj-metadados.pdf	Data Dictionary
Cnaes	Economic activity codes
Empresas*	Companies
Estabelecimentos	Establishments
Motivos	Reasons
Municipios	Municipalities
Naturezas	Natures
Paises	Countries
Qualificaoes	Qualifications
Simples
Socios*	Partners
Regime Tributario	Tax regime data: Immune and Exempt, Arbitrated Profit, Presumed Profit and Actual Profit

The data can be accessed as a Parquet file on AWS S3, but it is also available on AWS Athena, which makes it easier to query specific subsets of the data efficiently without needing to download or process the full dataset locally. This is especially useful for large-scale analyses, filtering by CNPJ, or joining with other data sources in-place using SQL. Here is an example of how to query the Athena table using Python and Pandas:

import pandas as pd
from pyathena import connect
from pyathena.arrow.cursor import ArrowCursor

# Define a list of CNPJs to enrich with metadata
example_cnpjs = pd.DataFrame({
    "cnpj": [11641575000100, 65004913000121, 12345678910]  # Example CNPJ values
})

# Athena connection settings
athena_table = "s3_big_data.cnpj_2023_08_12"  # CNPJ table in Athena
s3_staging_dir = "s3://trase-temp/athena/"    # Temporary staging path for query results

# Prepare list of CNPJs for SQL query
cnpj_values = ", ".join(str(cnpj) for cnpj in example_cnpjs["cnpj"])

# Build SQL query to retrieve metadata from Athena
query = f"""
SELECT
    CAST(cnpj AS BIGINT) AS cnpj,
    company_name,
    municipality
FROM {athena_table}
WHERE TRY_CAST(cnpj AS BIGINT) IN ({cnpj_values})
"""

# Execute the query and load results into DataFrame
cursor = connect(s3_staging_dir=s3_staging_dir).cursor(ArrowCursor)
df_athena = cursor.execute(query).as_arrow().to_pandas()

# Join the local CNPJ list with the Athena results
df_merged = pd.merge(
    example_cnpjs,
    df_athena,
    on="cnpj",
    how="left",               # Keep all local CNPJs; fill missing metadata with NaN
    validate="one_to_one"     # Ensure no duplicates in join
)

# Display the result
print(df_merged)

# Example output:
#           cnpj                          company_name  municipality
# 0  11641575000100  LJR SERVICOS DE APOIO ADMINISTRATIVO LTDA     4205407
# 1  65004913000121     COMERCIO E CONFECCOES DEME'S LTDA         3550308
# 2     12345678910                                  NaN           NaN

How often is the data updated?

The data is updated on a monthly basis.

How to re-fetch the dataset from the original source.

First, visit https://dados.gov.br/dados/conjuntos-dados/cadastro-nacional-da-pessoa-juridica---cnpj. Then, download every file and PDF on the page to the following location, where YYYY-MM-DD is the date under Data da última extração (Date of last extraction) on the website:

s3://trase-storage/brazil/logistics/cnpj/receita_federal_do_brasil/originals/YYYY-MM-DD/compressed/

Now, open the script trase/data/brazil/logistics/cnpj/receita_federal_do_brasil/receita_federal.py.

Update the S3_BASE_DIR variable to the latest location of the data.

Create a new schema in PostgreSQL, e.g. cnpj_YYYY_MM_DD and update the SCHEMA variable in the script.

Finally, run the script to write the tables to PostgreSQL.

History of updates

2nd February 2023 Creation Date

The files have the following row counts:

File	Row Count
Lucro Real.csv	1,379,742
Lucro Presumido 2021.csv	1,047,742
Lucro Presumido 2020.csv	1,040,711
Lucro Presumido 2019.csv	990,632
Lucro Presumido 2018.csv	945,775
Lucro Presumido 2017.csv	927,272
Lucro Presumido 2016.csv	893,435
Lucro Arbitrado.csv	7,133

24th November 2022 Creation Date

The files have the following row counts:

File	Row Count
Estabelecimentos*.csv	57,156,641
Empresas*.csv	54,287,877
simples.csv	35,918,970
Socio*.csv	22,978,660
Imunes e isentas.csv	1,474,161
municipios.csv	5,571
cnae.csv	1,359
paises.csv	255
naturezas.csv	90
Leiaute dos Arquivos	77
qualificaoes.csv	68
motivos.csv	61

Acceptance criteria for sufficient level of quality of the dataset

We currently have no acceptance criteria defined for this dataset.

Details

ColumnsDepends OnModel code

Column	Type	Description

Models / Seeds

model.trase_duckdb.cnpj_site_2023_08_12_silver
model.trase_duckdb.cnpj_company_2023_08_12_silver
model.trase_duckdb.cnpj_nature_2023_08_12_silver
model.trase_duckdb.cnpj_reason_2023_08_12_silver
model.trase_duckdb.cnpj_country_2023_08_12_silver
model.trase_duckdb.cnpj_cnae_2023_08_12_silver
model.trase_duckdb.cnpj_city_2023_08_12_silver
model.trase_duckdb.cnpj_city_ibge_2023_08_12_silver

No called script or script source not found.

-- cnpj_2023_08_12_gold
-- Code adapted from `trase/data/brazil/logistics/cnpj/receita_federal_do_brasil/receita_federal.py`
{{ config(materialized='external') }}

SELECT
    site.cnpj,
    company.company_name,
    company.juridic_nature,
    company.nature_name,
    company.company_social_capital,
    company.company_size,
    company.company_size_name,
    company.national_responsible_entity,
    site.id_head_branch,
    site.head_branch_name,
    site.trade_name,
    site.id_registration_status,
    site.registration_status_name,
    site.registration_status_date,
    site.id_registration_status_reason,
    reason.reason_name,
    site.foreign_city_name,
    site.id_country,
    country.country_name,
    site.start_activity_date,
    site.cnae,
    cnae.cnae_name,
    site.cnae_secondary,
    site.cnae_secondary_array,
    site.address_type,
    site.address_street,
    site.address_number,
    site.address_complement,
    site.address_neighbourhood,
    site.postal_code,
    site.id_city,
    city.city_name,
    city_ibge.municipality::TEXT AS municipality,
    city_ibge.id_state,
    city_ibge.state,
    site.phone_1_area_code,
    site.phone_1,
    site.phone_2_area_code,
    site.phone_2,
    site.fax_area_code,
    site.fax, email,
    site.special_status,
    site.special_status_date
FROM {{ ref('cnpj_site_2023_08_12_silver') }} AS site
LEFT JOIN (
    SELECT
        a.cnpj_basic,
        a.company_name,
        a.juridic_nature,
        b.nature_name,
        a.company_social_capital,
        a.company_size,
        a.company_size_name,
        a.national_responsible_entity
    FROM {{ ref('cnpj_company_2023_08_12_silver') }} AS a
    JOIN {{ ref('cnpj_nature_2023_08_12_silver') }} AS b
        ON a.juridic_nature = b.id_nature
) AS company
    ON site.cnpj_basic = company.cnpj_basic
LEFT JOIN {{ ref('cnpj_reason_2023_08_12_silver') }} AS reason
    ON site.id_registration_status_reason = reason.id_reason
LEFT JOIN {{ ref('cnpj_country_2023_08_12_silver') }} AS country
    ON site.id_country = country.id_country
LEFT JOIN {{ ref('cnpj_cnae_2023_08_12_silver') }} AS cnae
    ON site.cnae = cnae.id_cnae
LEFT JOIN {{ ref('cnpj_city_2023_08_12_silver') }} AS city
    ON site.id_city = city.id_city
LEFT JOIN {{ ref('cnpj_city_ibge_2023_08_12_silver') }} AS city_ibge
    ON city.city_name = city_ibge.city_name
    AND site.state = city_ibge.state
ORDER BY site.cnpj ASC