Skip to content

DBT: Cnpj 2023 08 12 Gold

File location: s3://trase-storage/brazil/logistics/cnpj/gold/cnpj_2023_08_12/cnpj_2023_08_12_gold.parquet

DBT model name: cnpj_2023_08_12_gold

Explore on Metabase: Full table; summary statistics

DBT details


Description

Final CNPJ table based on data from 'Receita Federal do Brasil'. Versions of different years should be located at brazil/logistics/cnpj/gold/cnpj_YYYY_MM_DD/cnpj_YYYY_MM_DD_gold.parquet.

Brazilian National Registry of Legal Entities (CNPJ) from the Federal Tax Agency (RFB)

This is a dataset of CNPJs (Cadastro Nacional de Pessoas Jurídicas, or National Registry of Legal Entities) downloaded from the publicly-available dataset hosted at Brazil's RFB (Receita Federal do Brasil, or Federal Tax Agency). CNPJs are unique identifiers for legal entities in Brazil. This dataset is important for Trase's efforts to map and model supply chains; in particular for all of our SEI-PCS models in Brazil.

It is a large dataset: in the 2023 download the final, cleaned CNPJ table was 57 million rows and 24GB uncompressed.

It typically includes the following information:

File Description
cnpj-metadados.pdf Data Dictionary
Cnaes Economic activity codes
Empresas* Companies
Estabelecimentos Establishments
Motivos Reasons
Municipios Municipalities
Naturezas Natures
Paises Countries
Qualificaoes Qualifications
Simples
Socios* Partners
Regime Tributario Tax regime data: Immune and Exempt, Arbitrated Profit, Presumed Profit and Actual Profit

The data can be accessed as a Parquet file on AWS S3, but it is also available on AWS Athena, which makes it easier to query specific subsets of the data efficiently without needing to download or process the full dataset locally. This is especially useful for large-scale analyses, filtering by CNPJ, or joining with other data sources in-place using SQL. Here is an example of how to query the Athena table using Python and Pandas:

import pandas as pd
from pyathena import connect
from pyathena.arrow.cursor import ArrowCursor

# Define a list of CNPJs to enrich with metadata
example_cnpjs = pd.DataFrame({
    "cnpj": [11641575000100, 65004913000121, 12345678910]  # Example CNPJ values
})

# Athena connection settings
athena_table = "s3_big_data.cnpj_2023_08_12"  # CNPJ table in Athena
s3_staging_dir = "s3://trase-temp/athena/"    # Temporary staging path for query results

# Prepare list of CNPJs for SQL query
cnpj_values = ", ".join(str(cnpj) for cnpj in example_cnpjs["cnpj"])

# Build SQL query to retrieve metadata from Athena
query = f"""
SELECT
    CAST(cnpj AS BIGINT) AS cnpj,
    company_name,
    municipality
FROM {athena_table}
WHERE TRY_CAST(cnpj AS BIGINT) IN ({cnpj_values})
"""

# Execute the query and load results into DataFrame
cursor = connect(s3_staging_dir=s3_staging_dir).cursor(ArrowCursor)
df_athena = cursor.execute(query).as_arrow().to_pandas()

# Join the local CNPJ list with the Athena results
df_merged = pd.merge(
    example_cnpjs,
    df_athena,
    on="cnpj",
    how="left",               # Keep all local CNPJs; fill missing metadata with NaN
    validate="one_to_one"     # Ensure no duplicates in join
)

# Display the result
print(df_merged)

# Example output:
#           cnpj                          company_name  municipality
# 0  11641575000100  LJR SERVICOS DE APOIO ADMINISTRATIVO LTDA     4205407
# 1  65004913000121     COMERCIO E CONFECCOES DEME'S LTDA         3550308
# 2     12345678910                                  NaN           NaN

How often is the data updated?

The data is updated on a monthly basis.

How to re-fetch the dataset from the original source.

First, visit https://dados.gov.br/dados/conjuntos-dados/cadastro-nacional-da-pessoa-juridica---cnpj. Then, download every file and PDF on the page to the following location, where YYYY-MM-DD is the date under Data da última extração (Date of last extraction) on the website:

s3://trase-storage/brazil/logistics/cnpj/receita_federal_do_brasil/originals/YYYY-MM-DD/compressed/

Now, open the script trase/data/brazil/logistics/cnpj/receita_federal_do_brasil/receita_federal.py.

Update the S3_BASE_DIR variable to the latest location of the data.

Create a new schema in PostgreSQL, e.g. cnpj_YYYY_MM_DD and update the SCHEMA variable in the script.

Finally, run the script to write the tables to PostgreSQL.

History of updates

2nd February 2023 Creation Date

The files have the following row counts:

File Row Count
Lucro Real.csv 1,379,742
Lucro Presumido 2021.csv 1,047,742
Lucro Presumido 2020.csv 1,040,711
Lucro Presumido 2019.csv 990,632
Lucro Presumido 2018.csv 945,775
Lucro Presumido 2017.csv 927,272
Lucro Presumido 2016.csv 893,435
Lucro Arbitrado.csv 7,133

24th November 2022 Creation Date

The files have the following row counts:

File Row Count
Estabelecimentos*.csv 57,156,641
Empresas*.csv 54,287,877
simples.csv 35,918,970
Socio*.csv 22,978,660
Imunes e isentas.csv 1,474,161
municipios.csv 5,571
cnae.csv 1,359
paises.csv 255
naturezas.csv 90
Leiaute dos Arquivos 77
qualificaoes.csv 68
motivos.csv 61

Acceptance criteria for sufficient level of quality of the dataset

We currently have no acceptance criteria defined for this dataset.


Details

Column Type Description

Models / Seeds

  • model.trase_duckdb.cnpj_site_2023_08_12_silver
  • model.trase_duckdb.cnpj_company_2023_08_12_silver
  • model.trase_duckdb.cnpj_nature_2023_08_12_silver
  • model.trase_duckdb.cnpj_reason_2023_08_12_silver
  • model.trase_duckdb.cnpj_country_2023_08_12_silver
  • model.trase_duckdb.cnpj_cnae_2023_08_12_silver
  • model.trase_duckdb.cnpj_city_2023_08_12_silver
  • model.trase_duckdb.cnpj_city_ibge_2023_08_12_silver

No called script or script source not found.

-- cnpj_2023_08_12_gold
-- Code adapted from `trase/data/brazil/logistics/cnpj/receita_federal_do_brasil/receita_federal.py`
{{ config(materialized='external') }}

SELECT
    site.cnpj,
    company.company_name,
    company.juridic_nature,
    company.nature_name,
    company.company_social_capital,
    company.company_size,
    company.company_size_name,
    company.national_responsible_entity,
    site.id_head_branch,
    site.head_branch_name,
    site.trade_name,
    site.id_registration_status,
    site.registration_status_name,
    site.registration_status_date,
    site.id_registration_status_reason,
    reason.reason_name,
    site.foreign_city_name,
    site.id_country,
    country.country_name,
    site.start_activity_date,
    site.cnae,
    cnae.cnae_name,
    site.cnae_secondary,
    site.cnae_secondary_array,
    site.address_type,
    site.address_street,
    site.address_number,
    site.address_complement,
    site.address_neighbourhood,
    site.postal_code,
    site.id_city,
    city.city_name,
    city_ibge.municipality::TEXT AS municipality,
    city_ibge.id_state,
    city_ibge.state,
    site.phone_1_area_code,
    site.phone_1,
    site.phone_2_area_code,
    site.phone_2,
    site.fax_area_code,
    site.fax, email,
    site.special_status,
    site.special_status_date
FROM {{ ref('cnpj_site_2023_08_12_silver') }} AS site
LEFT JOIN (
    SELECT
        a.cnpj_basic,
        a.company_name,
        a.juridic_nature,
        b.nature_name,
        a.company_social_capital,
        a.company_size,
        a.company_size_name,
        a.national_responsible_entity
    FROM {{ ref('cnpj_company_2023_08_12_silver') }} AS a
    JOIN {{ ref('cnpj_nature_2023_08_12_silver') }} AS b
        ON a.juridic_nature = b.id_nature
) AS company
    ON site.cnpj_basic = company.cnpj_basic
LEFT JOIN {{ ref('cnpj_reason_2023_08_12_silver') }} AS reason
    ON site.id_registration_status_reason = reason.id_reason
LEFT JOIN {{ ref('cnpj_country_2023_08_12_silver') }} AS country
    ON site.id_country = country.id_country
LEFT JOIN {{ ref('cnpj_cnae_2023_08_12_silver') }} AS cnae
    ON site.cnae = cnae.id_cnae
LEFT JOIN {{ ref('cnpj_city_2023_08_12_silver') }} AS city
    ON site.id_city = city.id_city
LEFT JOIN {{ ref('cnpj_city_ibge_2023_08_12_silver') }} AS city_ibge
    ON city.city_name = city_ibge.city_name
    AND site.state = city_ibge.state
ORDER BY site.cnpj ASC