DBT: Cnpj 2019 02 Gold
File location: s3://trase-storage/brazil/logistics/cnpj/gold/cnpj_2019_02/cnpj_2019_02_gold.parquet
DBT model name: cnpj_2019_02_gold
Explore on Metabase: Full table; summary statistics
DBT details
- Lineage
-
Dbt path:
trase_production.main_brazil.cnpj_2019_02_gold -
Containing yaml link: trase/data_pipeline/models/brazil/logistics/cnpj/_schema_cnpj_receita_federal_do_brasil.yml
-
Model file: trase/data_pipeline/models/brazil/logistics/cnpj/gold/cnpj_2019_02_gold.sql
-
Tags:
brazil,cnpj,rfb,gold,2019
Description
Final CNPJ table based on a download made by Javier and uploaded to s3 on 2 seperate files: one with the cnpj data and another with the secondary cnaes data
Brazilian National Registry of Legal Entities (CNPJ) from the Federal Tax Agency (RFB)
This is a dataset of CNPJs (Cadastro Nacional de Pessoas Jurídicas, or National Registry of Legal Entities) downloaded from the publicly-available dataset hosted at Brazil's RFB (Receita Federal do Brasil, or Federal Tax Agency). CNPJs are unique identifiers for legal entities in Brazil. This dataset is important for Trase's efforts to map and model supply chains; in particular for all of our SEI-PCS models in Brazil.
It is a large dataset: in the 2023 download the final, cleaned CNPJ table was 57 million rows and 24GB uncompressed.
It typically includes the following information:
| File | Description |
|---|---|
| cnpj-metadados.pdf | Data Dictionary |
| Cnaes | Economic activity codes |
| Empresas* | Companies |
| Estabelecimentos | Establishments |
| Motivos | Reasons |
| Municipios | Municipalities |
| Naturezas | Natures |
| Paises | Countries |
| Qualificaoes | Qualifications |
| Simples | |
| Socios* | Partners |
| Regime Tributario | Tax regime data: Immune and Exempt, Arbitrated Profit, Presumed Profit and Actual Profit |
The data can be accessed as a Parquet file on AWS S3, but it is also available on AWS Athena, which makes it easier to query specific subsets of the data efficiently without needing to download or process the full dataset locally. This is especially useful for large-scale analyses, filtering by CNPJ, or joining with other data sources in-place using SQL. Here is an example of how to query the Athena table using Python and Pandas:
import pandas as pd
from pyathena import connect
from pyathena.arrow.cursor import ArrowCursor
# Define a list of CNPJs to enrich with metadata
example_cnpjs = pd.DataFrame({
"cnpj": [11641575000100, 65004913000121, 12345678910] # Example CNPJ values
})
# Athena connection settings
athena_table = "s3_big_data.cnpj_2023_08_12" # CNPJ table in Athena
s3_staging_dir = "s3://trase-temp/athena/" # Temporary staging path for query results
# Prepare list of CNPJs for SQL query
cnpj_values = ", ".join(str(cnpj) for cnpj in example_cnpjs["cnpj"])
# Build SQL query to retrieve metadata from Athena
query = f"""
SELECT
CAST(cnpj AS BIGINT) AS cnpj,
company_name,
municipality
FROM {athena_table}
WHERE TRY_CAST(cnpj AS BIGINT) IN ({cnpj_values})
"""
# Execute the query and load results into DataFrame
cursor = connect(s3_staging_dir=s3_staging_dir).cursor(ArrowCursor)
df_athena = cursor.execute(query).as_arrow().to_pandas()
# Join the local CNPJ list with the Athena results
df_merged = pd.merge(
example_cnpjs,
df_athena,
on="cnpj",
how="left", # Keep all local CNPJs; fill missing metadata with NaN
validate="one_to_one" # Ensure no duplicates in join
)
# Display the result
print(df_merged)
# Example output:
# cnpj company_name municipality
# 0 11641575000100 LJR SERVICOS DE APOIO ADMINISTRATIVO LTDA 4205407
# 1 65004913000121 COMERCIO E CONFECCOES DEME'S LTDA 3550308
# 2 12345678910 NaN NaN
How often is the data updated?
The data is updated on a monthly basis.
How to re-fetch the dataset from the original source.
First, visit https://dados.gov.br/dados/conjuntos-dados/cadastro-nacional-da-pessoa-juridica---cnpj.
Then, download every file and PDF on the page to the following location, where YYYY-MM-DD is the date under Data da última extração (Date of last extraction) on the website:
s3://trase-storage/brazil/logistics/cnpj/receita_federal_do_brasil/originals/YYYY-MM-DD/compressed/
Now, open the script trase/data/brazil/logistics/cnpj/receita_federal_do_brasil/receita_federal.py.
Update the S3_BASE_DIR variable to the latest location of the data.
Create a new schema in PostgreSQL, e.g. cnpj_YYYY_MM_DD and update the SCHEMA variable in the script.
Finally, run the script to write the tables to PostgreSQL.
History of updates
2nd February 2023 Creation Date
The files have the following row counts:
| File | Row Count |
|---|---|
| Lucro Real.csv | 1,379,742 |
| Lucro Presumido 2021.csv | 1,047,742 |
| Lucro Presumido 2020.csv | 1,040,711 |
| Lucro Presumido 2019.csv | 990,632 |
| Lucro Presumido 2018.csv | 945,775 |
| Lucro Presumido 2017.csv | 927,272 |
| Lucro Presumido 2016.csv | 893,435 |
| Lucro Arbitrado.csv | 7,133 |
24th November 2022 Creation Date
The files have the following row counts:
| File | Row Count |
|---|---|
| Estabelecimentos*.csv | 57,156,641 |
| Empresas*.csv | 54,287,877 |
| simples.csv | 35,918,970 |
| Socio*.csv | 22,978,660 |
| Imunes e isentas.csv | 1,474,161 |
| municipios.csv | 5,571 |
| cnae.csv | 1,359 |
| paises.csv | 255 |
| naturezas.csv | 90 |
| Leiaute dos Arquivos | 77 |
| qualificaoes.csv | 68 |
| motivos.csv | 61 |
Acceptance criteria for sufficient level of quality of the dataset
We currently have no acceptance criteria defined for this dataset.
Details
| Column | Type | Description |
|---|---|---|
Models / Seeds
source.trase_duckdb.source_brazil.cnpj_2019_02_cnae_secondary_bronzesource.trase_duckdb.source_brazil.cnpj_2019_02_bronze
Sources
['source_brazil', 'cnpj_2019_02_cnae_secondary_bronze']['source_brazil', 'cnpj_2019_02_bronze']
No called script or script source not found.
{{ config(materialized='external') }}
SELECT
cnpj.cnpj,
cnpj.razao_social,
cnpj.nome_fantasia,
cnpj.municipio,
cnpj.uf,
cnpj.tipo_de_registro::INT AS tipo_de_registro,
cnpj.indicador::BOOLEAN AS indicador,
cnpj.tipo_atualizacao,
cnpj.identificador_matriz_filial::INT AS identificador_matriz_filial,
cnpj.situacao_cadastral,
CAST(cnpj.data_situacao_cadastral AS DATE) AS data_situacao_cadastral,
cnpj.motivo_situacao_cadastral,
cnpj.codigo_natureza_juridica::INT AS codigo_natureza_juridica,
CAST(cnpj.data_inicio_atividade AS DATE) AS data_inicio_atividade,
cnpj.cnae_primary,
cnae.cnae_secondary_array,
cnpj.descricao_tipo_logradouro,
cnpj.logradouro,
cnpj.numero,
cnpj.complemento,
cnpj.bairro,
cnpj.cep,
cnpj.ddd_telefone_1,
cnpj.ddd_telefone_2,
cnpj.ddd_fax,
cnpj.correio_eletronico,
cnpj.qualificacao_responsavel,
CAST(REPLACE(cnpj.capital_social_empresa, ',', '.') AS DOUBLE) AS capital_social_empresa,
cnpj.porte_empresa,
cnpj.opcao_pelo_simples::INT AS opcao_pelo_simples,
cnpj.opcao_pelo_mei,
cnpj.geocodmun,
cnpj.uf_cod::INT AS uf_code,
cnpj.type
FROM {{ source('source_brazil', 'cnpj_2019_02_bronze') }} AS cnpj
LEFT JOIN (
SELECT
cnpj,
ARRAY_AGG(DISTINCT cnae_secondary) AS cnae_secondary_array
FROM {{ source('source_brazil', 'cnpj_2019_02_cnae_secondary_bronze') }}
GROUP BY cnpj
) AS cnae
ON cnpj.cnpj = cnae.cnpj
ORDER BY cnpj.cnpj ASC