Skip to content

View or edit on GitHub

This page is synchronized from doc/DuckDB.md. Last modified on 2026-03-20 23:00 CET by Trase Admin. Please view or edit the original file there; changes should be reflected here after a midnight build (CET time), or manually triggering it with a GitHub action (link).

DuckDB

DuckDB is a high-performance analytical database that runs in-process: it doesn't require installing a separate database and have it constantly running, but can simply be included within a script (Python, R, or others) by importing it as a library, started from the command line, or used through a User Interface. DuckDB is useful to query data files (csv, parquet, json, excel, geojson, google spreadsheets, etc) or writing to them, both locally and in "cloud" object stores such as s3 or google cloud storage. Within Python, it can query and write to dataframes like pandas or polars, and in R it can work using dbplyr/dplyr syntax.

DuckDB can also connect to other database systems (PostgreSQL, BigQuery, Athena, etc.) to read and write in them.

Furthermore, as there is a DBT adaptor for duckdb (https://github.com/duckdb/dbt-duckdb), we can use it to define, document, and run data pipelines that rely on our existing Python or R scripts and s3 data. See our DBT documentation for more details on this.

Startup script

It can be useful to have DuckDB automatically run some common commands on startup. For instance, to load AWS credentials so DuckDB can read (and write) files from S3. To do this, create a file ~/.duckdbrc with the following:

-- load AWS credentials to read and write from S3
CALL load_aws_credentials();

Connecting to other databases

As we use PostgreSQL, BigQuery, and Athena, it is useful to be able to read from them, specially when integrating with DBT defined pipelines. For instance, we might be interested in taking some reference data from them, doing further transformations relying also on other s3 data, and write the results in s3.

The following sections

Connecting to PostgreSQL

DuckDB has a built-in extension to connect to PostgreSQL (documentation in https://duckdb.org/docs/stable/core_extensions/postgres).

Supposing you have an active .pg_service.conf file with connection details, including a production_readonly profile as mentioned in the main Trase README, you can connect to Postgres as follows:

ATTACH 'service=production_readonly dbname=trase' AS pg_db_connection (TYPE POSTGRES, READ_ONLY); 
SET pg_experimental_filter_pushdown=true;
Otherwise, you need to provide the full connection details as parameters in the ATTACH command, or through environment variables, as specified in the extension documentation linked above.

The SET pg_experimental_filter_pushdown=true; command enables to push down filter (WHERE clauses) to Postgres before it returns result (so it doesn't have to bring all the data from a table on each query), which can significantly reduce the data transferred over the network and speed up queries.

Afterwards, you can query Postgres tables refering to the new attached database, for example:

SELECT * FROM pg_db_connection.supply_chains_datasets.brazil_beef_v2_2_0 LIMIT 10;

For certain cases you might want to write the query using Postgres SQL syntax instead of DuckDB's. For this you can use the postgres_query function, for example:

SELECT * FROM postgres_query(
    'pg_db_connection', 
    'SELECT cnpj::text AS cnpj FROM cnpj.cnpj WHERE cnpj::text IN (....)'
    );

Example using in DBT

In trase/data_pipeline/macros/attach_postgres.sql you find an example for attaching to PostgreSQL in DBT, which is automatically done for all models within trase/data_pipeline/models/postgres_views, as defined in the DBT project definition in trase/data_pipeline/dbt_project.yml.

Connecting to BigQuery

There is a DuckDB 'community' extension to connect to BigQuery (https://duckdb.org/community_extensions/extensions/bigquery). It's usage is very similar to PostgreSQL. Here an example connecting to the BigQuery project dopastoaoprato-278fd in read-only mode, and enabling filter pushdown:

INSTALL bigquery FROM community; LOAD bigquery;
ATTACH 'project=dopastoaoprato-278fd' AS bq_db_connection (TYPE bigquery, READ_ONLY);
SET bq_experimental_filter_pushdown=true;

-- Do a normal query:
SELECT * FROM bq_db_connection.brazil_beef.reference_data.clean_cnpj_2019 LIMIT 10;

-- Use GoogleSQL syntax directly:
SELECT * FROM bigquery_query(
    'bq_db_connection', 
    'SELECT cnpj AS cnpj FROM brazil_beef.reference_data.clean_cnpj_2019');

For this to work, you need to have the Google Cloud CLI and be authenticated with the right permissions, as mentioned in the main Trase README.

Connecting to AWS Athena

Athena is a serverless query service that allows querying data stored in S3 (csv's, parquets, etc) using standard SQL. It is particularly useful for querying against large parquet files (several gigabytes in size), without requiring to download them locally or having a computer with enough memory to process them.

In Trase's case, a couple of examples are Brazil's company database, which includes information for all companies with a CNPJ number, or for Brazil's animal movement records (GTAs).

DuckDB can connect to Athena using ODBC (a open standard for accessing databases). This requires to previously install ODBC, the specific driver for Athena, and setting some configuration files pointing to the driver and the connection details, in a way somehow similar to how it's done for PostgreSQL.

Installing ODBC and Athena ODBC Driver

To connect DuckDB to Athena via ODBC, follow these steps:

  1. Install ODBC

    • MacOS (using Homebrew):
      brew install unixodbc
      
    • Linux (Ubuntu example, root installation):
      sudo apt-get install unixodbc
      
    • Linux (conda example - non-root):
      conda install -c conda-forge unixodbc pyodbc rpm-tools cpio xorg-libx11 dbus
      
  2. Install the Athena ODBC Driver

    For Windows or MAC, follow instructions.

    For linux, identify the link from above, download and unpack in a path where it will be installed. For example, for sagemaker, it can be within a path where other local library and executable files exist (e.g. /home/sagemaker-user/.admin/athena). Go to that directory and do:

    curl -O https://downloads.athena.us-east-1.amazonaws.com/drivers/ODBC/v2.0.5.1/Linux/AmazonAthenaODBC-2.0.5.1.rpm
    rpm2cpio AmazonAthenaODBC-2.0.5.1.rpm | cpio -idmv
    

  3. Install the iODBC Framework (Mac or Windows, not Linux)

    • The Athena driver depends on iODBC. Download and install from iODBC Downloads

Configuring ODBC for Athena

After installing you need to create some configuration files: * odbcinst.ini : this file tells ODBC where the Athena driver is located. Check the documentation of the Athena ODBC driver download page for common locations of this file depending on your OS. * If installing in MacOS using homebrew, a common location is /opt/homebrew/etc/odbcinst.ini. * If installing in Linux in a local path, it could be /home/sagemaker-user/.admin/athena/opt/athena/odbc/ini/odbcinst.ini

The content of the file would be something like this (adjust the path to the driver accordingly):

[Amazon Athena ODBC Driver]
Description=Amazon Athena ODBC Driver
Driver=/opt/athena/odbc/arm64/lib/libathena-odbc-arm64.dylib
For linux, the Driver location could be something like Driver=/home/sagemaker-user/.admin/athena/opt/athena/odbc/lib/libathena-odbc.so.

  • Create a DSN (Data Source Name) configuration file odbc.ini : this file tells ODBC how to connect to Athena, including AWS credentials, S3 output location, etc. A common location for this file is in the home directory ~/.odbc.ini. An example of a connection called AthenaProd connecting to the s3_big_data dataset in AWS Glue catalog, and using s3://trase-temp/athena/ as a temporary location for query results would be:
    [AthenaProd]
    Driver             = Amazon Athena ODBC 2.x
    AwsRegion          = eu-west-1
    Catalog            = AwsDataCatalog
    Schema             = s3_big_data
    Workgroup          = primary
    S3OutputLocation   = s3://trase-temp/athena/
    AuthenticationType = Default Credentials
    
  • Export environment variables so duckdb can find the ODBC configuration settings.
    • For the MacOS example this could be:
      export ODBCSYSINI=/opt/homebrew/etc
      export ODBCINI="$HOME/.odbc.ini"
      export DYLD_LIBRARY_PATH=/opt/athena/odbc/arm64/lib:$DYLD_LIBRARY_PATH
      
    • For the Linux example this could be:
      export ODBCINI=$HOME/.odbc.ini
      export ODBCSYSINI=$HOME/.admin/athena/opt/athena/odbc/ini
      export LD_LIBRARY_PATH="$HOME/.admin/athena/opt/athena/odbc/lib:$CONDA_PREFIX/lib"
      
      Some additional observations for linux:
      • Note the library path is in another environment variable
      • You can add these variables to the specific conda environment from which you are working so not to have to set them each session
        • Add them in the variables: section in the defining yaml file, or within the activated environment, doing conda env config vars set ODBCINI=/path/to/odbc.ini.
        • You can check the configured conda environment variables with conda env config vars list
      • Remove the contents from the test user configuration that comes with the Athena Driver, if it exists (empty the contents of /home/sagemaker-user/.admin/athena/opt/athena/odbc/ini/odbc.ini.

Testing general ODBC configuration

  • Check the location of the ODBC configuration files:
    odbcinst -j
    
    You should see something like:
    unixODBC 2.3.14
    DRIVERS............: /opt/homebrew/etc/odbcinst.ini
    SYSTEM DATA SOURCES: /opt/homebrew/etc/odbc.ini
    FILE DATA SOURCES..: /opt/homebrew/etc/ODBCDataSources
    USER DATA SOURCES..: /Users/nicolasmartin/.odbc.ini
    SQLULEN Size.......: 8
    SQLLEN Size........: 8
    SQLSETPOSIROW Size.: 8
    
  • Check the Athena ODBC driver is registered:
    odbcinst -q -d
    
    You should see something like:
    [Amazon Athena ODBC 2.x]
    
  • Check the DSN connection to Athena (it requires you have valid AWS credentials active in your environment), using isql (comes with unixodbc):
    isql -v AthenaProd
    
    You should see something like:
    +---------------------------------------+
    | Connected!                            |
    |                                       |
    | sql-statement                         |
    | help [tablename]                      |
    | echo [string]                         |
    | quit                                  |
    |                                       |
    +---------------------------------------+
    SQL> 
    

Testing the connection in Duckdb

For running queries from Duckdb connecting to Athena, you need to use an ODBC extension called odbc_scanner and then send an ODBC query. You can test the connection as follows:

INSTALL odbc_scanner;
LOAD odbc_scanner;

SELECT * FROM odbc_list_data_sources();

SELECT * FROM odbc_query(
  'DSN=AthenaProd',
  'SELECT 1'
);

If its working correctly, now you can try a query against actual data in S3. For example, the following query against a table registered as cnpj_2025_02_04, which really queries a 5GB paquet file stored in s3://trase-storage/brazil/logistics/cnpj/gold/cnpj_2025_02_04/cnpj_2025_02_04_gold.parquet:

SELECT * FROM odbc_query(
  'DSN=AthenaProd',
  'SELECT * FROM "AwsDataCatalog"."s3_big_data"."cnpj_2025_02_04" LIMIT 10'
);

To-do

Other things that would be useful to document on DuckdDB usage in the future:

  • Querying special files (excel, json, geojson, shp)
  • Common csv options
    • Querying iso-8859-1 encoded csvs
    • all_varchar, header, delimiter (example)
  • Using the UI
  • Further DBT - Duckdb integration
  • Example of Athena use in DBT