View or edit on GitHub
This page is synchronized from doc/DuckDB.md. Last modified on 2026-03-20 23:00 CET by Trase Admin.
Please view or edit the original file there; changes should be reflected here after a midnight build (CET time),
or manually triggering it with a GitHub action (link).
DuckDB
DuckDB is a high-performance analytical database that runs in-process: it doesn't require installing a separate database and have it constantly running, but can simply be included within a script (Python, R, or others) by importing it as a library, started from the command line, or used through a User Interface. DuckDB is useful to query data files (csv, parquet, json, excel, geojson, google spreadsheets, etc) or writing to them, both locally and in "cloud" object stores such as s3 or google cloud storage. Within Python, it can query and write to dataframes like pandas or polars, and in R it can work using dbplyr/dplyr syntax.
DuckDB can also connect to other database systems (PostgreSQL, BigQuery, Athena, etc.) to read and write in them.
Furthermore, as there is a DBT adaptor for duckdb (https://github.com/duckdb/dbt-duckdb), we can use it to define, document, and run data pipelines that rely on our existing Python or R scripts and s3 data. See our DBT documentation for more details on this.
Startup script
It can be useful to have DuckDB automatically run some common commands on startup. For instance, to load AWS credentials so DuckDB can read (and write) files from S3.
To do this, create a file ~/.duckdbrc with the following:
-- load AWS credentials to read and write from S3
CALL load_aws_credentials();
Connecting to other databases
As we use PostgreSQL, BigQuery, and Athena, it is useful to be able to read from them, specially when integrating with DBT defined pipelines. For instance, we might be interested in taking some reference data from them, doing further transformations relying also on other s3 data, and write the results in s3.
The following sections
Connecting to PostgreSQL
DuckDB has a built-in extension to connect to PostgreSQL (documentation in https://duckdb.org/docs/stable/core_extensions/postgres).
Supposing you have an active .pg_service.conf file with connection details, including a production_readonly profile as mentioned in the main Trase README, you can connect to Postgres as follows:
ATTACH 'service=production_readonly dbname=trase' AS pg_db_connection (TYPE POSTGRES, READ_ONLY);
SET pg_experimental_filter_pushdown=true;
ATTACH command, or through environment variables, as specified in the extension documentation linked above.
The SET pg_experimental_filter_pushdown=true; command enables to push down filter (WHERE clauses) to Postgres before it returns result (so it doesn't have to bring all the data from a table on each query), which can significantly reduce the data transferred over the network and speed up queries.
Afterwards, you can query Postgres tables refering to the new attached database, for example:
SELECT * FROM pg_db_connection.supply_chains_datasets.brazil_beef_v2_2_0 LIMIT 10;
For certain cases you might want to write the query using Postgres SQL syntax instead of DuckDB's. For this you can use the postgres_query function, for example:
SELECT * FROM postgres_query(
'pg_db_connection',
'SELECT cnpj::text AS cnpj FROM cnpj.cnpj WHERE cnpj::text IN (....)'
);
Example using in DBT
In trase/data_pipeline/macros/attach_postgres.sql you find an example for attaching to PostgreSQL in DBT, which is automatically done for all models within trase/data_pipeline/models/postgres_views, as defined in the DBT project definition in trase/data_pipeline/dbt_project.yml.
Connecting to BigQuery
There is a DuckDB 'community' extension to connect to BigQuery (https://duckdb.org/community_extensions/extensions/bigquery). It's usage is very similar to PostgreSQL. Here an example connecting to the BigQuery project dopastoaoprato-278fd in read-only mode, and enabling filter pushdown:
INSTALL bigquery FROM community; LOAD bigquery;
ATTACH 'project=dopastoaoprato-278fd' AS bq_db_connection (TYPE bigquery, READ_ONLY);
SET bq_experimental_filter_pushdown=true;
-- Do a normal query:
SELECT * FROM bq_db_connection.brazil_beef.reference_data.clean_cnpj_2019 LIMIT 10;
-- Use GoogleSQL syntax directly:
SELECT * FROM bigquery_query(
'bq_db_connection',
'SELECT cnpj AS cnpj FROM brazil_beef.reference_data.clean_cnpj_2019');
For this to work, you need to have the Google Cloud CLI and be authenticated with the right permissions, as mentioned in the main Trase README.
Connecting to AWS Athena
Athena is a serverless query service that allows querying data stored in S3 (csv's, parquets, etc) using standard SQL. It is particularly useful for querying against large parquet files (several gigabytes in size), without requiring to download them locally or having a computer with enough memory to process them.
In Trase's case, a couple of examples are Brazil's company database, which includes information for all companies with a CNPJ number, or for Brazil's animal movement records (GTAs).
DuckDB can connect to Athena using ODBC (a open standard for accessing databases). This requires to previously install ODBC, the specific driver for Athena, and setting some configuration files pointing to the driver and the connection details, in a way somehow similar to how it's done for PostgreSQL.
Installing ODBC and Athena ODBC Driver
To connect DuckDB to Athena via ODBC, follow these steps:
-
Install ODBC
- MacOS (using Homebrew):
brew install unixodbc - Linux (Ubuntu example, root installation):
sudo apt-get install unixodbc - Linux (conda example - non-root):
conda install -c conda-forge unixodbc pyodbc rpm-tools cpio xorg-libx11 dbus
- MacOS (using Homebrew):
-
Install the Athena ODBC Driver
- Download the driver from AWS: Athena ODBC Driver Download
For Windows or MAC, follow instructions.
For linux, identify the link from above, download and unpack in a path where it will be installed. For example, for
sagemaker, it can be within a path where other local library and executable files exist (e.g./home/sagemaker-user/.admin/athena). Go to that directory and do:curl -O https://downloads.athena.us-east-1.amazonaws.com/drivers/ODBC/v2.0.5.1/Linux/AmazonAthenaODBC-2.0.5.1.rpm rpm2cpio AmazonAthenaODBC-2.0.5.1.rpm | cpio -idmv -
Install the iODBC Framework (Mac or Windows, not Linux)
- The Athena driver depends on iODBC. Download and install from iODBC Downloads
Configuring ODBC for Athena
After installing you need to create some configuration files:
* odbcinst.ini : this file tells ODBC where the Athena driver is located. Check the documentation of the Athena ODBC driver download page for common locations of this file depending on your OS.
* If installing in MacOS using homebrew, a common location is /opt/homebrew/etc/odbcinst.ini.
* If installing in Linux in a local path, it could be /home/sagemaker-user/.admin/athena/opt/athena/odbc/ini/odbcinst.ini
The content of the file would be something like this (adjust the path to the driver accordingly):
[Amazon Athena ODBC Driver]
Description=Amazon Athena ODBC Driver
Driver=/opt/athena/odbc/arm64/lib/libathena-odbc-arm64.dylib
Driver=/home/sagemaker-user/.admin/athena/opt/athena/odbc/lib/libathena-odbc.so.
- Create a DSN (Data Source Name) configuration file
odbc.ini: this file tells ODBC how to connect to Athena, including AWS credentials, S3 output location, etc. A common location for this file is in the home directory~/.odbc.ini. An example of a connection calledAthenaProdconnecting to thes3_big_datadataset in AWS Glue catalog, and usings3://trase-temp/athena/as a temporary location for query results would be:[AthenaProd] Driver = Amazon Athena ODBC 2.x AwsRegion = eu-west-1 Catalog = AwsDataCatalog Schema = s3_big_data Workgroup = primary S3OutputLocation = s3://trase-temp/athena/ AuthenticationType = Default Credentials - Export environment variables so duckdb can find the ODBC configuration settings.
- For the MacOS example this could be:
export ODBCSYSINI=/opt/homebrew/etc export ODBCINI="$HOME/.odbc.ini" export DYLD_LIBRARY_PATH=/opt/athena/odbc/arm64/lib:$DYLD_LIBRARY_PATH - For the Linux example this could be:
Some additional observations for linux:
export ODBCINI=$HOME/.odbc.ini export ODBCSYSINI=$HOME/.admin/athena/opt/athena/odbc/ini export LD_LIBRARY_PATH="$HOME/.admin/athena/opt/athena/odbc/lib:$CONDA_PREFIX/lib"- Note the library path is in another environment variable
- You can add these variables to the specific conda environment from which you are working so not to have to set them each session
- Add them in the
variables:section in the defining yaml file, or within the activated environment, doingconda env config vars set ODBCINI=/path/to/odbc.ini. - You can check the configured conda environment variables with
conda env config vars list
- Add them in the
- Remove the contents from the test user configuration that comes with the Athena Driver, if it exists (empty the contents of
/home/sagemaker-user/.admin/athena/opt/athena/odbc/ini/odbc.ini.
- For the MacOS example this could be:
Testing general ODBC configuration
- Check the location of the ODBC configuration files:
You should see something like:
odbcinst -junixODBC 2.3.14 DRIVERS............: /opt/homebrew/etc/odbcinst.ini SYSTEM DATA SOURCES: /opt/homebrew/etc/odbc.ini FILE DATA SOURCES..: /opt/homebrew/etc/ODBCDataSources USER DATA SOURCES..: /Users/nicolasmartin/.odbc.ini SQLULEN Size.......: 8 SQLLEN Size........: 8 SQLSETPOSIROW Size.: 8 - Check the Athena ODBC driver is registered:
You should see something like:
odbcinst -q -d[Amazon Athena ODBC 2.x] - Check the DSN connection to Athena (it requires you have valid AWS credentials active in your environment), using
isql(comes with unixodbc):You should see something like:isql -v AthenaProd+---------------------------------------+ | Connected! | | | | sql-statement | | help [tablename] | | echo [string] | | quit | | | +---------------------------------------+ SQL>
Testing the connection in Duckdb
For running queries from Duckdb connecting to Athena, you need to use an ODBC extension called odbc_scanner and then send an ODBC query. You can test the connection as follows:
INSTALL odbc_scanner;
LOAD odbc_scanner;
SELECT * FROM odbc_list_data_sources();
SELECT * FROM odbc_query(
'DSN=AthenaProd',
'SELECT 1'
);
If its working correctly, now you can try a query against actual data in S3. For example, the following query against a table registered as cnpj_2025_02_04, which really queries a 5GB paquet file stored in s3://trase-storage/brazil/logistics/cnpj/gold/cnpj_2025_02_04/cnpj_2025_02_04_gold.parquet:
SELECT * FROM odbc_query(
'DSN=AthenaProd',
'SELECT * FROM "AwsDataCatalog"."s3_big_data"."cnpj_2025_02_04" LIMIT 10'
);
To-do
Other things that would be useful to document on DuckdDB usage in the future:
- Querying special files (excel, json, geojson, shp)
- Common csv options
- Querying
iso-8859-1encoded csvs - all_varchar, header, delimiter (example)
- Querying
- Using the UI
- Further DBT - Duckdb integration
- Example of Athena use in DBT