Trase Codebase Info

View or edit on GitHub

This page is synchronized from README.md. Last modified on 2026-01-22 21:12 CET by Florian Gollnow. Please view or edit the original file there; changes should be reflected here after a midnight build (CET time), or manually triggering it with a GitHub action (link).

Trase Codebase

Trase follows a "mono-repo" pattern: there are a lot of different things in this repository! Likely you will not need to do all of the steps below: instead, just pick what is relevant to you.

Repository Structure and Key Files
Contributing to this repository
Installation
A: If you are going to be running Python code
B: If you are going to be interacting with AWS
- b1) Configuring AWS access using a @trase.earth account
- b2) Configuring AWS credentials through static access keys
C: If you are going to be interacting with the Trase PostgreSQL database
D: If you are going to be contributing Python or Jupyter code to the repository
E: If you are going to be interacting with Google Cloud Platform (GCP)
- Google Cloud Roles
F: Other dependencies
On Python Dependencies

Google Earth Engine (GEE) code is kept on GEE, not on GitHub. GEE has its own version control system which can be viewed here. Note that to retain version control information for a file it must keep the same name and consistenly be stored in the same location.

Repository Structure and Key Files

├── doc
│     ├── *.md                     # Lots of documentation about our technology and processes
│     ├── db                       # A static site that displays the schema of our main PostgreSQL database
│     └── runbook                  # How we release data to the public
├── poetry.lock                    # Lockfile that pins Python dependencies
├── pyproject.toml                 # Python dependencies and other Python tool configuration
└── trase                          # Root folder for all Trase code
      ├── admin
      │     ├── deforestationfree  # Scripts to manage our JupyterHub instance
      │     └── scripts            # Other various administrative scripts
      ├── config.py                # Defines the Python configuration file
      ├── data                     # Pre-processing / cleaning scripts for our data
      ├── database                 # Managing the schema of main PostgreSQL database (but not the data itself, see runbook/ below)
      ├── default.toml             # Default Python configuration
      ├── models                   # SEI-PCS models
      ├── products                 # Data products, like the trase.earth website
      ├── runbook                  # Ingestion and processing of data in our main PostgreSQL database
      └── tools                    # Library code (mostly Python), such as wrappers for interacting with AWS and PostgreSQL

Contributing to this repository

We push code to branches and use pull requests to contribute code to the main branch ("master") in this repository. For information about this and lots more see How we use Git.

When writing documentation, it is recommended that there be a table of contents. We generate these using markdown-toc: install it, ensure that you have the toc/tocstop markers in your file, and use our wrapper script to generate the table of contents:

trase/admin/scripts/add_toc_to_docs.sh path/to/my_file.md
````

## Installation

First, follow [GitHub's documentation on how to clone a repository](https://docs.github.com/en/repositories/creating-and-managing-repositories/cloning-a-repository).

### A: If you are going to be running Python code

1. Ensure that you have installed Python 3.10.\*
   For handling multiple versions of Python [pyenv](https://github.com/pyenv/pyenv) is good (or [pyenv-win](https://github.com/pyenv-win/pyenv-win) for Windows folks).
2. We use [Poetry](https://python-poetry.org/) to track our Python dependencies and manage the virtual environment.
   Follow [their installation instructions to install it](https://python-poetry.org/docs/#installation).
3. Activate the virtual environment and install "core" dependencies:
   ```bash
   $ cd path/to/TRASE
   $ poetry install
   $ poetry shell
   ```
4. (Optional) Install "extra" dependencies
   ```bash
   poetry run pip install --requirement extra-requirements.txt
   ```

### B: If you are going to be interacting with AWS

(For example, accessing files on S3 or _managing_ RDS instances).

You can access AWS resources through your `@trase.earth` access credentials, or through AWS access keys. If you have a `@trase.earth` account,
this is the prefered and more secure method.

#### b1) Configuring AWS access using a `@trase.earth` account

You need to follow these instructions once per computer you want to have access from. After this, you will have to issue the command `aws sso login` when logging in.

Configuration instructions:

1. [Install the AWS Command Line Interface (CLI)](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html).
2. On the command line run the `aws configure sso` command, and fill with the following values:
   ```bash
   $ aws configure sso
   SSO session name (Recommended): trase-sso
   SSO start URL [None]: https://trase-earth.awsapps.com/start
   SSO region [None]: eu-west-1
   SSO registration scopes [None]:
   ```
   You can leave the last option blank.

   The AWS CLI will try to open an internet browser with an authentication window, were you can enter or select your `trase@earth` google-based authentication. If the window doesn't open, an access link is also shown, which you can manually copy and paste into a browser window.
3. Once you've been authenticated, you can select the default profile configuration. If you have more than one access role available, you can select 
   and set the default one in an interactive menu. For example:
   ```text
   Using the account ID 614804060947
      There are 3 roles available to you.
       [ReadOnlyAccess]
       [EditAccess] X
       [AdminAccess]

      Using the role name "EditAccess"

   Default client Region [None]: eu-west-1
   CLI default output format (json if not specified) [None]:
   Profile name [614804060947_EditAccess]: default
   ```

   You can review your current profile with the command: `aws sts get-caller-identity`. Also, a file called `~/.aws/config` should have been
   created, including the information you provided. You can add profiles to this document in case you need to. Here an example including an `Admin` profile:
   ```toml
   [default]
   sso_session = trase-sso
   sso_account_id = 614804060947
   sso_role_name = EditAccess
   region = eu-west-1
   [profile Admin]
   sso_session = trase-sso
   sso_account_id = 614804060947
   sso_role_name = AdminAccess
   region = eu-west-1
   [sso-session trase-sso]
   sso_start_url = https://trase-earth.awsapps.com/start
   sso_region = eu-west-1
   sso_registration_scopes = sso:account:access
   ```

   If you have multiple profiles, you can specify which one to use adding a `--profile` to an aws command. This is specially useful for
   differentiating when doing Administrative tasks (creating users, starting services, etc). For example:

   `aws s3 cp my_file.txt s3://trase-temp/my_folder/ --profile Admin`

   Or to use the `Admin` profile within a Python script do the following. Note that if you're using the default profile, you can omit the `profile_name` parameter:
   ```python
   import boto3
   session = boto3.Session(profile_name='Admin')
   s3_client = session.client('s3')
   ```

For security, after a period of several hours of inactivity you will have to login again. This is done through the following command line,
which will again try to open an internet browser window.


```bash
aws sso login

The previous instructions have been adapted from the AWS Configuring IAM Identity Center authentication with the AWS CLI page.

b2) Configuring AWS credentials through static access keys

Post in the #data-systems Slack channel to ask for access keys. They come as a pair, called Access Key ID and Secret Access Key.
Create an AWS credentials file. You can do this in one of two ways:

Either use the AWS CLI (recommended):

Install the AWS Command Line Interface (CLI).

On the command line run the aws configure command:

$ aws configure
AWS Access Key ID [None]: xxxx
AWS Secret Access Key [None]: xxxx
Default region name [None]: eu-west-1
Default output format [None]:

Or create the file yourself:
1. Create an empty file:
  - Mac/Linux: ~/.aws/credentials
  - Windows: C:\Users\<User>\.aws\credentials
2. Insert these contents, replacing xxxx with your access keys and secret access keys:
```
[default]
aws_access_key_id = xxxx
aws_secret_access_key = xxxx
```

C: If you are going to be interacting with the Trase PostgreSQL database

(For example, running ingestions or processing datasets).

Get one of the existing team members to provide you with a PostgreSQL username and password.
Create a PostgreSQL Connection Services file:
1. Windows: Create an empty file at C:\\Users\MyUser\AppData\postgresql\pg_services.conf. (If your username has spaces this might be problematic! In this case choose a different location without spaces). Then set a system-wide environment variable PGSERVICEFILE to the above file path. Restart your shell or terminal emulator to pick up this new environment variable.
2. Unix: Run touch ~/.pg_service.conf.
3. Open the file and add the following:
```
[production]
user=my_user_name
host=trase-db-instance.c6jsbtgl0u2s.eu-west-1.rds.amazonaws.com
dbname=trase

[production_readonly]
user=my_user_name
host=trase-db-instance.c6jsbtgl0u2s.eu-west-1.rds.amazonaws.com
dbname=trase
options=-c default_transaction_read_only=on
```
  Note: production_readonly is a required profile for the DBT-DuckDB pipeline in trase/data_pipeline.
Create a Postgres Password File:
1. Windows: Create an empty file at C:\\Users\MyUser\AppData\postgresql\pgpass.conf.
2. Unix: Run touch ~/.pgpass && chmod 0600 ~/.pgpass.
3. Open the file and add the following:
```
trase-db-instance.c6jsbtgl0u2s.eu-west-1.rds.amazonaws.com:5432:*:my_user_name:xxxxxx
```
If you have psql installed, test your connection works: ```bash $ psql service=production -c "select 'hello, world'" ?column? -------------- hello, world (1 row) ````

If you have already followed the Python instructions above, select which database service Trase should use:

$ poetry run trase config db
    service name  user     host                                                         dbname  port
--  ------------  -------  -----------------------------------------------------------  ------  ----
1   production    my_user  trase-db-instance.c6jsbtgl0u2s.eu-west-1.rds.amazonaws.com   trase   5341
Select a PostgreSQL service: 1

For more information on configuration see Configuration in Trase.

If you are going to run DBT models then follow the instructions in [trase/database/dbt/README.md].

D: If you are going to be contributing Python or Jupyter code to the repository

Certain coding standards are automatically applied and enforced using pre-commit.

Install pre-commit. If you are using a virtual environment, install pre-commit outside the virtual environment:
```
$ deactivate || true  # deactivate virtual environment
$ pip3 install pre-commit
```

Install the pre-commit hooks:

$ cd path/to/TRASE
$ pre-commit install
pre-commit installed at .git/hooks/pre-commit

Run the hooks to test that it is working

$ pre-commit run --all-files
black....................................................................Passed
jupyter-notebooks-clear-outputs..........................................Passed

(Optional) Pre-commit will abort the Git commit if your code is not conformant. This can be annoying because you have to commit twice! To avoid this configure your code editor to auto-format your code when you save the file:

First find where the pre-commit binary is by running the command which pre-commit
Then add a "file watcher"

PyCharm: Add a new file watcher (Preferences/Settings > File Watchers > + > ) with the following settings:

Key	Value
Name	e.g. `Pre-Commit`
File Type	`Any`
Scope	`Project Files`
Program	output of `which pre-commit`
Arguments	`run --files $FilePath$`
Output paths to refresh	$FilePath$
Working Directory and Environment Variables > Working directory	$ProjectFileDir$
Advanced Options > Auto-save edited files to trigger the watcher	Untick
Advanced Options > Show console	`Always`
- Emacs: A pre-save hook using package "blacken" can be added to the init file:
```emacs
(defun blacken-python ()
(when (and (stringp buffer-file-name)
(string-match "\.py\'" buffer-file-name))
(blacken-buffer)))
(add-hook 'before-save-hook 'blacken-python)
```
- RStudio: Add an R environment variable to save files upon styling, and add a hotkey for styling + saving your active file:
1. Install the styler package: `install.packages("styler")`
1. Run `usethis::edit_r_profile()` to open up your .Rprofile file, which runs each time you start R.
1. Add the line `Sys.setenv(save_after_styling = TRUE)` to your .Rprofile file, and save.
1. In RStudio, click through Tools -> AddIns -> Browse Add-Ins.
That brings up a list of available add-ins, and in the bottom-lefthand side of the box there is a "Keyboard Shortcuts..." button. Click that, and in the next menu you can add a keyboard shortcut for "Style active file."
1. Set the shortuct for "Style active file" to whatever combination you want to use to save and style an R or R Markdown file (`Ctrl + Alt + S` is a nice default).
Note, this shortcut will not work with for non-R filetypes.
1. Restart R.
The active file in RStudio will now be styled and saved with `Ctrl + Alt + S`.
1. Let's check it's working.
Edit any file within the project (i.e. not a scratch file) in your IDE and then save it.
If you are using PyCharm, check that you get a dialog box with the output from pre-commit, e.g:
`black....................................................................Passed check yaml...............................................................Passed`
You may need to tick a box which says "Trust project and run" in order for this to work.
1. If you are using PyCharm, go back to the settings for the file watcher and choose Advanced options > Show console > Never.

You might also want to use isort for sorting imports and ssort for ordering statements within a file. Neither of these are enforced in the codebase, however.

E: If you are going to be interacting with Google Cloud Platform (GCP)

If you want to access a GCP service like BigQuery or Gemini from scripts or the command line then you need to generate credentials.

For this step you will need:

An trase.earth user account (note: an email alias is insufficient)
For that user account to have been added to the "trase" (trase-396112) project
The gcloud command-line tool

Once these prerequisites are met, follow these steps:

Authenticate the CLI, ensuring that you use your @trase.earth account when the browser opens:
```
gcloud auth login
```
Generate API access keys, called "Application Default Credentials" (ADCs):
```
gcloud auth application-default login
```
This will populate a file on your local computer:
- Linux/MacOS: ~/.config/gcloud/application_default_credentials.json
- Windows: %APPDATA%/gcloud/application_default_credentials.json

Google Cloud Roles

Here are some roles that we use:

Effect	Role(s)
Permission to query datasets in BigQuery	`roles/bigquery.dataViewer`, `roles/bigquery.jobUser`
Permission to delete/create/etc datasets	`roles/bigquery.dataOwner`
Full admin permission on bigquery	`roles/bigquery.admin`

The easiest way to assign these roles is via the Google Cloud Console. Or you can use the gcloud CLI:

gcloud projects add-iam-policy-binding trase-396112
  --member="user:joe.blogs@trase.earth" \
  --role="roles/bigquery.admin"

F: Other dependencies

We use Docker for our AWS Lambda jobs.
The data pipline app can be deployed using the Elastic Beanstalk command line interface (EB CLI).
Some diagrams are written using GraphViz.
There are many Jupyter notebooks in this repository which you can run locally if you wish.
SEI-PCS models require GLPK, a solver for linear programming problems.
- You can install GLPK for Ubuntu like this: sudo apt-get install glpk-utils and for MacOS like this: brew install glpk
- If you do not have GLPK installed, and you try to run one of the SEI-PCS models, you may see an error like this: pulp.apis.core.PulpSolverError: PuLP: cannot execute glpsol
Tables of contents in documentation are generated using markdown-toc.

On Python Dependencies

Our Python code relies heavily on a lot of external libraries. It is important that we have all installed the same libraries, and moreover the same versions of those libraries. We have two places in which we specify Python dependencies:

In pyproject.toml, managed by Poetry.
In extra-requirements.txt, managed by Pip.

If you wish to add a dependency, you have two choices:

Add to Poetry using poetry add my-library
Add to Pip:
1. Run poetry run pip install my-library --upgrade-strategy only-if-needed
2. Note the version that was installed, say "1.2.3"
3. Edit the file extra-requirements.txt, adding a line my-library===1.2.3 so that others are able to install the same library.

Generally you would do the first option for a "core" library; for example one that is required by trase.tools.pcs or similar. You would do the second for an "extra" library; such as one that you have used in trase/data in a preprocessing script.

Below we list the advantages and disadvantages of both approaches:

	Poetry	Pip
Advantages	Dependencies resolve to a compatible set of versions Easy install using `poetry install` Easy to bulk-upgrade dependencies	Simple to understand Good for optional / "opt-in" dependencies
Disadvantages	Dependency resolution can take more time the more dependencies you have (see poetry #2094) Cannot install packages which are technically incompatible but practically cause no issues Harder to pick and choose during install: all dependencies are installed when you run `poetry install`	Running `pip install` can cause unexpected upgrade of Poetry-managed dependencies

WIN_SETUP