Trase Codebase Info
View or edit on GitHub
This page is synchronized from README.md. Last modified on 2025-12-09 00:30 CET by Trase Admin.
Please view or edit the original file there; changes should be reflected here after a midnight build (CET time),
or manually triggering it with a GitHub action (link).
Trase Codebase
Trase follows a "mono-repo" pattern: there are a lot of different things in this repository! Likely you will not need to do all of the steps below: instead, just pick what is relevant to you.
- Repository Structure and Key Files
- Contributing to this repository
- Installation
- A: If you are going to be running Python code
- B: If you are going to be interacting with AWS
- C: If you are going to be interacting with the Trase PostgreSQL database
- D: If you are going to be contributing Python or Jupyter code to the repository
- E: If you are going to be interacting with Google Cloud Platform (GCP)
- F: Other dependencies
- On Python Dependencies
Google Earth Engine (GEE) code is kept on GEE, not on GitHub. GEE has its own version control system which can be viewed here. Note that to retain version control information for a file it must keep the same name and consistenly be stored in the same location.
Repository Structure and Key Files
├── doc
│ ├── *.md # Lots of documentation about our technology and processes
│ ├── db # A static site that displays the schema of our main PostgreSQL database
│ └── runbook # How we release data to the public
├── poetry.lock # Lockfile that pins Python dependencies
├── pyproject.toml # Python dependencies and other Python tool configuration
└── trase # Root folder for all Trase code
├── admin
│ ├── deforestationfree # Scripts to manage our JupyterHub instance
│ └── scripts # Other various administrative scripts
├── config.py # Defines the Python configuration file
├── data # Pre-processing / cleaning scripts for our data
├── database # Managing the schema of main PostgreSQL database (but not the data itself, see runbook/ below)
├── default.toml # Default Python configuration
├── models # SEI-PCS models
├── products # Data products, like the trase.earth website
├── runbook # Ingestion and processing of data in our main PostgreSQL database
└── tools # Library code (mostly Python), such as wrappers for interacting with AWS and PostgreSQL
Contributing to this repository
We push code to branches and use pull requests to contribute code to the main branch ("master") in this repository. For information about this and lots more see How we use Git.
When writing documentation, it is recommended that there be a table of contents. We generate these using markdown-toc: install it, ensure that you have the toc/tocstop markers in your file, and use our wrapper script to generate the table of contents:
trase/admin/scripts/add_toc_to_docs.sh path/to/my_file.md
````
## Installation
First, follow [GitHub's documentation on how to clone a repository](https://docs.github.com/en/repositories/creating-and-managing-repositories/cloning-a-repository).
### A: If you are going to be running Python code
1. Ensure that you have installed Python 3.10.\*
For handling multiple versions of Python [pyenv](https://github.com/pyenv/pyenv) is good (or [pyenv-win](https://github.com/pyenv-win/pyenv-win) for Windows folks).
2. We use [Poetry](https://python-poetry.org/) to track our Python dependencies and manage the virtual environment.
Follow [their installation instructions to install it](https://python-poetry.org/docs/#installation).
3. Activate the virtual environment and install "core" dependencies:
```bash
$ cd path/to/TRASE
$ poetry install
$ poetry shell
```
4. (Optional) Install "extra" dependencies
```bash
poetry run pip install --requirement extra-requirements.txt
```
### B: If you are going to be interacting with AWS
(For example, accessing files on S3 or _managing_ RDS instances).
You can access AWS resources through your `@trase.earth` access credentials, or through AWS access keys. If you have a `@trase.earth` account,
this is the prefered and more secure method.
#### b1) Configuring AWS access using a `@trase.earth` account
You need to follow these instructions once per computer you want to have access from. After this, you will have to issue the command `aws sso login` when logging in.
Configuration instructions:
1. [Install the AWS Command Line Interface (CLI)](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html).
2. On the command line run the `aws configure sso` command, and fill with the following values:
```bash
$ aws configure sso
SSO session name (Recommended): trase-sso
SSO start URL [None]: https://trase-earth.awsapps.com/start
SSO region [None]: eu-west-1
SSO registration scopes [None]:
```
You can leave the last option blank.
The AWS CLI will try to open an internet browser with an authentication window, were you can enter or select your `trase@earth` google-based authentication. If the window doesn't open, an access link is also shown, which you can manually copy and paste into a browser window.
3. Once you've been authenticated, you can select the default profile configuration. If you have more than one access role available, you can select
and set the default one in an interactive menu. For example:
```text
Using the account ID 614804060947
There are 3 roles available to you.
[ReadOnlyAccess]
[EditAccess] X
[AdminAccess]
Using the role name "EditAccess"
Default client Region [None]: eu-west-1
CLI default output format (json if not specified) [None]:
Profile name [614804060947_EditAccess]: default
```
You can review your current profile with the command: `aws sts get-caller-identity`. Also, a file called `~/.aws/config` should have been
created, including the information you provided. You can add profiles to this document in case you need to. Here an example including an `Admin` profile:
```toml
[default]
sso_session = trase-sso
sso_account_id = 614804060947
sso_role_name = EditAccess
region = eu-west-1
[profile Admin]
sso_session = trase-sso
sso_account_id = 614804060947
sso_role_name = AdminAccess
region = eu-west-1
[sso-session trase-sso]
sso_start_url = https://trase-earth.awsapps.com/start
sso_region = eu-west-1
sso_registration_scopes = sso:account:access
```
If you have multiple profiles, you can specify which one to use adding a `--profile` to an aws command. This is specially useful for
differentiating when doing Administrative tasks (creating users, starting services, etc). For example:
`aws s3 cp my_file.txt s3://trase-temp/my_folder/ --profile Admin`
Or to use the `Admin` profile within a Python script do the following. Note that if you're using the default profile, you can omit the `profile_name` parameter:
```python
import boto3
session = boto3.Session(profile_name='Admin')
s3_client = session.client('s3')
```
For security, after a period of several hours of inactivity you will have to login again. This is done through the following command line,
which will again try to open an internet browser window.
```bash
aws sso login
The previous instructions have been adapted from the AWS Configuring IAM Identity Center authentication with the AWS CLI page.
b2) Configuring AWS credentials through static access keys
- Post in the #data-systems Slack channel to ask for access keys. They come as a pair, called Access Key ID and Secret Access Key.
- Create an AWS credentials file. You can do this in one of two ways:
- Either use the AWS CLI (recommended):
- Install the AWS Command Line Interface (CLI).
- On the command line run the
aws configurecommand:$ aws configure AWS Access Key ID [None]: xxxx AWS Secret Access Key [None]: xxxx Default region name [None]: eu-west-1 Default output format [None]:
- Or create the file yourself:
- Create an empty file:
- Mac/Linux:
~/.aws/credentials - Windows:
C:\Users\<User>\.aws\credentials
- Mac/Linux:
- Insert these contents, replacing xxxx with your access keys and secret access keys:
[default] aws_access_key_id = xxxx aws_secret_access_key = xxxx
- Create an empty file:
C: If you are going to be interacting with the Trase PostgreSQL database
(For example, running ingestions or processing datasets).
- Get one of the existing team members to provide you with a PostgreSQL username and password.
- Create a PostgreSQL Connection Services file:
- Windows: Create an empty file at
C:\\Users\MyUser\AppData\postgresql\pg_services.conf. (If your username has spaces this might be problematic! In this case choose a different location without spaces). Then set a system-wide environment variablePGSERVICEFILEto the above file path. Restart your shell or terminal emulator to pick up this new environment variable. - Unix: Run
touch ~/.pg_service.conf. - Open the file and add the following:
Note:
[production] user=my_user_name host=trase-db-instance.c6jsbtgl0u2s.eu-west-1.rds.amazonaws.com dbname=trase [production_readonly] user=my_user_name host=trase-db-instance.c6jsbtgl0u2s.eu-west-1.rds.amazonaws.com dbname=trase options=-c default_transaction_read_only=onproduction_readonlyis a required profile for the DBT-DuckDB pipeline in trase/data_pipeline.
- Windows: Create an empty file at
- Create a Postgres Password File:
- Windows: Create an empty file at
C:\\Users\MyUser\AppData\postgresql\pgpass.conf. - Unix: Run
touch ~/.pgpass && chmod 0600 ~/.pgpass. - Open the file and add the following:
trase-db-instance.c6jsbtgl0u2s.eu-west-1.rds.amazonaws.com:5432:*:my_user_name:xxxxxx
- Windows: Create an empty file at
- If you have
psqlinstalled, test your connection works: ```bash $ psql service=production -c "select 'hello, world'" ?column? -------------- hello, world (1 row) ```` - If you have already followed the Python instructions above, select which database service Trase should use:
For more information on configuration see Configuration in Trase.
$ poetry run trase config db service name user host dbname port -- ------------ ------- ----------------------------------------------------------- ------ ---- 1 production my_user trase-db-instance.c6jsbtgl0u2s.eu-west-1.rds.amazonaws.com trase 5341 Select a PostgreSQL service: 1
If you are going to run DBT models then follow the instructions in [trase/database/dbt/README.md].
D: If you are going to be contributing Python or Jupyter code to the repository
Certain coding standards are automatically applied and enforced using pre-commit.
- Install pre-commit.
If you are using a virtual environment, install pre-commit outside the virtual environment:
$ deactivate || true # deactivate virtual environment $ pip3 install pre-commit - Install the pre-commit hooks:
$ cd path/to/TRASE $ pre-commit install pre-commit installed at .git/hooks/pre-commit - Run the hooks to test that it is working
$ pre-commit run --all-files black....................................................................Passed jupyter-notebooks-clear-outputs..........................................Passed - (Optional) Pre-commit will abort the Git commit if your code is not conformant.
This can be annoying because you have to commit twice!
To avoid this configure your code editor to auto-format your code when you save the file:
- First find where the pre-commit binary is by running the command
which pre-commit - Then add a "file watcher"
-
PyCharm: Add a new file watcher (Preferences/Settings > File Watchers > + >
) with the following settings: Key Value Name e.g. Pre-CommitFile Type AnyScope Project FilesProgram output of which pre-commitArguments run --files $FilePath$Output paths to refresh $FilePath$Working Directory and Environment Variables > Working directory $ProjectFileDir$Advanced Options > Auto-save edited files to trigger the watcher Untick Advanced Options > Show console Always- Emacs: A pre-save hook using package "blacken" can be added to the init file: ```emacs (defun blacken-python () (when (and (stringp buffer-file-name) (string-match "\.py\'" buffer-file-name)) (blacken-buffer))) (add-hook 'before-save-hook 'blacken-python) ``` - RStudio: Add an R environment variable to save files upon styling, and add a hotkey for styling + saving your active file: 1. Install the styler package: install.packages("styler")1. Run usethis::edit_r_profile()to open up your .Rprofile file, which runs each time you start R.1. Add the line Sys.setenv(save_after_styling = TRUE)to your .Rprofile file, and save.1. In RStudio, click through Tools -> AddIns -> Browse Add-Ins. That brings up a list of available add-ins, and in the bottom-lefthand side of the box there is a "Keyboard Shortcuts..." button. Click that, and in the next menu you can add a keyboard shortcut for "Style active file." 1. Set the shortuct for "Style active file" to whatever combination you want to use to save and style an R or R Markdown file ( Ctrl + Alt + Sis a nice default).Note, this shortcut will not work with for non-R filetypes. 1. Restart R. The active file in RStudio will now be styled and saved with Ctrl + Alt + S.1. Let's check it's working. Edit any file within the project (i.e. not a scratch file) in your IDE and then save it. If you are using PyCharm, check that you get a dialog box with the output from pre-commit, e.g: black....................................................................Passed check yaml...............................................................PassedYou may need to tick a box which says "Trust project and run" in order for this to work. 1. If you are using PyCharm, go back to the settings for the file watcher and choose Advanced options > Show console > Never.
- First find where the pre-commit binary is by running the command
You might also want to use isort for sorting imports and ssort for ordering statements within a file. Neither of these are enforced in the codebase, however.
E: If you are going to be interacting with Google Cloud Platform (GCP)
If you want to access a GCP service like BigQuery or Gemini from scripts or the command line then you need to generate credentials.
For this step you will need:
- An trase.earth user account (note: an email alias is insufficient)
- For that user account to have been added to the "trase" (
trase-396112) project - The
gcloudcommand-line tool
Once these prerequisites are met, follow these steps:
- Authenticate the CLI, ensuring that you use your @trase.earth account when the browser opens:
gcloud auth login - Generate API access keys, called "Application Default Credentials" (ADCs):
This will populate a file on your local computer:
gcloud auth application-default login- Linux/MacOS:
~/.config/gcloud/application_default_credentials.json - Windows:
%APPDATA%/gcloud/application_default_credentials.json
- Linux/MacOS:
Google Cloud Roles
Here are some roles that we use:
| Effect | Role(s) |
|---|---|
| Permission to query datasets in BigQuery | roles/bigquery.dataViewer, roles/bigquery.jobUser |
| Permission to delete/create/etc datasets | roles/bigquery.dataOwner |
| Full admin permission on bigquery | roles/bigquery.admin |
The easiest way to assign these roles is via the Google Cloud Console.
Or you can use the gcloud CLI:
gcloud projects add-iam-policy-binding trase-396112
--member="user:joe.blogs@trase.earth" \
--role="roles/bigquery.admin"
F: Other dependencies
- We use Docker for our AWS Lambda jobs.
- The data pipline app can be deployed using the Elastic Beanstalk command line interface (EB CLI).
- Some diagrams are written using GraphViz.
- There are many Jupyter notebooks in this repository which you can run locally if you wish.
- SEI-PCS models require GLPK, a solver for linear programming problems.
- You can install GLPK for Ubuntu like this:
sudo apt-get install glpk-utilsand for MacOS like this:brew install glpk - If you do not have GLPK installed, and you try to run one of the SEI-PCS models, you may see an error like this:
pulp.apis.core.PulpSolverError: PuLP: cannot execute glpsol
- You can install GLPK for Ubuntu like this:
- Tables of contents in documentation are generated using markdown-toc.
On Python Dependencies
Our Python code relies heavily on a lot of external libraries. It is important that we have all installed the same libraries, and moreover the same versions of those libraries. We have two places in which we specify Python dependencies:
If you wish to add a dependency, you have two choices:
- Add to Poetry using
poetry add my-library - Add to Pip:
- Run
poetry run pip install my-library --upgrade-strategy only-if-needed - Note the version that was installed, say "1.2.3"
- Edit the file
extra-requirements.txt, adding a linemy-library===1.2.3so that others are able to install the same library.
- Run
Generally you would do the first option for a "core" library; for example one that is required by trase.tools.pcs or similar.
You would do the second for an "extra" library; such as one that you have used in trase/data in a preprocessing script.
Below we list the advantages and disadvantages of both approaches:
| Poetry | Pip | |
|---|---|---|
| Advantages |
|
|
| Disadvantages |
|
|