Using R on Trase

View or edit on GitHub

This page is synchronized from doc/Using-R-on-Trase.md. Last modified on 2025-12-09 00:30 CET by Trase Admin. Please view or edit the original file there; changes should be reflected here after a midnight build (CET time), or manually triggering it with a GitHub action (link).

Using R on Trase

A collection of R snippets which perform common actions used in SEI-PCS.

Making Markdown Graphs Visible in GitHub
Working with Git and Rstudio
All praise the tidyverse
Connect R to S3
Load single csv from S3
Load single parquet from S3
Load multiple or partitioned parquet files from S3
Load multiple csvs simultaneously from S3
Load spatial data
Load single excel file from S3
Save csv/sf to S3
Save parquet to S3
Save JSON to S3
Save/load R objects to S3
Move files from Google drive to S3
Remove special characters from strings
Load data from PostgreSQL DB -> R
Load data from BigQuery DB -> R
Working with nested data.frames
Visualizing data in R
Using VSCode

Making Markdown Graphs Visible in GitHub

RMarkdown and Quarto are great: the embedding of graphs mean that anyone coming to a script in the future has an easy (visual) way to understand a script, much like a notebook. The preferred way of using markdown with R is Quarto to produce GitHub-friendly Markdown files that can be viewed directly on GitHub. However, RMarkdown is perfectly fine to as well . Here is an example of how it looks.

Here is how to do this using Quarto:

if you are using VScode, install the Quarto extension from within the Extensions tab in VS Code

R packages needed:

install.packages(c("quarto","tidyverse", "palmerpenguins"))

Create a .qmd file with the following text at the start: Include the following text at the start of your .RMD file:

---

title: "Title"
author: "Joe Bloggs"
date: Jan XX, 20XX
format: gfm
execute:
warning: false
message: false
echo: false

---

```{r}
#| label: fig-airquality
#| fig-cap: "Temperature and ozone level."
#| warning: false

library(ggplot2)

ggplot(airquality, aes(Temp, Ozone)) +
  geom_point() +
  geom_smooth(method = "loess")
```

Render (preview in VS code) the quarto document into its final output format
Commit all files and graphics (.md, .qmd, .png) to the repository. The markdown document will automatically pull all figures into the document when rendered. Here is an example of how that looks..

sorces: Quarto

Here is how to do this using Rmarkdown:

If you are not using RStudio and haven't installed Pandoc, then install it (http://pandoc.org).
Install RMarkdown:

install.packages('rmarkdown')

Include the following text at the start of your .RMD file:

---

title: "Title"
author: "Joe Bloggs"
date: Jan XX, 20XX
output:
md_document:
variant: markdown_github

---

```
{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE, fig.align = 'center', fig.width = 12, fig.asp = 0.7, warning = F, message = F)
```

Render the R Markdown document into its final output format There are two ways to do this:
If you are using RStudio or VSCode, then the “Knit” button (Ctrl+Shift+K) will render the document and display a preview of it.
If you are not using RStudio or VSCode then you need to call the rmarkdown::render function, for example:
```
Rscript -e "rmarkdown::render('path/to/input.Rmd')"
```
Note that both methods use the same mechanism; RStudio’s “Knit” button calls rmarkdown::render() under the hood.
Commit all files and graphics (.md, .rmd, .png) to the repository. The markdown document will automatically pull all figures into the document when rendered. Here is an example of how that looks..

If you want PDF output you can install tinytex:

install.packages('tinytex')
tinytex::install_tinytex()

Sources:

Working with Git and Rstudio

You can read an introduction written by Erasmus here.

All praise the `tidyverse`

library(tidyverse)

Connect R to S3

First ensure that you have set up AWS credentials.

Then connect like so:

library(aws.s3)
aws.signature::use_credentials()
Sys.setenv("AWS_DEFAULT_REGION" = "eu-west-1")

Load single csv from S3

Also see our documentation on CSV Files in Trase.

# Load single csv
# ... can also use s3read_using()
muni_biomes <-
read_delim(
 get_object("brazil/spatial/BOUNDARIES/ibge/old/boundaries municipalities/MUNICIPALITIES_BIOMES_TRASE.csv",
            bucket = "trase-storage",
            check_region = T),
 col_types = cols(geocode = col_character()),
 delim = ";"
)

Load single parquet from S3

# Load single parquet
library(arrow)
data <- arrow::read_parquet("s3://trase-storage/ethiopia/trade/bol/2020/gold/gold_ethiopia_coffee_2020.parquet")

Load multiple or partitioned parquet files from S3

For multiple parquet files use open_dataset instead, specify partitions in case there are any, and filter based on them to avoid bringing the whole data.

library(arrow)
library(dplyr)

dataset <- open_dataset("s3://trase-storage/world/trade/statistical_data/comtrade/comtrade_monthly/", partitioning = c("refYear", "cmdCode"))

result <- dataset %>%
filter(
 refYear == 2020,
 cmdCode %in% c('090111', '090112', '090121', '090122', '090190', '210111', '210112')
) %>%
group_by(reporterISO, reporterDesc, cmdCode) %>%
summarise(tonnes = round(sum(netWgt, na.rm = TRUE) / 1000, 2), .groups = "drop") %>%
arrange(reporterISO, desc(tonnes))

# This query will be pushed down to Arrow and only final results will be brought into R.
result_df <- collect(result)

head(result_df)

Load multiple csvs simultaneously from S3

Also see our documentation on CSV Files in Trase.

bucket <- "trase-storage"
path <- "brazil/beef/trade/cd/combined/"
keys <- map_chr(get_bucket(bucket, prefix = path), "Key")
keys <- keys[str_detect(keys, "CD_COMBINED_BEEF_")]
keys <- keys[!grepl("archived", keys)]
which_keys_to_load <- map(years_of_interest, ~ keys[grepl(.x, keys)])
objects_csv <- map(which_keys_to_load, ~get_object(., bucket))
data_csv <- map_df(
objects_csv,
~ read_delim(
 .,
 delim = ";",
 col_types = cols(
   ncm = col_character()
 ),
 locale = locale(encoding = "UTF8"),
 trim_ws = TRUE
)
)
rm(which_keys_to_load, path, keys, objects_csv)    # tidy up

Load spatial data

# json, geojson
departments <-
s3read_using(
 object = "argentina/spatial/BOUNDARIES/departments/2019/AR_DEPARTMENTS_SIMPLIFIED.json",
 FUN = read_sf,
 bucket = "trase-storage",
 opts = c("check_region" = T)
)

# Load shp
# ... this saves the file locally, then reads it in.
# NB It's better to work with geojson/json
ts <- "trase-storage"
path <- "brazil/spatial/BOUNDARIES/ibge/old/boundaries states/"
keys <- map_chr(get_bucket(ts, prefix = path), "Key")
walk(keys, ~ save_object(., ts, file = str_c(tempdir(),"/", basename(.))))
objects <- list.files(tempdir(), pattern = "BRUFE250GC_SIR", full.names = TRUE)
objects_main <- objects[grepl("\\.shp", objects)]
states <- read_sf(objects_main)
rm(ts, path, keys, objects, objects_main)

Load single excel file from S3

# Load excel
# ... nb this file has been moved on S3 - I should replace with another example
data_excel <- s3read_using(
object = "data/1-TRADE/CD/EXPORT/COTE_DIVOIRE/ORIGINALS/RELEVE EXPORT 2016-2018.xlsx",
bucket = "trase-cocoa",
FUN = read_excel,
sheet = "Cacao",
skip = 3,
opts = c("check_region" = TRUE)
)

Save parquet to S3

library(arrow)

arrow::write_parquet(dataframe, "s3://trase-storage/path/to/file.parquet")

For the above to work you need to have compiled Arrow with S3 support:

install.packages("arrow", repos = "https://cloud.r-project.org/", configure.args = "--with-s3")

Alternatively, you can write to a local file first:

temporary_file_path <- tempfile(fileext = ".parquet")
write_parquet(dataframe, temporary_file_path)
put_object(
file=temporary_file_path,
object="s3://trase-storage/path/to/file.parquet",
)
unlink(temporary_file_path)

Save csv/sf to S3

Also see our documentation on CSV Files in Trase.

# Use write_delim for csv
s3write_using(
   x,
   object,
   bucket = "trase-storage",
   FUN  = write_delim, delim = ";",
   na = "",
   opts = c("check_region" = T),
)

for database ingestion character strings need double quotion marks ("double").

# Use write.table for creating cvs for database ingestion
options(scipen = 999) # set option to avoid scientific notations (translates to write.table, but not to write_delim above)
s3write_using(x, object, bucket = "trase-storage", FUN = write.table, sep = ";", row.names = F, quote = T, opts = c("check_region" = T, multipart = T))

Save JSON to S3

# NB cannot write JSONs with s3write_using()
zz <- rawConnection(raw(0), "r+")
sifs_all3 %>%
toJSON(pretty = T, Date = "ISO8601") %>%
write_lines(zz)
aws.s3::put_object(
file = rawConnectionValue(zz),
bucket = "trase-storage",
object = "data/3-LOGISTICS/BRAZIL/SANITARY_INSPECTIONS/ANIMAL_PRODUCTS/SIF/OUT/2020_SIF.json",
check_region = T)
close(zz)
rm(zz)

Save/load R objects to S3

There are convenient s3save and s3load functions in {aws.s3}.

Move files from Google drive to S3

library(googledrive)
library(sf)
library(arrow)

# authenticate
drive_authenticate()

# Download the file to a temporary location
drive_download("XX.geojson", "~/downloads/XX .geojson", overwrite = T)
tab <- read_sf("~/downloads/XX.geojson")

# example: convert from wide to long format
tab_l <- tab %>% as_tibble() %>% select(-geometry) %>%
       pivot_longer(-c(ibge_munic, ibge_state, name), names_to = "type", values_to = "ha")
tab_l <-tab_l %>% mutate(class = paste0(str_sub(type, end = -6), "_", "ha"), year = as.numeric(str_sub(type, start = -4)))
# write as parquet file to S3

s3write_using(tab_l,
 FUN = write_parquet,
 object = "country/commodity/indicators/in/qX_XXXX/XXX.parquet",
 opts = c("check_region" = TRUE),
 bucket = "trase-storage"
)

Remove special characters from strings

str_trans <- function(x) {
x %>%
 stringi::stri_trans_toupper() %>%
 stringi::stri_trans_general("Latin-ASCII")
}

Load data from PostgreSQL DB -> R

This requires permissions to access to the database, as mentioned in the main TRASE Readme PostgreSQL section, and create system environment variables PGUSER and PGPASSWORD to store your credentials

The following example loads some flows from Brazil soy:

# Load data from DB -> R
library(DBI)
library(RPostgres)
library(dplyr)
library(dbplyr)

# Establish database connection -------------------------------------------

# User and password are taken from PGUSER and PGPASSWORD environment variables
con <- dbConnect(
  Postgres(),
  host = "trase-db-instance.c6jsbtgl0u2s.eu-west-1.rds.amazonaws.com",
  dbname = "trase"
)

# Review available tables -------------------------------------------------

tables <- dbListTables(con)
sort(tables)

# Get data ----------------------------------------------------------------

# connect to brazil soy flows
data <- tbl(connection, in_schema("supply_chains", "brazil_soy_v2_6_0"))

# Standard dply syntax works
brazil_commodities_2017 <- data %>%
filter(country_of_first_import == "NETHERLANDS") %>%
group_by(key = year) %>%
summarise(value = sum(volume, na.rm = TRUE)) %>%
collect() # Don't forget to collect the results

See Trase Database for more.

Load data from BigQuery DB -> R

This requires that you have access permissions to BigQuery.

If you are running the script interactively (manually), you need to first install the Google Cloud CLI (installation instructions here), and then specify the google account you will use for authentication. For this run the following Google Cloud CLI command:

gcloud auth application-default login

it should prompt you to login with a Google account. This google account should have appropriate permissions.

For non-interactive scripts, you can use a Service Account, download its authentication Json, and have the environment variable GOOGLE_APPLICATION_CREDENTIALS point to it. Instructions on how to create these service credentials can be found at https://github.com/sei-international/TRASE/blob/master/README.md#e-if-you-are-going-to-be-interacting-with-google-cloud-platform-gcp.

Have in mind that dbplyr works for a subset of the data manipulation functions from dplyr. Also, though dbplyr translates them into BigQuery SQL, you can also send direct SQL statements, which tends to have bigger data manipulation capabilities (e.g. window functions, custom SQL functions, complex joins and subqueries, BigQuery specific functions, among many other).

Bigrquery uses lazy evaluation, so it will only bring the data and apply actual transformations when you need to show it, usually through collect(). Try to collect the data only when you know the data is not so big. This allows aggregations and transformations to be done in BigQuery, which tends to be more efficient for gigabyte or million-row results.

See Bigrquery for more.

library(dplyr)
library(dbplyr)
library(bigrquery)

# Establish database connection -------------------------------------------

bq_project <- "trase-396112"
bq_dataset <- "website"
bq_con <- src_bigquery(project = bq_project, dataset = bq_dataset)
sc_timeseries_bq <- tbl(bq_con, in_schema(bq_dataset, "supply_chains_timeseries"))

# Review available tables -------------------------------------------------

tables_list <- bq_dataset_tables(paste(bq_project, bq_dataset, sep = "."))
sapply(tables_list, function(x) x$table)

# Review metadata of a table
bq_table_meta(paste(bq_project, "pg_website_dev", "supply_chains_timeseries", sep = "."))

# Get data ----------------------------------------------------------------

# You can use dplyr syntax. Note that not all functions work
sc_timeseries_selection <- sc_timeseries_bq %>%
 filter(year %in% c(2020, 2021)) %>%
 group_by(context_slug) %>%
 summarize(
   total_volume = sum(volume, na.rm = TRUE),  # Sum volume, excluding NA values
 ) %>%
 arrange(desc(total_volume)) %>%
 collect()

View(head(sc_timeseries_selection, 5))

Working with nested data.frames

Below are some examples of how to nest and unnest data in data.frames. Way more detail and examples available in the Wickham book.

library(tidyverse)
data("mtcars")


# Extract car name
mtcars <- mtcars %>%
janitor::clean_names("screaming_snake")
mtcars$CAR_NAME  <- row.names(mtcars)
mtcars <- as_tibble(mtcars)


# Get a row with the car brand
mtcars <- mtcars %>%
mutate(CAR_BRAND = str_split(CAR_NAME, pattern = " "),
      CAR_BRAND = map_chr(CAR_BRAND, 1))


# Simple way to nest a tibble (using the existing data)
mtcars %>%
select(CAR_BRAND, MPG) %>%
group_by(CAR_BRAND) %>%
nest(MPG_LIST = MPG)


# Make a tibble of data to sample
my_sample_wt <-
mtcars %>%
distinct(CAR_BRAND, WT)


# Add a nested list of car weights (per brand)
fn_sample_car_weights <- function(brand, data, size){
data %>%
 filter(CAR_BRAND == brand) %>%
 sample_n(WT, size = size, replace = T) %>%
 pull(WT)
}
mtcars_with_nested_list <-
mtcars %>%
distinct(CAR_BRAND) %>%
mutate(MY_SAMPLE_WT = map(.x = CAR_BRAND,
                         .f = fn_sample_car_weights,
                         data = my_sample_wt,
                         size = 100))
mtcars_with_nested_list


# Add a nested tibble of car weights
# ... here, the nested df has a 'NUM_RANDOMISATION'
mtcars_with_nested_df <-
mtcars_with_nested_list %>%
mutate(MY_SAMPLE_WT_DF = map(.x = MY_SAMPLE_WT,
                              .f = ~ tibble("WEIGHT" = .x) %>% mutate(NUM_RANDOMISATION = row.names(.)))) %>%
select(CAR_BRAND, MY_SAMPLE_WT_DF)
mtcars_with_nested_df


# Unnest, check the number of NUM_RANDOMISATION for Mazdas
# ...  should be 100
mtcars_with_nested_df %>%
slice(1) %>%
unnest(cols = c(MY_SAMPLE_WT_DF)) %>%
distinct(NUM_RANDOMISATION)


# Another way of building nested data.frames
# ... using nest() of some columns already in the data
# ... then add Make column with MPG per brand, plus SAMPLE_NUMBER
mtcars %>%
select(CAR_BRAND, MPG) %>%
group_by(CAR_BRAND) %>%
nest(MPG_LIST = MPG) %>%
ungroup() %>%
mutate(MPG_LIST_WITH_NUMBER = map(MPG_LIST, ~ .x %>% mutate(NUM_RANDOMISATION = row.names(.))))

Visualizing data in R

There are Trase templates in Observable (e.g. https://observablehq.com/@trase/visual-id), and a traseviz package in R for vizualisation. Some traseviz options (e.g. color palettes) are described here or/and here.

Using VSCode)

This should work for Windows and Mac - but was tested on Mac only.

Install VS-code: https://code.visualstudio.com/
Install R: https://cloud.r-project.org/

VS-code extensions ():

Install the R extension
Install AWS CLI CONFIGURE
Install AWS toolkit
Install Github Codespace
Install Github Repositories

Key packages to install in R to work with VS code

Install languageserver in R. And 1. Toggle Format on Type, Format On Save, and Format On Paste settings in VS Code editor preferences, depending on your needs.
Install httpgd in R

Useful settings to add to settings.json

{
  "files.associations": {
    "*.rmd": "markdown",
    "*.Rmd": "markdown",
    "*.qmd": "quarto"
  },
  "r.rmarkdown.preview.autoShowPreviewToSide": true,
  "r.rmarkdown.codeLens": true,
  "r.rmarkdown.chunkArgs": ["eval = FALSE"],
  "r.alwaysUseActiveTerminal": true,
  "r.bracketedPaste": true,
  "r.helpPanel.enableHoverLinks": false,
  "r.lsp.diagnostics": false,
  "application.shellEnvironmentResolutionTimeout": 30,
  "terminal.integrated.scrollback": 10000,
  "githubPullRequests.pullBranch": "never",
  "workbench.startupEditor": "none",
  "terminal.explorerKind": "external",
  "files.dialog.defaultPath": "/Users/floriangollnow/dev",
  "r.removeLeadingComments": true,
  "r.workspaceViewer.removeHiddenItems": true,
  "notebook.formatOnSave.enabled": true,
  "editor.formatOnSave": true,
  "r.rterm.option": ["--no-save", "--no-restore"],
  "[r]": {
    "editor.defaultFormatter": "REditorSupport.r",
    "editor.formatOnSave": true
  },
  "[jsonc]": {
    "editor.defaultFormatter": "esbenp.prettier-vscode",
    "editor.formatOnSave": true
  },
  "[quarto]": {
    "editor.defaultFormatter": "REditorSupport.r",
    "editor.formatOnSave": true
  },
  "[rmarkdown]": {
    "editor.defaultFormatter": "REditorSupport.r",
    "editor.formatOnSave": true
  },
  "r.lsp.use_stdio": true,
  "r.plot.useHttpgd": true
}

Useful keybindings to add in keybindings.json

// Place your key bindings in this file to override the defaults
[
  // keybindings for R scripts.
  {
    "key": "Cmd+Shift+m",
    "command": "type",
    "args": {
      "text": " %>% "
    },
    "when": "editorTextFocus && editorLangId == r"
  },
  {
    "key": "ctrl+-",
    "command": "type",
    "args": {
      "text": " <- "
    },
    "when": "editorTextFocus && editorLangId == r"
  },
  // keybindings for Rmarkdown
  {
    "key": "Cmd+Shift+m",
    "command": "type",
    "args": {
      "text": " %>% "
    },
    "when": "editorTextFocus && editorLangId == rmd"
  },
  {
    "key": "ctrl+-",
    "command": "type",
    "args": {
      "text": " <- "
    },
    "when": "editorTextFocus && editorLangId == rmd"
  },
  {
    "key": "ctrl+cmd+i",
    "command": "editor.action.insertSnippet",
    "when": "editorTextFocus",
    "args": {
      "snippet": "```{r}\n$0\n```"
    }
  },
  // keybindings for quarto
  {
    "key": "Cmd+Shift+m",
    "command": "type",
    "args": {
      "text": " %>% "
    },
    "when": "editorTextFocus && editorLangId == quarto"
  },
  // keybindings for R terminal (radian included)
  {
    "key": "Cmd+Shift+m",
    "command": "workbench.action.terminal.sendSequence",
    "args": {
      "text": " %>% "
    },
    "when": "terminalFocus"
  },
  {
    "key": "ctrl+-",
    "command": "workbench.action.terminal.sendSequence",
    "args": {
      "text": " <- "
    },
    "when": "terminalFocus"
  },
  {
    "key": "cmd+enter",
    "command": "r.runSelection",
    "when": "editorTextFocus && editorLangId == 'r'"
  }
]

Sources:

https://www.reddit.com/r/bioinformatics/comments/qp6kes/rmarkdown_on_vs_code/
https://stackoverflow.com/questions/57395681/getting-visual-studio-code-to-auto-format-r-code
https://www.schmidtynotes.com/blog/r/2021-09-28-vscode-rmd-code-chunk-snippet/#:~:text=Edit%20the%20.,file%20with%20the%20following%20json.&text=%22key%22%3A%20%22cmd%2B,ctrl%2Balt%2Bi%22.&text=Now%20you%20can%20add%20a,with%20cmd%2Balt%2Bi%20

Using R on Trase

Using R on Trase

Making Markdown Graphs Visible in GitHub

Here is how to do this using Quarto:

Here is how to do this using Rmarkdown:

Working with Git and Rstudio

All praise the tidyverse

Connect R to S3

Load single csv from S3

Load single parquet from S3

Load multiple or partitioned parquet files from S3

Load multiple csvs simultaneously from S3

Load spatial data

Load single excel file from S3

Save parquet to S3

Save csv/sf to S3

Save JSON to S3

Save/load R objects to S3

Move files from Google drive to S3

Remove special characters from strings

Load data from PostgreSQL DB -> R

Load data from BigQuery DB -> R

Working with nested data.frames

Visualizing data in R

Using VSCode)

All praise the `tidyverse`