View or edit on GitHub

This page is synchronized from trase/models/argentina/soy/siogranos_matching/siogranos_matching_analysis.ipynb. Last modified on 2025-12-13 00:30 CET by Trase Admin. Please view or edit the original file there; changes should be reflected here after a midnight build (CET time), or manually triggering it with a GitHub action (link).

SIOGRANOS matching with OpenRefine

We used OpenRefine and reconcile-csv (http://okfnlabs.org/reconcile-csv/) to do fuzzy matching on a combined identifier of PROVINCE + LOCALITY for our Argentina data. We looked at every possible match where more than 100,000 tons were listed in the SIOGRANOS_2016_UNMATCHED.csv file.

This notebook will shortly look through our results to determine how much of the total volume has been matched to date, and check on the effectiveness of using OpenRefine.

library(tidyverse)
library(scales)
library(ggplot2)

Read in data

We're reading in a few datasets here: the original files used for matching (BAHRA and SIOGRANOS), and two outputs from OpenRefine (one containing all results, and one containing unmatched results only).

post_match_data <- read_csv("SIOGRANOS_2016_ALL_POST_MATCHING.csv")
openrefine_no_match <- read_csv("SIOGRANOS_2016_NO_MATCH.csv")

original_df <- read_delim("SIOGRANOS_2016_UNMATCHED.csv", delim = ';')
bahra_matching <- read_csv("BAHRA_CLEAN_edited2.csv")

Taking a look at these datasets

Just a short sanity check and refresh of what's in these datasets.

head(bahra_matching)

head(post_match_data)

head(openrefine_no_match)

Adding signifiers for matches

We add a column to see what's been matched at this point, and another column to see if there was a direct match between our Province + Locality identifier across datasets (which is useful to know, since those direct matches do not need OpenRefine to be linked).

post_match_data <- post_match_data %>%
    mutate(MATCHED = PROVINCE_LOCALITY %in% bahra_matching$PROVINCE_LOCALITY, 
           FUZZY_MATCHED_BY_OPENREFINE = (PROVINCE_LOCALITY %in% bahra_matching$PROVINCE_LOCALITY) & (paste0(PROVINCE, "_", LOCALITY_CLEAN) != PROVINCE_LOCALITY))

post_match_data %>%
    group_by(MATCHED, FUZZY_MATCHED_BY_OPENREFINE) %>%
    summarise(count = n(), TONS = sum(TONS)) %>%
    ungroup() %>%
    mutate(PCT = TONS / sum(TONS))

It looks like 22% of the data (by volume) was a direct match, and another 52% was matched with OpenRefine. So far, we've matched about 1.8 billion tons.

To reach the target of 90% matches, we need to match about 2.2 billion tons, so let's figure out how much work is required to get that last ~400 million tons of matched data:

# Look at the top N rows of the unmatched data and see how much volume is covered:

#5
top_5 <- openrefine_no_match %>%
    slice(1:5) %>%
    summarise(sum(TONS)) %>% 
    pull()

#10 
top_10 <- openrefine_no_match %>%
    slice(1:10) %>%
    summarise(sum(TONS)) %>% 
    pull()

#25
top_25 <- openrefine_no_match %>%
    slice(1:25) %>%
    summarise(sum(TONS)) %>% 
    pull()

#25
top_50 <- openrefine_no_match %>%
    slice(1:50) %>%
    summarise(sum(TONS)) %>% 
    pull()

cat(paste0("5: ", comma(top_5), "\n", "10: ", comma(top_10), "\n", "25: ", comma(top_25), "\n", "50: ", comma(top_50)))

openrefine_no_match %>%
    mutate(id = seq_len(nrow(openrefine_no_match))) %>%
    filter(id < 101) %>%
ggplot() + 
geom_line(aes(x = id, y=cumsum(TONS))) + 
scale_y_continuous(labels = scales::comma)

It seems like manual research on the top 25 or so remaining rows will get us to 90% matched volume.