Mill Merge

View or edit on GitHub

This page is synchronized from trase/data/indonesia/palm_oil/logistics/mills/mill_merge/readme.md. Last modified on 2025-12-14 23:19 CET by Trase Admin. Please view or edit the original file there; changes should be reflected here after a midnight build (CET time), or manually triggering it with a GitHub action (link).

Mill data consolidation process

Author: Jason J Benedict (jasonjb82@ucsb.edu)
Date: December 5, 2018

First step: This step uses the millsMerge.py script. It generates matrices of mills, coordinates and parent companies matches for every mill from an external dataset against mills from the aka files (alternative mills names, coordinates and parent companies).
Input files:
- AKA files in csv (mill names, coordinates and parents companies)
- External mil dataset – ideally should at least have columns for mill names, mill parent company name and coordinates
Output files: Matrix files of matches for each mill from external dataset against reference mill AKA files for mill name, parent company of mill and coordinates (csv format)
Second step: This step uses the matrixReformat.py script. It inputs the matrix output files from the previous process to create a combined mill matching output file and two (2) separate spreadsheets with potential matches and unmatched mills. There are various levels of matching the mill names and coordinates (distances between mills). The thresholds for matching can be adjusted to get better-matched outputs. For each level of matching, a column with an integer of the sequence of matching is added.
Third step: This step uses the nonMatches.py script. The spreadsheet with potential matches from the previous step gives you one or more matches to choose from and add a value of 'Y' or 'N' to an addition column called 'confirm' for the correct match. This is then run through the script to generate the unmatched mills spreadsheet. In this sheet, you will need to manually add the mill code of the correct match in a column called 'match'. If there are no matches, insert 'none' in the match column. This script eventually combines the initial automated matches and the manually checked matches from the potential and unmatched mills. This script also assigns new codes to new mills from the external dataset (if applicable).
Fourth step: This uses the updateAKAs.py script. Updating aka's with mill names, coordinates and parent company names from the consolidated master dataset. Inputs the current aka files and outputs the updated aka files.