Machine Learning Rediscovers Simpler Equations Driving the Ocean’s Iron Cycle

Researchers applied symbolic regression—a machine learning technique for equation discovery—to data from an ocean biogeochemical model. They successfully “rediscovered” (or derived alternative) equations governing the cycling of colloidal iron, a key component of the ocean iron cycle that influences phytoplankton productivity and carbon sequestration.

Symbolic regression starts with basic mathematical operators and searches for simple, interpretable equations that best fit the data. This differs from typical “black box” ML models (like neural networks) by producing explicit, human-readable formulas.

Colloidal iron (microscopic suspended iron particles) in the ocean. Ocean models use simplified equations for such complex biogeochemical processes, often based on limited observations and assumptions.

The ML-derived equations (a suite of six) differed from the original model equations but were functionally simpler. They performed comparably well in reproducing large-scale patterns of iron distribution.

This work validates symbolic regression (and equation discovery more broadly) for complex Earth systems. Traditional model equations rely on expert knowledge and sparse data; ML can help derive or refine them directly from simulations or observations, potentially improving climate and ocean models. It also provides practical guidance for future field sampling.

The paper is titled “Toward Using Equation Discovery to Generate Parameterizations of Biogeochemical Processes” (Wang et al., 2026). It builds on broader trends using ML for ocean science, such as data assimilation, process emulation, and hybrid physics-ML modeling.

_____________________________________________________________________________________

Toward Using Equation Discovery to Generate Parameterizations of Biogeochemical Processes

Paper Title: Toward Using Equation Discovery to Generate Parameterizations of Biogeochemical Processes

Authors: Chengwang Wang (University of Liverpool), B. B. Cael (University of Chicago), Alessandro Tagliabue (University of Liverpool)

Journal:  Geophysical Research Letters (2026), Volume 53, Issue 12

DOI: 10.1029/2025gl121380

Plain Language Summary (from the paper)

Ocean models use simplified equations to represent complex biogeochemical processes, but these are often based on limited observations and strong assumptions. Symbolic regression offers a way to discover equations directly from data for more objective and transparent parameterizations.

The study tested whether symbolic regression can “rediscover” a known equation for colloidal iron (CFe) cycling using output from a state-of-the-art ocean biogeochemical model as a controlled surrogate for real observations. While it did not exactly reproduce the original empirical equation, it found simpler alternatives that performed similarly and reproduced large-scale iron patterns equally well.

Key finding: Success depends heavily on data sampling—full water-column coverage and observations from multiple ocean basins are essential. This suggests symbolic regression can bridge complex models and simplified parameterizations, and it will improve as observational datasets (e.g., GEOTRACES) expand.

GEOTRACES is an international program studying the marine biogeochemical cycles of trace elements and isotopes (TEIs), with dissolved iron (dFe) as a core “key parameter” measured on nearly all sections due to its role as a limiting micronutrient for phytoplankton in much of the ocean.

GEOTRACES uses a standardized, contamination-free approach:

  • Trace-metal clean sampling: Specialized rosettes, Niskin bottles, or in-situ pumps on dedicated cruises or GEOTRACES-compliant sections.
  • High-resolution sections: Full-depth profiles (often every 10–50 m in the upper ocean, sparser deeper) along basin-scale transects, plus process studies.
  • Multi-parameter approach: dFe is paired with isotopes (e.g., δ⁵⁶Fe for source tracing), ligands, size-fractionated (soluble vs. colloidal), particulate Fe, and other tracers (Ra, Th, Al, Mn) from the same water samples.
  • Colloidal iron (cFe): Often operationally defined as the fraction between ~0.02 µm and 0.2 µm (dFe minus soluble Fe). Measured via cross-flow filtration or similar on select cruises; it typically comprises a significant and dynamic portion of dFe (often ~half or more in some regions).

The latest GEOTRACES Intermediate Data Product 2025 (IDP2025) includes data from 123 cruises, with ~23,912 dissolved Fe values (up substantially from prior versions). Data are quality-controlled via intercalibration and crossover stations.

GEOTRACES data revealed a paradigm shift: sediments and hydrothermal sources are more important than previously thought (beyond just dust), with boundary exchange and long-distance transport playing big roles. This directly informs the kind of equation discovery work in the Wang et al. (2026) paper, which highlighted needs for full water-column, multi-basin colloidal and dissolved Fe data to derive robust parameterizations.

Abstract (key excerpts)

Equation discovery methods like symbolic regression show promise for objective, data-driven biogeochemical parameterizations. Here, the authors applied it to rediscover an empirical equation for colloidal iron in an ocean model.

They introduced a robustness metric combining global pattern reproduction (R²) and functional similarity (EMD-SHAP). The discovered equations were simpler than the original but performed comparably. Robust results required full-depth, multi-basin sampling. The framework is transferable to other processes.

Background and Approach

Target process:

Colloidal iron (CFe) cycling. In models, CFe is often derived from dissolved iron (DFe) minus Fe solubility, using an empirical equation from Liu and Millero (1999) based on temperature, salinity, and pH.

Method:

Symbolic regression (e.g., via tools like PySR) starts with basic math operators and evolves equations via genetic algorithms to fit data while favoring simplicity.

Data:

Output from a global ocean biogeochemical model (as “perfect” surrogate data), plus subsampling to mimic real observational sparsity (e.g., GEOTRACES-like datasets).

Main Results

  • Symbolic regression produced a suite of simpler equations that omitted variables like salinity (which varies little in relevant regimes) and still matched large-scale CFe distributions.
  • Sampling insights:
    • Full water-column data >> depth-limited samples.
    • Multi-basin coverage is critical; sparse or regional data leads to less robust equations.
    • Combining colloidal Fe observations with dissolved Fe data from key sections improved results.
  • The ML equations acted well as emulators of the underlying model process.

This validates equation discovery for ocean biogeochemistry and provides actionable guidance for future fieldwork and model development.

You can access the full paper here:

Abstract

Equation discovery methods, such as symbolic regression, show great promise to generate parameterizations of biogeochemical processes in an objective data-driven manner, yet remain untested in ocean biogeochemistry. Here, we apply symbolic regression to a state-of-the-art ocean biogeochemical model, using it as a surrogate data set to rediscover an empirical equation used to calculate colloidal iron in the model. We introduce a robustness metric combining R2 (global pattern reproduction) and EMD-SHAP (similarity of functional behaviors) for discovered equations. While symbolic regression did not rediscover the original equation because of its empirical complexity, it generated simpler equations with similar performance and functional behaviors, indicating symbolic regression’s potential as an emulator bridging between models. Subsampling experiments show that robust equations require full-depth and multi-basin sampling, underscoring sampling priorities on colloidal iron. This framework can be broadly applicable to other poorly constrained biogeochemical processes.


Discover more from Climate- Science.press

Subscribe to get the latest posts sent to your email.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.