link: https://arxiv.org/pdf/2601.10583

This work provides and overview of COAML (Combinatorial Optimization Augmented Machine Learning), including a unify framework for COAML, overview of sota in COAML and taxonomy of problem settings.

Introduction

COAML mainly aims at solving CSO (stochastic optimization) problems, in which we need to determine a stochastic policy \(y\sim \pi(\cdot |x)\) to map context to decisions (eg. road condition -> traffic control). As after the decision \(y\) has been made, a real-life uncertainty \(\xi\) is realized, giving us a cost of \(c(x,y,\xi)\). Note that \(x,y,\xi\) are all realized random variables, and \(c\) is a non-stochastic cost function.

The goal of such problem is \(\min_{\pi} E_{x,\xi\sim P,y\sim \pi(\cdot|x)} c(x,y,\xi)\). However, it is hard to derive an optimal \(\pi\) due to the unknown distribution \(P\).

In COAML framework, we address the problem in the following form:

  • A statistical model \(\varphi_w\) to get parameter \(\theta=\varphi_w(x)\).
  • A surrogate CO problem \(\pi_w(x)\sim \arg\max_y \kappa (\theta,y)\) parameterize by \(\theta\). In many cases \(\kappa(\theta,y)=\theta^Ty\).
  • Find parameter \(w\) to minimize \(E_{x,\xi\in P}c(x,\pi_w(x),\xi)\).

In traditional methods (PTO), we train \(\varphi_w\) first using SL(supervised learning). In COAML methods, we directly optimize the decision-focused loss and get an end-to-end model. However, it introduces a challenge of the differentiation of the CO layer.

Formalization

Example 1. VSP problem

Vehicle Scheduling Problem: Construct a sequence of requests for each vehicle that collectively cover all requests at minimum cost. eg. aircraft routing. In this case, we can formalize the problem as that there are \(n\) requests (each request being travel from one place to another) and \(m\) aircrafts, and an aircraft can continuously do request \(y_i\) after request \(x_i\) (eg. \(y_i\) is "flying from B to C", while \(x_i\) is "flying from A to B").

We can model the problem as a graph problem, where we build edges from \(x_i\) to \(y_i\). If we model the cost as the sum of all used edges, then it becomes a simple minimum cost flow problem.

However the cost is often complex. Eg. there exists a lot of randomness in aircraft delays, and a single delay also propagate to subsequent delays. Minimizing the delay cost is challenging.

Parmentier [2021] proposed a COAML method by setting \(\theta\) as the costs of the edges and \(\kappa(\theta,y)\) to be a deterministic MCF problem.

Multi-stage Stochastic Optimization

In some scenarios, we can make some decision after the \(\xi\) is realized. That is, we have \(\tilde c(x,y,\xi)=\min_z c(x,y,z,\epsilon)\) and minimize \(E[\tilde c]\). This is not difficult to issue in COAML by doing similarly as above.

In general cases, we may encounter a MDP setting, where we need to give \(\pi:x_t\in X_t\to y_t\in Y_t\) and minimize the cost \(E[\sum c_t(x_t,y_t,\xi)|\pi]\). When the action space and state space is moderate enough we can optimize it by stochastic dynamic programming. While deep RL struggles at huge action space, COAML learn state representation and process action decision at combinatorial layer.

Methodology

Supervised Learning

argmax is not suitable for backpropagation. The essay shows us two ways combined to solve this issue.

  • Fenchel-Young loss

  • Perturbation

    Add gaussian noises \(Z\) to \(\theta\) and set \(\hat y=E_Z[\arg\max_y(\theta+Z)^Ty]\). WHATSOEVER, just sample \(Z\) for several times and use \(\hat y-\bar y\) for gradient estimation.

Empirical Cost Minimization

It is a more important part, where we optimize the actual expected costs \(\frac{1}{N}\sum c(x_i,\pi_w(x_i),\xi)\).

  • Alternating minimization algo (Bouvier et al. [2025])

    Alternatingly, do the following two steps:

    1. Frozen NN. Do CO for realized \(x\) and \(\xi\), with the goal being \(c(x,y,\xi)-\epsilon\cdot (\theta+Z)^Ty\) where \(Z\sim\) Gaussian. Sample for several times to get an ideal \(y\).
    2. Use that ideal \(y\) to do the supervised learning on NN.
  • Structured Reinforcement Learning (Hoppe et al. [2025])

    SRL is an update on Alternating minimization where we introduce a learnable critic model at step1.

    image-20260117232924997

    image-20260117232924997

    In each step, we update actor by:

    • for each sampled batch from replay buffer, calculate the \(\bar y\) using the \(Q\) values generated by critics. (Just like SL, perturbation needed)

    then update critic.

Applying COAML

Important note: "ALWAYS begin by training a simple GLM scoring model using the FY loss". Cuz in such simple case convergence is guaranteed, so failure could directly due to bug in the CO layer. Also, the CO layer is the cornerstone, so we need to select richer architecture only after the CO layer is fixed and validated.

Pros and cons of each architecture is listed in the essay. It is worth mentioning that GNN captures relational structure naturally when CO layer is graph algo, though it requires higher computational model and is hard to train.

Information listed in essay is a bit too detailed so I'll just skip this part. Look back in the essay if nessesary.

SOTA

Application Perspective

  • Hard CO Problems. Some hard CO problems can be converted into tractable surrogate CO problems by COAML. Including VSP ([Parmentier, 2021]), SMSP ( [Parmentier and T’kindt, 2023]), FCTP ( [Spieckermann et al., 2025])

    VSP is explained above.

    SMSP is Single Machine Scheduling problem, where jobs are released with different time and different importance to a single machine.

    FCTP is Fixed Charge Transportation problem, where things are like a MCF problem, despite that the cost of an edge is fixed if we decide to use them, rather than flow*cost.

  • CSO Problems. Including DIRP [Greif et al., 2024], traffic equilibria prediction [Jungel et al., 2024].

    DIRP is Dynamic Inventory Routing Problem, where we need to decide when to deliver goods to which store, so that we ensure the stores never run out of stock, and minimize the storage & driving cost.

    Traffic equilibria prediction: Predict traffic flow based on assumption of greedy drivers. It is like a Game Theory problem despite that the road cost is unknown.

  • MDP Stochastic optimization problems. Including ambulance redeployment [Rautenstrauß and Schiffer, 2025], DVRP [Baty et al., 2024], and autonomous mobility-on-demand problem([Enders et al., 2023, Hoppe et al., 2024, Jungel et al., 2025]). RL method is becoming increasingly leveraged.

    DVRP: Dynamic VRP, where consumer requests pop out in real time rather than planned.

    AMoD: managing a fleet of self-driving taxis (like robotic Uber). The decisions includes assigning cars to passengers and also, move empty cars to areas where there might be demands.

  • MDP with continuous / large action spaces, like robot control, recommendation system and large-scale scheduling. COAML inspire solutions for challenges of RL in such problems.