Domain Decomposition

A 2D slice of the simulated wave field propagating through a 320 × 320 × 320 wavelength structure of randomly packed spheres - solved in just 45 minutes on two A40 GPUs.

Breaking the Memory Barrier: Large-Scale Wave Simulations with Domain Decomposition

45 min

to simulate 320³ wavelengths on 2 GPUs

2.16B

voxels - largest MBS simulation to date

20×

faster than single-domain CPU simulation

WaveSim can now split simulations across multiple GPUs, enabling a 320 × 320 × 320 wavelength structure of 2.16 billion voxels to be simulated in 45 minutes on a dual-GPU setup.

WaveSim is already one of the fastest and most accurate wave-propagation solvers available. Until now, every simulation had to fit within the memory of a single computer device, a hard ceiling that left the largest real-world problems out of reach.

A new publication by Swapnil Mache and Ivo M. Vellekoop from the Biomedical Photonic Imaging Group at the University of Twente removes that barrier, introducing a domain decomposition framework that distributes simulations across multiple GPUs while fully preserving WaveSim's accuracy, memory efficiency, and guaranteed monotonic convergence.

The Problem

The Solution

Accuracy & convergence

Simulation size

Multi-GPU performance

Overhead

Who benefits & what's next

IN THIS STORY

PUBLICATION

Mache, S. & Vellekoop, I.M.

Domain decomposition of the modified Born series approach for large-scale wave propagation simulations

Journal of Computational Physics 550 (2026) 114619

Read the full paper

What problem were you trying to solve?

WaveSim is based on the modified Born series and is already a highly memory-efficient solver for the Helmholtz and Maxwell’s equations. However, it remains constrained by a fundamental hardware limit: the available memory of a single computer device. Regardless of algorithmic efficiency, sufficiently large simulations will eventually exceed that capacity. Falling back to CPU execution is a practical alternative that a lot of popular solver resort to, but the solve times become prohibitive at large scales.

Our goal was to remove that hardware-imposed ceiling. We wanted WaveSim to scale transparently across multiple GPUs while preserving its core numerical properties. No degradation in accuracy. No change in convergence behaviour. Simply the ability to run larger simulations by distributing the workload across more memory and more compute resources.

How did you solve it?

We introduced a domain decomposition framework that allows WaveSim simulations for the Helmholtz equation to be distributed across multiple GPUs.

Instead of solving the entire Helmholtz problem on a single device, the simulation domain is divided into smaller subdomains. Each subdomain is assigned to a GPU, where the Modified Born series iteration is performed locally.

To ensure the wave field remains consistent across the whole simulation, neighbouring GPUs exchange field values in a tiny region along their shared boundaries during every iteration.

Because this decomposition follows directly from the block structure of the global Helmholtz system, the distributed formulation remains mathematically equivalent to solving the original single-domain problem. As a result, the solver retains the same numerical behaviour, including stable iterations and monotonic convergence.

Does splitting the simulation affect result quality?

No. The distributed solver preserves the same convergence behaviour and accuracy as the original single-domain WaveSim method.

The solver continues to converge monotonically, meaning the residual decreases at every iteration, and this property remains unchanged when the simulation domain is decomposed across multiple GPUs.

In the large-scale 3D simulations presented in the paper, the relative difference between the domain-decomposed solution and the single-domain reference solution was 2.9 × 10⁻⁴, demonstrating that accuracy is effectively preserved.

Accuracy is controlled by a single parameter: the number of voxels near each subdomain boundary used for inter-domain communication and wraparound correction. Even just four voxels keep errors below 0.1%, while the paper uses eight voxels as the default. 0.2 GB is what we calculated specifically for the 2 billion voxel simulation. It will be just a few MBs for smaller simulations.

A small 2D slice of 100×100 wavelengths from the large 3D problem. The light source is a plane wave propagating from the left side of the x axis through a collection of small spheres. The image is showing 160000 voxels, just 0.007% of the total 2.1 billion that WaveSim simulated!

How large a simulation can this approach handle?

The paper demonstrates simulations containing more than two billion voxels.

In the large-scale example, a three-dimensional domain spanning 320 × 320 × 320 wavelengths was simulated using four grid points per wavelength. This corresponds to approximately 2.16 billion voxels, solved using just two GPUs.

By distributing the domain across multiple GPUs, simulation size is no longer limited by the memory of a single device. Instead, the maximum problem size scales with the total memory available across the GPU system.

Does adding more GPUs make the simulation faster?

Yes. Distributing the computational domain across multiple GPUs significantly reduces the total simulation time. By dividing the workload into subdomains, each GPU processes a portion of the grid in parallel, allowing the simulation to complete much faster than on a single CPU.

In the demonstration reported in the paper, the two-GPU simulation was roughly 20× faster than a CPU implementation for a problem that could not fit into the memory of a single GPU. As additional GPUs are used, the workload can be distributed further, decreasing the time-to-solution while preserving the same numerical accuracy. Some communication overhead occurs between GPUs, but this is small compared with the performance gains for large simulations.

Is there any overhead involved in splitting the simulation across GPUs?

Yes, but the overhead is relatively small compared with the performance gains of distributed computation.

Two main factors contribute. First, the number of solver iterations increases because splitting the domain reduces the internal scaling parameter in the MBS preconditioner, which slows convergence. In the experiments presented in the paper, this increased iterations by 3-3.5×, depending on the refractive index contrast of the simulated structure - a one-time cost that does not grow further as more subdomains are added along the same axis.

Second, neighbouring GPUs must exchange boundary data at each iteration, which introduces a small communication cost. For very large simulations, this cost becomes relatively minor compared to the computation within each subdomain.

Who will benefit most from this, and what comes next?

Researchers and engineers facing memory or size-limited simulations will benefit the most. This is particularly relevant in nanophotonics, geophysics, acoustic or seismic wave simulations, where realistic structures can span hundreds or thousands of wavelengths in three dimensions. For problems that previously exceeded GPU memory or required days to weeks of CPU computation, domain decomposition provides an effective solution.

Looking ahead, the same framework can be extended to more complex physical models. Because the approach operates at the level of the global Helmholtz system, it can in principle support simulations of electromagnetic wave propagation governed by Maxwell’s equations, including in complex media such as birefringent materials.

As GPU hardware continues to advance and multi-GPU systems become more accessible, this scalability opens the door to increasingly large and realistic wave-propagation simulations.