Why Lift-and-Shift Doesn’t Solve Reliability at Insurance-Group Scale.
When pricing and valuation runs span tens of thousands of cores, reliability stops being an operations problem and becomes an architecture problem.
Dr. Le Weiliang
Chief Architect, StrinGaze
There is a quiet assumption inside many large insurers that scaling actuarial compute is a procurement problem. Buy more cores. Move the cluster to a hyperscaler. Sign a managed-service agreement. The numbers in the pricing deck improve; the slides look reassuring; the migration plan goes to the steering committee. Then the model runs start failing in ways the on-prem version never did, and the conversation shifts from cost-per-core to why the month-end valuation is late again.
What "scale" actually means in actuarial workloads
A capital-aware repricing run on a mid-sized regional book is not a 200-core job. With nested stochastic, capital stress, and IFRS 17 measurement layered into the same pass, the sustained core demand for a single quarterly cycle routinely exceeds 20,000 cores. For group-level valuation runs across multiple entities and regulatory regimes, demand crosses 60,000 cores at peaks.
At those scales, a platform’s behaviour under failure dominates its behaviour under success. The probability that something — a node, a network segment, a storage volume, a worker process — will fail during a multi-hour run approaches one. The question is no longer whether failures happen, but what the platform does when they do.
There are essentially two answers, and the gap between them is the gap between a platform that ships month-end on time and one that does not.
Architecture A — stateful workers with shared files
The first answer is the inheritance from desktop-era actuarial software. Workers carry state between calculation steps. Intermediate results are written to shared file systems. Job orchestration is driven by file presence and lock acquisition rather than by a control plane.
This works on a single workstation. It works, with some friction, on a small departmental grid. It fails — predictably and expensively — at insurance-group scale, for three reasons.
First, recovery is coarse. When a stateful worker dies mid-run, the orchestration layer cannot cleanly recover its in-flight work because the state is tied to that worker’s local context. The pragmatic recovery is to rerun the entire model-point block, or in the worst case, the entire scenario.
Second, shared storage becomes the bottleneck. As thousands of workers hit the same shared file system to read inputs and write intermediates, the storage layer becomes the constraint long before CPU does. Throughput plateaus and tail latency spikes well below the cluster’s notional capacity.
Third, the cloud doesn’t fix this; it makes it more expensive. Lifting-and-shifting this architecture into a managed cloud service replaces local NAS with cloud storage and local nodes with virtual ones. The architectural pathology is unchanged — but now you are paying hyperscaler rates for the contention.
Architecture B — stateless services with externalised state
The second answer is the architecture pattern the broader software industry adopted a decade ago for any system that needs to run at scale: stateless services, externalised state, control-plane orchestration.
Workers hold no state between calculations. The unit of work — a model point, a scenario, a stochastic path — is dispatched by the control plane and executed against externalised state stores: a replicated cache for sessions, a replicated database for persistent metadata, and an artefact store (object storage on public cloud, NAS on-prem) for bulk inputs and outputs. Workers are interchangeable.
Failure becomes a non-event. A worker dying mid-run shifts the unit of work to a healthy node within seconds. Recovery is fine-grained — the failed task reruns, not the entire job.
Storage scales with the cluster. Object storage and replicated state stores are designed for the access pattern; throughput grows roughly linearly with parallelism rather than collapsing under contention.
Capacity becomes elastic by configuration. The cluster can grow from a few hundred to tens of thousands of cores for a quarterly close and shrink back, because workers carry nothing that prevents them from being added or removed.
The TCO inversion
This is where the economic argument flips, and where most procurement decks miss the point.
A managed lift-and-shift offering looks competitive on the per-core line item. It almost always loses on total cost of ownership at scale, for three reasons. Reliability has a price — every failed run is a delayed close, every delayed close has business consequences. Storage architecture has a price — shared file systems at hyperscaler rates, sized for peak contention, cost more than object storage sized for parallel access. Elasticity has a price — architectures that don’t shrink cost the same at quarter-end as they do mid-quarter.
In the engagements we have benchmarked, the architectural difference between A and B translates to a 30–40% lower total infrastructure spend over a three-year horizon — not because the per-core rate is cheaper, but because the cluster genuinely scales down.
What this means for procurement
If your evaluation framework for an actuarial platform stops at per-core cost, throughput-on-paper, and feature parity with the incumbent, you are evaluating the surface of the system, not the system. The questions that matter at insurance-group scale are architectural.
What happens when a worker fails mid-run? Rerun the block is a different answer to reissue the unit of work.
Where does intermediate state live? Local to the worker is a different answer to externalised in object storage with replicated coordination.
Does the cluster genuinely shrink between cycles? Auto-scaling on paper is a different answer to architecturally stateless.
The right platform for a single product line and a few hundred cores is not necessarily the right platform for an insurance group’s valuation engine. Lift-and-shift cannot resolve that.
It is an architecture problem. It deserves an architecture answer.
Dr. Le Weiliang is Chief Architect at StrinGaze. He has led actuarial platform architecture for life insurers internationally for over a decade.