Hosted Is Not Cloud-Native. The Architecture Distinction That Matters at Scale.
A managed cloud service running legacy software is not the same as a platform built for the cloud. The difference shows up first in your storage bill, then in your incident reports.
Dr. Le Weiliang
Chief Architect, StrinGaze
A piece of vocabulary has quietly drifted in actuarial software procurement, and it is costing buyers real money. When a vendor markets their platform as available on the cloud or running on a major hyperscaler, that statement is almost always true and almost always misleading. There is a distinction the broader software industry settled in the mid-2010s, but actuarial software has not yet been forced to confront: the distinction between a platform that is hosted in the cloud and a platform that is built for the cloud.
What "hosted" actually means
A hosted offering takes a software product designed for an earlier era — typically, a workstation or a small departmental grid — and runs it on infrastructure provisioned by a cloud provider. The vendor takes on the operational burden of provisioning, patching, and uptime. The customer no longer needs to run the data centre. The marketing line is reasonable: you focus on the actuarial work, we handle the infrastructure.
What this arrangement does not change is the underlying architecture of the software itself. If the product was designed around stateful workers, shared file systems, and coarse-grained job orchestration, those design choices are preserved into the hosted environment. The workers are now virtual instead of physical. The shared file system is now a managed cloud filer instead of an on-prem NAS. The orchestrator is the same orchestrator.
Three things follow.
First, the failure modes are the same. A worker dying mid-run still requires a block-level rerun, because the state is still tied to the worker. Cloud infrastructure is, in fact, more prone to individual node failures than dedicated hardware — that is an explicit trade-off in hyperscaler design. A hosted architecture pays the failure cost of the cloud without reaping the recovery benefit a cloud-native architecture would have.
Second, the storage cost surfaces directly. Shared file systems at managed-cloud rates are an order of magnitude more expensive than the equivalent object storage. If your product writes intermediate state to shared files, your storage bill scales with the number of workers, not with the volume of useful output.
Third, capacity does not genuinely flex. Workers that hold state cannot be removed safely while in use, and the orchestration layer cannot reissue their work cleanly when they fail. This forces the cluster to be sized for peak demand and held at that size for the duration of the cycle. The cloud’s elasticity is, in this architecture, theoretical.
What "cloud-native" actually means
A cloud-native platform takes the design assumptions of large-scale distributed systems — the assumptions that allowed services like Stripe, Snowflake, and Databricks to scale to the workloads they handle — and applies them to actuarial computation.
The differences are not branding choices. They are architectural commitments, and they reinforce each other; they only deliver their benefits as a set.
Worker state — lift-and-shift keeps it stateful, with workers carrying context between steps. Cloud-native makes it stateless; workers hold no state between calculations.
Coordination state — lift-and-shift puts it in shared files or in-process memory, where a single failure cascades. Cloud-native externalises it to a replicated cache and a replicated database, with automatic failover.
Artefact storage — lift-and-shift uses shared file systems for everything, conflating bulk artefacts with coordination state. Cloud-native separates them: object storage (or NAS, on-prem) for bulk read-once or write-once artefacts; replicated stores for coordination. The artefact store is not the coordination bus.
Failure recovery — lift-and-shift forces a block-level rerun on worker death. Cloud-native reissues the in-flight unit of work to a healthy node, fine-grained and automatic.
Cluster scaling — lift-and-shift sizes for peak and holds. Cloud-native scales by configuration, growing and shrinking with demand at minute-level granularity.
Storage isn’t one thing
The most common architectural misunderstanding in this space — the one that lets hosted platforms claim cloud parity — is treating storage as a single concern. It isn’t.
There are two kinds of storage in any actuarial run, and they have completely different requirements. The artefact store holds bulk read-once or write-once files — model point inputs, result outputs, scenario libraries. NAS works fine for this. So does object storage — OSS, S3, or any S3-compatible store. On public cloud, object storage replaces NAS entirely; on-prem, NAS does the job.
The coordination store is different. It holds the state that determines whether the system survives a node failure: session state, distributed locks, in-flight job tracking, intermediate results that one worker hands to another. Coordination state in a single shared file system is the failure mode that defines the lift-and-shift architecture. Coordination state in a replicated cache plus a replicated database — with automatic failover — is the cloud-native answer.
A vendor that says we use NAS or we use a shared file system isn’t telling you anything until you ask which one — the artefact store, or the coordination bus. The first is a deployment choice. The second is the architectural distinction.
The "service state" trap
If there is one row in that table I would underline for any procurement team, it is worker state.
Stateful workers are the original sin. They are also the easiest sin to hide in a demo. A demo runs a small model on a small cluster, finishes in a few minutes, and shows a clean output. Nothing about the demo exercises the failure modes that emerge when 8,000 workers run for 6 hours and 12 of them die mid-stream.
In the field, the symptoms of a stateful worker architecture only show up at scale, and they show up consistently. A monthly close that finishes most of the time, with a rerun playbook around it. A storage bill that grew faster than the cluster did. A migration to a more performant cloud tier that did not move the reliability number. An auto-scaling configuration that exists in policy but is rarely allowed to scale down, because the team has been burned by losing in-flight state.
These are not operational issues. They are architectural ones, surfacing through operational symptoms.
What to ask the vendor
If you are evaluating an actuarial platform that markets itself as cloud or cloud-ready, four questions separate hosted from cloud-native faster than any RFP response.
Where does coordination state live during a run — and is it replicated? Shared files or worker-local memory means hosted. Replicated cache plus replicated database means cloud-native.
What happens to the in-flight unit of work when a worker dies? Rerun the block is the hosted answer. Reissue the unit of work is the cloud-native answer.
How does storage cost scale as you double the worker count? Linearly with workers means the storage architecture is the bottleneck — and the budget item.
At quarter-end, the cluster is at peak. At mid-quarter, what is it at? The same size, because we cannot safely shrink it means elasticity is a slide, not a property of the system.
None of these questions require deep knowledge of the actuarial workload. They are general distributed-systems questions. That is precisely the point: the architecture problem actuarial platforms now face is the same problem the broader software industry solved a decade ago.
The marketing language has converged. The architectures have not. At insurance-group scale, that gap is where the surprises live — and it is the gap worth resolving before, not after, the migration goes to the steering committee.
Dr. Le Weiliang is Chief Architect at StrinGaze. StrinGaze is an actuarial intelligence platform for life insurance, headquartered in Singapore, built cloud-native from the engine up.