·5 min read·Deep Dive

40 Integrations Later: What Insurance EDI Taught Us About Building Reliable Software

What 40+ carrier and enrollment platform integrations taught us about data trust, failure modes, and why integration engineering is the most undervalued skill in enterprise software.

By Igor Riera

Before Cerberus Labs, I spent years building data integrations in the insurance industry. The work was invisible to everyone except the people whose enrollment data had to actually flow between systems – and the people who got calls when it didn’t.

Forty-plus integrations, each one a different combination of data format, authentication method, transformation requirements, and business rules. Each one breaking in its own way, with its own potential pitfalls. What follows is what that work taught us about building software that has to survive contact with systems you don’t control.

The landscape

The insurance benefits space sits between two worlds that were never designed to talk to each other, with data structures that don’t always align.

On one side: web-based enrollment platforms like Ease, Employee Navigator, Selerix, and others. Modern-ish UIs where HR administrators and employees manage benefits selections. These systems know what coverage an employee chose, when they chose it, and what their demographic information looks like.

On the other side: carrier policy administration systems. MetLife, Lincoln Financial, Principal, Guardian Life, Reliance Standard, HCSC, and dozens more. Legacy platforms – some decades old – that manage the actual insurance policies. These systems determine whether coverage is active, what the premium is, and what gets paid on a claim.

In between: HIPAA 834 EDI feeds, custom 834s, CSVs, XML formats, and RESTful APIs with varying degrees of documentation, as of 7 years ago a set of evolving LIMRA LDEx standards that everyone interprets slightly differently, and flat files in formats that were modern when they were designed.

Every integration is a snowflake

There’s no standard integration. The 834 EDI format is technically a standard, but every non-medical carrier implements it differently. One carrier requires the subscriber’s SSN in a specific loop. Another rejects files that include it. One expects coverage effective dates in the maintenance segment. Another derives them from the enrollment segment and ignores what you send.

The REST APIs are worse in some ways, because they create an illusion of modernity. You get a Swagger doc and an endpoint, and you think the hard part is over. Then you discover that the API validates against business rules that aren’t documented, returns error messages that are carrier-internal codes, and has rate limits that aren’t in the spec.

Authentication alone is a matrix:

  • SFTP with PGP-encrypted files and IP whitelisting
  • OAuth 2.0 with client credentials
  • Mutual TLS with carrier-issued certificates
  • Basic auth over VPN tunnels
  • API keys rotated on carrier-specific schedules

Each one has its own credential management lifecycle, its own renewal process, and its own failure mode when a certificate expires at 2 AM on a Saturday.

The failure modes nobody tests for

Unit tests pass. Integration tests pass. The file generates correctly against the spec. Then it hits production and breaks, because production means another organization’s system with its own interpretation of reality.

The “effective date” vs “signature date” problem. Two systems define “effective date” differently, or worse: have different rules configured driven by signature date. The enrollment platform says coverage starts on the first of the month following enrollment. The carrier says coverage starts on the date the file is processed. An employee enrolls on January 15th. The platform sends February 1st. The carrier processes the file on January 20th and activates coverage on January 20th. Now two systems disagree about when coverage started, and neither thinks it’s wrong.

The deduplication problem. An enrollment file runs daily. An employee’s record hasn’t changed, but the platform sends it anyway because it sends all active enrollments. The carrier sees the same record arrive again and interprets it as a change. Now there’s a duplicate enrollment event, and the premium calculation is wrong.

The timing problem. A qualifying life event (marriage, birth of a child) triggers a mid-month enrollment change. The next scheduled file run is in three days. The carrier’s system processes retroactive changes differently than prospective ones. By the time the file arrives, the carrier applies a different business rule than what the enrollment platform assumed. When EOI/SOH adjudication is involved, this adds another layer that can affect the effective date of coverage which needs to be accounted for: this is process problem, not a technology problem.

These aren’t edge cases, they’re any given Tuesday.

What we learned about building reliable integrations

Trust nothing from external systems. Validate every field, every record, every file. Not because the other system is broken, but because the other system has a different definition of correct. Schema validation catches format errors; business rule validation catches semantic disagreements. Both are required, not optional.

Design for the failure mode you haven’t seen yet. Every integration will encounter a scenario that wasn’t in the spec, because the spec is an approximation of how the system actually behaves. Build observability into every integration point: log the full payload in and out, track processing outcomes per record, and alert on patterns (not just errors). A single rejected record might be data quality. Ten rejected records with the same error code is a business rule change the carrier didn’t announce.

Idempotency isn’t optional. Files get reprocessed. APIs get retried. Network failures cause duplicate submissions. Every integration must handle receiving the same data twice without creating duplicate outcomes. This means tracking what you’ve processed, matching on business keys (not just technical IDs), and building reconciliation into the normal workflow, not as an afterthought.

Reconciliation is a feature, not a maintenance task. Every integration should produce a reconciliation report that compares what was sent with what was received and processed. Discrepancies surface within hours instead of weeks. When an employee’s coverage doesn’t activate and the first sign is a denied claim or a member unable to validate coverage at point of service, the integration has failed, even if every file was delivered successfully and every system shows green.

Why this matters beyond insurance

The insurance industry’s integration challenges aren’t unique. They’re a concentrated version of what happens whenever systems built by different organizations, with different assumptions, need to exchange data reliably.

Every industry has its version of the 834 file – a data format that’s technically standardized but practically variable. Every industry has its version of the “effective date” vs “signature date” problem: a business concept that means different things in different systems. Every industry has its version of the silent failure: data that’s delivered successfully but processed incorrectly.

This work taught us more about building reliable software than any greenfield project ever did. It’s why every system Cerberus Labs ships is built to survive contact with systems we don’t control. Because in production, that’s the only kind of system there is.