·8 min read·Deep Dive

Trust Nothing From External Systems

After 40+ integrations across insurance, payments, and manufacturing, one pattern holds regardless of domain: external systems lie structurally. Here's the architecture that accounts for it.

By Igor Riera

Software engineers tend to think “the integration is done” when the file is delivered or the API returns 200.

It isn’t. That’s when the real integration begins.

After 40+ integrations across insurance benefits, payments, and industrial manufacturing, we’ve internalized a set of architectural principles that hold regardless of domain. The systems change. The data formats change. The industries have nothing in common on the surface. The failure modes are nearly identical underneath.

The root cause is always the same: two systems that were built independently, by different teams, with different assumptions about what words mean. Both are correct by their own definition. Neither is wrong. They just disagree, and the disagreement surfaces in production at the worst possible time.

What “done” actually means

A delivered file is not a processed file. A 200 response is not a successful operation.

We’ve seen EDI 834 files land on a carrier’s SFTP server and sit for 72 hours because a downstream processing queue was paused for maintenance. The delivery confirmation came back immediately. The file wasn’t touched. Nobody knew until an employee tried to use their benefits.

We’ve seen REST APIs return 200 with a response body that contains an error code buried four levels deep in the JSON - an internal status field the documentation didn’t mention. From the outside: success. From the carrier’s system: the record was rejected.

The integration isn’t done when the data leaves your system. It’s done when the data is correctly represented in the destination system’s state and you have evidence of that. Everything before that point is delivery, not integration.

This distinction shapes how we design systems that touch external boundaries.

The boundary problem

Every external system is a boundary. At that boundary, your data changes ownership and interpretation.

Validation at the boundary is non-negotiable - but there are two kinds of validation that get conflated.

Schema validation catches format errors. Required fields are present. Date formats match. Numeric fields contain numbers. This is table stakes. Any integration that doesn’t do schema validation before transmission is one malformed record away from an opaque failure.

Business rule validation catches semantic disagreements. This one is harder, and most teams skip it until something breaks.

Insurance is where we learned this most clearly. HIPAA 834 EDI is a published standard. Every carrier in the space - MetLife, Lincoln Financial, Principal, Guardian Life - implements it differently. One requires the subscriber SSN in a specific loop. Another rejects the file if SSN appears anywhere outside the member segment. Same standard. Opposite rules. Both carriers are “correct.”

LIMRA LDEx was designed to improve this. It’s a more modern standard for benefits data exchange, and over the last several years, more carriers have adopted it. It doesn’t solve the interpretation problem. Every carrier reads the specification and builds to their interpretation of it. When those interpretations conflict, the integration breaks.

The REST API surface isn’t better. A Swagger doc looks complete. Then you discover business rules that aren’t documented, internal error codes that don’t appear in the spec, rate limits that the documentation doesn’t mention. The illusion of modernity doesn’t reduce the ambiguity - it just makes the ambiguity less visible.

Boundary validation has to account for both layers. Schema validation tells you the data is syntactically correct. Business rule validation tells you the data will be interpreted the way you expect. You need both checks before transmission, not after.

Sequence, timing, and deduplication

These three failure modes look similar in the logs and have different causes.

Sequence failures happen when records arrive in an order that violates an assumption the receiving system makes. An enrollment termination processes before the enrollment itself. A coverage change arrives before the coverage was ever established. The receiving system either rejects the record or applies it incorrectly, and the error message doesn’t always explain why.

Timing failures happen when two systems have different definitions of when something happened. Coverage effective date is the canonical example in insurance. One system says coverage starts on the first of the month following enrollment. Another derives effective date from when the file was processed. Same enrollment event, two different effective dates, neither system flagged an error.

In payments, we see the equivalent with SPEI settlement timing. T1 Pagos processes transactions through Mexico’s SPEI rails. Settlement isn’t instantaneous. The transaction timestamp, the settlement timestamp, and the timestamp the merchant’s bank records are three different values - sometimes by hours, sometimes by a calendar day when settlement crosses midnight. Reconciliation against the wrong timestamp produces discrepancies that look like missing transactions.

Deduplication failures happen when an external system sends the same logical event more than once, or when retry logic on your side causes duplicate submissions. A daily enrollment file that includes all active records - not just changed ones - means the same employee record arrives every day. If the receiving system interprets a repeat record as a change event, premiums double. If your retry logic fires after a network timeout without confirming whether the first request succeeded, you submit twice.

Idempotency isn’t optional. Every integration needs a deduplication strategy: a business key (not a technical ID) that identifies whether a record has already been processed, and logic to handle receiving the same data a second time without producing a duplicate outcome.

The three failure modes interact. A sequence failure can cause a timing failure. A deduplication failure can look like a sequence failure. Debugging integration problems without good observability means guessing which one you’re dealing with.

Reconciliation as architecture

Most teams treat reconciliation as a recovery task - something you do after something breaks.

We design it into the system from day one.

Reconciliation means the system continuously compares what it believes to be true against what the external system believes to be true, and surfaces discrepancies before they become operational problems. This is not a cron job that runs weekly. It’s a core workflow component.

For insurance integrations, this means producing a reconciliation report after every file cycle. Records sent versus records acknowledged. Coverage states in the enrollment platform versus coverage states at the carrier. Discrepancies surfaced within hours, not when an employee tries to use their benefits and finds out their coverage isn’t active.

For payment systems, it means tracking every transaction through its complete lifecycle - initiated, settled, reconciled against the bank statement - and alerting when a transaction doesn’t complete each stage within the expected window. An advisory lock on financial state ensures that concurrent settlement operations don’t produce inconsistent balances.

For manufacturing, the data feeds from production systems carry their own version of this problem. Equipment data arrives from systems that weren’t designed to talk to external software, timestamps are generated by industrial controllers with varying clock accuracy, and the same production event can arrive through multiple channels. The reconciliation question is which source is authoritative, not just whether the data arrived.

Designing for reconciliation changes the data model. You need to track processing state per record, not just per file. You need a way to identify the same logical record across multiple transmissions. You need alerting that triggers on patterns - ten records with the same error code is a business rule change, not noise - not just on complete failures.

Knowing which side owns correctness

Some integrations treat the external system as the source of truth. Some treat your system. The answer affects every architectural decision downstream, and teams often get it wrong because they don’t make the decision explicitly.

When an insurance carrier’s system is the source of truth for whether coverage is active, your system is a subscriber - you listen, you sync, you reconcile against the carrier’s state. You do not take actions that assume correctness until you’ve confirmed against the authoritative source. An enrollment that looks successful in your system isn’t successful until the carrier has processed it and your reconciliation layer has confirmed the match.

When your system is the source of truth - as it often is in payments, where your transaction record is the legal record of what happened - the external system is a downstream receiver. You transmit to it; you don’t trust its state over yours. If the external system loses data, you resubmit from your record.

The hybrid case is the hardest: integrations where both sides contribute authoritative data for different fields. An enrollment platform owns demographic information. A carrier owns underwriting decisions and premium calculations. Neither owns the full record. Reconciliation in this case requires field-level source-of-truth tracking, not just record-level.

Getting this wrong produces systems where the same piece of data has two conflicting “correct” values and nobody can tell which one to trust.

The rule

Every system we build that touches an external boundary is designed to survive contact with that system under adversarial conditions.

Not adversarial in the security sense - the other system isn’t malicious. Adversarial in the operational sense: the other system will send unexpected data, will implement standards differently than documented, will change behavior without notice, will return success responses that aren’t successes.

This means validation at ingestion, deduplication by business key, reconciliation as a first-class workflow, and explicit decisions about which side owns correctness for every field that crosses the boundary.

It also means accepting that you can’t fully know an external system until you’ve run against it in production. The spec is the beginning of the conversation, not the end of it. Build the system assuming you’re going to learn things about the external system that aren’t in the documentation - because you will.

The authentication matrix alone - SFTP with PGP encryption, OAuth 2.0 with client credentials, mutual TLS with carrier-issued certificates, API keys on carrier-specific rotation schedules - means that every external system has its own credential lifecycle, its own failure mode when a certificate expires at 2 AM, its own process for renewal. That’s before you’ve sent a single byte of data.

The integration is done when you can detect your own failures, reconcile your state against the external system’s state, and recover from discrepancies without manual intervention. Everything before that is plumbing.