Designing a Daily Financial Settlement System That Can't Run Twice

Every Bip QR recharge processed through MACHBANK creates a liability: money the platform owes to Orcen (Globe), the underlying transit card provider. That debt needs to be settled daily, on a precise schedule, through a chain of real banking operations — across accounts, across systems, against a deadline the provider doesn’t move.

The challenge wasn’t just building the payment logic. It was building a flow that could run automatically every business day without human intervention, handle infrastructure failures gracefully, never pay twice, and still give the finance team enough visibility and control to catch discrepancies before money left the account.

Why This Is Hard

Financial settlement flows sit at the intersection of two failure modes that normally don’t coexist in software:

Infrastructure failures — database timeouts, network errors, downstream service unavailability — should be retried automatically. The operation didn’t complete; try again.

Business logic failures — wrong state, already processed, missing data — should never be retried. Retrying them doesn’t fix anything. It just tries to do something wrong, again.

Standard retry logic doesn’t distinguish between them. A naive “retry on any error” approach in a payment flow is a path to double charges. A naive “dequeue on any error” approach means a transient MongoDB timeout silently drops a settlement.

The architecture needs to tell them apart.

A State Machine With One Rule Per Handler

The compensation flow is modeled as a state machine with five states:

CREATED → INTERNAL_TRANSFERED → AMOUNT_RECEIVED → INTERNAL_RETRIEVED → COMPLETED

Each state transition belongs to exactly one handler. No handler can execute if the compensation isn’t in the state it expects. If it’s in the wrong state, the handler dequeues without doing anything — idempotent by design.

function validateCompensationStatus(compensationStatus: CompensationState): void {
  const validStates = new Set([CompensationState.CREATED]);
  if (!validStates.has(compensationStatus)) {
    throw new BusinessError(errors.invalidStatus.name, errors.invalidStatus.message);
  }
}

The shouldRequeueError function draws the line between the two failure modes:

export function shouldRequeueError(error: Error): boolean {
  if (isMongoConnectionOrTimeoutError(error)) return true;
  const requeueable = new Set(['The request was not successful']);
  return requeueable.has(error.message);
}

MongoDB connection errors and HTTP failures from the banking service: requeue. Everything else: dequeue. The retry decision is made at the infrastructure boundary, not in business logic.

Two Cronjobs, Two Moments

The flow has a natural split: calculate and fund internally on accounting day close; pay the provider the following day. This split is enforced by two AWS EventBridge rules firing at different times.

Cron 1 — 14:00 CLT, every weekday:

"schedule_expression": "cron(0 17 ? * MON-FRI *)"

17:00 UTC maps to 14:00 Chile CLT during summer (UTC-3). This is when the accounting day closes. The compensation-triggered handler fires, calculates everything owed for the prior accounting period, and funds the Transfer Account immediately.

Cron 2 — 12:00 CLT, every weekday:

"schedule_expression": "cron(0 15 ? * MON-FRI *)"

15:00 UTC maps to 12:00 Chile CLT. This is two hours before Orcen’s deadline. Firing at noon gives the finance team a review window before the provider expects the money. The retrieve-to-internal-account handler fires and initiates the actual provider payment.

The Business Day Problem

Chilean banking doesn’t run on weekends or public holidays. If you fire a cron on Monday, the prior business day was Friday — but if there was a long weekend, it might have been Thursday or earlier. The compensation window needs to cover the entire gap.

The solution is a lookback loop:

function getCompensationStartDate(endDate: Date): Date {
  let startDate: Date | undefined;
  let counter = 0;
  do {
    counter += 1;
    startDate = new Date(
      endDate.getFullYear(),
      endDate.getMonth(),
      endDate.getDate() - counter,
      startDateHours,
      startDateMinutes,
    );
  } while (!isBusinessDay(startDate));
  // Apply Chilean timezone offset
  const dateOffsetDueToTimezone = getDateOffsetInHours(startDate);
  startDate.setHours(startDate.getHours() + dateOffsetDueToTimezone);
  return startDate;
}

isBusinessDay uses the date-holidays library configured for Chile (new Holidays('CL')), which knows Chilean national holidays. The loop walks backwards one day at a time until it finds a business day — Friday for a normal weekend, Thursday for a three-day weekend, and so on.

The entire compensation window is computed correctly regardless of what happened between the last business day and today.

The Dual-Amount Problem

Here’s the design decision that made the most people ask questions: MACHBANK calculates the compensation amount internally by summing completed payments. Orcen sends their own expected amount via email. These two numbers should match — but they might not.

Why maintain two amounts instead of just using one?

The internal amount funds the Transfer Account at 14:00. This step happens before Orcen’s email arrives. MACHBANK can’t wait for the provider to tell them how much to move — they need to pre-fund the account using their own accounting.

The provider amount is what actually gets paid to Orcen. Their email arrives the following day, before the 12:00 cron fires. The retrieve-to-internal-account and deposit-to-provider-account handlers use this figure for the actual transfer.

If the two amounts differ, the Transfer Account has been over- or under-funded. The finance team sees this discrepancy surfaced in Slack with a percentage difference:

const percentageDiference = Math.abs(
  ((providerAmount - amount) / amount) * 100
).toFixed(2);

They have from when the email arrives until 12:00 to decide if they need to intervene. After payment, reconciliation adjustments handle any remaining gap.

The compensation always happens — discrepancies are reconciled after the fact, not used as a reason to halt.

The Transfer Account as a Buffer

The money doesn’t flow directly from the Product Account to Orcen. It moves through an intermediate Transfer Account (a BCI Cuenta Corriente):

Product Account → Transfer Account → Orcen Account

This two-step movement exists for a reason. The internal funding step (Product → Transfer) uses MACHBANK’s calculated amount. The payment step (Transfer → Orcen) uses the provider’s amount. The Transfer Account absorbs any difference between them — it’s the buffer where reconciliation happens.

Both steps use BCI’s Cuenta Corriente service via the internal business-payroll microservice — transferFunds for credits, retrieveFunds for debits. Because Orcen’s account also lives in BCI, the money never leaves the BCI ecosystem, which makes the operation faster and removes the need for inter-bank coordination.

Audit Trail at Every Step

Each handler sends a structured Slack notification before returning. The finance team can follow the entire flow in real time:

Compensation created → amount calculated → period covered
Internal transfer complete → funds in Transfer Account
Provider amount received → percentage difference from internal calculation
Charge executed → provider payment initiated
COMPENSACIÓN FINALIZADA → payment complete

These aren’t just notifications. They’re the operational audit trail. If something goes wrong at any step, the team knows exactly where the flow stopped, what the amounts were, and what the state machine’s last known state was.

The Timezone Problem Nobody Talks About

The cron fires at 17:00 UTC. In Chile, clocks show 14:00 — but only sometimes. Chile observes daylight saving time, shifting between UTC-3 in summer and UTC-4 in winter. The UTC offset isn’t a constant.

This matters because the compensation window is defined in Chilean business time: from 14:00 CLT of the last business day to 13:59 CLT of today. If the code hardcodes the UTC offset, it’s correct for six months and wrong for the other six — calculating a window that’s off by a full hour, potentially including or excluding an entire hour of completed payments.

The solution is to compute the offset dynamically at runtime:

function getDateOffsetInHours(date: Date): number {
  const chileanDate = transformDateToChileanDate(date);
  return Math.round(
    ((date as unknown as number) - (chileanDate as unknown as number)) / (1000 * 60)
  ) / 60;
}

transformDateToChileanDate converts the UTC date to its Chilean equivalent using the America/Santiago timezone. The difference between the two gives the current offset — dynamically, for that specific date. This value is then applied when constructing the window boundaries:

function getCompensationEndDate(triggeredDate: Date): Date {
  const offSet = getDateOffsetInHours(triggeredDate);
  return new Date(
    triggeredDate.getFullYear(),
    triggeredDate.getMonth(),
    triggeredDate.getDate(),
    endDateHours + offSet,  // endDateHours is 13 (CLT) → adjusted to UTC
    endDateMinutes,
  );
}

The same offset is applied when computing the start date. This means the compensation window is always [14:00 CLT → 13:59 CLT] regardless of what UTC offset Chile is currently using — and regardless of when in the year the code runs.

It’s a small function. The failure mode if it’s wrong is a financial window that’s shifted by an hour every six months — silent, invisible, and discovered only during reconciliation.

The Escape Hatch

EventBridge cronjobs are the normal trigger, but the handlers accept a triggerMethod parameter (AUTOMATIC or MANUAL) and a date override. The business logic is fully decoupled from the scheduler:

export default async function compensationTriggered(
  args: ICompensationTriggeredInput
): Promise<ICompensationTriggeredOutput | undefined> {
  const { triggerMethod, date = new Date() } = args;
  if (!isBusinessDay(date)) return undefined;
  // ...
}

If a cron fires incorrectly, if a weekend override is needed, or if a step fails and needs to be rerun with a specific date — the same handlers work for manual invocation. The cron is just a delivery mechanism.

What Made It Work

The hardest part of this system wasn’t the banking integration or the scheduling. It was the state machine discipline.

It’s tempting to build payment flows as sequential scripts — do step 1, then step 2, then step 3. That’s fine until step 2 fails halfway through, or until the same script gets triggered twice, or until a downstream service times out at step 3. Sequential scripts make it hard to answer “where are we?” and impossible to safely resume.

Atomic handlers with explicit state transitions make every question answerable. The compensation record is always in a well-defined state. Any handler can run again safely on the same compensation — it will either advance the state or reject the execution because the state is wrong. There’s no ambiguous in-between.

That property — the ability to re-run any step safely — is what makes the retry strategy possible and what makes the system reliable in production.