All posts

How Two Functions Gave Us Full Visibility Into Every External Provider

When your platform depends on third-party providers, blind spots are expensive. This is the story of how a single wrapper function over Axios gave MACHBANK real-time dashboards, status code breakdowns, latency tracking, and webhook analytics — without touching any business logic.

MACHBANK’s QR payment platform depends on two external providers: Transbank for payment processing and Spreedly for card vaulting. Combined, they handle every transaction that flows through the platform.

For a long time, we had no direct visibility into their health. Not zero metrics — we had the usual suspects: Lambda error rates, service-level latency, database timings. But nothing that answered the operational questions that actually matter: Is Transbank returning more 422s than usual? Which specific endpoint is slow? How long does it typically take for a Transbank webhook to arrive after a transaction is initiated?

When something degraded, we found out through support tickets. The fix was smaller than you’d expect.

The Constraint That Shaped the Design

New Relic was already running in the service. Adding observability wasn’t a question of infrastructure — it was a question of what data to emit and when.

The constraint: don’t change business logic. Integration code in a payments service is dense and careful. Adding instrumentation by modifying each individual HTTP call means touching many files, adding lines inside try/catch blocks, and creating new failure modes if the monitoring call itself throws. Every touch point is a place the next engineer has to understand.

The right design instrument at the boundary, not inside.

One Wrapper, All Providers

Every outbound HTTP call to Transbank and Spreedly already went through Axios. That’s a single chokepoint. A wrapper function at that chokepoint can record the outcome of every call without any integration code knowing it’s there.

export async function makeMonitoredRequest<T>(
  requestConfiguration: AxiosRequestConfig,
  eventData: MonitoredRequestContent,
): Promise<T> {
  const { provider, endpoint, ...customEventData } = eventData;
  const initialTime = performance.now();
  try {
    const { data, status } = await axios.request<T>(requestConfiguration);
    recordMonitoredEvent({
      provider, endpoint,
      method: requestConfiguration.method as string,
      statusCode: status.toString(),
      type: 'success',
      time: (performance.now() - initialTime) / 1000,
      ...customEventData,
    });
    return data;
  } catch (error) {
    const err = error as AxiosError;
    recordMonitoredEvent({
      provider, endpoint,
      method: requestConfiguration.method as string,
      type: 'error',
      statusCode: (err.response?.status ?? 500).toString(),
      time: (performance.now() - initialTime) / 1000,
      ...customEventData,
    });
    throw error;
  }
}

Three things worth noting:

It always re-throws. The error path records the event and then throws the original error unchanged. Business logic that was handling a specific AxiosError still handles that exact same error. Nothing changes downstream.

Timing is measured end-to-end. performance.now() is called before the request and the delta is computed in both success and error paths. This captures real latency including DNS, TLS, and response body transfer — not just processing time.

The event schema is open. The ...customEventData spread lets callers attach any additional fields relevant to that specific call. An endpoint might include the channelCode or a transaction-type identifier. These extra fields show up on the event and become filterable dimensions in NRQL queries.

The call site change is minimal:

// Before
const { data, status } = await axios.request<IValidateQRCodeResponse>(config);

// After
const data = await makeMonitoredRequest<IValidateQRCodeResponse>(config, {
  provider: PROVIDERS.transbank,
  endpoint: 'qr-codes',
});

The type signature is preserved. The error behavior is identical. The only change visible to the caller is the function name and the event metadata argument.

What a Consistent Schema Unlocks

The core value isn’t the wrapper — it’s the schema consistency it enforces. Every Providers event in New Relic has the same shape: provider, endpoint, method, statusCode, type, time, plus any custom fields. Because the schema is consistent across all providers and all endpoints, you can write generic queries that work everywhere.

Status code distribution for any provider:

SELECT count(*) as 'Transbank Status Code'
FROM Providers
WHERE provider = 'Transbank' AND endpoint = 'qr-codes' AND method = 'GET'
TIMESERIES EXTRAPOLATE SINCE 1 hour ago
FACET statusCode

Success rate comparison across endpoints:

SELECT percentage(count(*), WHERE type = 'success')
FROM Providers
WHERE provider = 'Transbank'
FACET endpoint TIMESERIES SINCE 6 hours ago

P95 latency over time:

SELECT percentile(time, 95)
FROM Providers
WHERE provider = 'Spreedly'
FACET endpoint TIMESERIES SINCE 1 day ago

These queries work the same day the wrapper is deployed. No dashboard configuration, no data pipeline, no aggregation job. The events land in New Relic and NRQL can query them immediately.

Custom Events for What HTTP Can’t Capture

HTTP status codes and latency tell you how the integration is behaving. They don’t tell you about the business flow. Transbank’s QR payment process includes a webhook step: after a transaction is submitted, Transbank sends a callback with the final payment status. The time between transaction initiation and webhook arrival is a direct measure of user-perceived payment speed.

That’s not an HTTP response. But recordMonitoredEvent — the inner function the wrapper calls — is available independently:

recordMonitoredEvent({
  provider: PROVIDERS.transbank,
  name: 'time_passed_for_tbk_webhook',
  finalStatus: paymentTransaction.status,
  timeBetweenChange: parseFloat(secondsBetweenStatusChange.toFixed(2)),
  transactionId: transaction.transactionId,
});

This emits to the same Providers event table. The same dashboards. The same alerting infrastructure. And it enables queries that are genuinely useful for understanding real-world payment behavior:

SELECT percentage(count(*), WHERE finalStatus = 'AUTHORIZED') as 'Autorizadas'
FROM Providers
WHERE provider = 'Transbank' AND name = 'time_passed_for_tbk_webhook'
TIMESERIES EXTRAPOLATE SINCE 1 day ago

Authorization rate over time. Not estimated — measured directly from the event stream, per transaction.

Alerting on Top of It

New Relic supports threshold-based alerts configured on any NRQL query. Once the data is flowing, you can alert on:

  • 5xx rate for a specific endpoint exceeding a threshold
  • Average latency crossing a limit
  • Webhook authorization rate dropping below an expected floor
  • Any dimension in the event schema, at any granularity

The alerts are configured manually per metric in New Relic — there’s no automation on that side. But the cost of setting up a new alert is writing a NRQL query and setting a number. The hard part (getting the data into the system with the right shape) was done once, at the wrapper.

The Broader Point

The value here isn’t New Relic-specific. It’s a pattern: identify the chokepoint in your integration layer, instrument it once, enforce a consistent schema, and every analysis question about provider health becomes a query.

The alternative — adding ad-hoc metrics per endpoint, per provider, case by case — produces inconsistent schemas, missed coverage, and maintenance burden. When you add a new endpoint or a new provider, consistent coverage requires either discipline or a process. The wrapper provides it automatically.

Two functions. Roughly 70 lines of code. Full observability over every external integration.