All projects

MACHBANK · 2023 – 2024

External Provider Monitoring

Senior Software Engineer

Full visibility into Transbank and Spreedly health with zero changes to business logic — from blind spots to per-endpoint dashboards and alerting in a single wrapper function.

TypeScriptNode.jsAxiosNew RelicNRQL

The Problem

MACHBANK’s QR payment platform depends on two critical external providers: Transbank (Chile’s largest payment network) and Spreedly (payment method vault). Every QR transaction flows through one or both of them.

When something went wrong — an elevated error rate, a slow endpoint, a degraded Transbank status code distribution — the team had no direct visibility. Issues surfaced through support tickets or user complaints, not metrics. There was no way to answer “is Transbank returning more 5xx errors than usual right now?” without digging through raw logs.

The gap wasn’t a tooling problem. New Relic was already in the stack. The gap was that no structured data was reaching it.

The Solution: Wrap Once, Observe Everything

The insight was simple: every outbound call to an external provider already went through Axios. If you wrap that call in a single function that records the outcome as a structured event, you get observability across all providers without touching any business logic.

Two functions. That’s the entire implementation.

export async function makeMonitoredRequest<T>(
  requestConfiguration: AxiosRequestConfig,
  eventData: MonitoredRequestContent,
): Promise<T> {
  const { provider, endpoint, ...customEventData } = eventData;
  const initialTime = performance.now();
  try {
    const { data, status } = await axios.request<T>(requestConfiguration);
    recordMonitoredEvent({
      provider, endpoint,
      method: requestConfiguration.method as string,
      statusCode: status.toString(),
      type: 'success',
      time: (performance.now() - initialTime) / 1000,
      ...customEventData,
    });
    return data;
  } catch (error) {
    const err = error as AxiosError;
    recordMonitoredEvent({
      provider, endpoint,
      method: requestConfiguration.method as string,
      type: 'error',
      statusCode: (err.response?.status ?? 500).toString(),
      time: (performance.now() - initialTime) / 1000,
      ...customEventData,
    });
    throw error;
  }
}

Every call site changes from:

const { data } = await axios.request(config);

To:

const data = await makeMonitoredRequest(config, {
  provider: PROVIDERS.transbank,
  endpoint: 'qr-codes',
});

The error behavior is unchanged — the function re-throws. The business logic is unchanged. The only addition is the recordMonitoredEvent call on both the success and failure paths, emitting a Providers event to New Relic with a consistent schema.

What the Event Schema Enables

Every Providers event carries: provider, endpoint, method, statusCode, type (success/error), and time (seconds). Any additional fields passed as customEventData are included too.

That schema is queryable with NRQL — New Relic’s SQL-like query language — immediately after deployment. No additional configuration. The data is just there.

Per-endpoint status code distribution:

SELECT count(*) FROM Providers
WHERE provider = 'Transbank' AND endpoint = 'qr-codes' AND method = 'GET'
FACET statusCode TIMESERIES EXTRAPOLATE SINCE 1 hour ago

Success rate over time:

SELECT percentage(count(*), WHERE type = 'success') FROM Providers
WHERE provider = 'Transbank'
TIMESERIES EXTRAPOLATE SINCE 1 day ago

Average latency per endpoint:

SELECT average(time) FROM Providers
WHERE provider = 'Transbank'
FACET endpoint TIMESERIES SINCE 1 hour ago

Custom Events for Non-HTTP Observability

Not everything worth measuring is an HTTP call. Transbank’s QR payment flow uses webhooks — the payment result arrives asynchronously, and the time between a transaction being initiated and the webhook arriving is a direct user experience metric.

recordMonitoredEvent handles this case. When a webhook is processed:

recordMonitoredEvent({
  provider: PROVIDERS.transbank,
  name: 'time_passed_for_tbk_webhook',
  finalStatus: paymentTransaction.status,
  timeBetweenChange: parseFloat(secondsBetweenStatusChange.toFixed(2)),
  transactionId: transaction.transactionId,
});

This becomes queryable as any other Providers event:

SELECT percentage(count(*), WHERE finalStatus = 'AUTHORIZED') as 'Autorizadas'
FROM Providers WHERE provider = 'Transbank' AND name = 'time_passed_for_tbk_webhook'
TIMESERIES EXTRAPOLATE SINCE 1 day ago

The same dashboard infrastructure that shows HTTP status codes also shows webhook authorization rates and timing distributions — because it’s all the same event table.

Alerting

Once the data is in New Relic, threshold-based alerts can be configured directly on any NRQL query: alert if 5xx rate exceeds X%, if average latency crosses Y seconds, if webhook authorization rate drops below Z%. The alert configuration lives in New Relic and is set manually per metric, but the underlying data that powers it requires no additional instrumentation — it’s already there from the wrapper.