How We Cut Auth Latency 20x by Removing the Database

MACHBANK’s SDK channel authentication was slow. Not occasionally — structurally, by design. Every API call from every SDK channel partner passed through authentication before reaching any business logic, and every token refresh meant multiple sequential database operations before the request could proceed.

There were two separate performance problems, and they were compounding each other. This is the story of both.

Problem 1: The Lambda Authorizer Was Calling a Microservice

AWS API Gateway sits in front of all MACHBANK SDK endpoints. Before forwarding any request, it invokes a Lambda authorizer (mach-maas-authorizer-lambda) to validate the JWT and decide allow/deny.

The old authorizer didn’t validate the JWT itself. It made an HTTP call to mach-auth-service — a separate Node.js microservice — and waited for the response.

This means every single SDK request followed this path:

Client → API Gateway → Lambda (invoked) → HTTP → mach-auth-service → HTTP response → Lambda → API Gateway → BFF service

The microservice hop added latency on every request. More importantly, it coupled the authorizer’s availability to mach-auth-service’s availability. A slow deploy, a transient error, a cold Lambda warming up — any of it manifested as latency across the entire SDK platform simultaneously.

The fix was to move the validation into the Lambda itself.

JWT validation doesn’t require a network call. All you need is the signing secret. The new authorizer fetches the secret from AWS Secrets Manager and caches it in Lambda memory:

const secretCache = {}

async function getSecret(secretName) {
  if (secretCache[secretName]) return secretCache[secretName]
  const value = await secretsManager.getSecretValue({ SecretId: secretName })
  secretCache[secretName] = value
  return value
}

This pattern exploits Lambda’s execution model. A Lambda function instance is not torn down between invocations — it stays warm. The first invocation of a warm instance pays the Secrets Manager call (~50–100ms). Every subsequent invocation on that same instance retrieves the secret from secretCache in microseconds. No network. No external dependency.

Then validation is a single in-process call:

const decoded = jwt.verify(token, await getSecret('SDK_SECRET'))
// ~1ms. cryptographic. no network.

The old path: Lambda invocation → network → microservice → validation → network → Lambda. The new path: Lambda invocation → in-memory cache hit → jwt.verify().

This alone was worth doing. It reduced the authorizer from a network-bound operation to a compute-bound one, and eliminated the dependency between the authorizer and the auth service entirely.

We also upgraded the Lambda runtime to Node.js 22 and extended the context object forwarded downstream to include deviceOs — something downstream BFF services needed but the old authorizer wasn’t passing through.

Problem 2: The Session Service Had Six Phases and Eight Database Calls

Behind the authorizer, mach-auth-service managed sessions via a full OAuth 2.0 PKCE flow backed by MongoDB.

The flow supported two channel types:

Ally channels (MACH): full PKCE — the authorization code phase stores a codeChallenge + codeChallengeMethod in MongoDB; the token exchange phase validates the client’s codeVerifier against the stored challenge via SHA256. Also requires resolving machId from documentNumber via the Account Service.
Partner channels (BCI, SSFFQR): PKCE optional, machId optional. Simpler bootstrap path.

Six phases per session:

Device registration → basic access token (DB write)
Authorization code generation → store PKCE challenge (DB write)
Token exchange → validate PKCE verifier, issue session tokens (DB write × 2)
Token refresh → new access token, retain refresh token (DB write)
Acknowledge refresh → invalidate all prior sessions (updateMany)
Access token verification → scope check + fraud blacklist lookup (DB read × 2)

Every phase touched MongoDB. Tracing the token refresh path — the one that runs on every SDK API call requiring a new access token — revealed this sequence:

1. Feature flag check (Eolian network call)          ~50ms
2. DB lookup: find refresh token by jti              ~30ms
3. Account Service: resolve machId from documentNumber  ~80ms
4. DB writes (new access + refresh tokens)           ~60ms
5. DB updateMany: invalidate previous sessions       ~50–200ms

The updateMany at step 5 scans and invalidates all prior sessions for the device — under load, with accumulated tokens, that operation scales poorly and could dominate the total time.

Add it up: 270–420ms per token refresh, before any business logic ran. Under load, consistently worse.

The bottleneck wasn’t a single slow query. It was the architecture. Every token refresh required multiple database round-trips because session state lived in MongoDB, not in the token itself.

The Insight: Move State Into the Token

JWT tokens are not just authentication credentials — they’re a signed serialization format. You can encode arbitrary claims into the payload, and any holder of the signing key can verify and read them without consulting any external system.

The old system used JWTs, but only as keys. The token carried a jti (JWT ID), and validating the token meant looking up that jti in MongoDB to check it wasn’t revoked. The token was a pointer; the state was in the database.

The new design moved all session state into the token itself:

// JWT payload — everything downstream needs is here
{
  channelId: 'BCI',
  deviceId: 'device-uuid-123',
  deviceOs: 'ios',
  channelCode: 'bci-1653495634000',
  documentNumber: '12345678',
  machId: 'mach-uuid-456',   // optional, ally channels only
  sessionId: 'session-uuid-789'  // unique per session start
}

Validating this token is one line:

const decoded = jwt.verify(token, secrets.sdkSecret)
// microseconds. no network. no database.

jwt.verify checks the HMAC-SHA256 signature and the expiration claim. If the signature is valid, the token is valid. There is no other source of truth.

Replacing the Six-Phase OAuth Flow

The new service (mach-sdk-partner-auth-service) reduces the six-phase OAuth flow to two token types with a single handoff.

B2B tokens (2-minute TTL) — channel partners each hold a channel-specific secret. When a user needs to authenticate, the partner backend generates a short-lived JWT and passes it to the mobile client. The client exchanges it at /auth/session. The token is never stored anywhere.

// Partner backend generates:
jwt.sign(
  { documentNumber, channelCode },
  channelSpecificSecret,
  { issuer: 'machEcoApi', expiresIn: 2 * 60 }
)

Session tokens — issued by this service after B2B validation:

Access token: 20-minute TTL, signed with SDK_SECRET
Refresh token: 30-day TTL, signed with SDK_REFRESH_SECRET

The sessionId field is what distinguishes a session token from a B2B bootstrap token. The refresh endpoint rejects any token without sessionId — making the type discrimination explicit and cryptographic:

const decoded = jwt.verify(refreshToken, secrets.sdkRefreshSecret)

if (decoded.scopes) throw UnauthorizedError('Deprecated auth flow')
if (!decoded.sessionId) throw UnauthorizedError('Not a session token')

This is the only point where a network call happens (Account Service call to resolve machId for ally channels). Every subsequent request validates purely cryptographically.

The Two-Secret Strategy

Using separate secrets for access and refresh tokens isn’t just hygiene — it enables independent rotation.

// Access token: 20 minutes
jwt.sign(payload, secrets.sdkSecret, { expiresIn: 20 * 60 })

// Refresh token: 30 days — different secret
jwt.sign(payload, secrets.sdkRefreshSecret, { expiresIn: 30 * 24 * 60 * 60 })

If you use a single secret for both, rotating it to revoke compromised tokens also forces every user to re-authenticate immediately. With separate secrets, you can rotate SDK_SECRET to invalidate all access tokens (20-minute TTL means the system self-heals within the hour) without touching refresh tokens, or vice versa.

Async Device Registration

The old system registered devices synchronously during session creation — a blocking call to the Device Service on the hot path of every new session. In the new service:

// Non-blocking. Session response is returned immediately.
registerDevice(payload).catch((error) => {
  logger.error({ error }, 'Device registration failed')
})

Device registration is useful for analytics and push notifications but doesn’t affect the correctness of the current session. Making it fire-and-forget removes it from the latency budget entirely.

The general principle: if downstream work doesn’t affect the correctness of the current response, don’t wait for it.

Live Secret Rotation Without Downtime

Rotating JWT secrets in a live system with millions of active tokens requires careful handling. The backward-compatibility strategy:

try {
  decoded = jwt.verify(refreshToken, secrets.sdkRefreshSecret)  // new secret
} catch {
  decoded = jwt.verify(refreshToken, secrets.sdkSecret)          // old secret (fallback)
}

This lets you deploy the new SDK_REFRESH_SECRET while old tokens (signed with SDK_SECRET) continue to work. Clients gradually cycle out their old tokens as they expire. After the 30-day refresh TTL, no valid old-format tokens exist, and the fallback can be removed.

What the Numbers Looked Like

Lambda authorizer — before:

Invoke Lambda → HTTP → mach-auth-service → validation → HTTP response
Total added latency: ~80–150ms per request (plus coupling risk)

Lambda authorizer — after:

In-memory secret cache hit → jwt.verify()
Total: ~1ms

Channel session service — before:

Feature flag check (Eolian):    ~50ms
DB refresh token lookup:        ~30ms
Account Service call:           ~80ms
DB writes (access + refresh):   ~60ms
updateMany (revoke old):        ~50–200ms
─────────────────────────────────────────
Total per token refresh:        ~270–420ms+

Session service — after:

JWT cryptographic verification:                          <1ms
Account Service call (session creation only, not per-request): ~80ms (once)
Device registration:                                     0ms (async)
New access token signing:                                <1ms
─────────────────────────────────────────────────────────────────
Total per-request (after session creation):              <5ms
Session creation (first call only):                      ~90ms

The 20x improvement came from removing work from the hot path — not from making the remaining work faster.

What We’d Change

Secret rotation tooling should be built alongside the secrets. The backward-compatibility fallback works, but it requires manual deployment coordination: deploy the new secret, wait for the TTL to cycle, deploy again to remove the fallback. An automated rotation workflow with built-in TTL tracking would have been less error-prone.

Revocation is a real limitation of stateless JWT. The new architecture has no per-session revocation. Revoking a specific session means rotating the secret globally. For MACHBANK’s use case this was acceptable — access tokens are 20 minutes, so the exposure window on a compromised token is bounded. But for systems where immediate per-session revocation matters, you’d need a token blacklist, which reintroduces a database lookup. The tradeoff is real and worth acknowledging upfront.