← Back to writingBackend / 2026.04

Building and Hardening the API Layer: A BFF Gateway with Dual-Server Failover

My first approach was calling the upstream telematics API directly from the Ionic app. That lasted about two weeks before two separate problems made it impossible to continue.

The first was performance. The fleet API returns 40+ fields per vehicle and the mobile app needs 12. Getting a vehicle's live position meant two sequential HTTP calls - fetch the vehicle list, then fetch GPS positions separately, then loop through both arrays on the client to stitch them together. I was watching simple screens take 3-4 seconds to load on real devices on 3G connections and tracing it back to the same pattern every time: too many calls, too much payload, and the merging happening on a phone CPU.

The second problem I almost missed entirely. The upstream API uses HTTP Basic Auth with service credentials. I was about to ship a mobile app binary with those credentials hardcoded in it. Anyone who decompiled the APK would have had full access to the upstream telematics platform. I caught this during a late-night code review and it stopped me cold.

So I built a BFF - a purpose-built Node.js/Express gateway that sits between the mobile app and the upstream services. It aggregates multiple calls into one, normalises inconsistent response shapes, handles all authentication, and returns clean purpose-built responses.

Then the primary server went down during a maintenance window. Fleet operators managing trucks on the N1 highway called to say they couldn't track their vehicles. I had built a gateway and then made it a single point of failure.

This article covers both halves - how I built the BFF, and how I made it survive outages.

The structure - deliberately thin

The BFF is organised as vertical slices. Each feature gets a router for HTTP handling and a service file for upstream calls. No framework, no ORM, no GraphQL. Express with Axios.

ta-ionic-api/
├── index.js              # Express app, middleware, route mounting
├── helper.js             # Auth builder, time formatting, base URLs
├── error-handler.js      # Global error middleware
├── mailer.js             # Nodemailer for operational alerts
└── routes/
    ├── vehicle/
    │   ├── vehicle.js           # Express router
    │   └── vehicle.service.js   # Axios calls to upstream
    ├── markers/
    ├── snapshots/
    ├── tracking/
    ├── investigation/
    ├── video-playlist/
    ├── panic/
    ├── technicalLogging/
    ├── notifications-detect/
    ├── extra-details/
    └── clientID/

Around 2,000 lines total. I looked at NestJS for structure but the overhead wasn't justified for what amounts to a translation layer between two APIs. Express with Axios is readable, debuggable at 11pm, and doesn't require anyone to understand a framework to follow what's happening.

The Express setup is straightforward - routes mounted by feature domain, compression() middleware to reduce payload size on mobile connections, and a global error handler at the bottom that always returns structured JSON rather than stack traces:

const app = express();

app.use(compression());
app.use((req, res, next) => {
  res.header("Access-Control-Allow-Origin", "*");
  res.header("Access-Control-Allow-Methods", "GET, POST, OPTIONS");
  res.header("Access-Control-Allow-Headers",
    "Authorization, Origin, X-Requested-With, Content-Type, Accept");
  next();
});

app.use('/api/video-playlist', videoPlaylistRouter);
app.use('/api/snapshot', snapShotsRouter);
app.use('/api/technicalLogging', technicalLoggingRouter);
app.use('/api/clientID', clientIDRouter);
app.use('/api/vehicle', vehicleRouter);
app.use('/api/investigation', investigationRouter);
app.use('/api/tracking', trackingRouter);
app.use('/api/markers', markersRouter);
app.use('/api/extra-details', extraDetailsRouter);
app.use('/api/notifications', notifRouter);
app.use('/api/panic', panicRouter);

app.use(errorHandler);

Every error response from this gateway has a consistent { success: false, message: "..." } shape. That consistency matters on the mobile side - the app can always check response.success and render a meaningful message rather than parsing raw HTTP status codes and hoping the body is parseable.

The aggregation pattern - the main reason this exists

The vehicle endpoint is the clearest example of why the BFF earns its place. Before it existed, loading the vehicle list on the mobile app meant two sequential HTTP calls - vehicles first, then GPS positions - then merging them on the client. On a 3G connection on a moving vehicle this was genuinely painful to watch.

The BFF does the merge server-side:

async function getAllVehicleData(params) {
  const response = await axios.get(
    helpers.returnUrl() + '/api/vehicle?uid=' + params.uid
      + '&clientId=' + params.clientId,
    { headers: { "Authorization": auth } }
  );
  let vehicles = response.data;

  const gpsDataArray = await getGPSData(params.uid, params.clientId);

  if (gpsDataArray === null) {
    return vehicles.map(vehicle => ({
      ...vehicle,
      gps: null,
      gpsError: 'Failed to fetch GPS data'
    }));
  }

  const gpsMap = {};
  if (Array.isArray(gpsDataArray)) {
    gpsDataArray.forEach(gpsItem => {
      gpsMap[gpsItem.deviceNr] = gpsItem;
    });
  }

  return vehicles.map(vehicle => {
    const gpsData = gpsMap[vehicle.deviceNr];

    if (gpsData) {
      const gpsTime = gpsData.gpstime || gpsData.gps_time;
      const updatedStatus = helpers.checkVehicleStatusByGpsTime(
        gpsTime, gpsData.status
      );

      return {
        ...vehicle,
        gps: {
          latitude: gpsData.latitude,
          longitude: gpsData.longitude,
          lat: gpsData.latitude,
          lng: gpsData.longitude,
          status: updatedStatus,
          online: updatedStatus,
          ignition: gpsData.ignition,
          speed: gpsData.speed,
          gps_time: gpsTime,
          address: gpsData.address,
          course: gpsData.course || gpsData.ang,
        }
      };
    }
    return { ...vehicle, gps: null };
  });
}

A few things worth unpacking here.

The gpsMap lookup gives O(1) matching by deviceNr instead of nested find() calls that would be O(n*m). For a fleet of 200+ vehicles that difference is real.

The graceful GPS failure path - returning gps: null with a gpsError field rather than blowing up the whole response - came from an incident I hadn't anticipated. About a month into production the GPS API went down for 20 minutes during a peak morning shift. My old code treated GPS failure as a total failure and showed every operator a blank screen. The fleet manager couldn't see any vehicles, couldn't manage dispatch, and called me while I was in a lecture. After that I made GPS failure graceful. The vehicle list loads. The map shows vehicles without positions. Operations can continue.

The key normalisation - gpsData.gpstime || gpsData.gps_time and gpsData.course || gpsData.ang - exists because different GPS hardware firmware versions return the same fields with different names. I discovered this when a firmware rollout on part of the fleet caused half the vehicles to disappear from the map. The BFF normalises both variants now so the mobile app never has to know this inconsistency exists.

Business logic belongs here, not in the app

The vehicle status rule is the clearest example: if the last GPS reading is within 5 minutes the vehicle is online, more than 30 minutes stale it's offline, between 5 and 30 minutes trust what the hardware reports.

function checkVehicleStatusByGpsTime(gpsTime, currentStatus) {
  if (!gpsTime) return currentStatus || "offline";
  try {
    const gpsDate = new Date(gpsTime);
    const now = new Date();
    const diffMinutes = (now - gpsDate) / (1000 * 60);
    if (diffMinutes > 30) return "offline";
    if (diffMinutes <= 5) return "online";
    return currentStatus || "online";
  } catch (error) {
    return currentStatus || "offline";
  }
}

GPS hardware sometimes lies - a tracker can report "online" even after a vehicle has been parked for an hour if the device hasn't rebooted. The 30-minute timeout catches this. More importantly, this rule could change. The fleet manager might decide 15 minutes is the right threshold. If this logic lived in the mobile app, changing it means pushing an update through App Store review and waiting several days for it to propagate. In the BFF it's one edit, one deployment.

The authentication shield

The credentials near-miss I mentioned in the intro got fixed like this:

// helper.js
var auth = 'Basic ' + Buffer.from(
  salt.usery + ':' + salt.password
).toString('base64');

// Every upstream call
const response = await axios.get(url, {
  headers: { "Authorization": helpers.returnAuth() }
});

Credentials live in environment variables on the server. The auth header is built once at startup and reused. The mobile app talks to the BFF with no upstream credentials at all - the BFF is the trust boundary. Decompiling the APK reveals nothing useful about the upstream telematics platform. This was one of those decisions that felt obvious once I understood the threat model, and embarrassing that I'd almost shipped the alternative.

The investigation module - the messiest transformation

The investigation screen is where the BFF earns its keep most visibly. The upstream API returns flat arrays - incidents, drivers, registrations, fleet numbers - all separate, all using different keys to refer to the same vehicles. The mobile app needs a pre-structured object it can render directly without any client-side grouping.

I used Lodash to reshape everything server-side:

registrationWithDrivers = _.mapValues(
  _.groupBy(registrationsDrivers, 'registration_nr'),
  dlist => dlist.map(d => _.omit(d, 'registration_nr'))
);

driversArray = _.mapValues(
  _.groupBy(drivers, 'deviceNr'),
  dlist => dlist.map(d => _.omit(d, 'deviceNr'))
);

let registrationsArray = _.mapValues(
  _.groupBy(registrations, 'registration_nr'),
  rlist => rlist.map(r => _.omit(r, 'registration_nr'))
);

let result = {
  ...registrationWithDrivers,
  ...driversArray,
  ...registrationsArray,
  ...fleetsArray,
  ...deviceNrsArray
};

Without this the grouping and merging would happen on the phone every time the investigation screen loaded - on a battery-constrained device on a mobile data connection. The BFF does it once, server-side, where CPU is cheap. The mobile app receives one object it can render directly.

The full surface area

Every row in this table is a merge or transformation the mobile app no longer does:

BFF Endpoint	Upstream calls merged	What it saved
`/api/vehicle/getAllVehicleData`	Vehicle + GPS	2 round-trips to 1, merge server-side
`/api/investigation/getInvestigation`	Incidents + Drivers + Video	3 round-trips to 1, grouping server-side
`/api/video-playlist/getAllFlvVideos`	Video metadata	URL extraction, key normalisation
`/api/snapshot/requestSnapsByParams`	Snapshots + Channels	Time-range bucketing server-side
`/api/tracking/getTracking`	GPS history	Polyline pre-processing
`/api/panic/sendPanic`	Panic dispatch	Credential-shielded emergency forwarding
`/api/technicalLogging/postJobForm`	Multipart form	Photos + job card in one request
`/api/markers/gps-by-client`	GPS positions	Pre-filtered map marker payload
`/api/notifications/getNotiInfo`	GPS notification API	Active alerts, normalised
`/api/clientID/getID`	User + Firebase	Client/technician identity

The aggregate reduction in mobile round-trips is the single biggest performance win in the entire architecture. On 3G in a rural area each saved round-trip is 200-800ms. The operators never mentioned the performance improvement specifically - they just stopped calling about the app being slow. That's the right outcome.

Then the server went down

I deployed the BFF. The app felt significantly faster. I moved on to other features.

Then the primary server went down for a scheduled maintenance window. Fleet operators managing trucks on the N1 highway called to say they couldn't track their vehicles. I had built something that improved performance and then introduced a single point of failure that took down the entire app when it went offline.

I deployed the same BFF application to a second server with a different hostname. Two identical gateways running independently. The question was how the mobile app should handle the failover when the first one was unreachable.

The failover pattern - and why I didn't use an interceptor

The app is configured with two base URLs in Angular's environment files:

export const environment = {
  production: false,
  apiUrl: "http://app.1t-assist.com:8080/api/",
  fallBackApiUrl: "http://app.truck-assist.com:8080/api/",
  firebase: { /* ... */ }
};

My first instinct was an HTTP interceptor - one piece of code, retries all failed requests on the fallback, no repetition per service. I built it. Then I started thinking through the edge cases.

The video service needs to return a structured error object with specific status codes so the UI can show "no camera authority" versus "servers unavailable." The auth service needs to redirect to the login screen on certain failures. The panic service needs hard timeouts - I'll get to that. An interceptor retries all failed requests identically, losing the per-service context. I'd end up with a complex interceptor full of endpoint-specific conditionals that was harder to read than the repetition it was replacing.

So I went with nested catchError in every service:

getVehicles() {
  return this.http
    .get(`${this.primaryUrl}vehicle/getall/${this.authService.returnClientID()}`)
    .pipe(
      catchError((error) => {
        return this.http.get(
          `${this.fallbackUrl}vehicle/getall/${this.authService.returnClientID()}`
        );
      })
    );
}

Try primary. If it fails, retry against the fallback. Same pattern across every service and every endpoint:

// auth.service.ts
saveClientID() {
  const uid = this.fetchUID();
  return this.http.get(`${this.primaryUrl}clientID/getID/${uid}`).pipe(
    catchError((error) => {
      return this.http.get(`${this.fallbackUrl}clientID/getID/${uid}`);
    })
  );
}

// technical-logging.service.ts
getAllTechJobs() {
  const uid = this.authService.fetchUID();
  return this.http.get(
    `${this.primaryUrl}technicalLogging/getAllTechJobs/${uid}`
  ).pipe(
    catchError((error) => {
      return this.http.get(
        `${this.fallbackUrl}technicalLogging/getAllTechJobs/${uid}`
      );
    })
  );
}

It's repetitive. That's deliberate. Each service independently handles its own failover with no shared state, no coordination. Every developer who opens any service file immediately understands what it does. No mental model of how a central interceptor works required.

The panic service is the exception

There's one place where simple catchError isn't enough.

Emergency dispatch can't wait for the default HTTP timeout - which on a mobile network can be 30 seconds or more - before trying the fallback. If a driver hits the panic button and the primary server is slow to respond, waiting 30 seconds to even attempt the fallback is unacceptable.

The panic service gets explicit timeout() operators:

sendPanic(body: any) {
  body.uid = this.authService.fetchUID();
  const headers = { "Content-Type": "application/json" };

  return this.http
    .post(this.primaryUrl + "panic/sendPanic", body, { headers })
    .pipe(
      timeout(15000),
      catchError((error) => {
        return this.http
          .post(this.fallbackUrl + "panic/sendPanic", body, { headers })
          .pipe(
            timeout(15000),
            catchError((fallbackError) => {
              return throwError(
                () => new Error(
                  "Both servers are unavailable. Please try again later."
                )
              );
            })
          );
      })
    );
}

15 seconds per server. Worst case the dispatch takes 30 seconds. For an emergency that's the outer bound of acceptable - and it's bounded, which matters. The driver gets a result, success or failure, within 30 seconds. Not an infinite spinner.

The panic service is the only one in the entire app with explicit timeout() operators. For everything else - vehicle lists, map data, video streams - the default HTTP timeout is fine. A vehicle list loading 3 seconds slower is annoying. An emergency alert taking longer is a different category of problem entirely.

When both servers fail - structured error responses

The final catchError determines what the user sees when everything goes wrong. The video service shows the most sophisticated version of this:

private handleVideoError(error: HttpErrorResponse): Observable<any> {
  if (error.status === 403 ||
      (error.error?.includes?.('authority'))) {
    return of({
      success: false,
      error: true,
      message: 'No vehicle or device operating authority',
      data: []
    });
  }
  return of({
    success: false,
    error: true,
    message: error.message || 'Failed to retrieve video stream',
    data: []
  });
}

The service returns a structured response using of() rather than throwing. This means the component can always render something - data, an empty state, or an error message. No unhandled promise rejections, no blank screens, no infinite spinners.

The 403 check distinguishes between "both servers are down" and "this user doesn't have permission to view this camera." Without that distinction every video error would look the same to the operator, and they'd have no way to know whether to call IT or to check their account permissions.

What happened during the first maintenance window

After deploying the second server and rolling out the failover pattern, the primary went down again for the next scheduled maintenance.

Zero user-facing errors. Fallback requests added about 2-4 seconds to loading times - the first request times out, the fallback succeeds - but every feature kept working. The fleet operators didn't call. Nobody noticed.

That's the goal. The best failover is the one nobody knows happened.

This article is part of a series on building a fleet telematics platform.

Tech used in this article

TypeScript
Node.js
Express
Angular
Firebase