Client-side telemetry: Series overview

An overview of why and how you would roll your own client-side telemetry solution with AWS CDK.

A classy headshot of Graeme wearing cool glasses, looking like a goofball.

Graeme Zinck

Senior software engineer at LVL Wellbeing

A phone shows an error message while a paper plane flies away with a message.

This is the 1st article in a 5-part series:

Rolling your own client-side telemetry solution using AWS CDK

A step-by-step walkthrough on deploying a client-side telemetry stack using AWS CDK, Lambda, API Gateway, and CloudWatch.

  1. Client-side telemetry: Series overview
  2. Client-side telemetry: Setting up a new CDK project
  3. Client-side telemetry: Deploying a Typescript Lambda function with CDK
  4. Client-side telemetry: Lambda permissions and APIs in CDK
  5. Client-side telemetry: Alarms

Have you ever launched an app only to find out a month later that it crashes? Everyone's been there. When we're on a tight timeline, we can't cover every edge case—bugs will inevitably make it into production.

When things go wrong, we need all the diagnostics we can get so we can fix it.

Fast.

This is where client-side telemetry comes in. Every time something unexpected happens, we should get an email saying, "Whoops! Something is wrong. Here's all the information so you can find a solution ;-)"

When I worked at Amazon, they had it down to a science. Every engineer went on-call for a week in rotation and spent their workday responding to customer requests, getting paged, fixing problems, and resetting alarms. But what about startups with no existing toolkits and no budget?

That's where we started at LVL Wellbeing. We needed a solution to see what was going on in our React Native app. Furthermore, due to business requirements, we needed to either use an AWS solution or self-host a solution on AWS.

Existing client-side solutions

I starting looking around.

  • CloudWatch RUM is Amazon's tool for client-side monitoring. It has good integration with other AWS products, but the SDK doesn't support React Native and it has steep per-request pricing.
  • Google Firebase Analytics is Google's tool for client-side monitoring. It's actually pretty awesome. However, it's outside of AWS with no self-hosting option.
  • Azure Monitor is Microsoft's tool, but it's also not self-hosted.
  • Open Replay fit most of our needs, including providing a self-hosted option. However, it didn't support Android on React Native as of 2024.
  • Sentry was the top contender. It's elegant. It supports self-hosting. It has features to last a lifetime. And it's also expensive as heck to deploy. We decided it was overkill (and out of budget, in both time and money).

At the end of my search, I decided it would suffice to wrap a simple telemetry solution myself.

Problem overview

For the MVP, the needs were simple: create an endpoint that takes an error, triggers an alarm, and throws the error logs in a database of some sort.

Why not just create an endpoint on the existing backend that throws the error in an existing database? Well, that would tightly couple our telemetry with our backend. If the backend went down, frontend alarms would stop working. Also, if the telemetry endpoint got hit with a DDOS attack, our entire backend would go down.

A better solution would be:

  1. Isolated without any reliance on the existing infrastructure.
  2. Scalable without worrying about server and database capacity.
  3. Simple so it's not going to take valuable dev time to set up, maintain, or modify.

Desired API

We're looking to make something really simple: we want to be able to handle errors with a request that looks like this:

curl -X POST 'https://fe-telemetry.{local|staging|production}.{your-domain}.com/error' \
  -H "Content-Type: application/json" \
  -d '{
    "severity": 2,
    "errorCode": "UNCAUGHT_ERROR",
    "device": "iPhone 15",
    "os": "iOS 17.2",
    "appVersion": "1.0.0",
    "error": "Some error message"
  }'
  • The severity parameter is a number between 1 and 5. Lower numbers need to be addressed more urgently and have alarms to reflect that.
  • The errorCode parameter is a string that uniquely identifies the error type. This helps create graphs that show the frequency of different error types.
  • The rest of the body contains extra information to help us debug the error.

Proposed solution

Let's get cracking!

  1. First off, we need an access point for our app to send us error events. This is super easy to implement using API Gateway, which connects the outside internet to our telemetry service.
  2. To process requests and dump them in a database, we could use any server. However, in this case, we don't need anything fancy; a stateless Lambda Function will do nicely. It's cheap and scaleable!
  3. For the database, we could use anything. In this tutorial, we'll use CloudWatch Logs since that colocates all our frontend telemetry with server logs from Elastic Compute Service (ECS). It's also relatively affordable and scaleable!
  4. To trigger email alerts, we can use CloudWatch Metrics with alarms that notify a Simple Notification Service (SNS) topic. Whenever an error occurs, an alarm will trigger and send a notification via SNS to the dev team.

Using a similar stack to the one we're building, it costs LVL a whopping ~$0.01/month for thousands of requests 💸

We want to be able to set up a separate stack for multiple environments. At LVL, we have three environments:

  1. Local env: where the errors go when an engineer is messing around on their simulator. We won't want alarms to go off in this env.
  2. Staging env: where the errors go when the testing application breaks.
  3. Production env: where the errors go when the live application breaks. We want alarms to tell us immediately when something happens here.

Setting up all these resources via the AWS console would be time consuming since we'd have repeat all the setup for each env, each time we make a change. That's both error-prone and boring!

In the next article, we'll walk through why we should use AWS CDK to set up our infrastructure. We'll also set up boilerplate code so we can start deploying our resources 🚀