js

How Distributed Tracing with Zipkin Solves Microservice Debugging Nightmares

Discover how to trace requests across microservices using Zipkin to pinpoint performance issues and debug faster than ever.

How Distributed Tracing with Zipkin Solves Microservice Debugging Nightmares

I was debugging a production issue last week. A user reported that their order took over 30 seconds to complete. The problem? Our system is built with microservices. The request traveled through an API gateway, a user service, an inventory service, and a payment service. Traditional logs showed each service was “working.” But which one was the slowpoke? I couldn’t tell. That frustrating experience is why I’m writing this. If you’re building interconnected services, you need a way to see the whole story. Let’s build that visibility together.

Think of distributed tracing as a high-tech receipt for a request. A single user action, like placing an order, creates a unique trace ID. This ID gets passed along as the request hops from service to service. Each service stamps the “receipt” with its own entry: what it did, how long it took, and any important details. At the end, you have a complete timeline. You can see exactly where the time was spent. No more guessing.

Why is this better than just checking server logs? Logs are isolated. They tell you what happened inside one box. A trace connects the dots across your entire architecture. It shows you the relationships between services. You can see if a delay in the payment service is actually caused by a slow database call in the user service three steps earlier. It turns a pile of separate clues into a single, clear map.

So, how do we start? We need a central place to collect and view these traces. For this, we’ll use Zipkin. It’s a robust, open-source system. You can run it easily with Docker. Here’s a simple setup to get it going alongside a MongoDB instance for our application data.

# docker-compose.yml
version: '3.8'
services:
  zipkin:
    image: openzipkin/zipkin:latest
    ports:
      - "9411:9411"

  mongodb:
    image: mongo:latest
    ports:
      - "27017:27017"
    environment:
      MONGO_INITDB_ROOT_USERNAME: admin
      MONGO_INITDB_ROOT_PASSWORD: secret

Run docker-compose up, and Zipkin will be available at http://localhost:9411. Now, our services need to send data to it. The core unit of work in tracing is called a “span.” A span represents a single operation, like an HTTP request or a database query. A trace is a collection of spans linked together.

Let’s build a shared tracing module. This keeps our code clean and consistent across all services. We’ll create a Tracer class that sets up the connection to Zipkin.

// shared/tracer.js
const { Tracer, BatchRecorder } = require('zipkin');
const { HttpLogger } = require('zipkin-transport-http');

class AppTracer {
  constructor(serviceName) {
    const recorder = new BatchRecorder({
      logger: new HttpLogger({
        endpoint: 'http://localhost:9411/api/v2/spans'
      })
    });

    this.tracer = new Tracer({
      recorder,
      localServiceName: serviceName // e.g., 'user-service'
    });
  }

  getTracer() {
    return this.tracer;
  }
}

module.exports = AppTracer;

With the tracer ready, we need to instrument our Express.js services. The goal is to automatically create a span for every incoming HTTP request. We can use middleware for this. The middleware will extract any existing trace ID from the request headers (if this request is part of an ongoing trace) or start a new one.

// shared/middleware.js
const { expressMiddleware } = require('zipkin-instrumentation-express');

function createTracingMiddleware(tracer, serviceName) {
  return expressMiddleware({ tracer, serviceName });
}

module.exports = createTracingMiddleware;

Now, in your Express service, using it is straightforward.

// user-service/index.js
const express = require('express');
const AppTracer = require('../shared/tracer');
const createTracingMiddleware = require('../shared/middleware');

const app = express();
const tracer = new AppTracer('user-service').getTracer();

// Apply the tracing middleware
app.use(createTracingMiddleware(tracer, 'user-service'));

app.get('/users/:id', async (req, res) => {
  // The span for this route is already active!
  res.json({ id: req.params.id, name: 'Jane Doe' });
});

app.listen(3001);

But what about calls between services? The trace ID must travel with the request. When your API gateway calls the user service, it needs to pass the current trace context in the HTTP headers. This is called propagation. We need to instrument our HTTP client, like axios or fetch, to handle this automatically.

// shared/http-client.js
const wrapAxios = require('zipkin-instrumentation-axios');
const axios = require('axios');

function createTracedAxios(tracer, remoteServiceName) {
  return wrapAxios(axios, { tracer, remoteServiceName });
}

module.exports = createTracedAxios;

In your gateway service, you’d use this wrapped client.

// api-gateway/service.js
const tracedAxios = createTracedAxios(tracer, 'user-service');

app.get('/order/:id', async (req, res) => {
  // This call will automatically forward the trace headers.
  const userResponse = await tracedAxios.get(`http://localhost:3001/users/123`);
  // ... more logic
});

Now the trace flows seamlessly. But we’re missing a critical piece: database operations. A slow MongoDB query could be our real bottleneck. How do we see that? We need to wrap our database calls. Let’s create a simple instrumented MongoDB client helper.

// shared/mongo-instrumentation.js
const { annotation } = require('zipkin');

function traceMongoCall(tracer, callName, query, callback) {
  // Start a new child span for the DB operation
  tracer.scoped(() => {
    tracer.recordServiceName('mongodb');
    tracer.recordRpc(callName);
    tracer.recordBinary('db.query', JSON.stringify(query));
    tracer.recordAnnotation(new annotation.ClientSend());

    const startTime = Date.now();

    // Execute the query
    callback().then(result => {
      const duration = Date.now() - startTime;
      tracer.recordAnnotation(new annotation.ClientRecv());
      tracer.recordBinary('db.duration_ms', duration.toString());
      return result;
    }).catch(err => {
      tracer.recordBinary('db.error', err.message);
      throw err;
    });
  });
}

module.exports = traceMongoCall;

In your user service, you’d use it like this:

const traceMongoCall = require('../shared/mongo-instrumentation');
const { MongoClient } = require('mongodb');

async function findUser(userId) {
  const query = { _id: userId };
  return traceMongoCall(tracer, 'findOne', query, () => {
    return db.collection('users').findOne(query);
  });
}

Suddenly, that database call appears as its own span in Zipkin, nested within the GET /users/:id span. You can see its exact duration. Is it taking 2000ms? That’s your problem right there.

You might be thinking, “This sounds great, but won’t it slow everything down?” That’s a smart concern. In production, you often don’t need to trace every single request. That’s where sampling comes in. You can configure your tracer to only collect a percentage of traces, say 10% or 1%. This gives you a representative picture without the overhead. You can adjust the CountingSampler in the Tracer class we built earlier to control this rate.

What do you do once you have all this data? You open Zipkin. You search for a slow trace by its ID or by a service name. The UI shows you a waterfall diagram. Each bar is a span. The length of the bar is the duration. You can instantly see which service or database call is the widest, slowest bar. You click on it for details: the exact query, the timestamp, any errors. Debugging goes from a days-long hunt to a minutes-long inspection.

This approach transforms how you understand your system. It moves you from reactive debugging to proactive observation. You can set up alerts based on span durations. You can identify poorly performing endpoints before users complain. You can understand the real cost of adding a new service dependency.

I encourage you to start simple. Add tracing to just two services that talk to each other. See the trace appear in Zipkin. Then, add database instrumentation. Watch the story get more detailed. The clarity it brings is worth the initial setup. If you found this walkthrough helpful, please share it with a colleague who’s also wrestling with microservice complexity. Have you tried implementing tracing before? What was your biggest hurdle? Let me know in the comments below.


As a best-selling author, I invite you to explore my books on Amazon. Don’t forget to follow me on Medium and show your support. Thank you! Your support means the world!


101 Books

101 Books is an AI-driven publishing company co-founded by author Aarav Joshi. By leveraging advanced AI technology, we keep our publishing costs incredibly low—some books are priced as low as $4—making quality knowledge accessible to everyone.

Check out our book Golang Clean Code available on Amazon.

Stay tuned for updates and exciting news. When shopping for books, search for Aarav Joshi to find more of our titles. Use the provided link to enjoy special discounts!


📘 Checkout my latest ebook for free on my channel!
Be sure to like, share, comment, and subscribe to the channel!


Our Creations

Be sure to check out our creations:

Investor Central | Investor Central Spanish | Investor Central German | Smart Living | Epochs & Echoes | Puzzling Mysteries | Hindutva | Elite Dev | JS Schools


We are on Medium

Tech Koala Insights | Epochs & Echoes World | Investor Central Medium | Puzzling Mysteries Medium | Science & Epochs Medium | Modern Hindutva

Keywords: distributed tracing,zipkin,nodejs,microservices,performance monitoring



Similar Posts
Blog Image
How to Build Scalable Event-Driven Architecture with NestJS, RabbitMQ, and MongoDB

Learn to build scalable event-driven architecture using NestJS, RabbitMQ & MongoDB. Master microservices, CQRS patterns & production deployment strategies.

Blog Image
How to Build a Distributed Task Queue with BullMQ, Redis and TypeScript - Complete Guide

Learn to build a scalable distributed task queue with BullMQ, Redis & TypeScript. Master job processing, retry mechanisms, monitoring & Express.js integration for production systems.

Blog Image
Complete Guide to Next.js Prisma Integration: Build Type-Safe Full-Stack Apps in 2024

Learn how to integrate Next.js with Prisma ORM for type-safe, full-stack applications. Build modern web apps with seamless database operations and enhanced developer experience.

Blog Image
Distributed Rate Limiting with Redis and Node.js: Complete Implementation Guide

Learn how to build scalable distributed rate limiting with Redis and Node.js. Complete guide covering Token Bucket, Sliding Window algorithms, Express middleware, and monitoring techniques.

Blog Image
Build Multi-Tenant SaaS with NestJS, Prisma, and PostgreSQL Row-Level Security

Learn to build secure multi-tenant SaaS apps with NestJS, Prisma & PostgreSQL RLS. Complete guide with tenant isolation, auth, and best practices. Start building today!

Blog Image
Complete Guide to Integrating Next.js with Prisma ORM for Type-Safe Full-Stack Development

Learn how to integrate Next.js with Prisma ORM for type-safe full-stack development. Build faster, SEO-friendly web apps with complete TypeScript support.