How to Generate Pixel-Perfect PDFs and Scrape Dynamic Sites with Puppeteer and NestJS

js

How to Generate Pixel-Perfect PDFs and Scrape Dynamic Sites with Puppeteer and NestJS

Learn how to use Puppeteer with NestJS to create high-fidelity PDFs and scrape dynamic web content with ease.

Dec 27, 2025

How to Generate Pixel-Perfect PDFs and Scrape Dynamic Sites with Puppeteer and NestJS

I was building a reporting feature for a client last month. The requirement was simple: generate a clean, styled PDF from a complex user dashboard. My first instinct was to reach for a traditional PDF library. After a few hours of wrestling with layout quirks and broken CSS, I realized I was solving the wrong problem. What if, instead of building a PDF, I could just take a perfect picture of the webpage itself? That thought led me down a path of integrating Puppeteer, a tool that controls a real browser, with NestJS, my preferred framework for structured backend work. The result was not just a solution, but a new way of thinking about server-side automation. If you’ve ever struggled with dynamic content in PDFs or needed to reliably pull data from the modern web, this combination is worth your attention.

Let’s start with the core idea. Puppeteer is a Node.js library that gives you a programmatic interface to a Chromium browser. You can tell it to go to a webpage, click elements, fill out forms, and crucially for us, save what it sees as a PDF or an image. NestJS provides the robust, scalable home for this automation on your server. It handles the incoming requests, manages dependencies cleanly, and ensures your browser automation logic is maintainable and testable.

Why is this better than a standard PDF library? Modern web applications are built with frameworks like React, Vue, or Angular. They have complex styles, interactive charts, and custom fonts. A traditional PDF library often struggles to interpret this correctly. With Puppeteer inside NestJS, you render the page exactly as a user would see it in Chrome, and then capture that perfect representation. The fidelity is unmatched.

Setting this up is straightforward. First, you create a dedicated module in your NestJS application to manage the Puppeteer lifecycle. This is a good practice because launching a browser is resource-intensive. You don’t want to launch a new browser for every single request.

Here’s a basic service provider in a file like puppeteer.service.ts:

import { Injectable, OnModuleDestroy } from '@nestjs/common';
import * as puppeteer from 'puppeteer';

@Injectable()
export class PuppeteerService implements OnModuleDestroy {
  private browser: puppeteer.Browser;

  async onModuleInit() {
    this.browser = await puppeteer.launch({
      headless: 'new', // Use the new Headless mode
      args: ['--no-sandbox', '--disable-setuid-sandbox'], // Often needed for Docker
    });
  }

  async generatePDF(htmlContent: string, options?: puppeteer.PDFOptions) {
    const page = await this.browser.newPage();
    await page.setContent(htmlContent);
    const pdfBuffer = await page.pdf({
      format: 'A4',
      printBackground: true, // Crucial for capturing CSS backgrounds
      ...options,
    });
    await page.close();
    return pdfBuffer;
  }

  async onModuleDestroy() {
    await this.browser.close();
  }
}

This service launches one browser instance when your module starts. The generatePDF method creates a new page (like a new tab), sets your HTML content, and creates a PDF buffer. Notice printBackground: true. This is the small detail that makes your reports look professional. Have you ever generated a PDF only to find all the colors and backgrounds missing? This setting fixes that.

But what about generating a PDF from a URL, like a user’s dashboard? The process is similar, but you navigate instead of setting content.

async generatePDFFromUrl(url: string) {
  const page = await this.browser.newPage();
  await page.goto(url, { waitUntil: 'networkidle0' }); // Wait for page to fully load
  const pdfBuffer = await page.pdf({ format: 'A4', printBackground: true });
  await page.close();
  return pdfBuffer;
}

The waitUntil: 'networkidle0' option is key. It tells Puppeteer to wait until there are no more network connections for at least 500 ms. This ensures all your fonts, images, and API data have finished loading before the snapshot is taken. Without this, you might capture a half-loaded page.

Now, let’s talk about the other powerhouse use case: web scraping. The modern web is built with JavaScript. Many sites load their data dynamically after the initial page load. Traditional HTTP request libraries can’t execute this JavaScript, so they see empty pages. Puppeteer, being a real browser, handles this with ease.

Imagine you need to monitor the price of a product on an e-commerce site. Here’s how you might structure a scraping service:

async scrapeProductPrice(url: string): Promise<number> {
  const page = await this.browser.newPage();
  await page.goto(url);

  // Wait for a specific selector that contains the price to appear
  await page.waitForSelector('.product-price');

  // Evaluate JavaScript in the context of the page to extract data
  const priceText = await page.$eval('.product-price', (el) => el.textContent);
  await page.close();

  // Clean up the text and convert to a number
  const price = parseFloat(priceText.replace(/[^0-9.]/g, ''));
  return price;
}

The page.$eval method is your gateway to the page’s DOM. You can write any JavaScript here to extract data, click buttons, or fill forms. It’s as if you have a script running in the browser’s developer console, but it’s fully automated from your server.

Of course, with great power comes the need for responsibility. Running a full browser on your server consumes significant memory and CPU. If you get ten PDF generation requests at the same time and you only have one browser instance, they will queue up, potentially causing timeouts. For a production system, you need a more robust setup.

A common pattern is to implement a connection pool for browser pages. Instead of managing one browser, you create a pool of ready-to-use pages that can handle tasks concurrently. Another critical consideration is error handling and timeouts. Always wrap your Puppeteer operations in try-catch blocks and ensure pages are closed even if an error occurs. A forgotten page is a memory leak.

async safeGeneratePDF(html: string) {
  let page: puppeteer.Page | null = null;
  try {
    page = await this.browser.newPage();
    await page.setContent(html);
    return await page.pdf({ format: 'A4' });
  } catch (error) {
    console.error('PDF generation failed:', error);
    throw new Error('Could not generate PDF');
  } finally {
    if (page && !page.isClosed()) {
      await page.close();
    }
  }
}

When deploying, especially in Docker containers, remember the --no-sandbox argument. The Chrome sandbox often requires special permissions that aren’t available in containerized environments. This is a standard practice for running headless Chrome on servers.

The applications are endless. From generating invoices with complex tables and logos, to creating certificates with custom fonts and seals, to capturing weekly analytics reports as shareable snapshots. On the scraping side, it enables data aggregation from multiple sources, automated testing of your own web interfaces, and monitoring of external site changes.

I found that moving from specialized libraries to this browser-based approach felt like upgrading from a typewriter to a word processor. It aligns with how the web actually works. You design something to look good in a browser, and now you can programmatically capture that exact vision as a document or data point.

This integration turns your NestJS application into a powerful automation engine. It bridges the gap between the dynamic, visual web and the structured, data-driven needs of server-side applications. Start with a simple service, handle your errors, manage your resources, and you’ll unlock capabilities that can define entire features for your users.

What problem could you solve if you could perfectly capture any webpage as a document? Or reliably extract data from any site, regardless of its frontend framework? The tools are here and they work remarkably well together. Give it a try in your next NestJS project.

If this approach to handling dynamic content and automation resonates with you, please share this article with a colleague who might be battling similar challenges. Have you implemented something similar? I’d love to hear about your experiences or answer any questions in the comments below. Let’s build more robust and capable applications together.

As a best-selling author, I invite you to explore my books on Amazon. Don’t forget to follow me on Medium and show your support. Thank you! Your support means the world!

101 Books

101 Books is an AI-driven publishing company co-founded by author Aarav Joshi. By leveraging advanced AI technology, we keep our publishing costs incredibly low—some books are priced as low as $4—making quality knowledge accessible to everyone.

Check out our book Golang Clean Code available on Amazon.

Stay tuned for updates and exciting news. When shopping for books, search for Aarav Joshi to find more of our titles. Use the provided link to enjoy special discounts!

📘 Checkout my latest ebook for free on my channel!
Be sure to like, share, comment, and subscribe to the channel!

Our Creations

Be sure to check out our creations:

We are on Medium

Share: Facebook Twitter Reddit LinkedIn WhatsApp Telegram Pinterest Email Instagram

js