Extract Text from Pdf files using pdfjs dist

by iamvkr on

Introduction

Reading content from PDF files directly in a React app can be a powerful feature—whether you’re building a document reader, search tool, or parser. One of the most reliable tools for working with PDFs in the browser is pdfjs-dist, a PDF rendering library from Mozilla.

In this post, we’ll walk through how to extract text from a PDF file in React using pdfjs-dist.

🛠️ Setup

First, install pdfjs-dist:

npm install pdfjs-dist

The Core Idea

We’ll use a simple React component that allows users to upload a PDF, then extract and display the text content of all its pages.

🧪 The Code

Here’s the full example:

import React, { useState } from "react";
import * as pdfjsLib from "pdfjs-dist";

pdfjsLib.GlobalWorkerOptions.workerSrc = `./node_modules/pdfjs-dist/build/pdf.worker.mjs`;

const App = () => {
  const [pdfFile, setPdfFile] = useState(null);
  const [text, setText] = useState("");

  const handleFileChange = (event) => {
    setPdfFile(event.target.files[0]);
  };
  const getText = () => {
    if (!pdfFile) {
      return console.log("empty or invalid pdf");
    }
    console.log(pdfFile);
    const fileReader = new FileReader();
    fileReader.readAsArrayBuffer(pdfFile);
    fileReader.onload = async () => {
      const typedArray = new Uint8Array(fileReader.result);
      const pdf = await pdfjsLib.getDocument(typedArray).promise;
      let fullText = "";
      for (let pageNum = 1; pageNum <= pdf.numPages; pageNum++) {
        const page = await pdf.getPage(pageNum);
        const textContent = await page.getTextContent();
        const pageText = textContent.items.map((item) => item.str).join(" ");
        fullText += pageText;
      }
      setText(fullText);
    };
  };

  return (
    <div>
      <input type="file" accept=".pdf" onChange={handleFileChange} />
      <div>
        <button onClick={getText}>Get Text</button>
      </div>
      {text && (
        <div>
          <p>RESULT</p>
          <p>{text}</p>
        </div>
      )}
    </div>
  );
};

export default App;

🔍 What’s Happening Here?

  1. User Uploads PDF: A file input lets users upload a .pdf file.
  2. Read File: The PDF is read as an ArrayBuffer using FileReader.
  3. Load PDF Document: pdfjsLib.getDocument() loads the PDF document.
  4. Extract Text: Each page is parsed using getPage().getTextContent().
  5. Display Result: The extracted text is stored in state and displayed in the UI.

⚠️ Common Pitfalls

  • Make sure the worker script path (pdf.worker.mjs) is correctly set relative to your app’s build setup.
  • This method extracts raw text only—not formatting, images, or styles.

✅ Conclusion

Using pdfjs-dist in React makes PDF text extraction quite straightforward. Whether you need to search documents, analyze content, or just preview text—this method gives you a reliable foundation.

Hope You liked the post! Make sure to leave a comment below and share the post! Happy Coding!!