Extract Text from Pdf files using pdfjs dist
by iamvkr on
Introduction
Reading content from PDF files directly in a React app can be a powerful feature—whether you’re building a document reader, search tool, or parser. One of the most reliable tools for working with PDFs in the browser is pdfjs-dist
, a PDF rendering library from Mozilla.
In this post, we’ll walk through how to extract text from a PDF file in React using pdfjs-dist
.
🛠️ Setup
First, install pdfjs-dist
:
npm install pdfjs-dist
The Core Idea
We’ll use a simple React component that allows users to upload a PDF, then extract and display the text content of all its pages.
🧪 The Code
Here’s the full example:
import React, { useState } from "react";
import * as pdfjsLib from "pdfjs-dist";
pdfjsLib.GlobalWorkerOptions.workerSrc = `./node_modules/pdfjs-dist/build/pdf.worker.mjs`;
const App = () => {
const [pdfFile, setPdfFile] = useState(null);
const [text, setText] = useState("");
const handleFileChange = (event) => {
setPdfFile(event.target.files[0]);
};
const getText = () => {
if (!pdfFile) {
return console.log("empty or invalid pdf");
}
console.log(pdfFile);
const fileReader = new FileReader();
fileReader.readAsArrayBuffer(pdfFile);
fileReader.onload = async () => {
const typedArray = new Uint8Array(fileReader.result);
const pdf = await pdfjsLib.getDocument(typedArray).promise;
let fullText = "";
for (let pageNum = 1; pageNum <= pdf.numPages; pageNum++) {
const page = await pdf.getPage(pageNum);
const textContent = await page.getTextContent();
const pageText = textContent.items.map((item) => item.str).join(" ");
fullText += pageText;
}
setText(fullText);
};
};
return (
<div>
<input type="file" accept=".pdf" onChange={handleFileChange} />
<div>
<button onClick={getText}>Get Text</button>
</div>
{text && (
<div>
<p>RESULT</p>
<p>{text}</p>
</div>
)}
</div>
);
};
export default App;
🔍 What’s Happening Here?
- User Uploads PDF: A file input lets users upload a
.pdf
file. - Read File: The PDF is read as an
ArrayBuffer
usingFileReader
. - Load PDF Document:
pdfjsLib.getDocument()
loads the PDF document. - Extract Text: Each page is parsed using
getPage().getTextContent()
. - Display Result: The extracted text is stored in state and displayed in the UI.
⚠️ Common Pitfalls
- Make sure the worker script path (pdf.worker.mjs) is correctly set relative to your app’s build setup.
- This method extracts raw text only—not formatting, images, or styles.
✅ Conclusion
Using pdfjs-dist in React makes PDF text extraction quite straightforward. Whether you need to search documents, analyze content, or just preview text—this method gives you a reliable foundation.
Hope You liked the post! Make sure to leave a comment below and share the post! Happy Coding!!