The short version
Some PDFs look like text but aren’t. You try to select a sentence and nothing highlights, because the whole page is really just a photo, a scan, a phone snapshot, a fax that got saved as PDF. There’s no text layer to grab. Optical character recognition reads the pixels, figures out the letters, and hands you back actual text you can copy, search, and edit.
That’s what this does. Upload a scanned PDF, pick its language, and get the words out.
Under the hood
OCR works by looking at the shapes on each page and matching them to characters. We render every page of your PDF and run a recognition engine across the image, then stitch the results together in reading order.
Language matters a lot here, more than people expect. The engine loads a model trained on the alphabet and word patterns of whatever language you select, so telling it the right one sharpens accuracy considerably. You’ve got nine to choose from: English, Turkish, Spanish, French, German, Portuguese, Italian, Dutch, and Russian. Pick one at a time, and pick the one actually printed on the page.
Two practical notes. The first run is the slow one, because the language model has to load before any reading happens. After that it moves faster. And there’s a 20-page ceiling per file, OCR is heavy work, so a hard cap keeps things responsive instead of timing out on a 300-page scan.
Every result comes back with a confidence score from 0 to 100. Treat it as a rough quality gauge. High confidence on a crisp 300-dpi scan, lower on a crumpled photo shot at an angle in bad light. Garbled output almost always traces back to a blurry, skewed, or low-resolution source rather than the engine itself.
When you’d use this, and when you wouldn’t
Reach for OCR when the PDF is genuinely image-based: scanned contracts, old paper records that got digitized, receipts photographed with a phone, screenshots saved to PDF, a faxed form. Anything where the page is a picture of text rather than text.
Here’s the part worth being straight about. If your PDF already has a real, selectable text layer, this is the wrong tool. PDF to Text reads that layer directly, which is faster, cleaner, and perfectly accurate because it isn’t guessing pixels. Quick test: open the PDF, try to select a word with your cursor. If it highlights, use PDF to Text. If nothing happens, you’ve got an image, and that’s OCR’s job.
Working with a single picture instead of a PDF? Image to Text does the same recognition on JPG, PNG, and other image files.
FAQ
How is this different from regular PDF text extraction?
Plain extraction copies a text layer that’s already inside the file. OCR is for PDFs that have no text layer, just images, so it reads the letters off the pixels instead. Slower and approximate, but it’s the only option when there’s nothing to copy.
Which languages does it support?
Nine: English, Turkish, Spanish, French, German, Portuguese, Italian, Dutch, and Russian. Choose the one printed in the document, one per run, for the best accuracy.
Why is the first run so slow?
The language model loads on that first pass before any reading starts. Once it’s warm, the next runs are quicker.
What does the confidence percentage mean?
It’s the engine’s own estimate of how sure it is about what it read. Sharp, high-resolution scans score high, blurry or tilted photos score lower. Use it as a sanity check on the output.
Is there a page limit?
Yep, 20 pages per file. OCR is processing-heavy, so the cap keeps results coming back in a reasonable time. Files can be up to 50 MB.
What happens to my uploaded PDF?
It’s processed on the server and deleted automatically about an hour later. Nothing is kept beyond running the recognition.