blog

Search Unsearchable Web-Based PDF Files

Google has announced a significant addition to its search capabilities. A huge collection of PDF files online that were not being indexed, now are through optical character recognition (OCR) technology.

Previously, when paper files were scanned to PDF and not OCRd (see How Text Works in OCR’d and Scanned PDF files), the contents of the files, though appearing to be text, were just being treated as images. Google now OCRs the contents, scanning these images for text, which can then be indexed and included in search results.

There must be an enormous amount of PDF files online, such as old user manuals, specification sheets, pages from magazines/newspapers, that are now searchable.

Google explained its release this way:

In the past, scanned documents were rarely included in search results as we couldn’t be sure of their content. We had occasional clues from references to the document– so you might get a search result with a title but no snippet highlighting your query. Today, that changes. We are now able to perform OCR on any scanned documents that we find stored in Adobe’s PDF format. This Optical Character Recognition (OCR) technology lets us convert a picture (of a thousand words) into a thousand words — words that can be searched and indexed, so that these valuable documents are more easily found. This is a small but important step forward in our mission of making all the world’s information accessible and useful.

The technology employed by Google has most likely come out of the open source OCR engine (OCRpus) they sponsor and the company’s book scanning project.

It will be interesting to see where this heads. The most obvious and significant direction would be for Google to start OCRing all text contained in all images on all web pages — a massive processing task but possible. Time will tell.

One question comes to mind. Has anyone put files on the Web as scanned PDF files to ensure the content was not indexed and made searchable by search engines? If so, they’ll be scrambling to get them offline.

Update: If you’d like to OCR PDF documents yourself, our new Nitro Pro OCR product will let you do it. Follow the link to take it for a test drive.
Update: If you’d like to OCR PDF documents yourself, our new Nitro Pro OCR product will let you do it. Follow the link to take it for a test drive.
http://www.nitropdf.com/professional/ocr.htm

Leave a comment

 
 
 
 

 

No Comments

No comments yet.