What Authors Should Know About OCR

By: Ben Denckla | June 27, 2016

Typewritten manuscripts are especially difficult for OCR

If you published a book before 2008, its ebook edition was probably created using optical character recognition (OCR). And if your ebook was created using OCR, it probably has typos in it. That’s the bad news.

The good news: you don’t have to accept this situation.

What’s special about the year 2008? Nothing, really. I just chose 2008 because the first Kindle came out in late 2007. So 2008 is the earliest year I can imagine a significant number of publishers adopting a single-source workflow: a workflow in which the ebook is created from the same files used to create the paper book. For example, nowadays Adobe InDesign can create an ebook and a paper book (well, a PDF) from the same file. A single-source workflow avoids OCR and OCR-caused typos. It doesn’t avoid all problems, but it goes a long way toward making higher-quality ebooks.

Many publishers continued to use OCR for books published more recently than 2008. On the other hand, commendably, some publishers used single-source workflows for books published before 2008. Since files may be available for books published as long ago as the 1970s, single-source workflows are possible (though unlikely) for books published while Jeff Bezos was still a child.

The bottom line for authors is this: regardless of its year of paper publication, ask your publisher whether OCR was used to create the ebook edition of your book.

If OCR was used, your ebook probably has typos in it. It was probably spellchecked, but not carefully. The whole conversion, including spellchecking, was probably outsourced to inexpensive workers who, even if their English skills were good, were probably working under severe time constraints. And even the most careful spellchecking, as you know, is no substitute for good old proofreading. Your ebook was almost certainly not proofread.

So what can you do?

