PDF, the ubiquitous document format, is great for sharing documents while preserving fonts, images, and general layout across platforms. However, is there an easy way to keep this formatting while copying and pasting text from the document?
Today’s Q&A session is courtesy of SuperUser – a division of Stack Exchange, a community-driven grouping of Q&A websites.
SuperUser reader Colen is looking for a way to extract text from PDFs while preserving the formatting:
When I copy text from a PDF file into a text editor, it is garbled in a number of ways. Formatting such as bold and italic are lost; soft line breaks within a paragraph of text are converted to hard line breaks; Hyphens to break a word over two lines are retained even if they shouldn’t be; and single and double quotes are put through? Character.
Ideally, I want to be able to copy text from a PDF file and get the formatting in HTML, “smart quotes” in “and” and line breaks done correctly. Is there any way to do this?
Is there a quick and easy way for Colen (and the rest of us) to capture text without sacrificing formatting?
SuperUser contributor Frabjous offers a solution combined with a large dose of caution:
First, you need to understand what a PDF is. PDFs are designed to mimic a printed page, and they only serve as an output format, not an input format. A PDF is basically a map that contains the exact location of characters (individual letters or punctuation marks, etc.) or images. In most cases, a PDF doesn’t even store information about where one word ends and another begins, let alone things like soft breaks vs. hard breaks for paragraph ends.
(Some newer PDFs store some information about this stuff, but this is new technology and you can be lucky to find such PDFs. Even if you do, your PDF viewer may not know about it.)
However, it’s up to your software to implement some kind of “artificial intelligence” to just extract what a word, what a paragraph, and so on is from the positions of individual characters. Other software will do this better than others and it will depend on how the PDF was created too. In any case, you should never expect perfect results. The output PDF is not the same as the source document. Much better to try to get that if you can.
The standard solution to your problem is to use Adobe Acrobat Professional (the expensive, not the free reader) to convert the PDF to HTML. That doesn’t lead to perfect results either.
There is free software that can be used to extract text from PDFs with some formatting intact, but again, don’t expect perfect results. See e.g. Kaliber (which can convert to RTF format), pdftohtml / pdfreflow, or the AbiWord word processor (with all import / export plugins enabled). There is also a PDF import plugin for OpenOffice.
But please don’t expect perfection in any of these results. You are going against the current here. PDF is just not meant to be an editable input format.
If you can’t decide which tool to start with, Caliber is a true Swiss Army Knife. You can also use it to convert PDF files for use in your e-book reader and organize your e-book / document library.
Would you like to add something to the statement? Tone off in the comments. Want to read more responses from other tech-savvy Stack Exchange users? Check out the full discussion thread here.