Article for translators: Dealing With Those Darn PDFs

Home

Join as a Member!

Post Your Job - Free!

All Translation Agencies

Advertisements

Dealing With Those Darn PDFs

By Jost Zetzsche

jost@internationalwriters.com

Become a member of TranslationDirectory.com at just $12 per month (paid per year)

If you look through the archives of discussion lists for translators, these are the two questions that are most often asked: First, what are the differences between the different computer-assisted translation tools? Second, do any of them support PDF files, and if not, what's the best way to translate those files?

I've written a lot about the first question, but I've always shied away from answering the second because there is just no real good answer.

First of all, no CAT tool really supports PDF files. Wordfast (see www.wordfast.net) does list this as one of its supported formats, but it adds a lot of disclaimers in its manual about the effectiveness of its method (through MS Word).

Second, no CAT tool ever WILL support PDF files (I would love to be proven wrong on this one!). One of the major reasons for the existence of PDFs is content protection (yes, PDF stands for "Portable Document Format," but in my opinion it could just as well stand for "Protected Document Format"), and this gives you some idea of why it's so hard to get text out of PDFs. It is possible in newer editions of Adobe Acrobat to save a PDF to an RTF, text, or XML file, but these formats have the same set of problems that you also encounter when simply copying and pasting content out of a PDF file: text that used to be a field (page numbers or cross-references) is now plain text, non-visible fields (such as the index) are gone, no styles are preserved, the formatting is gone, graphics are ignored, and, worst of all, every line break is replaced with a paragraph mark (making it essentially unusable for CAT tools). And if you're really out of luck, all the text will be garbled if your system does not support the fonts that were used in the PDFs.

So, not much light on the horizon for PDF translation, except. . .

All PDF files were originally created in a format other than PDF. Many clients send translators the PDF because they're simply too lazy to look for the original files. After all, the translator's headaches are not their headaches -- until you make them their headaches. One easy way of doing this is to charge a hefty surcharge! I have found it quite revealing that suddenly many of the "lost" source files were discovered.

Obviously, there are cases where this does not work. Either the source file truly cannot be found (or accessed), or the source format is some kind of format that you could not support anyway (the file may have been created in Quark, InDesign, or one of the other expensive DTP programs that many translators don't have), or there are legal limitations that prevent the client from simply giving you the source files. Whatever the reason, at that point you will have to find a better solution.

There are a great many conversion programs on the market that convert PDFs to RTF or HTML files (see for instance http://www.pdfstore.com/category.asp?CtgID=7), and over the years I have worked unhappily with a decent number of them. Most of them do not solve the problem of the paragraph mark at the end of each line. Even if they do, they add another layer of complication to the formatting by placing everything in text boxes. And any graphic content is treated as graphics and cannot be directly translated.

The one solution that I like and use in a very productive manner is an optical character recognition (OCR) program for scanning, such as OmniPage (see www.omnipage.com/omnipage) or ABBYY FineReader (see www.abbyy.com/finereader). Newer versions of these programs can now convert PDF files into Word documents without actually scanning them (they scan them internally). If the typeface of the originating PDF was clearly visible the results are great, particularly because even text in graphics is transformed into translatable text! The unnecessary and annoying paragraph markers are eliminated, and the only thing that doesn't work is the re-conversion of former fields into actual fields. This means that there may be some work for you to do once you have your PDF converted, but it's significantly less than with other solutions.

Both OmniPage and ABBYY have realized that this has become an increasingly popular feature of their OCR system (which in itself is pretty expensive), so they have now created much less expensive stand-alone programs that are specifically geared toward that process: PDF Converter (www.omnipage.com/pdfconverter) and PDF Transformer (www.abbyy.com/pdftransformer), with the former even supporting PDF creation in some versions.

© International Writers' Group. Excerpt from the Tool Kit Newsletter, a biweekly newsletter for people in the translation industry who want to get more out of their computers. For more information see www.internationalwriters.com/toolkit

Submit your article!