Dealing With Those Darn PDFs
By
Jost Zetzsche
Get the List of 5,400+ Translation Agencies Now! No Recurring Membership Fees!
If you look through the archives of discussion lists
for translators, these are the two questions that
are most often asked: First, what are the differences
between the different computer-assisted translation
tools? Second, do any of them support PDF files, and
if not, what's the best way to translate those files?
I've written a lot about the first
question, but I've always shied away from answering
the second because there is just no real good answer.
First of all, no CAT tool really supports
PDF files. Wordfast (see www.wordfast.net)
does list this as one of its supported formats, but
it adds a lot of disclaimers in its manual about the
effectiveness of its method (through MS Word).
Second, no CAT tool ever WILL support
PDF files (I would love to be proven wrong on this
one!). One of the major reasons for the existence
of PDFs is content protection (yes, PDF stands for
"Portable Document Format," but in my opinion
it could just as well stand for "Protected Document
Format"), and this gives you some idea of why
it's so hard to get text out of PDFs. It is possible
in newer editions of Adobe Acrobat to save a PDF to
an RTF, text, or XML file, but these formats have
the same set of problems that you also encounter when
simply copying and pasting content out of a PDF file:
text that used to be a field (page numbers or cross-references)
is now plain text, non-visible fields (such as the
index) are gone, no styles are preserved, the formatting
is gone, graphics are ignored, and, worst of all,
every line break is replaced with a paragraph mark
(making it essentially unusable for CAT tools). And
if you're really out of luck, all the text will be
garbled if your system does not support the fonts
that were used in the PDFs.
So, not much light on the horizon
for PDF translation, except. . .
All PDF files were originally created
in a format other than PDF. Many clients send translators
the PDF because they're simply too lazy to look for
the original files. After all, the translator's headaches
are not their headaches -- until you make them their
headaches. One easy way of doing this is to charge
a hefty surcharge! I have found it quite revealing
that suddenly many of the "lost" source
files were discovered.
Obviously, there are cases where this
does not work. Either the source file truly cannot
be found (or accessed), or the source format is some
kind of format that you could not support anyway (the
file may have been created in Quark, InDesign, or
one of the other expensive DTP programs that many
translators don't have), or there are legal limitations
that prevent the client from simply giving you the
source files. Whatever the reason, at that point you
will have to find a better solution.
There are a great many conversion
programs on the market that convert PDFs to RTF or
HTML files (see for instance http://www.pdfstore.com/category.asp?CtgID=7),
and over the years I have worked unhappily with a
decent number of them. Most of them do not solve the
problem of the paragraph mark at the end of each line.
Even if they do, they add another layer of complication
to the formatting by placing everything in text boxes.
And any graphic content is treated as graphics and
cannot be directly translated.
The one solution that I like and use
in a very productive manner is an optical character
recognition (OCR) program for scanning, such as OmniPage
(see www.omnipage.com/omnipage)
or ABBYY FineReader (see www.abbyy.com/finereader).
Newer versions of these programs can now convert PDF
files into Word documents without actually scanning
them (they scan them internally). If the typeface
of the originating PDF was clearly visible the results
are great, particularly because even text in graphics
is transformed into translatable text! The unnecessary
and annoying paragraph markers are eliminated, and
the only thing that doesn't work is the re-conversion
of former fields into actual fields. This means that
there may be some work for you to do once you have
your PDF converted, but it's significantly less than
with other solutions.
Both OmniPage and ABBYY have realized
that this has become an increasingly popular feature
of their OCR system (which in itself is pretty expensive),
so they have now created much less expensive stand-alone
programs that are specifically geared toward that
process: PDF Converter (www.omnipage.com/pdfconverter)
and PDF Transformer (www.abbyy.com/pdftransformer),
with the former even supporting PDF creation in some
versions.
© International Writers'
Group. Excerpt from the Tool Kit Newsletter, a biweekly
newsletter for people in the translation industry
who want to get more out of their computers. For more
information see www.internationalwriters.com/toolkit
Read
more articles - Free!
E-mail
this article to your colleague!
Need
more translation jobs? Click here!
Translation
agencies are welcome to register here - Free!
Freelance
translators are welcome to register here - Free!
Subscribe
to TranslationDirectory.com newsletter - Free!
Take
part in TranslationDirectory.com poll - your voice counts!
|