Best way to add a bilingual pdf file into a TM
Thread poster: ND1169
ND1169
ND1169
Local time: 07:39
English to Japanese
Oct 11, 2012

Hi, I have some previously translated files that were outsourced to an unknown company that are in pdfs that I would like to incorporate into my new translation memory I have made. However both the source language (English) and the target language (Japanese) are in the same pdf file. Here is a screen

What is the most efficient way to go about doing this????????

Obviously the easiest way would be to get in contact wi
... See more
Hi, I have some previously translated files that were outsourced to an unknown company that are in pdfs that I would like to incorporate into my new translation memory I have made. However both the source language (English) and the target language (Japanese) are in the same pdf file. Here is a screen

What is the most efficient way to go about doing this????????

Obviously the easiest way would be to get in contact with whoever translated this and get a tmx file but that's not possible.

Should I just copy and paste like a madman into excel? That seems very time consuming but possible.

I've tried to align the files but it puts both languages on both source and target and being pdfs with lots of tables etc, has given me lots of formatting trouble that doesn't seem to be worth my time. I've also tried using LF align and I get the same formatting errors, but I can get the data into an xls. Is there an easy way I can select all roman characters / all Japanese characters in an office application like excel or word and delete them???? Aka select target row... then delete all Japanese characters??

Or should I just convert it to .txt and or .doc first and try it that way? Although the problem of both languages on both source and target won't go away.


Thanks for any and all help!
Collapse


 
FarkasAndras
FarkasAndras  Identity Verified
Local time: 23:39
English to Hungarian
+ ...
Separate Oct 11, 2012

One thing you could try to separate the EN and the JP text is to OCR it with ABBYY or some other OCR software and hope that the text colours are recognized consistently. Then you could use "select text with similar formatting" in Word to select all the JP text and cut and paste it to a new doc.
Theoretically, it should also be possible to separate the texts based on their differing character sets using regex, e.g. in Notepad++. If the JP text is always on the second half of each line, the
... See more
One thing you could try to separate the EN and the JP text is to OCR it with ABBYY or some other OCR software and hope that the text colours are recognized consistently. Then you could use "select text with similar formatting" in Word to select all the JP text and cut and paste it to a new doc.
Theoretically, it should also be possible to separate the texts based on their differing character sets using regex, e.g. in Notepad++. If the JP text is always on the second half of each line, the regex would be:
replace
^([^allJPcharacters]*)(*)$
with
\1\t\2
If this works, you end up with a tab separated text file which is ready for converting into a TMX or importing into Studio as a bilingual file.

Both solutions are error prone depending on what your file looks like.

[Edited at 2012-10-11 10:04 GMT]
Collapse


 
Stanislaw Czech, MCIL CL
Stanislaw Czech, MCIL CL  Identity Verified
United Kingdom
Local time: 22:39
Member (2006)
English to Polish
+ ...
SITE LOCALIZER
I would outsource it Oct 11, 2012

I don't know how useful to you would be such a TM. If you expect a lot of matches, than it could make sense to outsource the job to someone who will copy the text to two columns in Excel. Once it is done, you could take over and align the text.

From my experience there are websites where you can find someone willing to undertake such a simple job at a fraction of translator's hourly pay.

Alternatively you could do nothing - copy these files into a single directory and
... See more
I don't know how useful to you would be such a TM. If you expect a lot of matches, than it could make sense to outsource the job to someone who will copy the text to two columns in Excel. Once it is done, you could take over and align the text.

From my experience there are websites where you can find someone willing to undertake such a simple job at a fraction of translator's hourly pay.

Alternatively you could do nothing - copy these files into a single directory and run a search on the contents of the files (for instance in Windows Explorer) when you need to find a particular term.

Good luck
Stanislaw
Collapse


 
Samuel Murray
Samuel Murray  Identity Verified
Netherlands
Local time: 23:39
Member (2006)
English to Afrikaans
+ ...
What does your OCR program do? Oct 11, 2012

ND1169 wrote:
However both the source language (English) and the target language (Japanese) are in the same pdf file. Here is a screen.


Some OCR programs can extract text from PDF perfectly and still attempt to format the text like in the PDF. Does your OCR program successfully format the Japanese text in blue? If so, you can use MS Word's advanced find/replace function to remove all black text and all blue text in two versions of the file, and then align it.


 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

Best way to add a bilingual pdf file into a TM







Wordfast Pro
Translation Memory Software for Any Platform

Exclusive discount for ProZ.com users! Save over 13% when purchasing Wordfast Pro through ProZ.com. Wordfast is the world's #1 provider of platform-independent Translation Memory software. Consistently ranked the most user-friendly and highest value

Buy now! »
Trados Studio 2022 Freelance
The leading translation software used by over 270,000 translators.

Designed with your feedback in mind, Trados Studio 2022 delivers an unrivalled, powerful desktop and cloud solution, empowering you to work in the most efficient and cost-effective way.

More info »