Decoder for HTML Special Characters/Trados font codes in TMX segments (CAT Tools Technical Help)

Technical forums » CAT Tools Technical Help »
Decoder for HTML Special Characters/Trados font codes in TMX segments
Track this topic

Decoder for HTML Special Characters/Trados font codes in TMX segments

Thread poster: Michael Beijer

Michael Beijer

United Kingdom
Local time: 20:45
Member (2009)
Dutch to English
+ ...

Jun 30, 2011

Does anyone know of a clever way to quickly convert all of the HTML special characters in a TMX to plain text. That is, how to decode them?

This website: http://www.web2generators.com/html/entities has a nice encoder/decoder, but I would like to be able to do it automatically on only the content of the TMX files. If I copy paste the entire content of the TMX (after opening it in a... See more

but this only removes TMX codes, not the HTML stuff

I am trying to figure out a way of cleaning up Trados exported memories from clients that are so full of HTML special characters that I can't use them in memoQ.

Also, I noticed that some of the "stuff" (which isn't HTML special chars) doesn't get decoded by the above mentioned decoder, such as in this segment:

Any suggestions would be welcome!

[Edited at 2011-06-30 12:29 GMT] ▲ Collapse

FarkasAndras

Local time: 21:45
English to Hungarian
+ ...

Character entities

Jun 30, 2011

In principle, this isn't a problem you should need to concern yourself with. Perhaps you should fish around for some setting in MemoQ that ignores the inline formatting tags. It should be there somewhere among the TMX import settings.
As to the HTML-style character entities (such as "), these shouldn't pose a problem. Only a couple of these (IIRC, 5) are allowed in TMX, and they should all be recognized by MemoQ. I.e. the TMX contains " but this should show up for you in MemoQ as " when you do a concordance search or such. If you're seeing " in concordance hits, then the TMX probably contains &quot; due to a messed-up "doubly encoded" source file that somebody was translating.
If you want to do a complete cleanup, I have a program in my grab bag called "TMX to tabbed" that can handle both inline formatting tags and character entities. It converts TMX files to tab delimited txt, which you can then review and convert back to TMX. This is pretty radical, though... Don't do it unless you must. ▲ Collapse

Jaroslaw Michalak

Poland
Local time: 21:45
Member (2004)
English to Polish

SITE LOCALIZER

Maybe Okapi Rainbow?

Jun 30, 2011

Okapi Rainbow allows you to process TMX files - you can search and replace regular expressions, which should be enough for you...

http://www.opentag.com/okapi/wiki/index.php?title=Search_and_Replace_Step

By the way, it also has a function called "Inline Codes Removal", but I am not sure if it strips any kind of code or just TMX specific o... See more

FarkasAndras

Local time: 21:45
English to Hungarian
+ ...

Sample

Jun 30, 2011

Also, I noticed that some of the "stuff" (which isn't HTML special chars) doesn't get decoded by the above mentioned decoder, such as in this segment:

This looks like the file has been the victim of some messed-up character encoding, probably more than one round of it as well. All those <-s and >-s should be tags with literal <...>, not character entities. There is stuff like &lt; and &quot; in there which should probably be literal < and ". Is this excerpt from the original file as you received it or after some conversions? Also, could it be that the tags are actually the translatable text? I.e. somebody was translating a document that discusses and "quotes" this sort of tagged markup?
I can't quite figure it out from a glance at that sample, but that whole mess could be inside inline tags, which you can choose to ignore when you're importing the TMX.

[Edited at 2011-06-30 15:06 GMT]

Michael Beijer

United Kingdom
Local time: 20:45
Member (2009)
Dutch to English
+ ...

TOPIC STARTER

@FarkasAndras

Jun 30, 2011

No, the document doesn't discuss "tags", or "markup", or anything like that, and it is taken straight from a Trados Workbench exported memory, which starts with the huge RTF preamble:

{\fonttbl
{\f1 \fmodern\fprq1 \fcharset0 Courier New;}
{\f2 \fswiss\fprq2 \fcharset0 Arial;}
{\f3 \froman\fprq2 {\*\panose 02020603050405020304}\fcharset0 Times New Roman;}
{\f4 \froman\fprq2 {\*\panose 05050102010706020507}\fcharset2 Symbol;}
{\f5 \fswiss\fprq2 {\*\panose 020b0604020202020204}\fcharset0 Helvetica;}
{\f6 \fnil\fprq2 {\*\panose 05000000000000000000}\fcharset2 Wingdings;}
{\f7 \fswiss\fprq2 {\*\panose 020b0604030504040204}\fcharset0 Tahoma;}
{\f8 \froman\fprq2 {\*\panose 02020500000000000000}{\*\falt Times New Roman}\fcharset0 Melior;}
{\f9 \froman\fprq2 {\*\panose 02020603050405020304}\fcharset0 Times;}
{\f10 \froman\fprq2 {\*\panose 02020404030301010803}\fcharset0 Garamond;}
{\f11 \fswiss\fprq2 {\*\panose 00000000000000000000}\fcharset0 Univers Condensed;}
{\f12 \froman\fprq2 {\*\panose 00000000000000000000}\fcharset0 MS Serif;}
{\f13 \froman\fprq2 \fcharset238 Times New Roman CE;}
{\f14 \froman\fprq2 \fcharset204 Times New Roman Cyr;}
{\f15 \froman\fprq2 \fcharset161 Times New Roman Greek;}
{\f16 \froman\fprq2 \fcharset162 Times New Roman Tur;}
{\f17 \froman\fprq2 \fcharset177 Times New Roman (Hebrew);}
{\f18 \froman\fprq2 \fcharset178 Times New Roman (Arabic);}
{\f19 \froman\fprq2 \fcharset186 Times New Roman Baltic;}
{\f20 \froman\fprq2 \fcharset163 Times New Roman (Vietnamese);}
{\f21 \fswiss\fprq2 \fcharset238 Arial CE;}
{\f22 \fswiss\fprq2 \fcharset204 Arial Cyr;}
{\f23 \fswiss\fprq2 \fcharset161 Arial Greek;}
{\f24 \fswiss\fprq2 \fcharset162 Arial Tur;}
{\f25 \fswiss\fprq2 \fcharset177 Arial (Hebrew);}
{\f26 \fswiss\fprq2 \fcharset178 Arial (Arabic);}
{\f27 \fswiss\fprq2 \fcharset186 Arial Baltic;}
{\f28 \fswiss\fprq2 \fcharset163 Arial (Vietnamese);}
{\f29 \fmodern\fprq1 \fcharset238 Courier New CE;}
{\f30 \fmodern\fprq1 \fcharset204 Courier New Cyr;}
{\f31 \fmodern\fprq1 \fcharset161 Courier New Greek;}
{\f32 \fmodern\fprq1 \fcharset162 Courier New Tur;}
{\f33 \fmodern\fprq1 \fcharset177 Courier New (Hebrew);}
{\f34 \fmodern\fprq1 \fcharset178 Courier New (Arabic);}
{\f35 \fmodern\fprq1 \fcharset186 Courier New Baltic;}
{\f36 \fmodern\fprq1 \fcharset163 Courier New (Vietnamese);}
{\f37 \fswiss\fprq2 \fcharset238 Tahoma CE;}l

etc.

And here is one TU:

[Edited at 2011-06-30 18:41 GMT] ▲ Collapse

Michael Beijer

United Kingdom
Local time: 20:45
Member (2009)
Dutch to English
+ ...

TOPIC STARTER

@Jabberwock

Jun 30, 2011

Yes, I keep meaning to set aside a little time to try and figure out how to use all the new Okapi Framework stuff, like Rainbow... just haven't had time yet. I still use Olifant quite a lot, partly because Kilgray still haven't built us a decent TM editor!

Michael

FarkasAndras

Local time: 21:45
English to Hungarian
+ ...

header

Jun 30, 2011

The long rtf header is perfectly normal, don't worry about that.

The question is whether memoq can correctly import the TUs, and if not, why not. There is a lot of crap there, but, as I said earlier, you should be able to tell memoq to just ignore it all if you want.

Michael Beijer

United Kingdom
Local time: 20:45
Member (2009)
Dutch to English
+ ...

TOPIC STARTER

@FarkasAndras

Mar 16, 2012

Hey, thanks for that 'TMX to tabbed' tool.

I recently realised that a few of my rather large TMXs that have been going back and forth between myself and a client are FULL of that garbage. Take a look at this:

I am trying to fix it by using your tool to see if I can figure out what went wrong and where, and then removing it and re-creating a clean TM.

Michael

[Edited at 2012-03-16 00:53 GMT]

[Edited at 2012-03-16 00:59 GMT]

[Edited at 2012-03-16 01:02 GMT]

FarkasAndras

Local time: 21:45
English to Hungarian
+ ...

What did that?

Mar 16, 2012

It'd be nice to know what tool is responsible for that mess, let me know if you find out.
It seems you have a lot of < ... > in there that should be < ... > (some tool mistakenly encoded the actual tags as character references). It could have been olifant, I don't know.
Here's an updated version of the tmx to tabbed tool that hasn't made it into the sourceforge package yet:
http://dl.dropbox.com/u/16377950/TMX_to_tabbed_1.5.zip
IIRC the only difference is that it now recognizes and strips some formatting tags that the previous versions missed.
It's hard to tell what it may do with your files. You might have to run the tool several times (the first will fix the tags that were converted to character entities, the second will strip the tags).
If your CAT can correctly import the TMX, then just import it and export to a new TMX. The rubbish should be gone, or if it's still there, it might not cause problems. ▲ Collapse

Michael Beijer

United Kingdom
Local time: 20:45
Member (2009)
Dutch to English
+ ...

TOPIC STARTER

Thanks!

Mar 16, 2012

I'll try your new version on them and keep you posted.

I really have no idea what program created this mess. That's what you get from unprotected CAT hopping between agencies and translators I suppose;)

The TMs actually seem to work fine in memoQ, but the reason I wanted to clean them up is that I have been experiencing slow-downs in memoQ recently, and I was wondering if these TMXs with sometimes up to 60% of extra garbage characters might have something to do with this. (although it's probably more to do with the fact that I have way too many and large TMs in my memoQ project;)

Also, I recently tried OmegaT, just for fun, and that is why I noticed them. OmegaT shows all of the zillions of little [insert garbage here] etc. in the TM window, whereas memoQ just hides all of it. ▲ Collapse

FarkasAndras

Local time: 21:45
English to Hungarian
+ ...

Options

Mar 16, 2012

I see. Perhaps our resident MQ experts will tell you if there is an option in MQ to throw out the TMX formatting tags during either import or export. If you can clean the tmx with a CAT, then the metadata (dates, creator names etc) will be left in the TMX, and it's probably a bit safer for the actual content, as well. (The tmx to tabbed tool hasn't gone through much testing and it's lossy, so it's intended as a last resort option.)

Login to reply/comment

To report site rules violations or get help, contact a site moderator:

Moderator(s) of this forum
Natalie	[Call to this topic]
Peter Zauner	[Call to this topic]
Prachya Mruetusatorn	[Call to this topic]

You can also contact site staff by submitting a support request »

Decoder for HTML Special Characters/Trados font codes in TMX segments

Translation news related to CAT tools

» Memsource Sells to Carlyle: The Inside Story
(0 comments)
» memoQ 9.4: Turbo-Charging Productivity
(0 comments)
» The Future Of Work Now: The Computer-Assisted Translator And Lilt
(0 comments)

Submit translation news about CAT tools »
Read more translation news »

Forum rules

Help and orientation

Anycount & Translation Office 3000
Translation Office 3000 Translation Office 3000 is an advanced accounting tool for freelance translators and small agencies. TO3000 easily and seamlessly integrates with the business life of professional freelance translators. More info »

CafeTran Espresso
You've never met a CAT tool this clever! Translate faster & easier, using a sophisticated CAT tool built by a translator / developer. Accept jobs from clients who use Trados, MemoQ, Wordfast & major CAT tools. Download and start using CafeTran Espresso -- for free Buy now! »

Recent posts | FAQ | Rules | Moderators | Article knowledgebase

Your current localization setting

English

Select a language

More languages...

Decoder for HTML Special Characters/Trados font codes in TMX segments

Decoder for HTML Special Characters/Trados font codes in TMX segments

You have native languages that can be verified

Your current localization setting

Select a language