Decoder for HTML Special Characters/Trados font codes in TMX segments
Thread poster: Michael Beijer
Michael Beijer
Michael Beijer  Identity Verified
United Kingdom
Local time: 20:45
Member (2009)
Dutch to English
+ ...
Jun 30, 2011

Does anyone know of a clever way to quickly convert all of the HTML special characters in a TMX to plain text. That is, how to decode them?

This website: http://www.web2generators.com/html/entities has a nice encoder/decoder, but I would like to be able to do it automatically on only the content of the TMX files. If I copy paste the entire content of the TMX (after opening it in a
... See more
Does anyone know of a clever way to quickly convert all of the HTML special characters in a TMX to plain text. That is, how to decode them?

This website: http://www.web2generators.com/html/entities has a nice encoder/decoder, but I would like to be able to do it automatically on only the content of the TMX files. If I copy paste the entire content of the TMX (after opening it in a text editor) all of the other characters are also decoded, which is of course not that useful.

Olifant has the Remove Codes function:



but this only removes TMX codes, not the HTML stuff

I am trying to figure out a way of cleaning up Trados exported memories from clients that are so full of HTML special characters that I can't use them in memoQ.

Also, I noticed that some of the "stuff" (which isn't HTML special chars) doesn't get decoded by the above mentioned decoder, such as in this segment:



Any suggestions would be welcome!


[Edited at 2011-06-30 12:29 GMT]
Collapse


 
FarkasAndras
FarkasAndras  Identity Verified
Local time: 21:45
English to Hungarian
+ ...
Character entities Jun 30, 2011

In principle, this isn't a problem you should need to concern yourself with. Perhaps you should fish around for some setting in MemoQ that ignores the inline formatting tags. It should be there somewhere among the TMX import settings.
As to the HTML-style character entities (such as "), these shouldn't pose a problem. Only a couple of these (IIRC, 5) are allowed in TMX, and they should all be recognized by MemoQ. I.e. the TMX contains " but this should show up for you in
... See more
In principle, this isn't a problem you should need to concern yourself with. Perhaps you should fish around for some setting in MemoQ that ignores the inline formatting tags. It should be there somewhere among the TMX import settings.
As to the HTML-style character entities (such as "), these shouldn't pose a problem. Only a couple of these (IIRC, 5) are allowed in TMX, and they should all be recognized by MemoQ. I.e. the TMX contains " but this should show up for you in MemoQ as " when you do a concordance search or such. If you're seeing " in concordance hits, then the TMX probably contains " due to a messed-up "doubly encoded" source file that somebody was translating.
If you want to do a complete cleanup, I have a program in my grab bag called "TMX to tabbed" that can handle both inline formatting tags and character entities. It converts TMX files to tab delimited txt, which you can then review and convert back to TMX. This is pretty radical, though... Don't do it unless you must.
Collapse


 
Jaroslaw Michalak
Jaroslaw Michalak  Identity Verified
Poland
Local time: 21:45
Member (2004)
English to Polish
SITE LOCALIZER
Maybe Okapi Rainbow? Jun 30, 2011

Okapi Rainbow allows you to process TMX files - you can search and replace regular expressions, which should be enough for you...

http://www.opentag.com/okapi/wiki/index.php?title=Search_and_Replace_Step

By the way, it also has a function called "Inline Codes Removal", but I am not sure if it strips any kind of code or just TMX specific o
... See more
Okapi Rainbow allows you to process TMX files - you can search and replace regular expressions, which should be enough for you...

http://www.opentag.com/okapi/wiki/index.php?title=Search_and_Replace_Step

By the way, it also has a function called "Inline Codes Removal", but I am not sure if it strips any kind of code or just TMX specific ones.

http://www.opentag.com/okapi/wiki/index.php?title=Inline_Codes_Removal_Step
Collapse


 
FarkasAndras
FarkasAndras  Identity Verified
Local time: 21:45
English to Hungarian
+ ...
Sample Jun 30, 2011



Also, I noticed that some of the "stuff" (which isn't HTML special chars) doesn't get decoded by the above mentioned decoder, such as in this segment:




This looks like the file has been the victim of some messed-up character encoding, probably more than one round of it as well. All those &lt;-s and &gt;-s should be tags with literal <...>, not character entities. There is stuff like &amp;lt; and &amp;quot; in there which should probably be literal < and ". Is this excerpt from the original file as you received it or after some conversions? Also, could it be that the tags are actually the translatable text? I.e. somebody was translating a document that discusses and "quotes" this sort of tagged markup?
I can't quite figure it out from a glance at that sample, but that whole mess could be inside inline tags, which you can choose to ignore when you're importing the TMX.

[Edited at 2011-06-30 15:06 GMT]


 
Michael Beijer
Michael Beijer  Identity Verified
United Kingdom
Local time: 20:45
Member (2009)
Dutch to English
+ ...
TOPIC STARTER
@FarkasAndras Jun 30, 2011

No, the document doesn't discuss "tags", or "markup", or anything like that, and it is taken straight from a Trados Workbench exported memory, which starts with the huge RTF preamble:



{\fonttbl
{\f1 \fmodern\fprq1 \fcharset0 Courier New;}
{\f2 \fswiss\fprq2 \fcharset0 Arial;}
{\f3 \froman\fprq2 {\*\panose 02020603050405020304}\fcharset0 Times New Roman;}
{\f4 \froman\fprq2 {\*\panose 05050102010706020507}\fcharset2 Symbol;}
{\f5 \fswiss\f
... See more
No, the document doesn't discuss "tags", or "markup", or anything like that, and it is taken straight from a Trados Workbench exported memory, which starts with the huge RTF preamble:



{\fonttbl
{\f1 \fmodern\fprq1 \fcharset0 Courier New;}
{\f2 \fswiss\fprq2 \fcharset0 Arial;}
{\f3 \froman\fprq2 {\*\panose 02020603050405020304}\fcharset0 Times New Roman;}
{\f4 \froman\fprq2 {\*\panose 05050102010706020507}\fcharset2 Symbol;}
{\f5 \fswiss\fprq2 {\*\panose 020b0604020202020204}\fcharset0 Helvetica;}
{\f6 \fnil\fprq2 {\*\panose 05000000000000000000}\fcharset2 Wingdings;}
{\f7 \fswiss\fprq2 {\*\panose 020b0604030504040204}\fcharset0 Tahoma;}
{\f8 \froman\fprq2 {\*\panose 02020500000000000000}{\*\falt Times New Roman}\fcharset0 Melior;}
{\f9 \froman\fprq2 {\*\panose 02020603050405020304}\fcharset0 Times;}
{\f10 \froman\fprq2 {\*\panose 02020404030301010803}\fcharset0 Garamond;}
{\f11 \fswiss\fprq2 {\*\panose 00000000000000000000}\fcharset0 Univers Condensed;}
{\f12 \froman\fprq2 {\*\panose 00000000000000000000}\fcharset0 MS Serif;}
{\f13 \froman\fprq2 \fcharset238 Times New Roman CE;}
{\f14 \froman\fprq2 \fcharset204 Times New Roman Cyr;}
{\f15 \froman\fprq2 \fcharset161 Times New Roman Greek;}
{\f16 \froman\fprq2 \fcharset162 Times New Roman Tur;}
{\f17 \froman\fprq2 \fcharset177 Times New Roman (Hebrew);}
{\f18 \froman\fprq2 \fcharset178 Times New Roman (Arabic);}
{\f19 \froman\fprq2 \fcharset186 Times New Roman Baltic;}
{\f20 \froman\fprq2 \fcharset163 Times New Roman (Vietnamese);}
{\f21 \fswiss\fprq2 \fcharset238 Arial CE;}
{\f22 \fswiss\fprq2 \fcharset204 Arial Cyr;}
{\f23 \fswiss\fprq2 \fcharset161 Arial Greek;}
{\f24 \fswiss\fprq2 \fcharset162 Arial Tur;}
{\f25 \fswiss\fprq2 \fcharset177 Arial (Hebrew);}
{\f26 \fswiss\fprq2 \fcharset178 Arial (Arabic);}
{\f27 \fswiss\fprq2 \fcharset186 Arial Baltic;}
{\f28 \fswiss\fprq2 \fcharset163 Arial (Vietnamese);}
{\f29 \fmodern\fprq1 \fcharset238 Courier New CE;}
{\f30 \fmodern\fprq1 \fcharset204 Courier New Cyr;}
{\f31 \fmodern\fprq1 \fcharset161 Courier New Greek;}
{\f32 \fmodern\fprq1 \fcharset162 Courier New Tur;}
{\f33 \fmodern\fprq1 \fcharset177 Courier New (Hebrew);}
{\f34 \fmodern\fprq1 \fcharset178 Courier New (Arabic);}
{\f35 \fmodern\fprq1 \fcharset186 Courier New Baltic;}
{\f36 \fmodern\fprq1 \fcharset163 Courier New (Vietnamese);}
{\f37 \fswiss\fprq2 \fcharset238 Tahoma CE;}l

etc.

And here is one TU:





[Edited at 2011-06-30 18:41 GMT]
Collapse


 
Michael Beijer
Michael Beijer  Identity Verified
United Kingdom
Local time: 20:45
Member (2009)
Dutch to English
+ ...
TOPIC STARTER
@Jabberwock Jun 30, 2011

Yes, I keep meaning to set aside a little time to try and figure out how to use all the new Okapi Framework stuff, like Rainbow... just haven't had time yet. I still use Olifant quite a lot, partly because Kilgray still haven't built us a decent TM editor!

Michael


 
FarkasAndras
FarkasAndras  Identity Verified
Local time: 21:45
English to Hungarian
+ ...
header Jun 30, 2011

The long rtf header is perfectly normal, don't worry about that.

The question is whether memoq can correctly import the TUs, and if not, why not. There is a lot of crap there, but, as I said earlier, you should be able to tell memoq to just ignore it all if you want.


 
Michael Beijer
Michael Beijer  Identity Verified
United Kingdom
Local time: 20:45
Member (2009)
Dutch to English
+ ...
TOPIC STARTER
@FarkasAndras Mar 16, 2012

Hey, thanks for that 'TMX to tabbed' tool.

I recently realised that a few of my rather large TMXs that have been going back and forth between myself and a client are FULL of that garbage. Take a look at this:


garbage




I am trying to fix it by using your tool to see if I can figure out what went wrong and where, and then removing it and re-creating a clean TM.

Michael

[Edited at 2012-03-16 00:53 GMT]

[Edited at 2012-03-16 00:59 GMT]

[Edited at 2012-03-16 01:02 GMT]


 
FarkasAndras
FarkasAndras  Identity Verified
Local time: 21:45
English to Hungarian
+ ...
What did that? Mar 16, 2012

It'd be nice to know what tool is responsible for that mess, let me know if you find out.
It seems you have a lot of &lt; ... &gt; in there that should be < ... > (some tool mistakenly encoded the actual tags as character references). It could have been olifant, I don't know.
Here's an updated version of the tmx to tabbed tool that hasn't made it into the sourceforge pack
... See more
It'd be nice to know what tool is responsible for that mess, let me know if you find out.
It seems you have a lot of &lt; ... &gt; in there that should be < ... > (some tool mistakenly encoded the actual tags as character references). It could have been olifant, I don't know.
Here's an updated version of the tmx to tabbed tool that hasn't made it into the sourceforge package yet:
http://dl.dropbox.com/u/16377950/TMX_to_tabbed_1.5.zip
IIRC the only difference is that it now recognizes and strips some formatting tags that the previous versions missed.
It's hard to tell what it may do with your files. You might have to run the tool several times (the first will fix the tags that were converted to character entities, the second will strip the tags).
If your CAT can correctly import the TMX, then just import it and export to a new TMX. The rubbish should be gone, or if it's still there, it might not cause problems.
Collapse


 
Michael Beijer
Michael Beijer  Identity Verified
United Kingdom
Local time: 20:45
Member (2009)
Dutch to English
+ ...
TOPIC STARTER
Thanks! Mar 16, 2012

I'll try your new version on them and keep you posted.

I really have no idea what program created this mess. That's what you get from unprotected CAT hopping between agencies and translators I suppose;)

The TMs actually seem to work fine in memoQ, but the reason I wanted to clean them up is that I have been experiencing slow-downs in memoQ recently, and I was wondering if these TMXs with sometimes up to 60% of extra garbage characters might have something to do with thi
... See more
I'll try your new version on them and keep you posted.

I really have no idea what program created this mess. That's what you get from unprotected CAT hopping between agencies and translators I suppose;)

The TMs actually seem to work fine in memoQ, but the reason I wanted to clean them up is that I have been experiencing slow-downs in memoQ recently, and I was wondering if these TMXs with sometimes up to 60% of extra garbage characters might have something to do with this. (although it's probably more to do with the fact that I have way too many and large TMs in my memoQ project;)

Also, I recently tried OmegaT, just for fun, and that is why I noticed them. OmegaT shows all of the zillions of little [insert garbage here] etc. in the TM window, whereas memoQ just hides all of it.
Collapse


 
FarkasAndras
FarkasAndras  Identity Verified
Local time: 21:45
English to Hungarian
+ ...
Options Mar 16, 2012

I see. Perhaps our resident MQ experts will tell you if there is an option in MQ to throw out the TMX formatting tags during either import or export. If you can clean the tmx with a CAT, then the metadata (dates, creator names etc) will be left in the TMX, and it's probably a bit safer for the actual content, as well. (The tmx to tabbed tool hasn't gone through much testing and it's lossy, so it's intended as a last resort option.)

 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

Decoder for HTML Special Characters/Trados font codes in TMX segments







Anycount & Translation Office 3000
Translation Office 3000

Translation Office 3000 is an advanced accounting tool for freelance translators and small agencies. TO3000 easily and seamlessly integrates with the business life of professional freelance translators.

More info »
CafeTran Espresso
You've never met a CAT tool this clever!

Translate faster & easier, using a sophisticated CAT tool built by a translator / developer. Accept jobs from clients who use Trados, MemoQ, Wordfast & major CAT tools. Download and start using CafeTran Espresso -- for free

Buy now! »