Decoder for HTML Special Characters/Trados font codes in TMX segments Thread poster: Michael Beijer
| Michael Beijer United Kingdom Local time: 20:45 Member (2009) Dutch to English + ...
Does anyone know of a clever way to quickly convert all of the HTML special characters in a TMX to plain text. That is, how to decode them? This website: http://www.web2generators.com/html/entities has a nice encoder/decoder, but I would like to be able to do it automatically on only the content of the TMX files. If I copy paste the entire content of the TMX (after opening it in a... See more Does anyone know of a clever way to quickly convert all of the HTML special characters in a TMX to plain text. That is, how to decode them? This website: http://www.web2generators.com/html/entities has a nice encoder/decoder, but I would like to be able to do it automatically on only the content of the TMX files. If I copy paste the entire content of the TMX (after opening it in a text editor) all of the other characters are also decoded, which is of course not that useful. Olifant has the Remove Codes function: but this only removes TMX codes, not the HTML stuff I am trying to figure out a way of cleaning up Trados exported memories from clients that are so full of HTML special characters that I can't use them in memoQ. Also, I noticed that some of the "stuff" (which isn't HTML special chars) doesn't get decoded by the above mentioned decoder, such as in this segment: Any suggestions would be welcome!
[Edited at 2011-06-30 12:29 GMT] ▲ Collapse | | | Character entities | Jun 30, 2011 |
In principle, this isn't a problem you should need to concern yourself with. Perhaps you should fish around for some setting in MemoQ that ignores the inline formatting tags. It should be there somewhere among the TMX import settings. As to the HTML-style character entities (such as "), these shouldn't pose a problem. Only a couple of these (IIRC, 5) are allowed in TMX, and they should all be recognized by MemoQ. I.e. the TMX contains " but this should show up for you in ... See more In principle, this isn't a problem you should need to concern yourself with. Perhaps you should fish around for some setting in MemoQ that ignores the inline formatting tags. It should be there somewhere among the TMX import settings. As to the HTML-style character entities (such as "), these shouldn't pose a problem. Only a couple of these (IIRC, 5) are allowed in TMX, and they should all be recognized by MemoQ. I.e. the TMX contains " but this should show up for you in MemoQ as " when you do a concordance search or such. If you're seeing " in concordance hits, then the TMX probably contains " due to a messed-up "doubly encoded" source file that somebody was translating. If you want to do a complete cleanup, I have a program in my grab bag called "TMX to tabbed" that can handle both inline formatting tags and character entities. It converts TMX files to tab delimited txt, which you can then review and convert back to TMX. This is pretty radical, though... Don't do it unless you must. ▲ Collapse | | | Jaroslaw Michalak Poland Local time: 21:45 Member (2004) English to Polish SITE LOCALIZER |
Also, I noticed that some of the "stuff" (which isn't HTML special chars) doesn't get decoded by the above mentioned decoder, such as in this segment: This looks like the file has been the victim of some messed-up character encoding, probably more than one round of it as well. All those <-s and >-s should be tags with literal <...>, not character entities. There is stuff like &lt; and &quot; in there which should probably be literal < and ". Is this excerpt from the original file as you received it or after some conversions? Also, could it be that the tags are actually the translatable text? I.e. somebody was translating a document that discusses and "quotes" this sort of tagged markup? I can't quite figure it out from a glance at that sample, but that whole mess could be inside inline tags, which you can choose to ignore when you're importing the TMX.
[Edited at 2011-06-30 15:06 GMT] | |
|
|
Michael Beijer United Kingdom Local time: 20:45 Member (2009) Dutch to English + ... TOPIC STARTER @FarkasAndras | Jun 30, 2011 |
No, the document doesn't discuss "tags", or "markup", or anything like that, and it is taken straight from a Trados Workbench exported memory, which starts with the huge RTF preamble: {\fonttbl {\f1 \fmodern\fprq1 \fcharset0 Courier New;} {\f2 \fswiss\fprq2 \fcharset0 Arial;} {\f3 \froman\fprq2 {\*\panose 02020603050405020304}\fcharset0 Times New Roman;} {\f4 \froman\fprq2 {\*\panose 05050102010706020507}\fcharset2 Symbol;} {\f5 \fswiss\f... See more No, the document doesn't discuss "tags", or "markup", or anything like that, and it is taken straight from a Trados Workbench exported memory, which starts with the huge RTF preamble: {\fonttbl {\f1 \fmodern\fprq1 \fcharset0 Courier New;} {\f2 \fswiss\fprq2 \fcharset0 Arial;} {\f3 \froman\fprq2 {\*\panose 02020603050405020304}\fcharset0 Times New Roman;} {\f4 \froman\fprq2 {\*\panose 05050102010706020507}\fcharset2 Symbol;} {\f5 \fswiss\fprq2 {\*\panose 020b0604020202020204}\fcharset0 Helvetica;} {\f6 \fnil\fprq2 {\*\panose 05000000000000000000}\fcharset2 Wingdings;} {\f7 \fswiss\fprq2 {\*\panose 020b0604030504040204}\fcharset0 Tahoma;} {\f8 \froman\fprq2 {\*\panose 02020500000000000000}{\*\falt Times New Roman}\fcharset0 Melior;} {\f9 \froman\fprq2 {\*\panose 02020603050405020304}\fcharset0 Times;} {\f10 \froman\fprq2 {\*\panose 02020404030301010803}\fcharset0 Garamond;} {\f11 \fswiss\fprq2 {\*\panose 00000000000000000000}\fcharset0 Univers Condensed;} {\f12 \froman\fprq2 {\*\panose 00000000000000000000}\fcharset0 MS Serif;} {\f13 \froman\fprq2 \fcharset238 Times New Roman CE;} {\f14 \froman\fprq2 \fcharset204 Times New Roman Cyr;} {\f15 \froman\fprq2 \fcharset161 Times New Roman Greek;} {\f16 \froman\fprq2 \fcharset162 Times New Roman Tur;} {\f17 \froman\fprq2 \fcharset177 Times New Roman (Hebrew);} {\f18 \froman\fprq2 \fcharset178 Times New Roman (Arabic);} {\f19 \froman\fprq2 \fcharset186 Times New Roman Baltic;} {\f20 \froman\fprq2 \fcharset163 Times New Roman (Vietnamese);} {\f21 \fswiss\fprq2 \fcharset238 Arial CE;} {\f22 \fswiss\fprq2 \fcharset204 Arial Cyr;} {\f23 \fswiss\fprq2 \fcharset161 Arial Greek;} {\f24 \fswiss\fprq2 \fcharset162 Arial Tur;} {\f25 \fswiss\fprq2 \fcharset177 Arial (Hebrew);} {\f26 \fswiss\fprq2 \fcharset178 Arial (Arabic);} {\f27 \fswiss\fprq2 \fcharset186 Arial Baltic;} {\f28 \fswiss\fprq2 \fcharset163 Arial (Vietnamese);} {\f29 \fmodern\fprq1 \fcharset238 Courier New CE;} {\f30 \fmodern\fprq1 \fcharset204 Courier New Cyr;} {\f31 \fmodern\fprq1 \fcharset161 Courier New Greek;} {\f32 \fmodern\fprq1 \fcharset162 Courier New Tur;} {\f33 \fmodern\fprq1 \fcharset177 Courier New (Hebrew);} {\f34 \fmodern\fprq1 \fcharset178 Courier New (Arabic);} {\f35 \fmodern\fprq1 \fcharset186 Courier New Baltic;} {\f36 \fmodern\fprq1 \fcharset163 Courier New (Vietnamese);} {\f37 \fswiss\fprq2 \fcharset238 Tahoma CE;}l etc. And here is one TU:
[Edited at 2011-06-30 18:41 GMT] ▲ Collapse | | | Michael Beijer United Kingdom Local time: 20:45 Member (2009) Dutch to English + ... TOPIC STARTER
Yes, I keep meaning to set aside a little time to try and figure out how to use all the new Okapi Framework stuff, like Rainbow... just haven't had time yet. I still use Olifant quite a lot, partly because Kilgray still haven't built us a decent TM editor! Michael | | |
The long rtf header is perfectly normal, don't worry about that. The question is whether memoq can correctly import the TUs, and if not, why not. There is a lot of crap there, but, as I said earlier, you should be able to tell memoq to just ignore it all if you want. | | | Michael Beijer United Kingdom Local time: 20:45 Member (2009) Dutch to English + ... TOPIC STARTER @FarkasAndras | Mar 16, 2012 |
Hey, thanks for that 'TMX to tabbed' tool. I recently realised that a few of my rather large TMXs that have been going back and forth between myself and a client are FULL of that garbage. Take a look at this: I am trying to fix it by using your tool to see if I can figure out what went wrong and where, and then removing it and re-creating a clean TM. Michael
[Edited at 2012-03-16 00:53 GMT]
[Edited at 2012-03-16 00:59 GMT]
[Edited at 2012-03-16 01:02 GMT] | |
|
|
What did that? | Mar 16, 2012 |
It'd be nice to know what tool is responsible for that mess, let me know if you find out. It seems you have a lot of < ... > in there that should be < ... > (some tool mistakenly encoded the actual tags as character references). It could have been olifant, I don't know. Here's an updated version of the tmx to tabbed tool that hasn't made it into the sourceforge pack... See more It'd be nice to know what tool is responsible for that mess, let me know if you find out. It seems you have a lot of < ... > in there that should be < ... > (some tool mistakenly encoded the actual tags as character references). It could have been olifant, I don't know. Here's an updated version of the tmx to tabbed tool that hasn't made it into the sourceforge package yet: http://dl.dropbox.com/u/16377950/TMX_to_tabbed_1.5.zip IIRC the only difference is that it now recognizes and strips some formatting tags that the previous versions missed. It's hard to tell what it may do with your files. You might have to run the tool several times (the first will fix the tags that were converted to character entities, the second will strip the tags). If your CAT can correctly import the TMX, then just import it and export to a new TMX. The rubbish should be gone, or if it's still there, it might not cause problems. ▲ Collapse | | | Michael Beijer United Kingdom Local time: 20:45 Member (2009) Dutch to English + ... TOPIC STARTER
I'll try your new version on them and keep you posted. I really have no idea what program created this mess. That's what you get from unprotected CAT hopping between agencies and translators I suppose;) The TMs actually seem to work fine in memoQ, but the reason I wanted to clean them up is that I have been experiencing slow-downs in memoQ recently, and I was wondering if these TMXs with sometimes up to 60% of extra garbage characters might have something to do with thi... See more I'll try your new version on them and keep you posted. I really have no idea what program created this mess. That's what you get from unprotected CAT hopping between agencies and translators I suppose;) The TMs actually seem to work fine in memoQ, but the reason I wanted to clean them up is that I have been experiencing slow-downs in memoQ recently, and I was wondering if these TMXs with sometimes up to 60% of extra garbage characters might have something to do with this. (although it's probably more to do with the fact that I have way too many and large TMs in my memoQ project;) Also, I recently tried OmegaT, just for fun, and that is why I noticed them. OmegaT shows all of the zillions of little [insert garbage here] etc. in the TM window, whereas memoQ just hides all of it. ▲ Collapse | | |
I see. Perhaps our resident MQ experts will tell you if there is an option in MQ to throw out the TMX formatting tags during either import or export. If you can clean the tmx with a CAT, then the metadata (dates, creator names etc) will be left in the TMX, and it's probably a bit safer for the actual content, as well. (The tmx to tabbed tool hasn't gone through much testing and it's lossy, so it's intended as a last resort option.) | | | To report site rules violations or get help, contact a site moderator: You can also contact site staff by submitting a support request » Decoder for HTML Special Characters/Trados font codes in TMX segments Anycount & Translation Office 3000 | Translation Office 3000
Translation Office 3000 is an advanced accounting tool for freelance translators and small agencies. TO3000 easily and seamlessly integrates with the business life of professional freelance translators.
More info » |
| CafeTran Espresso | You've never met a CAT tool this clever!
Translate faster & easier, using a sophisticated CAT tool built by a translator / developer.
Accept jobs from clients who use Trados, MemoQ, Wordfast & major CAT tools.
Download and start using CafeTran Espresso -- for free
Buy now! » |
|
| | | | X Sign in to your ProZ.com account... | | | | | |