Useful segmentation rules for Trados Studio and memoQ Thread poster: Bogdan Dusa
| Bogdan Dusa Romania Local time: 15:45 English to Romanian + ...
Hi all, After doing some research on the Regex codes and with the final help of a good friend of mine, expert in this field, I managed to create some useful segmentation rules for a better import of documents in projects and I’d like to share them. Let's say we have: (a) This is a sentence. 1) This is a sentence. B) This is a sentence 1.1 This is a sentence. 1.1This is a sentence Normally, these would be imported as five segm... See more Hi all, After doing some research on the Regex codes and with the final help of a good friend of mine, expert in this field, I managed to create some useful segmentation rules for a better import of documents in projects and I’d like to share them. Let's say we have: (a) This is a sentence. 1) This is a sentence. B) This is a sentence 1.1 This is a sentence. 1.1This is a sentence Normally, these would be imported as five segments, because there is no automatic numbering there. But it would be much more convenient to have them segmented as follows: (a) This is a sentence. 1) This is a sentence. etc. etc. To do that, in Trados Studio: Go to Project settings - Language Pairs - select your TM - Settings - Language Resources - Segmentation Rules - Edit - Add - Advanced view and Add the 2 codes below in the Before break field (consequently, two new rules), leaving the After break field empty: ^\(?[a-zA-Z0-9]+\)[\s\t]* It means: Look for all segments that start with any lowercase letter or uppercase letter or any number between 0 and 9, which repeats one or more times, is preceded or not by a left parenthesis, is followed by a right parenthesis, then by a space character or a tab character, which repeats zero or more times. ^\d{1,}\.\d{1,}[\s\t]* It means: Look for all segments that start with a digit, which repeats one or more times, is followed by a dot character, then by another digit which repeats one or more times, then by a space character or a tab character which repeats zero or more times. In memoQ: Under the Resource console - Segmentation rule, create your own rule or Clone and then Edit a default rule and Add the 2 codes below in the Rule field: ^\(?[a-zA-Z0-9]+\)[\s\t]*#!# It means: Look for all segments that start with any lowercase letter or uppercase letter or any number between 0 and 9, which repeats one or more times, is preceded or not by a left parenthesis, is followed by a right parenthesis, then by a space character or a tab character, which repeats zero or more times and apply a segment break there. ^\d{1,}\.\d{1,}[\s\t]*#!##cap# It means: Look for all segments that start with a digit, which repeats one or more times, is followed by a dot character, then by another digit which repeats one or more times, then by a space character or a tab character which repeats zero or more times, then by a capital letter (defined under the #cap# group) and apply the segment break before the capital letter. To make things even easier, in memoQ you have in addition the option of simply excluding any numbering a segment starts with. From the original: (a) This is a sentence. You can simply import only This is a sentence. To do that, import the file using Import with option – Change filter and configuration – Add cascading filter. From the Filter drop-down menu, select Regex text filter, go to Include/Exclude tab and in the Rule field add the same code(s) as above, stopping at the “segment here” part (#!#): ^\(?[a-zA-Z0-9]+\)[\s\t]* and/or ^\d{1,}\.\d{1,}[\s\t]* It’s up to you to add the second code in order to exclude the segments that start with 1.1 or 11.2 etc. It depends on the context. It may be paragraph numbering or just numbers (such as 1.1 million), case in which you don’t want to exclude them, as they have to be localized. Hope you find it useful! Bogdan ▲ Collapse | | | Thanks for sharing | Jun 6, 2014 |
I've been hearing about regex expressions for years and I have an idea about their usefulness, but I still haven't come round to sit and look at it. This is a good incentive to try and understand what they actually are. Thank you, Philippe
[Edited at 2014-06-06 08:47 GMT] | | | Bogdan Dusa Romania Local time: 15:45 English to Romanian + ... TOPIC STARTER You're very welcome! | Jun 6, 2014 |
There is always a beginning | | | Laura Harrison United Kingdom Local time: 13:45 French to English + ... Thank you for sharing! | Jun 6, 2014 |
Will see if I can work it into my MemoQ codes | |
|
|
Chunyi Chen United States Local time: 06:45 English to Chinese How to modify these segmentation rules so that... | Jun 7, 2014 |
the numbers for bullet items can be excluded from importing to MemoQ? Hi Bogdan, I added the two rules you provided to the resource console and was able to exclude most of the number items in the file. The ones that failed to be excluded are: 1.(tab)text 2.(tab)text 3.(space space)text ... As you can see, the source file format is not good, with some items using tab and others using manual space. Can you tell me how to modif... See more the numbers for bullet items can be excluded from importing to MemoQ? Hi Bogdan, I added the two rules you provided to the resource console and was able to exclude most of the number items in the file. The ones that failed to be excluded are: 1.(tab)text 2.(tab)text 3.(space space)text ... As you can see, the source file format is not good, with some items using tab and others using manual space. Can you tell me how to modify one of your rules to exclude the numbers (plus the dots) above and only import the text part to MemoQ? Thanks a lot for your help! Chun-yi Bogdan Dusa wrote: In memoQ: Under the Resource console - Segmentation rule, create your own rule or Clone and then Edit a default rule and Add the 2 codes below in the Rule field: ^\(?[a-zA-Z0-9]+\)[\s\t]*#!# It means: Look for all segments that start with any lowercase letter or uppercase letter or any number between 0 and 9, which repeats one or more times, is preceded or not by a left parenthesis, is followed by a right parenthesis, then by a space character or a tab character, which repeats zero or more times and apply a segment break there. ^\d{1,}\.\d{1,}[\s\t]*#!##cap# It means: Look for all segments that start with a digit, which repeats one or more times, is followed by a dot character, then by another digit which repeats one or more times, then by a space character or a tab character which repeats zero or more times, then by a capital letter (defined under the #cap# group) and apply the segment break before the capital letter. To make things even easier, in memoQ you have in addition the option of simply excluding any numbering a segment starts with. From the original: (a) This is a sentence. You can simply import only This is a sentence. To do that, import the file using Import with option – Change filter and configuration – Add cascading filter. From the Filter drop-down menu, select Regex text filter, go to Include/Exclude tab and in the Rule field add the same code(s) as above, stopping at the “segment here” part (#!#): ^\(?[a-zA-Z0-9]+\)[\s\t]* and/or ^\d{1,}\.\d{1,}[\s\t]* It’s up to you to add the second code in order to exclude the segments that start with 1.1 or 11.2 etc. It depends on the context. It may be paragraph numbering or just numbers (such as 1.1 million), case in which you don’t want to exclude them, as they have to be localized. Hope you find it useful! Bogdan ▲ Collapse | | | Chunyi Chen United States Local time: 06:45 English to Chinese problem solved | Jun 7, 2014 |
I found out that by setting Tab to "Start new segment" in Document import settings, the item numbers can be separated from the main text. It looks like I don't need special segmentation rules to achieve this goal, so problem is solved! Chun-yi [quote]Chun-yi Chen wrote: the numbers for bullet items can be excluded from importing to MemoQ? Hi Bogdan, I added the two rules you provided to the resource console and was able to exclude most of the number items in the file. The ones that failed to be excluded are: 1.(tab)text 2.(tab)text 3.(space space)text ... As you can see, the source file format is not good, with some items using tab and others using manual space. Can you tell me how to modify one of your rules to exclude the numbers (plus the dots) above and only import the text part to MemoQ?
[Edited at 2014-06-07 20:23 GMT]
[Edited at 2014-06-07 20:24 GMT] | | | Bogdan Dusa Romania Local time: 15:45 English to Romanian + ... TOPIC STARTER Better solution | Jun 7, 2014 |
Hi Chun-yi, I do hope this is your first name Setting Tab to ""Start new segment" is usually a good solution, but it could be tricky. Think of documents where there are intentional tabs in order to align text from a single phrase on two or several rows or even typing errors, with unintentional tabs. You would have to check the file first. There could be a better solution, i.e. you could chan... See more Hi Chun-yi, I do hope this is your first name Setting Tab to ""Start new segment" is usually a good solution, but it could be tricky. Think of documents where there are intentional tabs in order to align text from a single phrase on two or several rows or even typing errors, with unintentional tabs. You would have to check the file first. There could be a better solution, i.e. you could change the Regex codes I indicated: ^\(?[a-zA-Z0-9]+\)*[\s\t]{1,} Modified: * inserted after \) Meaning: the numbering may or may not be followed by a right parenthesis Modified: {1,} after [\s\t] Meaning: the space character or tab character repeats at least one time ^\d{1,}\.\d{0,}[\s\t]{1,} Modified: {0,} after \.\d (instead of {1,}) Meaning: the dot character may or may not be followed by a digit Modified: {1,} after [\s\t] Meaning: the space character or tab character repeats at least one time It should work for contexts like this: (a)(tab)Text (1)(tab)Text 1(tab)Text 1.(tab)Text 1(space space)Text
[Editat la 2014-06-07 19:09 GMT] ▲ Collapse | | | Chunyi Chen United States Local time: 06:45 English to Chinese
Hi Bogdan, Yes, that's my first name:) Thank you so much for the modified regex. I will add them to MemoQ segmentation rules and see how the files turn out in the MemoQ grid. I did think of another question to ask: how would these files look when they are passed to the editor who does not have such segmentation rules in his/her MemoQ program? Once the MemoQ XLIFF files are imported to his or her MemoQ program, would the numbers still be separated from the main text a... See more Hi Bogdan, Yes, that's my first name:) Thank you so much for the modified regex. I will add them to MemoQ segmentation rules and see how the files turn out in the MemoQ grid. I did think of another question to ask: how would these files look when they are passed to the editor who does not have such segmentation rules in his/her MemoQ program? Once the MemoQ XLIFF files are imported to his or her MemoQ program, would the numbers still be separated from the main text as they were sent out? Chun-yi Bogdan Dusa wrote: Hi Chun-yi, I do hope this is your first name Setting Tab to ""Start new segment" is usually a good solution, but it could be tricky. Think of documents where there are intentional tabs in order to align text from a single phrase on two or several rows or even typing errors, with unintentional tabs. You would have to check the file first. There could be a better solution, i.e. you could change the Regex codes I indicated: ^\(?[a-zA-Z0-9]+\)*[\s\t]{1,} Modified: * inserted after \) Meaning: the numbering may or may not be followed by a right parenthesis Modified: {1,} after [\s\t] Meaning: the space character or tab character repeats at least one time ^\d{1,}\.\d{0,}[\s\t]{1,} Modified: {0,} after \.\d (instead of {1,}) Meaning: the dot character may or may not be followed by a digit Modified: {1,} after [\s\t] Meaning: the space character or tab character repeats at least one time It should work for contexts like this: (a)(tab)Text (1)(tab)Text 1(tab)Text 1.(tab)Text 1(space space)Text [Editat la 2014-06-07 19:09 GMT] ▲ Collapse | |
|
|
Bogdan Dusa Romania Local time: 15:45 English to Romanian + ... TOPIC STARTER Yes, they are | Jun 8, 2014 |
Hi Chun-yi, You're welcome Yes, the numbers are still separated because you send him your imported file. Regardless of whether or not the editor defined the same segmentation rules, he will only see what you will send him. Just to be on the safe side, I ran two tests, one with Export bilingual as memoQ XLIFF and another one with Export bilingual as Two-column RTF. The result was the same. The exported fi... See more Hi Chun-yi, You're welcome Yes, the numbers are still separated because you send him your imported file. Regardless of whether or not the editor defined the same segmentation rules, he will only see what you will send him. Just to be on the safe side, I ran two tests, one with Export bilingual as memoQ XLIFF and another one with Export bilingual as Two-column RTF. The result was the same. The exported file contained only the plain text, excluding any numbering from the original file. Bogdan ▲ Collapse | | | Chunyi Chen United States Local time: 06:45 English to Chinese
Thank you so much for the additional information! I have decided to add these rules to the resource console in MemoQ. I was just adding the new rules but MemoQ told me it's invalid. Can you tell me if these are the ones I should add? ^\(?[a-zA-Z0-9]+\)*[\s\t]{1,}#!# ^\d{1,}\.\d{0,}[\s\t]{1,}#!##cap# These rules didn't seem to do what they were supposed to do. I must have messed these up but don't know how to fix it. Thank you again! Chun-yi Bogdan Dusa wrote: Hi Chun-yi, You're welcome Yes, the numbers are still separated because you send him your imported file. Regardless of whether or not the editor defined the same segmentation rules, he will only see what you will send him. Just to be on the safe side, I ran two tests, one with Export bilingual as memoQ XLIFF and another one with Export bilingual as Two-column RTF. The result was the same. The exported file contained only the plain text, excluding any numbering from the original file. Bogdan | | | Bogdan Dusa Romania Local time: 15:45 English to Romanian + ... TOPIC STARTER Segmentation rules or Regex text filter? | Jun 8, 2014 |
Hi Chun-yi, I ran a test with your modified codes, apparently there is nothing wrong with them as long as you use them under the Segmentation rules. #!# means "segment break here". Otherwise, if you want to use them under Regex text filter in order to exclude any numbering, delete the final part of the codes (#!# and #!##cap# respectively) as it is useless. Bogdan | | | Chunyi Chen United States Local time: 06:45 English to Chinese segmentation rules | Jun 8, 2014 |
Hi Bogdan, Thank you for not giving up on me. I was adding the regex rules under segmentation rules. The modified rules chopped up sentences, such as material[seg]sensitivity reactions, infection[seg] or allergic reaction. Since I can separate numbers from text with MemoQ's existing feature (start as new segment), I will just use it for this project and come back to try these ones when I am more familiar with regex. Chun-yi Bogdan Dusa wrote: Hi Chun-yi, I ran a test with your modified codes, apparently there is nothing wrong with them as long as you use them under the Segmentation rules. #!# means "segment break here". Otherwise, if you want to use them under Regex text filter in order to exclude any numbering, delete the final part of the codes (#!# and #!##cap# respectively) as it is useless. Bogdan | |
|
|
Bogdan Dusa Romania Local time: 15:45 English to Romanian + ... TOPIC STARTER Everybody learns | Jun 8, 2014 |
Hi Chun-yi, No problem, we all learn from each other here, how do you think I started with the Regex codes myself? Anyway, if you want to dig further, you can take at loot at these sites (among many others): http://www.regular-expressions.info/ ... See more Hi Chun-yi, No problem, we all learn from each other here, how do you think I started with the Regex codes myself? Anyway, if you want to dig further, you can take at loot at these sites (among many others): http://www.regular-expressions.info/ http://www.jedit.org/users-guide/regexps.html http://www.dreambank.net/regex.html Good luck! Bogdan ▲ Collapse | | | To report site rules violations or get help, contact a site moderator: You can also contact site staff by submitting a support request » Useful segmentation rules for Trados Studio and memoQ CafeTran Espresso | You've never met a CAT tool this clever!
Translate faster & easier, using a sophisticated CAT tool built by a translator / developer.
Accept jobs from clients who use Trados, MemoQ, Wordfast & major CAT tools.
Download and start using CafeTran Espresso -- for free
Buy now! » |
| Anycount & Translation Office 3000 | Translation Office 3000
Translation Office 3000 is an advanced accounting tool for freelance translators and small agencies. TO3000 easily and seamlessly integrates with the business life of professional freelance translators.
More info » |
|
| | | | X Sign in to your ProZ.com account... | | | | | |