Segment length analysis?
Thread poster: Mirko Mainardi
Mirko Mainardi
Mirko Mainardi  Identity Verified
Italy
Local time: 18:25
Member
English to Italian
Apr 26, 2019

Hi everyone.

For some reason I never really thought about this before... but are there CAT tools that provide an analysis of segments based on their length? I mean, everyone knows that translating 2500 isolated 1 word terms is totally different from translating 25 segments comprised of 100 words of cohesive text each, even if the total wordcount is the same.

And if the answer to the above question is "no"... why?


Jorge Payan
 
RWS Community
RWS Community
United Kingdom
Local time: 18:25
English
What would you like to see? Apr 26, 2019

You already have the analysis by character, word and segment. So if there were 2500 segments and 2500 words you'd know. How would you like to see the analysis?

Regards

Paul
http://xl8.one


 
Mirko Mainardi
Mirko Mainardi  Identity Verified
Italy
Local time: 18:25
Member
English to Italian
TOPIC STARTER
Agnostic Apr 26, 2019

Thank you for your reply Paul. At any rate, this was supposed to be an agnostic question (i.e. not specifically SDL-related).

Also, if I'm not mistaken, what the Studio analysis says is based on totals and "fuzzy bands", while what I meant is an analysis specifically based on number of words per segment.

In other words, if the analysis says a file has 100 segments, 2500 words, and 10000 characters, all I know is what the average length per segment is (25 words),
... See more
Thank you for your reply Paul. At any rate, this was supposed to be an agnostic question (i.e. not specifically SDL-related).

Also, if I'm not mistaken, what the Studio analysis says is based on totals and "fuzzy bands", while what I meant is an analysis specifically based on number of words per segment.

In other words, if the analysis says a file has 100 segments, 2500 words, and 10000 characters, all I know is what the average length per segment is (25 words), but in reality, I could have a few segments with big chunks of text and a lot of smaller/tiny segments.

So, what I'm talking about is a breakdown based on segment length rather than (or "in addition to", of course...) fuzzy matching, so that a translator would have an additional metric to discern how time consuming a task could be, at a glance.
Collapse


 
Philippe Etienne
Philippe Etienne  Identity Verified
Spain
Local time: 18:25
Member
English to French
Warning Apr 26, 2019

A breakdown by source segment length could look like this:
< 5 words 18% (titles, software strings, headlines, tables: more time)
5-19 words 64% (sentences: standard)
> 19 words 18% (long sentences: perhaps more time to convey with style)

The middle band deserves discounts, I think.

Philippe


 
Jean Dimitriadis
Jean Dimitriadis  Identity Verified
English to French
+ ...
Filter segments by length Apr 26, 2019

In CafeTran Espresso, additionally to the CAT file analysis (number of segments/words/characters) [and SDL Trados can also provide these details without any fuzzy matching/TM attached], which gives a good idea of the average words number per segment, you can quickly sort (filter) segments by length (short or long first). I think MemoQ does offer that as well.

You can also use a QA step for displaying only segments above a user-defined maximum character count.

This is no
... See more
In CafeTran Espresso, additionally to the CAT file analysis (number of segments/words/characters) [and SDL Trados can also provide these details without any fuzzy matching/TM attached], which gives a good idea of the average words number per segment, you can quickly sort (filter) segments by length (short or long first). I think MemoQ does offer that as well.

You can also use a QA step for displaying only segments above a user-defined maximum character count.

This is not a standard analysis as you mean it, but it does provide a rough overview that should be sufficient to understand at a glance whether the project has many short segments, many long segments, or a mix.

When speaking of translation difficulty, a quantitative analysis (especially total word count alone) can only get you that far.

I’m still refining my own pre-translation analysis process for time and translation difficulty estimation, it is a tricky subject for sure.

[Edited at 2019-04-26 18:23 GMT]
Collapse


 
Mirko Mainardi
Mirko Mainardi  Identity Verified
Italy
Local time: 18:25
Member
English to Italian
TOPIC STARTER
Additional metric Apr 26, 2019

Jean Dimitriadis wrote:

When speaking of translation difficulty, a quantitative analysis (especially total word count alone) can only get you that far.


Yes Jean, I do agree, and that's why I wrote this would be "an additional metric to discern how time consuming a task could be, at a glance".

However, good point about sorting segments by length, although I would much prefer a report.

Philippe Etienne wrote:

A breakdown by source segment length could look like this:
< 5 words 18% (titles, software strings, headlines, tables: more time)
5-19 words 64% (sentences: standard)
> 19 words 18% (long sentences: perhaps more time to convey with style)

The middle band deserves discounts, I think.


Yeah, something like that, even though I would like a detailed breakdown, especially for shorter (i.e. <7 words) segments. Also, I don't think this could be used to give or request further discounts (in addition to those for fuzzies...). Just to make an example, a lot of 1-2 words segments would basically amount to glossary building, or would however take more time compared to longer and cohesive text, so in my opinion it would be useful to have a quick way to check that (ideally before accepting a project...).

[Edited at 2019-04-26 20:05 GMT]


Philippe Etienne
 
Samuel Murray
Samuel Murray  Identity Verified
Netherlands
Local time: 18:25
Member (2006)
English to Afrikaans
+ ...
@Paul Apr 27, 2019

SDL Community wrote:
So if there were 2500 segments and 2500 words you'd know.


What if there were 1000 segments and 10 000 words? That's 10 words per segment, on average. But the time saving on very long segments does not cancel out the time wastage on very short segments. A 30-word segment does not really take more time per word than a 20-word segment, but a 3-word segment takes up much more time per word than a 10-word segment.

I mean, suppose 100 of those segments have only 1 word, and 100 have only 2 words, and 100 have only 3 words, then the average length of the remaining 700 segments (the remaining 9400 words) is 13 words per segment. The 300 short segments will take up far more time per word than the average.

It takes me (generally) just as long to translate a 1-word segment as a 3-word segment or even a 5-word segment. So for me, if I had wanted the weighted word count to be an accurate indication of the amount of time it will take to do the job, all segments of 5 words or less should be counted as 5 words.

So let's recalculate the the 10 000-word example:

100 x 1-word segments: 100 words actual, 500 words weighted
100 x 2-word segments: 200 words actual, 500 words weighted
100 x 3-word segments: 300 words actual, 500 words weighted
Other segments: 9400 words actual

The adjusted word count, then, is 10900 words (i.e. it would take two to three hours longer to complete the job than a strictly average 10 000 words).


[Edited at 2019-04-27 06:25 GMT]


RWS Community
 
RWS Community
RWS Community
United Kingdom
Local time: 18:25
English
All agreed... Apr 27, 2019

Samuel Murray wrote:

SDL Community wrote:
So if there were 2500 segments and 2500 words you'd know.


What if there were 1000 segments and 10 000 words? That's 10 words per segment, on average. But the time saving on very long segments does not cancel out the time wastage on very short segments. A 30-word segment does not really take more time per word than a 20-word segment, but a 3-word segment takes up much more time per word than a 10-word segment.

I mean, suppose 100 of those segments have only 1 word, and 100 have only 2 words, and 100 have only 3 words, then the average length of the remaining 700 segments (the remaining 9400 words) is 13 words per segment. The 300 short segments will take up far more time per word than the average.

It takes me (generally) just as long to translate a 1-word segment as a 3-word segment or even a 5-word segment. So for me, if I had wanted the weighted word count to be an accurate indication of the amount of time it will take to do the job, all segments of 5 words or less should be counted as 5 words.

So let's recalculate the the 10 000-word example:

100 x 1-word segments: 100 words actual, 500 words weighted
100 x 2-word segments: 200 words actual, 500 words weighted
100 x 3-word segments: 300 words actual, 500 words weighted
Other segments: 9400 words actual

The adjusted word count, then, is 10900 words (i.e. it would take two to three hours longer to complete the job than a strictly average 10 000 words).


[Edited at 2019-04-27 06:25 GMT]


That's why I asked what you'd like to see. In terms of helping with project estimation this seems like an interesting way forward. Perhaps this is something we could do as a small plugin so you have an additional analysis. Any developer could add this using the API... but assuming nobody here can develop perhaps I'll add it to our list of things to do.

Regards

Paul


 
Philippe Etienne
Philippe Etienne  Identity Verified
Spain
Local time: 18:25
Member
English to French
MeToo Apr 29, 2019

Samuel Murray wrote:
...
It takes me (generally) just as long to translate a 1-word segment as a 3-word segment or even a 5-word segment. So for me, if I had wanted the weighted word count to be an accurate indication of the amount of time it will take to do the job, all segments of 5 words or less should be counted as 5 words.
...

While opposed to potentially getting weighted wordcounts higher than the actual wordcount for philosophical reasons, I see the point. In all fairness, small segments shouldn't be "discounted".

Simpler to visualise than segment wordcount breakdown, I think such a weighted wordcount would already lead to a much more accurate anticipation of the translation time required.

But CAT tool makers, when coming up with "partial matches", "analyses", "non-existing matches that will exist later", "tags/numbers that don't count" and stuff, haven't implemented a kind of threshold (I also think that around 3-5 words is realistic) below which the contents of small segments are reported as full words, neither weighted, nor discounted.
If there are only a few mini-segments, the buyer would "lose" a few pennies, and it there are a lot, the translator would actually be paid for the extra-time needed.
However, I am aware that weighted wordcounts have long lost their primary function of anticipating the time required: for instance, 80% discounts on 95-99% concordance matches seem to be common practice with a certain type of agencies, whereas 15 years ago, most used a single discount rate for all 75-99% matches.
To actually anticipate the time needed, I use a slightly amended historical version of the three-thirds 33/66/100, with fuzzies in the 75-99% concordance band.

Besides, I can't imagine any CAT tool maker implementing any small-segment threshold, because its analyses would consistently yield higher weighted wordcounts compared to the competition. Hardly a selling argument in the agency market, which to a significant extent shapes what translators buy as CAT tools.
After almost 20 years of daily use of CAT tools, I've never seen any "ground-breaking", "innovating" or "killer" feature increase weighted wordcounts! And don't start me on the "significant productivity gains" to justify the downward trend of weighted wordcounts together with the downward trend of discount grids together with the stagnation of the unit rate.

Philippe


Mirko Mainardi
 
Mirko Mainardi
Mirko Mainardi  Identity Verified
Italy
Local time: 18:25
Member
English to Italian
TOPIC STARTER
News? Apr 3, 2020

Any news on this, one year later? Maybe some external tools to do it, if CAT tools developers don't feel this deserves their attention (for whatever reason, as it seems pretty important to me as a translator...)?

 
Robin LEPLUMEY
Robin LEPLUMEY
France
Local time: 18:25
English to French
+ ...
VBA macro Apr 6, 2020

Mirko Mainardi wrote:

Any news on this, one year later? Maybe some external tools to do it, if CAT tools developers don't feel this deserves their attention (for whatever reason, as it seems pretty important to me as a translator...)?


Hi Mirko,

I guess it would be feasible to write a VBA macro for Word able to do this

Edit: maybe this tread would be a good start: https://superuser.com/questions/1170594/count-number-of-words-in-each-sentence-in-microsoft-word

Robin

[Edited at 2020-04-06 13:01 GMT]


 
Mirko Mainardi
Mirko Mainardi  Identity Verified
Italy
Local time: 18:25
Member
English to Italian
TOPIC STARTER
Thanks Apr 6, 2020

Robin LEPLUMEY wrote:

Hi Mirko,

I guess it would be feasible to write a VBA macro for Word able to do this

Edit: maybe this tread would be a good start: https://superuser.com/questions/1170594/count-number-of-words-in-each-sentence-in-microsoft-word


Hi Robin, thanks for your reply. That's an interesting idea, and I guess something even "easier" (to some extent...) would be feasible just with formulas in a spreadsheet, but I believe this goes beyond my abilities and/or the time I'm willing to spend trying to do something like that! Besides, I believe that what a translator would actually need is an "agnostic" solution, so ideally something able to process different files formats. In other words, yes I might need to run a quick analysis a bunch of doc files, but they might also be .po files, or .xlsx, or .xliff, etc.

Incidentally, that's also why the ideal solution would be for a CAT tool developer to integrate that functionality in their software, so that it could be used on whatever file format is supported by it (and again, I'm still baffled as to why they don't...).


Robin LEPLUMEY
 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

Segment length analysis?







Trados Studio 2022 Freelance
The leading translation software used by over 270,000 translators.

Designed with your feedback in mind, Trados Studio 2022 delivers an unrivalled, powerful desktop and cloud solution, empowering you to work in the most efficient and cost-effective way.

More info »
Wordfast Pro
Translation Memory Software for Any Platform

Exclusive discount for ProZ.com users! Save over 13% when purchasing Wordfast Pro through ProZ.com. Wordfast is the world's #1 provider of platform-independent Translation Memory software. Consistently ranked the most user-friendly and highest value

Buy now! »