Very large TMs (~10 million TU)
Thread poster: FarkasAndras
FarkasAndras
FarkasAndras  Identity Verified
Local time: 19:19
English to Hungarian
+ ...
Nov 11, 2012

So, I'm in the process of generating some truly massive TMs, in the 6 to 15 million segment range. What are people's experiences with massive TMs? I'm interested in data about all major CAT tools. Studio, memoQ, Wordfast Pro and whatever else people are using.
What are the maximum supported sizes as claimed by the vendors? What is the largest TM you have successfully used? What tricks have you tried to get these massive TMs imported and working? What are the minimum hardware recommendation
... See more
So, I'm in the process of generating some truly massive TMs, in the 6 to 15 million segment range. What are people's experiences with massive TMs? I'm interested in data about all major CAT tools. Studio, memoQ, Wordfast Pro and whatever else people are using.
What are the maximum supported sizes as claimed by the vendors? What is the largest TM you have successfully used? What tricks have you tried to get these massive TMs imported and working? What are the minimum hardware recommendations (how much memory does a 1 million+ TU TM use when loaded, and how much HDD space does it take up)?
I've only just started testing, and an 8 million TU import failed on Studio despite decent hardware (blazing fast SSD, mid-range corei5 and 8GB RAM). The import progressed pretty quickly up to the first 200,000 or so TUs, then slowed to a crawl. It was nowhere near done after 8 hours and progressing at a rate of less than 5 segments a second, so I shot it down. Maybe I'll chop itup to 8 TMs of 1 million TU each. But I don't know if, for instance, importing various smaller TMXes into the same TM might work better than importing one large TM. I also don't know if the performance of Studio might be better when searching several smaller TMs compared to one gigantic TM. Same applies to other tools, of course.

Paul, if you are willing to help clear up things / get some testing done on Studio, please contact me.
Collapse


 
Siegfried Armbruster
Siegfried Armbruster  Identity Verified
Germany
Local time: 19:19
English to German
+ ...
In memoriam
This is truly massive Nov 11, 2012

FarkasAndras wrote:

So, I'm in the process of generating some truly massive TMs, in the 6 to 15 million segment range.


Hi Farkas, the largest TMs we created up to now are around 1 million segments (about 1 GB in size). Using them in Trados 2011 produced no problems at all. Also using several TMs in the range of 300,000 to 1 Million segments also produced no problems.

Main systems used here are Double Core processors with 8 GB RAM and Win 7 64 Bit.


 
Hermann Bruns
Hermann Bruns  Identity Verified
Local time: 19:19
English to German
MetaTexis for Word Nov 11, 2012

Hello Andras,

most CAT tools should have no problem with importing large TMs - in principle, at least. The database engines used by the professional CAT tools usually support huge database sizes. This is also true for MetaTexis for Word, if you use the SQLite engine, or MySQL, or the MS SQL Server (whereas the MS Access engine will fail). You can download a trial version of MetaTexis
... See more
Hello Andras,

most CAT tools should have no problem with importing large TMs - in principle, at least. The database engines used by the professional CAT tools usually support huge database sizes. This is also true for MetaTexis for Word, if you use the SQLite engine, or MySQL, or the MS SQL Server (whereas the MS Access engine will fail). You can download a trial version of MetaTexis for Word at www.metatexis.com.

HOWEVER, to import huge TMs can take a lot of time! And the search performance can get worse, too. This simple rule cannot be overcome: The more data to be searched, the longer the search will take. Only the degree of the performance decrease can differ according to inventiveness of the software developers and the search settings. E.g. in MetaTexis you can trigger the search parameters to increase the search speed (at the cost accurateness of the search, or the number of hits, or both). And, of course, a faster computer always helps...

One strategy to avoid extremely huge TMs would be to split it up in sub-set TMs. If this is feasible, depends on the nature of the data. This is what only you can decide.

Best regards
Hermann
Collapse


 
Michael Beijer
Michael Beijer  Identity Verified
United Kingdom
Local time: 18:19
Member (2009)
Dutch to English
+ ...
The view from memoQ. Nov 11, 2012

memoQ stores its TMs in a folder containing various files.

This is an example of how memoQ stores the large Dutch-English DGT-TM-2011 TM on my computer:





The Dutch-English DGT-TM-2011 .TMX is 1.28 GB and contains 1,910,694 segments.
However, my memoQ TM takes up 6.15GB on disk, which is a little strange because I just checked and most of my memoQ TMs seem to be around double the original TMX size, not 4.8 times bigger. See * below.

Import time

If I remember correctly, importing the DGT-TM-2011 into a new TM took me around 2 hours. However, I also remember that because I kept running into errors I eventually had to split it into 2 TMXs (using EmEditor) before I could get it to import without errors. Not bad when you compare it to Déjà Vu, which I tested the other day and decided to stop after 8 hours of waiting for a TMX to import. Basically, it seems that anything around a million starts to need to be cut up into smaller chunks in order to be imported.

RAM usage

memoQ doesn't load TMs in RAM (like CafeTran e.g.) so your computer's amount of RAM isn't that important. My work computer is a desktop PC running Win7 64-bit with 16 GB, a 3.07GHz i7 and a new SSD. This is subjective, but I think that the new 64-bit version of memoQ is even faster than it already was.

Average concordance look-up times

I just ran some tests and running a concordance search of a two-phrase string
takes around 1.5 seconds. Running a concordance search on an entire segment from
the DGT-TM-2011

`Twee jaar na de ingebruikneming van het VIS, en vervolgens om de twee jaar,
legt de beheersautoriteit aan het Europees Parlement, de Raad en de Commissie
een verslag voor over de technische werking van het VIS, daaronder begrepen de
beveiliging ervan.'

takes about 10 seconds.

This might seem a little long, but I hardly ever do concordance searches for strings that long. Also, I just checked and I have a total of around 8,148,5000 (!!!) segments in my various connected TMs in memoQ. And yet, when translating, memoQ never feels slow (especially this new 64-bit version). I can't say the same for pretty much every other CAT tool I have tried.

Michael


*
KDE4 v2: 154,829 segments. (.TMX: 147 MB; memoQ: 346 MB)
Europarl3: 1,285,373 segments (.TMX: 1.69 GB; memoQ: 4.43 GB)
European Commission: 368,547 segments (.TMX: 476 MB; memoQ: 1.17 GB)
DGT 2007: 382,556 segments (.TMX: 1.28 GB; memoQ: 6.15GB)
DGT-TM-2011: 1,910,694 segments
EUconst: 6,479 segments
EMEA: 325,355 segments
ECDC: 2,405 segments
European Central Bank: 66,586 segments
BEIJER_TM: 675,712 segments
BEIJER_TM2: 360,345 segments
ACROSS: 455,585 segments
translationproject.org: 86,522 segments
TAUS: 537,596 segments
TAUS2: 679,626 segments
PHP: 4,325 segments


 
FarkasAndras
FarkasAndras  Identity Verified
Local time: 19:19
English to Hungarian
+ ...
TOPIC STARTER
split Nov 11, 2012

Thanks for the info and keep it coming.

Hermann Bruns wrote:
One strategy to avoid extremely huge TMs would be to split it up in sub-set TMs. If this is feasible, depends on the nature of the data. This is what only you can decide.

I can do this and I plan to. The question is whether it will improve lookup times. (I.e. what's faster: a lookup on 5 TMs with 1 million TUs each, or a lookup on a single TM with 5 million TUs - assuming that the CAT even supports that).
I can split up the TM into 20 subsets of varying sizes, each covering a different (sub)field. The only trouble is that many TUs are assigned to more than one field (2, 3 or even 4) so if I do this by field (which is advisable as it makes it possible to pick and choose TMs for different jobs), I get duplicated TUs. I'm not sure how many duplications there will be, this is one of the things I'll be testing.


 
FarkasAndras
FarkasAndras  Identity Verified
Local time: 19:19
English to Hungarian
+ ...
TOPIC STARTER
8 million TUs Nov 11, 2012

Michael Beijer wrote:
I have a total of around 8,148,5000 (!!!) segments in my various connected TMs in memoQ. And yet, when translating, memoQ never feels slow (especially this new 64-bit version). I can't say the same for pretty much every other CAT tool I have tried.

Michael


*
KDE4 v2: 154,829 segments. (.TMX: 147 MB; memoQ: 346 MB)
Europarl3: 1,285,373 segments (.TMX: 1.69 GB; memoQ: 4.43 GB)
European Commission: 368,547 segments (.TMX: 476 MB; memoQ: 1.17 GB)
DGT 2007: 382,556 segments (.TMX: 1.28 GB; memoQ: 6.15GB)
DGT-TM-2011: 1,910,694 segments
EUconst: 6,479 segments
EMEA: 325,355 segments
ECDC: 2,405 segments
European Central Bank: 66,586 segments
BEIJER_TM: 675,712 segments
BEIJER_TM2: 360,345 segments
ACROSS: 455,585 segments
translationproject.org: 86,522 segments
TAUS: 537,596 segments
TAUS2: 679,626 segments
PHP: 4,325 segments


Are these all local TMs that are stored on your computer and used simultaneously in the same project? If so, that's impressive - and promising for my project. It means that with MQ, the full dataset can realistically be used all at the same time (split up into various TMs). I expect that most CATS will become too slow with 5 million TUs and up.


 
Michael Beijer
Michael Beijer  Identity Verified
United Kingdom
Local time: 18:19
Member (2009)
Dutch to English
+ ...
memoQ = built for speed Nov 11, 2012

FarkasAndras wrote:

Michael Beijer wrote:
I have a total of around 8,148,5000 (!!!) segments in my various connected TMs in memoQ. And yet, when translating, memoQ never feels slow (especially this new 64-bit version). I can't say the same for pretty much every other CAT tool I have tried.

Michael


*
KDE4 v2: 154,829 segments. (.TMX: 147 MB; memoQ: 346 MB)
Europarl3: 1,285,373 segments (.TMX: 1.69 GB; memoQ: 4.43 GB)
European Commission: 368,547 segments (.TMX: 476 MB; memoQ: 1.17 GB)
DGT 2007: 382,556 segments (.TMX: 1.28 GB; memoQ: 6.15GB)
DGT-TM-2011: 1,910,694 segments
EUconst: 6,479 segments
EMEA: 325,355 segments
ECDC: 2,405 segments
European Central Bank: 66,586 segments
BEIJER_TM: 675,712 segments
BEIJER_TM2: 360,345 segments
ACROSS: 455,585 segments
translationproject.org: 86,522 segments
TAUS: 537,596 segments
TAUS2: 679,626 segments
PHP: 4,325 segments


Are these all local TMs that are stored on your computer and used simultaneously in the same project? If so, that's impressive - and promising for my project. It means that with MQ, the full dataset can realistically be used all at the same time (split up into various TMs). I expect that most CATS will become too slow with 5 million TUs and up.


Yup. They are all contained in my memoQ ‘Translation Memories’ folder, which weighs in at 25.2 GB. I usually just have them all switched on when translating and notice no slowdowns whatsoever, unless I am doing a concordance search for a very long phrase. But that's very rare, and the 10 seconds I then need to wait is forgivable. memoQ seems to be designed to deal happily with very large amounts of data. However, I do suspect that this is partially reliant on having an SSD, a decent amount of RAM and a good CPU.

Michael


 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

Very large TMs (~10 million TU)







Trados Business Manager Lite
Create customer quotes and invoices from within Trados Studio

Trados Business Manager Lite helps to simplify and speed up some of the daily tasks, such as invoicing and reporting, associated with running your freelance translation business.

More info »
Trados Studio 2022 Freelance
The leading translation software used by over 270,000 translators.

Designed with your feedback in mind, Trados Studio 2022 delivers an unrivalled, powerful desktop and cloud solution, empowering you to work in the most efficient and cost-effective way.

More info »