build a model language (Machine Translation (MT))

Fóruns técnicos » Machine Translation (MT) »
build a model language
Track this topic

build a model language

Tópico cartaz: cyrine84

cyrine84

Jul 26, 2010

Hi all,
I want to buid a 5 gram language model, so i want to know if it is necessary to do these three steps :
Tokenize training data,

Filter out long sentences
and
Lowercase training data
and why ?

Any one can help me

Best regards

plotinus
Local time: 22:58
italiano para português
+ ...

It depends

Aug 12, 2010

Hi cyrine84,

a language model is not a unique solution/tool that you can apply to every single situation. In short, a language model will (and should) only model the characteristics that you want it to model. You should also be aware that the size of the model (you want a 5-gram model) is also subject to these restrictions: for many languages, such a large gram size would either result in an extremely large model in terms of bytes (which can make it slow or even impossible to load in memory) or in a model that needs to be purged.

Anyway, the filtering of long sentences is usually not necessary for language models if you are not concerned about the time it will take to build the model itself (however, it matters a lot if you want candidates for statistical machine translation), because you will usually filter out uncommon grams (i.e., grams whose frequencies are below a given value, usually obtained with an estimator).

The lowercasing of the training data is usually done both to make a short/faster language model and to make it more general and less context-specific; if you are not sure about lowercasing or not (i.e., if what you want to model is not case-significant), you should lowercase it.

Could you be a little more specific about what you want to model and why?

- Tiago ▲ Collapse

Login to reply/comment

To report site rules violations or get help, contact a site moderator:

Moderador(es) deste fórum
Mahmoud Akbari	[Call to this topic]
Prachya Mruetusatorn	[Call to this topic]

You can also contact site staff by submitting a support request »

build a model language

Forum rules

Help and orientation

Trados Business Manager Lite
Create customer quotes and invoices from within Trados Studio Trados Business Manager Lite helps to simplify and speed up some of the daily tasks, such as invoicing and reporting, associated with running your freelance translation business. More info »

Anycount & Translation Office 3000
Translation Office 3000 Translation Office 3000 is an advanced accounting tool for freelance translators and small agencies. TO3000 easily and seamlessly integrates with the business life of professional freelance translators. More info »

Mensagens recentes | FAQ | Regras | Moderadores | Banco de artigos

Your current localization setting

português (Br)

Select a language

More languages...

build a model language

build a model language

You have native languages that can be verified

Your current localization setting

Select a language