New free & open source aligner (for Windows, OS X and linux) (CAT Tools Technical Help)

Technical forums » CAT Tools Technical Help »
New free & open source aligner (for Windows, OS X and linux)
Track this topic

Pages in topic: < [1 2 3 4 5 6 7 8 9] >

New free & open source aligner (for Windows, OS X and linux)

Thread poster: FarkasAndras

KylaR
Local time: 13:21

More questions...

Feb 21, 2013

Thank you ! But it doesn't work like we wanted it to. Maybe I didn't use the right formula for columns E and F ?

I used the following :
=SI(OU(A1="";B1="");SI(OU(A3="";B3="");A2&" "&A3;A2);SI(OU(A2="";B2="");"";SI(OU(A3="";B3="");A2&" "&A3;A2)))
=SI(OU(B1="";C1="");SI(OU(B3="";C3="");B2&" "&B3;B2);SI(OU(B2="";C2="");"";SI(OU(B3="";C3="");B2&" "&B3;B2)))
=SI(OU(C1="";D1="");SI(OU(C3="";D3="");C2&" "&C3;C2);SI(OU(C2="";D2="");"";SI(OU(C3="";D3="");C2&" "&C3;C2)))
... See more

The empty cell in B884 is handled correctly : A883 and 884 are merged, and C883 and 884 too.

But empty cells in column A don't work quite right : C880 and 881 are merged, but not B880 and 881.

***

Another question (probably the only thing left I hadn't asked !): do you think providing a dictionary of character names would help anchor the segments better ?

If so, how do I go about it ?

Right now, I have :
\LF_aligner_3.11_win\scripts\hunalign\data\en-en.dic (25/01/2013)
\LF_aligner_3.11_win\scripts\hunalign\data\en-es.dic
\LF_aligner_3.11_win\scripts\hunalign\data\en-fr.dic (15/01/2013)
\LF_aligner_3.11_win\scripts\hunalign\data\en-hu.dic
\LF_aligner_3.11_win\scripts\hunalign\data\en-it.dic
\LF_aligner_3.11_win\scripts\hunalign\data\en-nl.dic
\LF_aligner_3.11_win\scripts\hunalign\data\en-pt.dic (16/02/2013)
\LF_aligner_3.11_win\scripts\hunalign\data\info.txt
\LF_aligner_3.11_win\scripts\hunalign\data\null.dic
\LF_aligner_3.11_win\scripts\hunalign\data\raw\dicmaker.pl
\LF_aligner_3.11_win\scripts\hunalign\data\raw\en.txt
\LF_aligner_3.11_win\scripts\hunalign\data\raw\fr.txt
\LF_aligner_3.11_win\scripts\hunalign\data\raw\hu.txt
\LF_aligner_3.11_win\scripts\hunalign\data\raw\it.txt
\LF_aligner_3.11_win\scripts\hunalign\data\raw\nl.txt
\LF_aligner_3.11_win\scripts\hunalign\data\raw\pt.txt
\LF_aligner_3.11_win\scripts\hunalign\data\raw\[many other languages]

If I want to add stuff, do I change the individual en and pt files, or the en-pt.dic ? (when are those generated, btw ? Those modification dates I mentioned are confusing me a bit.)

Or is it better to add a new, separate dictionary ? But in that case, how do I tell LF Aligner to use it on top of the general dictionary ?

I know your doc says "To add your own dictionary, read Hunalign's documentation.", but I wasn't really able to find anything beyond that page : http://mokk.bme.hu/en/resources/hunalign/... Do you have a specific link ?

Thanks ! ▲ Collapse

FarkasAndras

Local time: 13:21
English to Hungarian
+ ...

TOPIC STARTER

.dic

Feb 21, 2013

I can't bring myself to review the excel formula again to be honest. It makes my head hurt. Maybe A881 had a space or some other invisible character in it. Or, if the formula is broken, maybe some excel wizard will fix it for you, or maybe I'll be really bored on the plane tomorrow:)

Now, to the dictionary files.
Hunalign's documentation says this about .dic files:
"The dictionary consists of newline-separated dictionary items. An item consists of a target languge phrase and a source language phrase, separated by the ” @ ” sequence. Multiword phrases are allowed. The words of a phrase are space-separated as usual. IMPORTANT NOTE: In the current version, for historical reasons, the target language phrases come first. Therefore the ordering is the opposite of the ordering of the command-line arguments or the results."

That's about all you need to know. If you want to add terms to an existing .dic file, just open it in a text editor and add new lines, or if you want to replace it altogether, delete its contents and add your own in the same format. The file won't be overwritten by LF Aligner (it only creates new dictionaries using the data in the raw folder when there isn't one).
LF Aligner generates the .dic files on demand as they are needed, i.e. when you first run an alignment in the language combination in question. Quoting info.txt: "Dictionaries for each language pair are generated automatically as they are needed - there are close to 1000 language combinations, so providing premade dictionaries for all pairs would take up too much space."
It appears that you ran an English-French alignment as early as 15 January, but you only did your first English-Portuguese project on 16 February.

If you're aligning books, yes, adding names might help. Although if the names are exactly the same in both languages, hunalign might pick them up and use them as anchors anyway, I'm not sure about that. Either way, adding them can't hurt.
Report back about any difference you notice when fiddling with the dictionaries. I never really tested what impact they have.

[Edited at 2013-02-21 21:02 GMT] ▲ Collapse

KylaR
Local time: 13:21

No more questions !

Feb 22, 2013

Lol! Don't worry about the formula. I'll use the previous version, there probably wasn't that much loss anyway.

And thank you for the explanations about the dictionaries. I decided to delete the en-fr.dic and edit the raw en.txt and fr.txt. I put my character names at the end of both files, and it worked well.

I haven't noticed much difference in the results so far, but the alignment was almost perfect already, so... If I notice significant differences with other texts,... See more

Once again : thanks a lot !

[Edited at 2013-02-22 23:43 GMT] ▲ Collapse

Piotr Bienkowski

Poland
Local time: 13:21
English to Polish
+ ...

Segmentation rules, exceptions?

Apr 16, 2013

I am really impressed with what the aligner can do for me, but...

I have to repeat my question from the 1st page of this thread. Does the aligner use any segmentation rules that can be configured?

I have a list of exceptions where segments in Polish should not be broken (text should be kept together) and I would like to add them to the aligner to get better results.

Can this be done in the LF Aligner setup?

Will appreciate,

Piotr

FarkasAndras

Local time: 13:21
English to Hungarian
+ ...

TOPIC STARTER

Segmentation

Apr 16, 2013

Yes, you can improve the segmentation to a certain extent.
Read aligner\scripts\sentence_splitter\README.doc, then based on the information there, you can edit
aligner\scripts\sentence_splitter\nonbreaking_prefixes\nonbreaking_prefix.pl

README.doc is the readme file of the sentence segmenter from the europarl project, which is what LF Aligner uses (i.e. the sentence segmenter is not my creation). The part of the readme that you need to look at is the bit about the 'Nonbreaking Prefixes Directory' (ignore the stuff about the tokenizer).

If you edit nonbreaking_prefix.pl, please send the improved version back to me so I can include it in the next release.

If you really want to, you can also try and use the sentence segmenter of your CAT tool. E.g. if you use Trados, you can import the files one by one as translatable documents, then export them into Excel with the sdlxliff converter for MS office. Copy-paste the appropriate column into a txt file, then run the aligner on the txt with paragraph segmentation.

PSA:
As of version 4.0, LF Aligner has a graphical interface for manually correcting alignment. Update if you haven't already.

[Edited at 2013-04-16 11:50 GMT] ▲ Collapse

esperantisto

Local time: 14:21
Member (2006)
English to Russian
+ ...

SITE LOCALIZER

Doesn’t work for me

Apr 16, 2013

FarkasAndras wrote:
As of version 4.0, LF Aligner has a graphical interface for manually correcting alignment.

I tried using the GUI with LFA 4.0 on Windows 7. Did not work.
1. Launched LF_aligner_4.0.exe from D:\Program Files\aligner4\.
2. Selected the first option for text, RTF, doc or docx files.
3. Selected English and Russian for source and target languages.
4. Selected two plain-text files in UTF-8 for source and target.
5. Confirmed segmentation.
6. Confirmed Use the graphical editor.
7. Next only appeared Do you want to generate a TMX file? I confirmed.
8. I entered EN-US and RU-RU for language codes, unticked Note.
9. Pressed Next and only got a blank window with nothing.

Here’s what I see in the command line (some strings translated from Russian being the system default language):

LF Aligner 4.0
OS detected: Windows
Sentence Splitter v3
Language: en
Sentence Splitter v3
Language: ru
Reading dictionary...
cygwin warning:
MS-DOS style path detected: D:\Program Files\aligner4\scripts\hunalign\data\
-ru.dic
Preferred POSIX equivalent is: /hunalign/data/en-ru.dic
CYGWIN environment variable option "nodosfilewarning" turns off this warning
Consult the user's guide for more details about POSIX paths:
http://cygwin.com/cygwin-ug-net/using.html#using-pathnames
109 source language sentences read.
103 target language sentences read.
quasiglobal_stopwordRemoval is set to 0
Simplified dictionary ready.
Rough translation ready.
0 100
Rough translation-based similarity matrix ready.
Matrix built.
Trail found.
Align ready.
Global quality of unfiltered align 0.469747
quasiglobal_spaceOutBySentenceLength is set to 1
Trail spaced out by sentence length.
Global quality of unfiltered align after realign 0.469747
Quality 0.469747
"D:\Program" is not internal or external
command, executable program or batch file.
error:Can't locate Tk/Popup.pm in @INC (@INC contains: C:\Users\user\A
Data\Local\Temp\par-476162696e\cache-5af9552f01bc75f14d144cd9befa80adc07170d9\
c\lib C:\Users\user\AppData\Local\Temp\par-476162696e\cache-5af9552f01b
5f14d144cd9befa80adc07170d9\inc CODE(0x2fa6afc) CODE(0x358f1a4)) at Tk/Widget.
line 270.

Tk::Error: Can't locate Tk/Popup.pm in @INC (@INC contains: C:\Users\user\AppData\Local\Temp\par-476162696e\cache-5af9552f01bc75f14d144cd9befa80adc0717
9\inc\lib C:\Users\user\AppData\Local\Temp\par-476162696e\cache-5af9552
1bc75f14d144cd9befa80adc07170d9\inc CODE(0x2fa6afc) CODE(0x358f1a4)) at Tk/Wid
t.pm line 270.
Tk callback for .dialog.top
Tk callback for .dialog.bottom
Tk callback for .dialog.bottom.frame
Tk::Widget::_AutoloadTkWidget at Tk/Widget.pm line 268
Tk::Widget::AUTOLOAD at Tk/Widget.pm line 338
Tk::DialogBox::Show at Tk/DialogBox.pm line 117
LFA_GUI::__ANON__ at LFA_GUI.pm line 930
Tk::After::repeat at Tk/After.pm line 80
[repeat,[{},after#5502,50,repeat,[\&LFA_GUI::__ANON__]]]
("after" script)

FarkasAndras

Local time: 13:21
English to Hungarian
+ ...

TOPIC STARTER

Fixed already

Apr 16, 2013

That looks like a bug that was fixed in 4.01 (4.0 didn't support paths with spaces in them). Download the current version (4.04) and the problem should go away.

BTW I see you have the program in the Program Files folder. If you get errors that say something like "No permission to write file", you'll have to move the program to some other folder. AFAIK Windows has pretty strict limitations on who and what can write to Program Files.

Piotr Bienkowski

Poland
Local time: 13:21
English to Polish
+ ...

You can still use it

Apr 16, 2013

YOu can start the GUI editor separately and load the tabbed txt file that LF Aligner generates.

HTH

Piotr

esperantisto

Local time: 14:21
Member (2006)
English to Russian
+ ...

SITE LOCALIZER

It’s only…

Apr 18, 2013

FarkasAndras wrote:

Windows has pretty strict limitations on who and what can write to Program Files.

…for the C: partition. Yeah, Windows is a quite dumb OS, it’s limitations can be in most cases easily bypassed.

esperantisto

Local time: 14:21
Member (2006)
English to Russian
+ ...

SITE LOCALIZER

Indeed, fixed

Apr 18, 2013

FarkasAndras wrote:

That looks like a bug that was fixed in 4.01 (4.0 didn't support paths with spaces in them). Download the current version (4.04) and the problem should go away.

I’m happy to confirm that the issue is really fixed.

Piotr Bienkowski

Poland
Local time: 13:21
English to Polish
+ ...

Comparison with earlier alignment?

Apr 23, 2013

I have recently been doing alignment of similar files in LF Aligner.

It would be nice to be able to compare the current Excel file with the previous TMX to be able to see if some work can be skipped and only new rows saved into a new TMX file.

Or maybe a similar feature or solution already exists? Will appreciate your feedback.

Regards,

Piotr

Piotr Bienkowski

Poland
Local time: 13:21
English to Polish
+ ...

The other way (segmentation)

Apr 23, 2013

FarkasAndras wrote:

Yes, you can improve the segmentation to a certain extent.

Can I also ADD additional rules for where segments SHOULD be broken?

Regards,

Piotr

FarkasAndras

Local time: 13:21
English to Hungarian
+ ...

TOPIC STARTER

add rules

Apr 23, 2013

Not without editing the perl code that does the segmentation. It'd probably be easier to do it yourself, using some regex approach (perl, sed, Notepad++). I think you could intercept the txt files when lf aligner asks you whether you're happy with the results of the segmentation. Just insert line breaks wherever you see fit, close and save the files and proceed with the alignment.

Piotr Bienkowski

Poland
Local time: 13:21
English to Polish
+ ...

Option to prevent joining lines in txt files?

Apr 24, 2013

FarkasAndras wrote:

Not without editing the perl code that does the segmentation. It'd probably be easier to do it yourself, using some regex approach (perl, sed, Notepad++). I think you could intercept the txt files when lf aligner asks you whether you're happy with the results of the segmentation. Just insert line breaks wherever you see fit, close and save the files and proceed with the alignment.

So I will rephrase my question. Is there an option I overlooked to prevent LF Aligner from running together lines in text files? I already do some of the preprocessing you mentioned on text files before I feed them to LF Aligner. But the aligner sometimes defeats my work by running the lines together. Here is an extreme, but real example (happened today).

Special Tool Check Harness Connector Terminal Arrangement B B B 5V Battery current sensor Engine-ECU Current (A) Output voltage (V) DischargeCharge Equipment side connector Air flow sensor (incorporating No.1 intake air temperature sensor) 8 Manifold absolute pressure sensor 7 No. 2 intake air temperature sensor 4 EGR valve (DC motor) Fuel temperature sensor 11 EGR valve position sensor 2 Crank angle sensor 15 Exhaust differential pressure sensor EGR cooler bypass control solenoid valve

If the option I mentioned does not exist, please consider this an enhancement request.

Regards,

Piotr

FarkasAndras

Local time: 13:21
English to Hungarian
+ ...

TOPIC STARTER

Line breaks

Apr 25, 2013

To the best of my knowledge, LF Aligner never merges lines*. I.e. every line break is a segment delimiter. Of course hunalign merges segments as it sees fit (merge several segments in one language and pair them up with one longer segment in the other language), but that's a different matter.
When LF Aligner asks you whether you want to revert to the unsegmented file versions, you can open the XXXXX_seg.txt files to see how the segmentation went.
If you're seeing merged lines, send me... See more

Pages in topic: < [1 2 3 4 5 6 7 8 9] >

Login to reply/comment

To report site rules violations or get help, contact a site moderator:

Moderator(s) of this forum
Natalie	[Call to this topic]
Peter Zauner	[Call to this topic]
Prachya Mruetusatorn	[Call to this topic]

You can also contact site staff by submitting a support request »

New free & open source aligner (for Windows, OS X and linux)

Translation news related to CAT tools

» Memsource Sells to Carlyle: The Inside Story
(0 comments)
» memoQ 9.4: Turbo-Charging Productivity
(0 comments)
» The Future Of Work Now: The Computer-Assisted Translator And Lilt
(0 comments)

Submit translation news about CAT tools »
Read more translation news »

Forum rules

Help and orientation

Anycount & Translation Office 3000
Translation Office 3000 Translation Office 3000 is an advanced accounting tool for freelance translators and small agencies. TO3000 easily and seamlessly integrates with the business life of professional freelance translators. More info »

Trados Studio 2022 Freelance
The leading translation software used by over 270,000 translators. Designed with your feedback in mind, Trados Studio 2022 delivers an unrivalled, powerful desktop and cloud solution, empowering you to work in the most efficient and cost-effective way. More info »


	X Sign in to your ProZ.com account... Username: Password: Forgot your password? Or create a new account