https://www.proz.com/forum/cat_tools_technical_help/184708-new_free_open_source_aligner_for_windows_os_x_and_linux-page5.html

Pages in topic:   < [1 2 3 4 5 6 7 8 9] >
New free & open source aligner (for Windows, OS X and linux)
Thread poster: FarkasAndras
FarkasAndras
FarkasAndras  Identity Verified
Local time: 14:37
English to Hungarian
+ ...
TOPIC STARTER
Segment numbers Feb 10, 2013

That discrepancy is normal. What you see is:

2929/3033 -> 2786 (unsegmented)
and
6413/6770 -> 6223 (segmented)

This relatively small (10%) size difference is due to the way Hunalign works. In order to bring the two texts into sync, it merges segments (i.e. if a single English sentence was split into two sentences by the English translator, Hunalign merges them back together). It never splits segments because that would mean that sentences would be split at a
... See more
That discrepancy is normal. What you see is:

2929/3033 -> 2786 (unsegmented)
and
6413/6770 -> 6223 (segmented)

This relatively small (10%) size difference is due to the way Hunalign works. In order to bring the two texts into sync, it merges segments (i.e. if a single English sentence was split into two sentences by the English translator, Hunalign merges them back together). It never splits segments because that would mean that sentences would be split at arbitrary points. So as segments are merged, the texts become "shorter".

[Edited at 2013-02-10 20:03 GMT]
Collapse


 
KylaR
KylaR
Local time: 14:37
UTF-8 no BOM Feb 10, 2013

Thanks for those explanations about the segment numbers ! Glad to know nothing is missing from the source files, that was my fear.

And thanks for the right syntax ! That works much better indeed ! I feel dumb for not guessing that was the way to do it ! ^^

There's just a small issue with the outfile : it is in UTF-8 no BOM, and when opening it in Excel, the accented characters are all wrong.
The individual result files (aligned_HP01ENG-HP01FRE.txt, etc.) are in UF
... See more
Thanks for those explanations about the segment numbers ! Glad to know nothing is missing from the source files, that was my fear.

And thanks for the right syntax ! That works much better indeed ! I feel dumb for not guessing that was the way to do it ! ^^

There's just a small issue with the outfile : it is in UTF-8 no BOM, and when opening it in Excel, the accented characters are all wrong.
The individual result files (aligned_HP01ENG-HP01FRE.txt, etc.) are in UFT8 BOM, and the individual excel files are fine.
How can I fix that ?
Collapse


 
FarkasAndras
FarkasAndras  Identity Verified
Local time: 14:37
English to Hungarian
+ ...
TOPIC STARTER
BOM Feb 10, 2013

Well, there are dozens of options. A good text editor like Notepad++ can save files with or without a BOM, you just have to pick. Or you can open a UTF8-BOM file in any text editor, delete whatever is in it and paste the contents of the noBOM file in it. Or just open and save the noBOM file with Notepad, IIRC it adds a BOM to all UTF8 files if they don't have one. Or just open the txt in a text editor and copy-paste the contents to Excel instead of opening the file with Excel.
BTW the reas
... See more
Well, there are dozens of options. A good text editor like Notepad++ can save files with or without a BOM, you just have to pick. Or you can open a UTF8-BOM file in any text editor, delete whatever is in it and paste the contents of the noBOM file in it. Or just open and save the noBOM file with Notepad, IIRC it adds a BOM to all UTF8 files if they don't have one. Or just open the txt in a text editor and copy-paste the contents to Excel instead of opening the file with Excel.
BTW the reason why the outfile doesn't have a BOM is that it is written in "append" mode, i.e. in each alignment, new lines are added to it without deleting the previous contents. (Otherwise, it would only contain the segments of the last file pair instead of all of them.) So if I were to write it with a BOM, there would be rogue BOMs scattered all over the file (6, in your case).

[Edited at 2013-02-10 20:47 GMT]
Collapse


 
KylaR
KylaR
Local time: 14:37
Thanks Feb 10, 2013

Thanks ! I'm going to try that.

 
KylaR
KylaR
Local time: 14:37
Any way to generate the BAT ? Feb 16, 2013

Hello,

Me again !

Do you have any suggestions to write the BAT file faster ? At some point, I'm gonna need to align thousands of pairs ! And they always have symmetrical names, like File07ENG and File07FRE.

(honestly, I'd love it if the program was able to automagically guess that file A goes with file B, and to process the entire folder with as few clicks / as little batch editing as possible... But that would probably be a bit more complicated ! ;o)


 
FarkasAndras
FarkasAndras  Identity Verified
Local time: 14:37
English to Hungarian
+ ...
TOPIC STARTER
generate BAT Feb 16, 2013

KylaR wrote:

Hello,

Me again !

Do you have any suggestions to write the BAT file faster ? At some point, I'm gonna need to align thousands of pairs ! And they always have symmetrical names, like File07ENG and File07FRE.

(honestly, I'd love it if the program was able to automagically guess that file A goes with file B, and to process the entire folder with as few clicks / as little batch editing as possible... But that would probably be a bit more complicated ! ;o)

The BAT can be generated rather easily in excel - or in a text editor using regex. That's what I normally do, and I sometimes align upwards of 100,000 file pairs.
Here's how I usually do it:
Open your folder in total commander. Select all EN files. Click Mark/Copy names with path to clipboard. Open excel, paste the name list into column A. Repeat with FR and column B. Make sure file names are correctly paired. Into C1, write something like the following:
="LF_aligner_3.11.exe -f=t -l=en,fr -s=y -r=xn -t=n -i="&A1&","&B1

As you can see the syntax for excel cells is pretty simple: start with =, use & between elements, put text in double quotes and reference cells simply with their letter+number code.

Then copy C1 down along column C, the select and copy column C and paste into a txt file. This should work if there are no spaces in any of the file names or paths. If there are, you need to add double quotes around file paths. You can't do that directly but you can write a double quote in D1 and just put &d1& where you need a double quote.


I guess I could write a feature that autodetects file pairs, but then you'd have to name your files somefilename_en.doc and somefilename_fr.doc, i.e. have the exact same file name in each pair, with the language code tacked on. Probably not worth bothering with. I do plan to write a graphical user interface for the batch aligner, which will probably include something to make this easy.

[Edited at 2013-02-16 19:54 GMT]

[Edited at 2013-02-16 20:09 GMT]


 
KylaR
KylaR
Local time: 14:37
Awesome ! Feb 16, 2013

Okay, I just tried it writing the following :

="LF_aligner_3.11.exe -f=t -l=en,fr -s=n -r=xn -t=n -o=C:\test\outfile.txt -i="&A11&","&B11

And it worked great ! Thanks a lot.

(I'll be back soon with other questions though ! ;p)


 
KylaR
KylaR
Local time: 14:37
I'm a total Excel noob... Feb 16, 2013

I'm back !

First, it turns out I do need double quotes for some files... And I have no idea how to use that &d1& bit you mentioned ! I tried writing -i="&D1&""&A1&""&D1&","&D1&""&B1&""&D1 ... Yeah, obviously I have no idea how this works ! I didn't even know you could do that with Excel. I tried googling, but... I don't even know what this is called, what you're doing... ;o

Can you tell me what the exact formula would be ?

And my second and last question fo
... See more
I'm back !

First, it turns out I do need double quotes for some files... And I have no idea how to use that &d1& bit you mentioned ! I tried writing -i="&D1&""&A1&""&D1&","&D1&""&B1&""&D1 ... Yeah, obviously I have no idea how this works ! I didn't even know you could do that with Excel. I tried googling, but... I don't even know what this is called, what you're doing... ;o

Can you tell me what the exact formula would be ?

And my second and last question for tonight : do you know if there is any way to get rid of all the blank cells ? I do not want to get rid of the full row, though. In the following image, I'd like to merge A2 and A3, B2 and B3, then A4 and A5, B4 and B5.



Is there any way to do that automatically on the whole file ?

Thanks a lot...
Collapse


 
FarkasAndras
FarkasAndras  Identity Verified
Local time: 14:37
English to Hungarian
+ ...
TOPIC STARTER
Double quotes Feb 17, 2013

Well, the double quote thing is because excel uses double quotes as part of its syntax so you can't just type them in (excel will assume you're quoting text). Therefore, you take a cell that you're not using for anything else, such as D1 and write a double quote character in it. Then when you need a double quote, you just tell excel to insert the contents of D1, which consist of a double quote. There are many other ways of getting double quotes around the filenames but this one is fairly simple ... See more
Well, the double quote thing is because excel uses double quotes as part of its syntax so you can't just type them in (excel will assume you're quoting text). Therefore, you take a cell that you're not using for anything else, such as D1 and write a double quote character in it. Then when you need a double quote, you just tell excel to insert the contents of D1, which consist of a double quote. There are many other ways of getting double quotes around the filenames but this one is fairly simple if you're using Excel to generate the commands already.
Note: you need to use $d$1 to tell excel not to use D2, D3, D4 etc. as the formula is copied down the column.
So assuming that you wrote a " in D1, your command would be:

="LF_aligner_3.11.exe -f=t -l=en,fr -s=n -r=xn -t=n -o=C:\test\outfile.txt -i="&$d$1&A1&$d$1&","&$d$1&B1&$d$1

This way all file names are quoted and it's not a problem if the file or folder names contain spaces.
Collapse


 
FarkasAndras
FarkasAndras  Identity Verified
Local time: 14:37
English to Hungarian
+ ...
TOPIC STARTER
Empties Feb 17, 2013

KylaR wrote:

And my second and last question for tonight : do you know if there is any way to get rid of all the blank cells ? I do not want to get rid of the full row, though. In the following image, I'd like to merge A2 and A3, B2 and B3, then A4 and A5, B4 and B5.



Is there any way to do that automatically on the whole file ?

Thanks a lot...

When you generate a TMX (with the latest version), those rows are skipped by default (i.e. any text that is not paired up with text in the other language is left out). They are left in the txt and xls files, though. Automatically merging all these cells is not currently supported. Maybe I'll add that feature sometime. It could be useful in some situations.

In the meantime, this can be done with Excel and a text editor, but it's not simple. (Warning! More Excel functions follow!)
Write some random text into A1 and B1 so they are not empty. Put your text in A and B from row 2. Then write this in C2*:
=IF(OR($A1="",$B1=""),A2,IF(OR($A2="",$B2=""),"",IF(OR($A3="",$B3=""),A2&" "&A3,A2)))
Copy the function to the entire C and D column from row 2 down. You should see your segments merged as needed. It will fix single blanks only, i.e. if there are two empty cells in subsequent rows, it won't fix that. Copy and paste the content of C and D from row 2 down into a text editor. If certain cells were filled with a zero, remove the zero with find and replace. Remove empty lines (Notepad++ has a menu item that does this). Delete the content of A and B from row 2 down. Paste text back to A2 from the text editor. Rinse and repeat as many times as necessary.


* Excel functions are localized. This is for English Excel; in other versions, you need to use the local equivalents of AND and OR, and possibly ; instead of ,.


[Edited at 2013-02-17 11:31 GMT]


 
KylaR
KylaR
Local time: 14:37
Trouble with pasting stuff into Excel Feb 17, 2013

Thank you so much for taking the time, Farkas!

Double quotes: that works great!

Getting rid of the empties: so those are called "functions" then... Don't laugh! I'm completely ignorant in all things Excel. ;p I'm gonna need to learn more about this.

Anyway, the function in French goes:
=SI(OU($A1="";$B1="");A2;SI(OU($A2="";$B2="");"";SI(OU($A3="";$B3="");A2&" "&A3;A2)))

A
... See more
Thank you so much for taking the time, Farkas!

Double quotes: that works great!

Getting rid of the empties: so those are called "functions" then... Don't laugh! I'm completely ignorant in all things Excel. ;p I'm gonna need to learn more about this.

Anyway, the function in French goes:
=SI(OU($A1="";$B1="");A2;SI(OU($A2="";$B2="");"";SI(OU($A3="";$B3="");A2&" "&A3;A2)))

And it works just like it should.

There were no blank lines; just lines with tabulations on them, so I used Word to get rid of those (although I could have used Trim whitespace before Delete blank lines and that would have done the trick, I guess).

I ran into a problem when copying the text from the text editor back to Excel, though. Which brings me back to one of my earlier questions:

>>There's just a small issue with the outfile : it is in UTF-8 no BOM, and when opening it in Excel, the accented characters are all wrong. The individual result files (aligned_HP01ENG-HP01FRE.txt, etc.) are in UFT8 BOM, and the individual excel files are fine. How can I fix that ?
>Well, there are dozens of options.
>A good text editor like Notepad++ can save files with or without a BOM, you just have to pick.
>Or you can open a UTF8-BOM file in any text editor, delete whatever is in it and paste the contents of the noBOM file in it.
>Or just open and save the noBOM file with Notepad, IIRC it adds a BOM to all UTF8 files if they don't have one.
>Or just open the txt in a text editor and copy-paste the contents to Excel instead of opening the file with Excel.

=> I tried all four options and couldn't really work out the problem so far.

The first three still result in crappy characters:


(I realized just now than when trying to open the individual TXT result files in UFT8 BOM in Excel, they show crappy characters too. So nothing to do with the BOM or no BOM, I guess. ^^)

Now, the fourth option has clean characters... But some rows end up all in one cell:

This is what I have in the individual result file. Seems normal:


Now, when I copy/paste the whole outfile from my text editor in Excel, this is what those cells look like:


I see nothing in those three rows that would explain it...

Here's a second example, maybe that'll help troubleshooting this (although I would really understand if you don't have the time...):

Individual result file:


When I copy/paste the whole outfile from my text editor in Excel:


Any ideas?
Collapse


 
FarkasAndras
FarkasAndras  Identity Verified
Local time: 14:37
English to Hungarian
+ ...
TOPIC STARTER
Excel being mischievous Feb 18, 2013

Both those things look like Windows/Excel bugs. You'll have to experiment to find a way around them.

I've had the second one myself. I don't really know what causes it, but sometimes line breaks and tabs get lost in Excel. Really odd and really annoying. I think I only got that when pasting from a text editor, not from MS Word. So you could paste the text back to MS Word and copy-paste it to Excel from there.

The first one is some encoding bug. Again, it shouldn't happe
... See more
Both those things look like Windows/Excel bugs. You'll have to experiment to find a way around them.

I've had the second one myself. I don't really know what causes it, but sometimes line breaks and tabs get lost in Excel. Really odd and really annoying. I think I only got that when pasting from a text editor, not from MS Word. So you could paste the text back to MS Word and copy-paste it to Excel from there.

The first one is some encoding bug. Again, it shouldn't happen. I've never had encoding issues when pasting text into Excel from a text editor. You could try saving the txt file with the text editor and opening it with excel. Then, if you saved the file in UTF-8, which is what I'd recommend, pick UTF-8 in the drop-down list in the file opening dialog.
Collapse


 
KylaR
KylaR
Local time: 14:37
Will test. Feb 19, 2013

Thanks, Farkas.

Those are interesting ideas. I'm not done testing, but I'll be sure to report back !


 
KylaR
KylaR
Local time: 14:37
Worked out both bugs I think ; problem with the merge function though Feb 21, 2013

I am back !

Pasting the text to Word and copy-pasting it back to Excel didn't work for me; I had the same "some rows end up all in one cell" problem.

When opening the TXT file or copy/pasting text into LibreOffice Calc, I had the same problem, only much worse (many more rows were affected).

Opening an UTF-8 outfile in Excel : my old Excel XP doesn't have an option for UTF-8 in the stupid drop-down:
... See more
I am back !

Pasting the text to Word and copy-pasting it back to Excel didn't work for me; I had the same "some rows end up all in one cell" problem.

When opening the TXT file or copy/pasting text into LibreOffice Calc, I had the same problem, only much worse (many more rows were affected).

Opening an UTF-8 outfile in Excel : my old Excel XP doesn't have an option for UTF-8 in the stupid drop-down:


However, I finally understood what caused the "full row in one cell" problem: in the import wizard, you have to specify "no text identifier":

(disregard the pipe on the screencap; that was just a test)

So, what I did was : I opened the TXT outfile with LibO. Their import wizard is much prettier :


I tested the merge function, and it seemed to work just as well as in Excel. However, I'm even less familiar with Calc than I am with Excel, and had no idea how to reach the next blank cell, which made it a bit hard to check the whole file... So...

I saved the file as Excel XP .xls, then opened it in Excel ; at this point, I was expecting encoding issues, but oddly, there were none !

Then, like you explained, I left the English, French and name of files in columns A B C, and added formulas to columns D E F :
=SI(OU($A1="";$B1="");A2;SI(OU($A2="";$B2="");"";SI(OU($A3="";$B3="");A2&" "&A3;A2)))
=SI(OU($A1="";$B1="");B2;SI(OU($A2="";$B2="");"";SI(OU($A3="";$B3="");B2&" "&B3;B2)))
=SI(OU($A1="";$B1="");C2;SI(OU($A2="";$B2="");"";SI(OU($A3="";$B3="");C2&" "&C3;C2)))
(I hope the fact I added a third merged column in the mix doesn't change anything? Otherwise, as I had less and less rows, the origin of the segments was not accurately reflected anymore.)

Then I pasted the content of columns D E F in a text editor, cleaned the zeros and the blank lines (and btw : hitting Trim whitespace, as I had mentioned earlier, was NOT a good idea as it would have removed some very necessary tabulations! Oops!)

And then... I saved the TXT as UTF-8 ; opened it in LibO, saved it as Excel XP .xls ; and again, opened with Excel ; etc. !

Phew ! Glad I got that sorted out !

***

And now, for the bad news... In the process, I noticed that some bits disappeared instead of being merged. I think it only happens when there's a bad row, a good row and again a bad row, like this :


Here, A4 is merged into D3, but instead of being merged into D5, A6 disappears !

I don't think there are too many cases of this, so if it is the way it is, I can deal. But if you have ideas on how to fix it, I'm all ears !

[Edited at 2013-02-21 08:42 GMT]
Collapse


 
FarkasAndras
FarkasAndras  Identity Verified
Local time: 14:37
English to Hungarian
+ ...
TOPIC STARTER
edge case Feb 21, 2013

KylaR wrote:


And now, for the bad news... In the process, I noticed that some bits disappeared instead of being merged. I think it only happens when there's a bad row, a good row and again a bad row, like this:
...pic...
Here, A4 is merged into D3, but instead of being merged into D5, A6 disappears !

I don't think there are too many cases of this, so if it is the way it is, I can deal. But if you have ideas on how to fix it, I'm all ears !


You're right, the function didn't account for the case where there's one line with an empty cell, one line with text in both cells, and then one line with an empty cell.
This should fix it:

=SI(OU(A1="";B1="");SI(OU(A3="";B3="");A2&" "&A3;A2);SI(OU(A2="";B2="");"";SI(OU(A3="";B3="");A2&" "&A3;A2)))


 
Pages in topic:   < [1 2 3 4 5 6 7 8 9] >


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

New free & open source aligner (for Windows, OS X and linux)


Translation news related to CAT tools





Trados Studio 2022 Freelance
The leading translation software used by over 270,000 translators.

Designed with your feedback in mind, Trados Studio 2022 delivers an unrivalled, powerful desktop and cloud solution, empowering you to work in the most efficient and cost-effective way.

More info »
CafeTran Espresso
You've never met a CAT tool this clever!

Translate faster & easier, using a sophisticated CAT tool built by a translator / developer. Accept jobs from clients who use Trados, MemoQ, Wordfast & major CAT tools. Download and start using CafeTran Espresso -- for free

Buy now! »