Pages in topic:   [1 2] >
website alignment – is it possible, and what tools do you use?
Thread poster: TSDM
TSDM
TSDM
Russian Federation
Local time: 02:17
Russian to English
Mar 13, 2008

Are there any tools that can batch align bi-lingual websites? (With the aim of creating themed, or client specific TMs)

It's quite easy to download full websites, and this would be a great way to make translation memories.

I found a description of a linux tool in development:
"bitextor – Builds parallel text corpora from webpages. Uses websites as the source of text. Analyzes webpage text for bitexts. Presently works with es, ca, gl, pt, and en languages. Can ea
... See more
Are there any tools that can batch align bi-lingual websites? (With the aim of creating themed, or client specific TMs)

It's quite easy to download full websites, and this would be a great way to make translation memories.

I found a description of a linux tool in development:
"bitextor – Builds parallel text corpora from webpages. Uses websites as the source of text. Analyzes webpage text for bitexts. Presently works with es, ca, gl, pt, and en languages. Can easily be extended to support new languages."

I assume a tool like this would produce two aligned documents that could be easily turned into a translation memory using the CAT tool of your choice.

Is there anything out there commercially available for Windows or Mac OS X that does this? I've spent easily half a day searching and trying different CAT products, but haven't found anything close to satisfactory.
Collapse


 
Wolfgang Jörissen
Wolfgang Jörissen  Identity Verified
Belize
Dutch to German
+ ...
Interesting approach, but I'm afraid not Mar 14, 2008

The problem with websites could be the different technologies used. Think about CMS, java applets, flash etc. And if it is not technology, it will certainly be the structure, which could be different at each and every website. Creating a tool for all of that would be a _very_ smart challenge. However, you might want to use one of those grabbers that download all pages of a website to your harddisk (I used HTTP Weazel years ago, it did a good job), and then check for alignable material. Not fully... See more
The problem with websites could be the different technologies used. Think about CMS, java applets, flash etc. And if it is not technology, it will certainly be the structure, which could be different at each and every website. Creating a tool for all of that would be a _very_ smart challenge. However, you might want to use one of those grabbers that download all pages of a website to your harddisk (I used HTTP Weazel years ago, it did a good job), and then check for alignable material. Not fully automated, but at least a step in the right direction.Collapse


 
Samuel Murray
Samuel Murray  Identity Verified
Netherlands
Local time: 01:17
Member (2006)
English to Afrikaans
+ ...
I wrote one, but it's geeky Mar 14, 2008

chacher wrote:
Are there any tools that can batch align bi-lingual websites? (With the aim of creating themed, or client specific TMs)


You need two things:

* An exctracor
* An aligner

For the aligner, you can use any alignment program. I suggest PlusTools from the Wordfast people.

For the extractor, take a look at my humble collection of scripts:
http://leuce.com/tempfile/omtautoit/
...and search the page for "large alignments". There are two versions -- the older version is less sophisticated and therefore less likely to fail you.

I used this when I aligned the text from a multilingual government web site. It's geeky, but it worked for me (you may have to watch it, though, so you can kill it the moment it misbehaves).

On that same page there is also a script named "Abbzz" which is used in conjunction with Abbyy Finereader to bulk extract text from PDFs (sometimes you get web sites offering PDFs in many languages).


 
Samuel Murray
Samuel Murray  Identity Verified
Netherlands
Local time: 01:17
Member (2006)
English to Afrikaans
+ ...
Yes, of course, that too... Mar 14, 2008

Wolfgang Jörissen wrote:
However, you might want to use one of those grabbers that download all pages of a website to your harddisk (I used HTTP Weazel years ago, it did a good job), and then check for alignable material.


Yes, of course... I took for granted that the OP would have the web pages on his hard disk already. For ripping a web site, you could also look at Oleg Arny Chernavin's "Web Downloader" (webdown.exe) (abandonware, but excellent). If you're into FLOSS, you could go with HTTrack.


 
TSDM
TSDM
Russian Federation
Local time: 02:17
Russian to English
TOPIC STARTER
responses to Samuel and Wolfgang (My Mum almost named me Wolfgang...) Mar 14, 2008

Samuel – great info, and I'm looking forward to trying your scripts. One question: Won't the extraction process remove valuable html tag information that would be helpful in alignment?

Wolfgang – downloading sites is not a problem (I'm on a Mac using a great program – with a great name – SiteSucker) the trick is finding an alignment program. I'm not looking for something that would capture 100% (java applets, flash, etc.), just basic text content.

I found that M
... See more
Samuel – great info, and I'm looking forward to trying your scripts. One question: Won't the extraction process remove valuable html tag information that would be helpful in alignment?

Wolfgang – downloading sites is not a problem (I'm on a Mac using a great program – with a great name – SiteSucker) the trick is finding an alignment program. I'm not looking for something that would capture 100% (java applets, flash, etc.), just basic text content.

I found that Multitrans has a alignment tool that does html/php –and does a good job – but only up to 10 documents at a time, and they have to be manually paired one by one.

What we're missing is a program that can batch files, and of course, handle subdirectories.

Any other suggestions out there?
Collapse


 
Samuel Murray
Samuel Murray  Identity Verified
Netherlands
Local time: 01:17
Member (2006)
English to Afrikaans
+ ...
Answers Mar 15, 2008

chacher wrote:
Samuel – great info, and I'm looking forward to trying your scripts. One question: Won't the extraction process remove valuable html tag information that would be helpful in alignment?


There are two ways of looking at HTML tag information. You can see it as helpful, or you can see it as unhelpful. I take the latter view. Well, I suppose one could write a very fancy program that actually makes use of the HTML structure to improve the initial alignment, but if your alignment tool is good and if you know both languages well, then I don't think you should be concerned.

...and they have to be manually paired one by one.


Yes, well, that is what you do in an aligner. The aligner presents a table with two columns and you go through them to see that the segments on the left all match up with a segment on the right.

I have little faith in fully automated procedures. Alignment is only useful if you invest time in it.


 
David Turner
David Turner  Identity Verified
Local time: 01:17
French to English
+ ...
Logiterm Mar 15, 2008


What we're missing is a program that can batch files, and of course, handle subdirectories.
Any other suggestions out there?


Logiterm or Alignment Factory must be among the best batch aligners.
http://www.terminotix.com/index.asp?name=Professional&content=item&brand=2&item=12&lang=en

David Turner


 
TSDM
TSDM
Russian Federation
Local time: 02:17
Russian to English
TOPIC STARTER
comparing commercial alignment software Mar 17, 2008

David – Alignment Factory/Logiterm may be exactly what we're looking for.

Anyone else have recommendations or can comment on experiences with this or other batch/semi-automated alignment software?


 
TSDM
TSDM
Russian Federation
Local time: 02:17
Russian to English
TOPIC STARTER
automated alignment – a farce? Mar 17, 2008

Samuel –
I'd like to hear more about your approach to alignment. It's pretty hard to justify going through line by line to align years accumulated of documents perfectly.

Do you do all your alignment in advance, or do you use software that shows full-text TMs (rather than units) and allows alignment on the fly?


 
TSDM
TSDM
Russian Federation
Local time: 02:17
Russian to English
TOPIC STARTER
extraction script not working? Mar 17, 2008

Samuel,
The script you have is listed as not working. Can you clarify? Is it possible to use?


 
TSDM
TSDM
Russian Federation
Local time: 02:17
Russian to English
TOPIC STARTER
maybe you can recommend another extraction tool? Mar 17, 2008

I read the script's readme file, and it is just too complex for me.

Can anyone recommend a tool to extract text from websites?


 
Samuel Murray
Samuel Murray  Identity Verified
Netherlands
Local time: 01:17
Member (2006)
English to Afrikaans
+ ...
If you find any... Mar 17, 2008

chacher wrote:
Can anyone recommend a tool to extract text from websites?


If you find any, please let us know. All of the HTM2TXT software that I have seen so far puts line breaks in the middle of sentences when converting to text, thereby rendering the extraction useless.

But if you open the HTM file in a browser and then go Ctrl+A, Ctrl+C in it, and then Ctrl+V in a text editor, the sentences remain intact. However, doing that one file at a time will take a long time to complete (unless you pay a student to do it for you).


 
Samuel Murray
Samuel Murray  Identity Verified
Netherlands
Local time: 01:17
Member (2006)
English to Afrikaans
+ ...
Alignment must be 100% or zero Mar 17, 2008

chacher wrote:
It's pretty hard to justify going through line by line to align years accumulated of documents perfectly.


Well, it depends at what level the alignment is. If you align paragraphs, it is easier to do it in a semi-automated way (theoretically speaking). But if you want to have segment matching (fuzzy matching etc) then sentence segmentation is pretty much what you're looking for, right?

If your automated alignment tool missegments one sentence at the top of a file (eg by creating two sentences instead of one, or because the one language editor added a sentence to the file), then all subsequent segments on that file will be misaligned. Don't you agree?


 
Felipe Gútiez Velasco
Felipe Gútiez Velasco
Germany
Local time: 01:17
Member (2002)
German to Spanish
+ ...
Do you know Multitrans or any other new good aligment tool? Oct 6, 2008

Samuel Murray wrote:

chacher wrote:
It's pretty hard to justify going through line by line to align years accumulated of documents perfectly.


Well, it depends at what level the alignment is. If you align paragraphs, it is easier to do it in a semi-automated way (theoretically speaking). But if you want to have segment matching (fuzzy matching etc) then sentence segmentation is pretty much what you're looking for, right?

If your automated alignment tool missegments one sentence at the top of a file (eg by creating two sentences instead of one, or because the one language editor added a sentence to the file), then all subsequent segments on that file will be misaligned. Don't you agree?


I am looking for a good alignment tool for In-Design and XML files.
Can Trados do several language in one fly? how many?


 
Michael Beijer
Michael Beijer  Identity Verified
United Kingdom
Local time: 00:17
Member (2009)
Dutch to English
+ ...
Try this: May 20, 2010

http://www.youalign.com/

 
Pages in topic:   [1 2] >


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

website alignment – is it possible, and what tools do you use?







Wordfast Pro
Translation Memory Software for Any Platform

Exclusive discount for ProZ.com users! Save over 13% when purchasing Wordfast Pro through ProZ.com. Wordfast is the world's #1 provider of platform-independent Translation Memory software. Consistently ranked the most user-friendly and highest value

Buy now! »
Protemos translation business management system
Create your account in minutes, and start working! 3-month trial for agencies, and free for freelancers!

The system lets you keep client/vendor database, with contacts and rates, manage projects and assign jobs to vendors, issue invoices, track payments, store and manage project files, generate business reports on turnover profit per client/manager etc.

More info »