Total Recall

Want to load the DGT in Total Recall? See: Using the DGT with Total Recall

Total Recall: What is it?

Total Recall is an unique feature of CafeTran to handle Big Data. Normally you will store your segments in TMX files, which CafeTran can open and process without further conversion. These TMX files are loaded into RAM, without any indexation. Depending of the amount of RAM memory and the speed of the processor of your computer, these TMX Translation Memories will be queried very fast.

However, when the size of your TM increases beyond the limit of 500,000 segments, you will want to start using indexed databases for your TMX files.

Total Recall offers both a mechanism to import sets of millions of segments into a very fast database and to create extracts of these databases that can be used for concordancing and segment matching.

By default Total Recall uses a SQLite database (optionally an H2 or Hsqldb database, see Edit > Options > Database > Database connection) to recall from Database to Memory only relevant segments in the Project to avoid loading massive amounts of data to RAM. See the new ‘Total Recall’ menu item. See also new ‘Recall segments…’ and ‘Recall phrases…’ submenus.

TR basically finally allows you pre-translate against massive TMXs.

After you run Total Recall, a temporary TM (named Database name_TM) is created and opens up in the tabbed pane, which contains only the relevant hits/matches. Just like one of those pre-translation importable/exportable TMXs.

Using Total Recall

  • Load a database:
menu.png
  • Choose Total Recall > Recall segments to memory….

A temporary Translation is created. The Settings dialogue box for the temporary TM is opened.

  • Set the maximum number of segments recalled for a context word (a word that is present in the translated document) (default value is ‘100’):
settings.png
  • After recalling segments from Total Recall to Memory, please run the Pretranslate command (menu Translation > Pretranslate all segments). Then, there will be no delay for the longer segments.

Importing a TMX now takes considerably longer, since all TUs are indexed now. But hey, the result is accordingly: retrieval (querying) is ‘blitzschnell’!

Video on Total Recall

Creating a Total Recall database from a large TMX file (with about 800,000 TUs) Also known as: Storing a TM (translation memory) in a DB (database).

Part 1

Loading a Total Recall database to RAM memory, recalling any relevant segments in the project from the RAM memory, thus creating a far smaller, new TM, that can be used to get instantaneous hits from the TM.

Part 2


PLEASE NOTE: Total Recall is not only meant to deal with 'terabytes' of segments. It is a storage system for segments/phrases with the aim to flashback later only those segments relevant to the current document, that is, in its context. Each new project recalls to another set of segments to use in the translation.

After recalling, you can use Pretranslation against the recalled segments, as you have done against TMX segments so far.

Total Recall is here for you and me!

Demonstration with a database with 4673277 TUs from the EU

Loading the database to RAM: http://youtu.be/nd4jj0Ue_EM

Using the gigantic database for auto-assembling and concordancing: http://youtu.be/WwlmW6ys-VQ

In this screencast, only recognised terms are inserted. The settings:

abc.png

The screencast: http://youtu.be/NqaO0p9cvcI

Background info

Total Recall is a storage and retrieval system for segments or phrases. The previous version of CafeTran was not much scalable (depending on your computer's RAM) since all segments were loaded into RAM Memory for exact and fuzzy matching, auto-concordancing and auto-assembling.

The new system takes the translated document context into account, recalling from database only the relevant segments. Thus, from the huge base of segments, you recall just the segments you need to use for your current translation.

Apart from the document context, you can also filter out segments based on properties such as Subject, Client etc. The underlying database is by default on the translator's computer, but one can place it on a server and provide the URL address to the server computer.

Auto-assembling is an optional feature to use in CafeTran, since it is only useful for certain kind of translations and language pairs (like source languages with ‘limited’ inflection and language pairs with similar syntax).

Users' comments

Hans vd B. wrote:

CafeTran will extract ALL relevant segments, words or phrases from an H2 database and save them in a TMX file. It will "only" go for a maximum of a 100 (default) hits as to not "overheat" the search process. In other words, if there are more than 100 occurrences of a word or phrase in the database, CT will call it a day, and stop searching for the word/phrase.

In short:

  • Let CT convert a resource (TMX or TXT) to an H2 database, or use an existing database
  • CT will automatically index*) the database
  • Start your project, and select Recall segments to memory in Menu | Total Recall
  • CT will automatically create a new TMX memory for those segments, and show it in the tabbed pane
  • Run: Translation > Pretranslate all segments **)
  • Start translating. CT will use the newly created TMX file like any other TMX file for Auto-Assembly, Auto-Complete, and all other features, while you can still search the indexed H2 database in no time for terms and phrases that haven't been extracted because they didn't match the Project exactly (no fuzziness)

*) May take a while
**) It depends on the size of the resulting TMX file, and the (average) length of the segments in the Project, I'd say. I'd start translating, and if I notice a serious delay when CT searches for matches in the generated TMX file, I'd go for pretranslation and give it a head-start of a minute or two. No delay, no pretranslation.

Michael B. wrote:

Indeed, the problem of CT being unable to work with very large databases is now over. I am currently still testing this very new feature, but so far it looks like Total Recall has made CT better at handling large datasets than memoQ, which was previously my favourite CAT tool for dealing with Big data. My Total Recall db now contains around 2 million TUs, and it looks like it can handle a lot more than that. It does help to have a decent computer though. I have 32GB of RAM, a Haswell i7, 2 SSDs, etc. and I am getting very good results in terms of import, indexing & "Total Recall pretranslation" times.

Hans vd B.:

I recorded a screencast of the search for two words in the German-Dutch DGT, attached to the project as a TMX memory, and as an indexed H2 database.

http://www.screencast.com/t/95NPzHRJJIpn

Don't worry, it's only 26 seconds short.

The search in the TMX file show how things were before the upgrade. Clicking the icon with the coffee cup marked MS (Memory Search) activates the search in the TMX file, so the "old" way. Clicking the icon marked DS starts the search in the indexed H2 database. A third search - in a newly created "Total Recall" TMX file - would undoubtedly be faster, but I don't think mere mortals will notice the increase in speed.

Relevant technical data:

  • DGT GER-DUT, 2 million segments
  • Computer: iMac 27", late October 2009 model
  • Processor: 3.06 GHz Intel Core 2 Duo
  • RAM: 12 GB 1067 MHz DDR3, 8 GB assigned to Java
  • Storage: 1 TB rotational HDD

Opening SQLite databases with a database tool

You can view/edit CafeTran's databases with SQLite Database Browser.

browser.png

See also some in-depth articles in the official CafeTran Freshdesk.

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License