Databases or glossaries?

Although relational databases (such as MySQL, SQLite, etc) can be useful, keep in mind that using one will force you to learn its language as well as reduce the interoperability of your resources. A SQLite .db can only be opened in special programs, whereas a tab-delimited UTF-8 text file (such as this one) can be opened in all manner of tools: a simple text editor, Microsoft Word or a so-called CSV editor. You can’t just double click on your .db file and see its contents. You can only interact with it obliquely via, e.g., SQL queries, which can be very difficult to learn.

DVX2/3, for example, stores its resources in some form of relational database. I think it's an Access .db, and you can get at your data using SQL commands. However, it is also notoriously slow and many people wish Atrl had used another DB format.

Compare this with my main CafeTran glossary, which counts around 400,000 entries, many of which contain a ton of metadata, which is stored as a simple (25 MB) tab-delimited text file. I can work with this glossary attached to my CafeTran project with no slow-downs whatsoever. Furthermore, if I want to edit its contents all I need to do is right-click and open it in my CSV editor, which will present me with a beautifully clear, visual view of the data, very similar to how a file looks when opened in Excel. That is, I can filter and sort on columns and rows and do all kinds of cleaning and maintenance operations, all without having to learn anything about SQL or another .db language.

When it comes to translation memories and translation memory collections, however, I completely agree with Hans: a relational database is absolutely indispensable. There is simply no way to access and edit a TMX collection of 30,000,000 TUs without resorting to some kind of database. TMLookup (which uses an SQLite database) is a good example of what is possible with large TMX collections. It is even much faster than TMX handling in e.g. memoQ, DVX2/3, SDL Studio, etc. My TMLookup ‘default.db’ is 25GB on disk, but this doesn't matter: disk space is cheap these days and what matters is how blisteringly fast it is once loaded in the program.

So, in a nutshell, my recommendation would be:

  • termbase: tab-delimited UTF-8 text file (under approx. 400,000 entries)
  • translation memories: relational databases (virtually unlimited # of entries)

Michael

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License