Defining Segmentation Rules

CafeTran also supports the SRX standard for setting segmentation rules for a given language. It means that you can define exceptions to the segmentation in your language. The default SRX file Rules.srx is located in the rules/segmentation directory (‘C:\CafeTran Espresso\cafetran\rules\segmentation’ on Windows). It contains the default rules and examples of some exceptions in a few languages.

To edit or add your own exceptions, you need to go to Edit > Options > Segmentation. Next, select a SRX file from the list and click the Segmentation button.

The segmentation editor has two menus: languagerules and maprules. In the maprules menu, you may edit or add a new language map based on the provided examples.

For example, to create a Dutch language map:
Go to: Edit > Options > Segmentation editor…
Select ‘Rules.srx’ from the drop-down menu
Click on Segmentation editor…
Click on Language maps
Click on New languagemap…

Enter:
languagepattern: [Nl][Ll].*
languagerulename: Dutch

Once a new language map has been created, the language for segmentation will be chosen automatically as you create a new Project.

In the Language rules menu, you can create, add and edit the rules for segmentation in a given language. The Default rules take precedence over any other rules and should not be changed.

To add a new language to the Languagerules, click ‘New languagerule’ and provide a name of your language.

There are two kinds of rules we can define. ‘Yes rules’ which determine the segmentation breaks and ‘No rules’ that define the exceptions to the breaks. Look at the given examples to set your own Yes and No rules. In most cases you will need to create only No rules, that is, exceptions to the default breaks since the Default Yes rules are sufficient. The rules must be in the form of regular expressions. The SRX specification available online provides many examples and can serve as a general guide to creating your own rules.

The Rule menu has two submenus for the ‘Beforebreak rule’ and ‘Afterbreak rule’. For example, if we want to define the exception to segmentation for the English word Mr., first we must create the No break rule with the field set to ‘no’. Then, create the Beforebreak rule: \sMr\. (‘\s’ stands for a whitespace character and any dots must be preceded by the backslash symbol ‘\’). Next, create the Afterbreak rule ‘\s’ for a whitespace character.

You can also prepare your own SRX file in an XML editor and place it in the rules/segmentation directory. SRX files may be selected from the list in Edit > Options > Segmentation.

Here is a generic segmentation rules file that you can use as a starter. It contains many generic German abbreviations.

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License