Segmentation rules in Trados Studio and memoQ

Segmentation rules are particularly important in the translation process, as they determine how the text will be imported in a CAT tool and hence how simple/complex or short/long the segments will be. No matter you're a project manager, a translator or a reviewer, the shorter and simpler the segments, the easier to handle them, which means less time spent and more productivity. Let's take for instance this phrase:

These waters are home for various fish species - carp, trout, perch, bream, pike, roach.

Suppose the phrase contains even more species and the translator is tired or for any reason during the translation they skip one of the species. In order to prevent such error to make its way through to the client, it would be much more convenient to have this phrase imported in a CAT tool as follows:
These waters are home for various fish species -
carp,
trout,
perch,
bream,
pike,
roach.

Let's take some hands-on examples, first with cumbersome manual paragraph numbering. Suppose you have:
a) This is an object.
1) This is an object.
B) This is an object.
1.1 This is an object.
1.1This is an object.

Normally, these would be imported as five segments as there is no automatic numbering there. But you could do it in another, translator-friendly way:
(a)
This is an object.
1)
This is an object.
etc.

To do that, in Trados Studio:

Go to Project settings -> Language Pairs, select your TM, then press Settings -> Language Resources. In the right-hand side view, next to the source language, under Segmentation Rules, click on Default. In the Segmentation Rules window that opens, press Add and in the next Add Segmentation Rule window, select Advanced view, overwrite the expression under the Before break field with the one below, delete the expression under the After break field and press OK (you must give a name to this rule):

^\(?[a-zA-Z0-9]+\)[\s\t]*

It means: Look for all segments that start with any lowercase / uppercase letter or any digit from 0 to 9, which occurs one or more times, is preceded or not by an opening round bracket, is followed by a closing round bracket, then by a space character or a tab character, which occurs zero or more times.

Follow the same steps and also add this expression:

^\d{1,}\.\d{1,}[\s\t]*

It means: Look for all segments that start with a digit, which occurs one or more times, is followed by a dot character, then by another digit, which occurs one or more times, then by a space character or a tab character, which occurs zero or more times.

In memoQ:

Under the Resource console, go to Segmentation rules, press Create new to create your own rule, give it a name, select your language and press OK. Select this new rule, press Edit and in the Edit segmentation rule set window that opens, click on the Advanced view link in the bottom left corner, overwrite the expression in the bottom field with the one below and press Add:

^\(?[a-zA-Z0-9]+\)[\s\t]*#!#

It means: Look for all segments that start with any lowercase / uppercase letter or any digit from 0 and 9, which occurs one or more times, is preceded or not by an opening round bracket, is followed by a closing round bracket, then by a space character or a tab character, which occurs zero or more times and apply there a segment break (#!#).

Follow the same steps and also add this expression:

^\d{1,}\.\d{1,}[\s\t]*#!##cap#

It means: Look for all segments that start with a digit, which occurs one or more times, is followed by a dot character, then by another digit which occurs one or more times, then by a space character or a tab character which occurs zero or more times, then by a capital letter (defined under the #cap# group) and apply the segment break (#!#) before the capital letter.

Let's go deeper now. Suppose you have lots of repetitive texts, such as names of sections, titles, chapters, articles etc.
Section I "Section name"
Title A "Title name"
Chapter II "Chapter name"
Article 2 "Article name"

Instead of wasting time on modifying fuzzy matches and risking to accidentally change the related letters or numbers, you could simply import the names of sections, titles, chapters, articles etc. as separate segments:
Section I
"Section name"
Title A
"Title name"
Chapter II
"Chapter name"
Article 2
"Article name"

It is very easy afterwards to filter only for segments containing these repetitive texts and, depending on the target language, just copy-paste the source text or copy-paste it and replace with the target text:
Section I
Section II
Title A
Title B
Chapter I
Chapter II
Article 2
Article 13

To do that, in Trados Studio:

Follow the steps above to get to the Advanced view and add this expression:

^(Section|Title|Chap|Chapter|Art|Article)\s[A-Z0-9]+

It means: Look for all segments that start with any of "Section", "Title", "Chap", "Chapter", "Art" or "Article", followed by a space, followed by any uppercase letter or any digit from 0 to 9, which occurs one or more times.

In memoQ:

Follow the steps above to get to the Advanced view and add this expression:

^(Section|Title|Chap|Chapter|Art|Article)\s[A-Z0-9]+#!#

You can play with the segmentation rules anyway you want and set whatever you need as a paragraph separator, for instance the semicolon or tab characters.

To do that, in Trados Studio:

Follow the steps above to get to the Add Segmentation Rule window. Under the Before break and After break drop down menus select Anything and under Break character, select the default value you want: " ; " or " Tab " (you must give these rules a name):

In memoQ:

Follow the steps above to get to the Advanced view and add this expression:

[;\t]#!#

Hope you found these hints useful!

Segmentation rules in Trados Studio and memoQ

Other Posts

Regular expressions made simple

Tags