Complex queries

One performing a simple queries on a set of annotation files, Dolmen attempts to find a set of concordances in one item (point or interval) at a time. While it is possible for an item to match a given search pattern several times if several substrings match the pattern, matches are nevertheless limited to a single item.

Sometimes, however, we might want to match text in several items simultaneously. Such a query is called a complex query in Dolmen. There are 3 types of relations between items, detailed below: alignment, precedence and dominance.

Building a complex query

When you open a search window, two small buttons with a + and - sign appear below the main the search field. These buttons allow you to add and remove search items. Any query which has more than one search item is a complex query.

When you add one or more search items, you will notice that each of them (except the last one) is followed by a selector with 3 possible values: is aligned with, precedes and dominates. They correspond to the tier item relations alignment, precedence and dominance, respectively.

Contrary to simple queries, complex queries do not use the KWIC model to display results. Instead of displaying a matched string in its context, it lets the user select a display tier, which appears at the top of the search box. The text that is displayed is the concatenation of all the items contain within the time interval defined by simultaneous satisfaction of the constraints on each search item. Several examples are given below.

Alignment relation

Two items are aligned if they are on different tiers and their left and right boundaries coincide. Suppose that you have a word tier (tier 1), where each word was segmented, and a part-of-speech (POS) tier (tier 2) which is aligned with the word tier. To extract all the nouns in the corpus, you could do the following:

Dolmen will first look for all items whose text contains "NOUN" on tier 1, and will keep all those items which contain a non-empty label in an item of tier 2 which is exactly aligned with a NOUN item on tier 1. Dolmen will then return a list of the text labels on tier 2 which match the above criteria.

As another example, suppose you now want to extract all the adverbs that end with -ly. You could do the following:

Assuming that tier 2 contains exactly one word per interval, this will successfully extract all the adverbs on tier 2 that end with -ly.

Precedence relation

Two items are in a precedence relation if they immediately follow each other. You can search for arbitrarily long sequences by chaining search items on the same tier. When you specify a sequence, Dolmen will retrieve the text from the display tier that is included within the span defined by the sequence.

Suppose that you have a word tier (tier 1) and a POS tier (tier 2), as in the alignment examples. Instead of searching for a single word, you might be interested in looking for word sequences. To find all the DET+NOUN sequences, you could do the following:

Dolmen will first look for all DET items on tier one, and will keep only those that are followed by a NOUN item on the same tier. It will then display the text that results from the concatenation of all the items on tier 2 within the span determined by the beginning of the DET item and by the end of NOUN item on tier 1.

Dominance relation

An item a dominates an item b if a and b are on different tier, the left boundary of b is greater or equal to that of a, and the right boundary of b is lesser or equal to that of a. Dominance relations typically encode hierarchical structures, for instance word > syllable > segment.

Suppose you have 3 tiers in your file: the first one contains spans which denote syllables, the second one contains syllabic constituents ("syll") ("Onset", "Nucleus", "Coda") and the last one individual segments ("p", "a", "t"...). In order to retrieve all syllables that end in a coda, you could do the following:

This query will first get all the items that have a syll label on the first tier; then, for each of those, it will look for a label Coda on tier 2 within the limits of the span on tier 1; for each item which matches both conditions, it will display the concatenated text of the items on tier 3 that are dominated by the matching item on tier 1.