Just as an email has an addressee, a subject line, a main body, and maybe some attachments, the map has a very specific group of elements which are always in the same order and must follow a required format.

what is a map?

A map is the equivalent to a field in database terms, combined with what is often called a locator / search / find and replace feature in other software systems such as Microsoft Office.

It is defining the name of the data, defining the rules used to capture the data, and acting as its final container all in one. The combination of these concepts to reduce the learning curve and technical overhead, is what makes datamap easily accessible and unique. Anyone who has experience writing command line code or database queries will immediately recognise the style of the map syntax. For anyone less familiar with these concepts, it essentially tells the software in a few lines what it should do to retrieve data with certain constraints.

MAP block

The MAP block defines the name for the container of the data.

This can be considered similar to creating a database field in a table, whose purpose is to store a similar type of data extracted from all documents in the folder.

With the default settings, it will be the column name in the final results spreadsheet in Microsoft Excel.

Figure 1 – the MAP block

There is only one MAP block.

FIND block

The FIND block defines the format of each piece of data returned by the map.

It must contain a literal word or a regular expression (a standard syntax for specifying a pattern of characters to search for) which is not case sensitive by default.

Figure 2 – the FIND block

There is only one FIND block.

WITHIN blocks

WITHIN blocks are supporting rules that influence the score of each result and help identify or prioritise all the pieces of data returned. You can add many of these blocks to a single map.

These lines have the most complex and variable structure of all the blocks, while still being straightforward to read and understand from the user perspective. They set the criteria of distance (and optionally direction) the potential result must lie in relation to each keyword for an associated confidence boost to be applied.

Figure 3 – WITHIN blocks

The elements of a WITHIN block are as follows:

ElementDescription
Block TypeSets whether this a positive or negative scoring rule. For example:
WITHIN (will boost potential results)
NOT WITHIN (will reduce the confidence of potential results)
DistanceThe distance, in points, that the top left corner of the result must be from the top left corner of the keyword. For example:
50 pt (a result very close to the keyword)
0 pt (a result directly on top of the keyword)
Direction (optional)The direction, as a compass point, that the potential result must lie in relation to the keyword. For example:
(N or S) (a result either above or below the keyword)
E and NE and SE (a result generally to the right of the keyword)
W NW SW (a result generally to the left of the keyword)
Spacer words like ‘or’, ‘and’, parentheses and slashes are actually ignored.
This is to give you some freedom of how to express the direction.
RestrictionWhether there are any restrictions on where the result lies in the character story. For example:
of (no restriction)
before (the result lies before the keyword in the story)
after (the result lies after the keyword in the story)
Weight (optional)The confidence boost that will be applied from this WITHIN block if it satisfies the conditions. For example:
STRONG (applies a confidence of 100)
WEAK (applies a confidence of 40)
20 (applies a confidence of 20)
NEUTRAL / 0 (applies a confidence of 0 – useful for restricting keywords that you do not want to actually influence the score)
INHERITED / ? (applies original confidence when another map’s result is referenced)
The weight may not be negative.
The weight is unit-less but can be considered a percentage. If unspecified, the default weight is 80.
ObjectWhether the comparing object is a new unique keyword or the result of another map. For example:
KEYWORD (a keyword used for this rule only, which is not returned as a result to any map)
RESULT (the result of another map, which is treated like a keyword to score the result of the current map)
If multiple instances of the object are found, it only scores once, and the closest instance to the potential result is used for applying the proximity score.
Regular Expression / Map NameEven though keywords are normally simple phrases, you can specify a regular expression to search for them more flexibly. If this WITHIN block references the result of another map, state the individual map name here. For example:
Invoice Num(ber)? (applies the score if it finds the phrase ‘Invoice Number’ or ‘Invoice Num’ within the search criteria)
Map1 (applies the score if it finds a result from ‘Map1’ within the search criteria)
By default, regular expressions are not case sensitive.
Block Type

A normal WITHIN block aims to boost the score and priority of a potential result.

Use the syntax NOT WITHIN when you have two potential results which are easily confused, and proximity alone is not enough to discern the correct result from the incorrect one. When the conditions are met, the weight will be subtracted from the result to push it further down the final list.

Distance

Distance is specified in points (pt) which is a standard unit in Microsoft Word, most notably used for font size.

One point is 1/72 of an inch.

The default map templates come with a preset distance of 200 pt. This is a medium distance within the context of an A4-size document, enough to narrow down the area but not be too restrictive.

Direction

Using directions requires careful decision making, since the position of text elements in an OCR representation are not always predictable.

Even though datamap has been designed to work effectively without requiring directions, you can specify them.

The compass directions are measured in an angular way for maximum coverage and flexibility.

Figure 3.1 – the datamap compass, illustrating the angular regions that represent each supported direction

Restriction

A story is a Microsoft Word concept which essentially means the main text of the document, which is a continuous range of characters. Objects in a document such as text and structural elements are generally laid out top to bottom, left to right. If you know that a certain result always lies before or after a keyword in the story, you can use the relevant restriction.

This is most useful when capturing as a table – you know that all table content will lie after the first column header, and before some keyword below the table such as the summary of costs or tax information.

Note the following important points on how restrictions work:

  • Restrictions are not linked to directions – the before / after refers to character position in a continuous range of text
    • Results that wrap over a text line could end up W of the keyword or in some other unpredictable direction
    • For such situations and capturing data in the body of text, restrictions are actually more reliable than directions
  • Restrictions are absolute – potential results that are not within the final restricted search area / range (which is calculated first and highlighted yellow when testing with comments) will not be returned, no matter how many positive keywords are nearby
    • If there is a chance that your result could be on the borderline and could slip above or below the keyword when Microsoft Word creates the document structure and full text story, we recommend not using restrictions for this case
  • When restricting results before keywords, datamap makes the search area as large as possible by only cutting off the end of the range at the latest instance of any restricting keyword from any WITHIN block
  • When restricting results after keywords, datamap makes the search area as large as possible by only cutting off the start of the range at the earliest instance of any restricting keyword from any WITHIN block
  • A restriction need only be stated once for a map set and will reduce the potential results for all maps in the set

RESULTS block

The RESULTS block is the final collection of results ordered by confidence and proximity to the keywords (which are cumulative totals).

Figure 4 – the RESULTS block

In the taskpane only these limited elements are displayed, but more are visible when testing with comments (along with positional highlighting).

The full results collection is accessible via the Visual Basic script platform within Microsoft Word, allowing for custom configuration prior to display and export to Microsoft Excel or your chosen location.