Skip to content

PDF files

itemis ANALYZE’s PDF adapter identifies fragments of a PDF document as traceable artifacts. These artifacts can be identified by either their textual contents or by comments in the PDF document.

Data access

Configuration

Open the ANALYZE configuration with the ANALYZE configuration editor, and add a new data access as described in section "Data accesses". Select PDF files as data access type.

Supported options:

  • resource – File filter pattern

Example:

resource "*.pdf"

This configuration specifies that ANALYZE should load and analyze all files residing in the workspace whose filename extension is .pdf.

Artifact type

Configuration

Open the ANALYZE configuration with the ANALYZE configuration editor, and add a new artifact type as described in section "Artifact types". Select your previously-configured PDF files data access in the Data access drop-down list.

Keywords

The PDF artifact type configuration supports the following keywords:

  • analyze comments – Analyzes the comments in the PDF document and creates artifacts from them.
  • locate text where pattern matches – Looks up textual contents matching the specified regular expression in the configured files. For each text sequence matching the regular expression an artifact will be created.
  • name expr – Specifies the name of the artifact as the value of the expr expression. By default, this is the text fragment matched by the pattern.
  • group( name) – Retrieves the match of the capture named name in the pattern.
  • identified by – An optional key for the matched artifact. If specified, it should be a value uniquely identifying the artifact. If the same value is specified for multiple matching text elements, ANALYZE will create only a single artifact nevertheless.
  • map – Starts a mapping block for specifying custom attributes.

Example for analyzing comments:

analyze comments

Example for text parsing and artifact extraction:

locate text where pattern matches "(?sm)(?<id>\\[A:.*?])(?<txt>([^\\[])*)" { 
	name group("id")
	identified by group("$1")
	map{
	   attr to group ("txt").substringBefore("HEADER")+group("txt").substringAfter("FOOTER")
	}	
}

The pattern uses multiline-mode search ( ?sm) to find all occurrences of specific text elements. Such a text element starts with „[A:” ( [A:), followed by none or multiple characters ( .*?) and ends with a square bracket ( ]).

The group txt will contain the text until the next square bracket ( *)).

Extracted artifacts are named according to the value of the captured group named id. Artifacts are identified by the value of the first matching group, here id. The custom attribute attr is mapped to the txt group.

If a txt element stretches more than one page, the attr will include the page’s header and footer texts. If you want strip them off, you can use the substringBefore and substringAfter methods as shown. To actually use the example’s code snippet, please replace HEADER by the header’s first characters and FOOTER by the footer’s last characters. These character sequences should be unique in order to match only in the header and in the footer, respectively, and nowhere else.