DPL XML Documents

XML

The XML matcher parses XML documents, creating a structure that adapts to the content of the XML document.

The following mapping rules apply when using the XML matcher:

  • Elements are represented as fields.
  • Child elements form nested fields.
  • Attributes are mapped to nested fields by adding @ as a prefix in the field name. For example, attribute <... attr="..."> becomes field @attr.
  • In the presence of attributes or child elements, the text content of elements is represented as a nested #text field.
  • The #text field becomes an array when mixing child elements (or comments) with multiple text content parts.
  • For elements without attributes or child elements, the text content becomes the value of the corresponding field.
  • Elements that occur multiple times are mapped to arrays.
  • Namespace prefixes are included in the field name. For example, element <prefix:element> becomes field prefix:element.
  • XML declarations (version and encoding attributes) are discarded.
  • Attribute values and text content are represented as strings.
  • Comments are discarded.

Alternatively, you can use the XML_PLAIN or XML_VERBOSE matchers with a fixed output structure.

output type

quantifier

configuration

variant_object

none

rootTag = string specifying the name of the XML element from which to parse content. The first occurrence is selected if multiple such elements exist. By default, the matcher parses the whole XML document.

excludeRoot = Boolean value. "true" excludes the root element from the output. The default is false.

maxlen = numeric value representing the maximum byte size of an XML document. Allows parsing large XML documents (exceeding default size of 128000 bytes).

charset = character set name enclosed in single or double quotes (for example charset="ISO-8859-1")

locale = string specifying IETF BCP 47 language tag enclosed in single or double quotes (see the list). The default locale is English.

Example

An XML document:

<?xml version="1.0" encoding="UTF-8"?>
<messages xmlns:xhtml="http://www.w3.org/1999/xhtml">
<thread id="1">
<topic>XML Parsing</topic>
<!-- comment -->
<message id="101">
<sender>Alice</sender>
<content type="plain">&lt;b&gt;text&lt;/b&gt;</content>
</message>
<message id="102">
<sender>Bob</sender>
<content type="plain"><![CDATA[<b>text</b>]]></content>
</message>
<message id="103">
<sender>John</sender>
<content type="xhtml">More <xhtml:b>text</xhtml:b> here.</content>
</message>
<message id="104">
<sender>Mary</sender>
<content type="xhtml">Some <!-- hidden text --> included.</content>
</message>
</thread>
</messages>

Can be parsed using the pattern:

XML:xml

The result is:

namevaluetype

xml[messages][@xmlns:xhtml]

http://www.w3.org/1999/xhtml

STRING

xml[messages][thread][@id]

1

STRING

xml[messages][thread][topic]

XML Parsing

STRING

xml[messages][thread][message][0][@id]

101

STRING

xml[messages][thread][message][0][sender]

Alice

STRING

xml[messages][thread][message][0][content][@type]

plain

STRING

xml[messages][thread][message][0][content][#text]

<b>text</b>

STRING

xml[messages][thread][message][1][@id]

102

STRING

xml[messages][thread][message][1][sender]

Bob

STRING

xml[messages][thread][message][1][content][@type]

plain

STRING

xml[messages][thread][message][1][content][#text]

<b>text</b>

STRING

xml[messages][thread][message][2][@id]

103

STRING

xml[messages][thread][message][2][sender]

John

STRING

xml[messages][thread][message][2][content][@type]

xhtml

STRING

xml[messages][thread][message][2][content][xhtml:b]

text

STRING

xml[messages][thread][message][2][content][#text]

['More ',' here.']

STRING_ARRAY

xml[messages][thread][message][3][@id]

104

STRING

xml[messages][thread][message][3][sender]

Mary

STRING

xml[messages][thread][message][3][content][@type]

xhtml

STRING

xml[messages][thread][message][3][content][#text]

['Some ',' included.']

STRING_ARRAY

XML_PLAIN

With XML_PLAIN, you get a streamlined version of the XML data. The matcher discards attributes of XML elements. It can be helpful for cases where this information is unnecessary since it reduces the output structure's complexity, making it easier to work with the parsed data.

XML_PLAIN uses the following mapping rules:

  • Elements are represented as fields.
  • Child elements form nested fields.
  • Attributes of elements are discarded.
  • The text content of elements becomes the value of the corresponding field.
  • Text content mixed with child elements is discarded.
  • Elements that occur multiple times are mapped to arrays.
  • Namespace prefixes are included in the field name. For example, element <prefix:element> becomes field prefix:element.
  • XML declarations (version and encoding attributes) are discarded.
  • Text content is represented as strings.
  • Comments are discarded.

output type

quantifier

configuration

variant_object

none

rootTag = string specifying the name of the XML element from which to parse content. The first occurrence is selected if multiple such elements exist. By default, the matcher parses the whole XML document.

excludeRoot = Boolean value. "true" excludes the root element from the output. The default is false.

maxlen = numeric value representing the maximum byte size of an XML document. Allows parsing large XML documents (exceeding default size of 128000 bytes).

charset = character set name enclosed in single or double quotes (for example charset="ISO-8859-1")

locale = string specifying IETF BCP 47 language tag enclosed in single or double quotes (see the list here). The default locale is English.

Example

An XML document:

<?xml version="1.0" encoding="UTF-8"?>
<messages xmlns:xhtml="http://www.w3.org/1999/xhtml">
<thread id="1">
<topic>XML Parsing</topic>
<!-- comment -->
<message id="101">
<sender>Alice</sender>
<content type="plain">&lt;b&gt;text&lt;/b&gt;</content>
</message>
<message id="102">
<sender>Bob</sender>
<content type="plain"><![CDATA[<b>text</b>]]></content>
</message>
<message id="103">
<sender>John</sender>
<content type="xhtml">More <xhtml:b>text</xhtml:b> here.</content>
</message>
<message id="104">
<sender>Mary</sender>
<content type="xhtml">Some <!-- hidden text --> included.</content>
</message>
</thread>
</messages>

Can be parsed using the pattern:

XML_PLAIN:xml

The result is:

namevaluetype

xml[messages][thread][topic]

XML Parsing

STRING

xml[messages][thread][message][0][sender]

Alice

STRING

xml[messages][thread][message][0][content]

<b>text</b>

STRING

xml[messages][thread][message][1][sender]

Bob

STRING

xml[messages][thread][message][1][content]

<b>text</b>

STRING

xml[messages][thread][message][2][sender]

John

STRING

xml[messages][thread][message][2][content][xhtml:b]

text

STRING

xml[messages][thread][message][3][sender]

Mary

STRING

xml[messages][thread][message][3][content]

Some included.

STRING

XML_VERBOSE

You receive the most detailed and comprehensive data structure when parsing XML documents using the XML_VERBOSE matcher. In contrast to the XML matcher, the XML_VERBOSE matcher creates an output structure that is fixed and does not depend on the presence of element attributes or child elements.

XML_VERBOSE uses the following mapping rules:

  • Elements are represented as fields.
  • Child elements form nested fields.
  • Attributes are mapped to nested fields by adding @ as a prefix in the field name. For example, attribute <... attr="..."> becomes field @attr.
  • The text content of elements is represented as a nested #text field.
  • The #text field becomes an array when mixing child elements (or comments) with multiple text content parts.
  • Elements that occur multiple times are mapped to arrays.
  • Namespace prefixes are included in the field name. For example, element <prefix:element> becomes field prefix:element.
  • XML declarations (version and encoding attributes) are mapped to fields at the root element level.
  • Attribute values and text content are represented as strings.
  • Comments are discarded.

output type

quantifier

configuration

variant_object

none

rootTag = string specifying the name of the XML element from which to parse content. The first occurrence is selected if multiple such elements exist. By default, the matcher parses the whole XML document.

excludeRoot = Boolean value. "true" excludes the root element from the output. The default is false.

maxlen = numeric value representing the maximum byte size of an XML document. Allows parsing large XML documents (exceeding default size of 128000 bytes).

charset = character set name enclosed in single or double quotes (for example charset="ISO-8859-1")

locale = string specifying IETF BCP 47 language tag enclosed in single or double quotes (see the list here). The default locale is English.

Example

An XML document:

<?xml version="1.0" encoding="UTF-8"?>
<messages xmlns:xhtml="http://www.w3.org/1999/xhtml">
<thread id="1">
<topic>XML Parsing</topic>
<!-- comment -->
<message id="101">
<sender>Alice</sender>
<content type="plain">&lt;b&gt;text&lt;/b&gt;</content>
</message>
<message id="102">
<sender>Bob</sender>
<content type="plain"><![CDATA[<b>text</b>]]></content>
</message>
<message id="103">
<sender>John</sender>
<content type="xhtml">More <xhtml:b>text</xhtml:b> here.</content>
</message>
<message id="104">
<sender>Mary</sender>
<content type="xhtml">Some <!-- hidden text --> included.</content>
</message>
</thread>
</messages>

Can be parsed using the pattern:

XML_VERBOSE:xml

The result is:

namevaluetype

xml[@version]

1.0

STRING

xml[@encoding]

UTF-8

STRING

xml[messages][@xmlns:xhtml]

http://www.w3.org/1999/xhtml

STRING

xml[messages][thread][@id]

1

STRING

xml[messages][thread][topic]

XML Parsing

STRING

xml[messages][thread][message][0][@id]

101

STRING

xml[messages][thread][message][0][sender]

Alice

STRING

xml[messages][thread][message][0][content][@type]

plain

STRING

xml[messages][thread][message][0][content][#text]

<b>text</b>

STRING

xml[messages][thread][message][1][@id]

102

STRING

xml[messages][thread][message][1][sender]

Bob

STRING

xml[messages][thread][message][1][content][@type]

plain

STRING

xml[messages][thread][message][1][content][#text]

<b>text</b>

STRING

xml[messages][thread][message][2][@id]

103

STRING

xml[messages][thread][message][2][sender]

John

STRING

xml[messages][thread][message][2][content][@type]

xhtml

STRING

xml[messages][thread][message][2][content][xhtml:b]

text

STRING

xml[messages][thread][message][2][content][#text]

['More ',' here.']

STRING_ARRAY

xml[messages][thread][message][3][@id]

104

STRING

xml[messages][thread][message][3][sender]

Mary

STRING

xml[messages][thread][message][3][content][@type]

xhtml

STRING

xml[messages][thread][message][3][content][#text]

['Some ',' included.']

STRING_ARRAY