Skip to main content

XML Input Stream (StAX)

Description

The XML Input Stream (StAX) step in the Input plugin for Process Studio workflows reads data from any XML file using the StAX parser. The existing Get Data from XML step is easier to use but relies on DOM parsers, which require in-memory processing. Purging parts of the file is often insufficient when those parts are very large. Choose this step when other steps have limitations or when you need to parse XML files under the following conditions:

  • You need high-speed processing that’s independent of memory, regardless of file size. The streaming approach supports files of several gigabytes or more.
  • You need flexible parsing to read different parts of the XML file in different ways and avoid parsing the file multiple times.

Configurations

No.Field NameDescription
1Step nameSpecify the name of the step as it appears in the workflow workspace. This name has to be unique in a single workflow.
2FilenameSpecify the file name of the input XML file.
3Add filename to result?Select Add filename to result? to include the processed XML filename in the workflow results. The step stores a unique list in memory, which you can use in the next job entry—for example, in another workflow.
4Skip (Elements/Attributes)Specify the number of Elements / Attributes that should be skipped. This can be used for starting the processing at a specific location of a file. The file is still being loaded by the parser but the rows are not produced.
5Limit (Elements/Attributes)Specify the limit of Elements / Attributes after which processing stops. With the Skip and Limit properties it is possible to enable chunk loading that is defined in an outer loop.
6Default String Length (Elements / Attributes)Specify the default string length for the XML data name and value fields.
7EncodingSpecify the encoding of the XML file.
8Add Namespace information?Enable checkbox to add the XML data type NAMESPACE to the stream with an optional prefix (given in the XML data name) and URI information (given in the XML data value). Also a defined prefix in the ELEMENT data type is preceded to the XML data name, e.g. prefix: product.

Performance considerations: Due to the extra namespace handling this option slows down the processing throughput a little bit.

9Trim strings?Enable checkbox to trims all name/value elements and attributes. It is also eliminating white spaces, tab, cr, lf at the beginning and end of the string.
10Include filename in output? / FieldnameEnable checkbox to add the processed filename to the given fieldname.
11Row number in output? / FieldnameEnable checkbox to add the processed row number (starting with 1) to the given fieldname.
12XML data type (numeric) in output? / FieldnameEnable checkbox to step add the processed data type in numeric format to the given fieldname.

The following data types are defined:

"UNKNOWN" (not used, reserved)

"START_ELEMENT"

"END_ELEMENT"

"PROCESSING_INSTRUCTION" (not used, reserved)

"CHARACTERS"

"COMMENT" (not used, reserved)

"SPACE" (not used, reserved)

"START_DOCUMENT"

"END_DOCUMENT"

"ENTITY_REFERENCE" (not used, reserved)

ENTITY_REFERENCE" (not used, reserved)

"ATTRIBUTE"

"DTD" (not used, reserved)

"CDATA" (not used, reserved)

"NAMESPACE" (when namespace information is selected) 14-"NOTATION_DECLARA TION" (not used, reserved)

15-"ENTITY_DECLARATION" (not used, reserved)

13XML data type (description) in output? / FieldnameEnable checkbox to add the processed data type in text format to the given fieldname. This should be used instead of the numeric data type for better readability of the workflow. See XML data type (numeric) for a list of values. Performance considerations: Due to slower processing of strings and the extra memory consumption, it is recommended to use the numeric data type format for big data loads.
14XML location line in output? / FieldnameEnable checkbox to add the processed source XML location line to the given fieldname.
15XML location column in output? / FieldnameEnable checkbox to add the processed source XML location column to the given fieldname.
16XML element ID in output? / FieldnameEnable checkbox to add the processed element number (starting with 0) to the given fieldname. In contrast to the Row number, this field gets incremented by a new element and not a now row. The correct nesting between levels is ensured.
17XML parent element ID in output? / FieldnameEnable checkbox to add the parent element number to the given fieldname.

Note: By the use of the XML element ID in connection with the XML parent element ID, a complete XML element tree is available for later usage.

18XML element level in output? / FieldnameEnable checkbox to add the processed element level (starting with 0 for the root START_ and END_DOCUMENT) to the given fieldname.
19XML path in output? / FieldnameEnable checkbox to add the processed XML path to the given fieldname.
20XML parent path in output? / FieldnameEnable checkbox to add the processed XML parent path to the given fieldname.
21XML data name in output? / FieldnameEnable checkbox to add the processed data name of elements, attributes and optional namespace prefixes to the given fieldname.
22XML data value in output? / FieldnameEnable checkbox to step add the processed data value of elements, attributes and optional namespace URIs to the given fieldname.