Skip to main content

XML Input Stream (StAX)

Description

XML Input Stream (StAX) is a step in the Input Plugin for Process Studio Workflows. XML Input Stream (StAX) step provides the ability to read data from any type of XML file using the StAX parser. The existing Get Data from XML step is easier to use but uses DOM parsers that need in memory processing and even the purging of parts of the file is not sufficient when these parts are very big.

Choose this step, whenever you have limitations with other steps or when you are in need of parsing XML with the following conditions:

- Very fast and independent of the memory regardless of the file size (GBs and more are possible due to the streaming approach).

- Very flexible reading different parts of the XML file in different ways (and avoid parsing the file many times).

Configurations

No.Field NameDescription
1Step nameSpecify the name of the step as it appears in the workflow workspace. This name has to be unique in a single workflow.
2FilenameSpecify the file name of the input XML file.
3Add filename to result?Enable checkbox to add the processed XML filename to the result of this workflow. A unique list is being kept in memory that can be used in the next job entry in a job, for example in another workflow.
4Skip (Elements/Attributes)Specify the number of Elements / Attributes that should be skipped. This can be used for starting the processing at a specific location of a file. The file is still being loaded by the parser but the rows are not produced.
5Limit (Elements/Attributes)Specify the limit of Elements / Attributes after which processing stops. With the Skip and Limit properties it is possible to enable chunk loading that is defined in an outer loop.
6Default String Length (Elements / Attributes)Specify the default string length for the XML data name and value fields.
7EncodingSpecify the encoding of the XML file.
8Add Namespace information?Enable checkbox to add the XML data type NAMESPACE to the stream with an optional prefix (given in the XML data name) and URI information (given in the XML data value). Also a defined prefix in the ELEMENT data type is preceded to the XML data name, e.g. prefix: product.

Performance considerations: Due to the extra namespace handling this option slows down the processing throughput a little bit.

9Trim strings?Enable checkbox to trims all name/value elements and attributes. It is also eliminating white spaces, tab, cr, lf at the beginning and end of the string.
10Include filename in output? / FieldnameEnable checkbox to add the processed filename to the given fieldname.
11Row number in output? / FieldnameEnable checkbox to add the processed row number (starting with 1) to the given fieldname.
12XML data type (numeric) in output? / FieldnameEnable checkbox to step add the processed data type in numeric format to the given fieldname.

The following data types are defined:

"UNKNOWN" (not used, reserved)

"START_ELEMENT"

"END_ELEMENT"

"PROCESSING_INSTRUCTION" (not used, reserved)

"CHARACTERS"

"COMMENT" (not used, reserved)

"SPACE" (not used, reserved)

"START_DOCUMENT"

"END_DOCUMENT"

"ENTITY_REFERENCE" (not used, reserved)

ENTITY_REFERENCE" (not used, reserved)

"ATTRIBUTE"

"DTD" (not used, reserved)

"CDATA" (not used, reserved)

"NAMESPACE" (when namespace information is selected) 14-"NOTATION_DECLARA TION" (not used, reserved)

15-"ENTITY_DECLARATION" (not used, reserved)

13XML data type (description) in output? / FieldnameEnable checkbox to add the processed data type in text format to the given fieldname. This should be used instead of the numeric data type for better readability of the workflow. See XML data type (numeric) for a list of values. Performance considerations: Due to slower processing of strings and the extra memory consumption, it is recommended to use the numeric data type format for big data loads.
14XML location line in output? / FieldnameEnable checkbox to add the processed source XML location line to the given fieldname.
15XML location column in output? / FieldnameEnable checkbox to add the processed source XML location column to the given fieldname.
16XML element ID in output? / FieldnameEnable checkbox to add the processed element number (starting with 0) to the given fieldname. In contrast to the Row number, this field gets incremented by a new element and not a now row. The correct nesting between levels is ensured.
17XML parent element ID in output? / FieldnameEnable checkbox to add the parent element number to the given fieldname.

Note: By the use of the XML element ID in connection with the XML parent element ID, a complete XML element tree is available for later usage.

18XML element level in output? / FieldnameEnable checkbox to add the processed element level (starting with 0 for the root START_ and END_DOCUMENT) to the given fieldname.
19XML path in output? / FieldnameEnable checkbox to add the processed XML path to the given fieldname.
20XML parent path in output? / FieldnameEnable checkbox to add the processed XML parent path to the given fieldname.
21XML data name in output? / FieldnameEnable checkbox to add the processed data name of elements, attributes and optional namespace prefixes to the given fieldname.
22XML data value in output? / FieldnameEnable checkbox to step add the processed data value of elements, attributes and optional namespace URIs to the given fieldname.