Skip to main content

Reservoir Sampling

Description

Reservoir Sampling selects a random subset of rows from an input stream using uniform probability, so every incoming row has an equal chance of being included in the sample. This step is useful when you need to reduce a large dataset to a manageable size for testing, analysis, or quality checks without introducing selection bias. You control the sample size directly — setting it to zero passes all rows through, while a negative value blocks all rows. A configurable random seed lets you produce repeatable samples across workflow runs.

Configurations

Field NameDescription
S Tab
Step nameSpecify the name of the step as it appears in the workflow workspace. This name has to be unique in a single workflow.
Sample sizeSelect how many rows to sample from an incoming stream.Setting a value of 0 will cause all rows to be sampled; setting a negative value will block all rows.
Random seedChoose a seed for the random number generator. Repeating a workflow with a different value for the seed will result in a different random sample being chosen.