Reservoir Sampling
Description
Reservoir Sampling selects a random subset of rows from an input stream using uniform probability, so every incoming row has an equal chance of being included in the sample. This step is useful when you need to reduce a large dataset to a manageable size for testing, analysis, or quality checks without introducing selection bias. You control the sample size directly — setting it to zero passes all rows through, while a negative value blocks all rows. A configurable random seed lets you produce repeatable samples across workflow runs.
Configurations
| Field Name | Description |
|---|---|
| S Tab | |
| Step name | Specify the name of the step as it appears in the workflow workspace. This name has to be unique in a single workflow. |
| Sample size | Select how many rows to sample from an incoming stream.Setting a value of 0 will cause all rows to be sampled; setting a negative value will block all rows. |
| Random seed | Choose a seed for the random number generator. Repeating a workflow with a different value for the seed will result in a different random sample being chosen. |