Similarity Lite
Description
Build the model based on input training data. Then by using this model, predict the answer based on input query in the specified output field. All processes are carried in one execution.
Note: Use the step when data is at minimum level based on the hardware configuration of the machine under consideration.
The step is used to find the most similar sentences to the input query against the input given in a single sentence, paragraph, or text.
Configurations
No. | Field Name | Description |
---|---|---|
General tab | ||
1 | Step Name | Specify the name of the step. Step names should be unique within a workflow. |
2 | Number of Rows to Process | Specify total number of rows to be taken as input. (Default value: 500) |
3 | Build using AE Model Version | Select from the dropdown that which Python version to use for build and prediction purpose. |
4 | Query | Specify which column/features to be considered for building model. |
5 | Top n results | Specify number of rows closest to the original answer to be fetched as output. |
Field Mapping tab | ||
1 | Feature / Name | Feature or name used during model building step. |
2 | Text Preprocessing | Preprocessing options to be used to process the text/string. Please refer "Classification Model Builder's" step documentation. |
3 | Target Field | Specify output field name in which prediction value will be put. |
When you are processing a feature of type string, as mentioned in ‘Text Processing’ section of above table, this feature needs to be converted into numeric features. Text Vectorization Tab governs how all string features get converted into numeric features. An n-gram is a contiguous sequence of n items from a given sample of text or speech. Table below shows how internally a string gets tokenized given different values of n-gram
No. | String | N Gram Start/End | Tokens |
---|---|---|---|
1 | Weather today is good | 1-1 | 'Weather', 'today', 'good' |
2 | Weather today is good | 1-2 | 'Weather', 'today', 'good', 'Weather today', 'today good' |
3 | Weather today is good | 1-3 | 'Weather', 'today', 'good', 'Weather today', 'today good', 'Weather today good' |
4 | Weather today is good | 2-3 | 'Weather today', 'today good', 'Weather today good' |
*is treated as stop word and not considered
No. | Field Name | Description |
---|---|---|
Text Vectorization Tab | ||
1 | N Gram start | Should be a numeric value with minimum of 1 |
2 | N Gram end | Should be a numeric value greater than or equal to N Gram start |
3 | Vectorization | N-Gram operation tokenizes input string feature. Vectorization is the operation where these tokens are converted to numeric features which are needed by the algorithms. There are three types of vectorizers supported - Count Vectorizer: It counts the number of times a token shows up in the document and uses this value as its weight. Tfidf Vectorizer: TF-IDF stands for “term frequency-inverse document frequency”, meaning the weight assigned to each token not only depends on its frequency in a document but also how recurrent that term is in the entire corpora. - Hashing Vectorizer: It is designed to be as memory efficient as possible. Instead of storing the tokens as strings, the vectorizer applies the hashing trick to encode them as numerical indexes. The downside of this method is that once vectorized, the features’ names can no longer be retrieved. |