Machine-Learning Preprocessing Functions in Confluent Cloud for Apache Flink¶
The following built-in functions are available for ML preprocessing in Confluent Cloud for Apache Flink®. These functions help transform features into representations more suitable for downstream processors.
ML_BUCKETIZE¶
Bucketizes numerical values into discrete bins based on split points.
- Syntax
ML_BUCKETIZE(value, splitBucketPoints [, bucketNames])
- Description
- The
ML_BUCKETIZE
function divides numerical values into discrete buckets based on specified split points. Each bucket represents a range of values, and the function returns the bucket index or name for each input value. - Arguments
- value: Numerical expression to be bucketized. If the input value is
NaN
orNULL
, it is bucketized to theNULL
bucket. - splitBucketPoints: Array of numerical values that define the bucket
boundaries, or split points.
- If the
splitBucketPoints
array is empty, an exception is thrown. - Any split points that are
NaN
orNULL
are removed from thesplitBucketPoints
array. splitBucketPoints
must be in ascending order, or an exception is thrown.- Duplicates are removed from
splitBucketPoints
.
- If the
- bucketNames: (Optional) Array of names of the buckets defined in
splitBucketPoints
.- If the
bucketNames
array is not provided, buckets are namedbin_NULL
,bin_1
,bin_2
…bin_n
, withn
being the total number of buckets insplitBucketPoints
. - If the
bucketNames
array is provided, names must be in the same order as in thesplitBucketPoints
array. - Names for all of the buckets must be provided, including the
NULL
bucket, or an exception is thrown. - If the
bucketNames
array is provided, the first name is the name for theNULL
bucket.
- If the
- value: Numerical expression to be bucketized. If the input value is
- Example
-- returns 'bin_2' SELECT ML_BUCKETIZE(2, ARRAY[1, 4, 7]); -- returns 'b2' SELECT ML_BUCKETIZE(2, ARRAY[1, 4, 7], ARRAY['b_null','b1','b2','b3','b4']);
ML_CHARACTER_TEXT_SPLITTER¶
Splits text into chunks based on character count and separators.
- Syntax
ML_CHARACTER_TEXT_SPLITTER(text, chunkSize, chunkOverlap, separator, isSeparatorRegex [, trimWhitespace] [, keepSeparator] [, separatorPosition])
- Description
The
ML_CHARACTER_TEXT_SPLITTER
function splits text into chunks based on character count and specified separators. This is useful for processing large text documents into smaller, manageable pieces.If any argument other than
text
isNULL
, an exception is thrown.The returned array of chunks has the same order as the input.
The function tries to keep every chunk within the
chunkSize
limit, but if a chunk is more than the limit, it is returned as is.- Arguments
- text: The input text to be split. If the input text is
NULL
, it is returned as is. - chunkSize: The size of each chunk. If
chunkSize < 0
orchunkOverlap > chunkSize
, an exception is thrown. - chunkOverlap: The number of overlapping characters between chunks. If
chunkOverlap < 0
, an exception is thrown. - separator: The separator used for splitting.
- isSeparatorRegex: Whether the separator is a regex pattern.
- trimWhitespace: (Optional) Whether to trim whitespace from chunks.
The default is
TRUE
. - keepSeparator: (Optional) Whether to keep the separator in the chunks.
The default is
FALSE
. - separatorPosition: (Optional) The position of the separator. Valid
values are
START
orEND
. The default isSTART
.START
means place the separator at the start of the following chunk, andEND
means place the separator at the end of the previous chunk.
- text: The input text to be split. If the input text is
- Example
-- returns ['This is the text I would like to ch', 'o chunk up. It is the example text ', 'ext for this exercise'] SELECT ML_CHARACTER_TEXT_SPLITTER('This is the text I would like to chunk up. It is the example text for this exercise', 35, 4, '', TRUE, FALSE, TRUE, 'END');
ML_FILE_FORMAT_TEXT_SPLITTER¶
Splits text into chunks based on specific file format patterns.
- Syntax
ML_FILE_FORMAT_TEXT_SPLITTER(text, chunkSize, chunkOverlap, formatName, [trimWhitespace] [, keepSeparator] [, separatorPosition])
- Description
The
ML_FILE_FORMAT_TEXT_SPLITTER
function splits text into chunks based on specific file format patterns. It uses format-specific separators to split code intelligently or structure text.The returned array of chunks has the same order as the input.
The function starts splitting the chunks with the first separator in the separators list. If a chunk is bigger than
chunkSize
, the function splits the chunk recursively using the next separator in the separators list for the given file format. If separators are exhausted, and the remaining text is bigger thanchunkSize
, the function returns the smallest chunk possible, even though it is bigger thanchunkSize
.- Arguments
text: The input text to be split. If the input text is
NULL
, it is returned as is.chunkSize: The size of each chunk. If
chunkSize < 0
orchunkOverlap > chunkSize
, an exception is thrown.chunkOverlap: The number of overlapping characters between chunks. If
chunkOverlap < 0
, an exception is thrown.formatName: ENUM of the format names. Valid values are:
Valid values for formatName
C
CPP
CSHARP
ELIXIR
GO
HTML
JAVA
JAVASCRIPT
JSON
KOTLIN
LATEX
MARKDOWN
PHP
PYTHON
RUBY
RUST
SCALA
SQL
SWIFT
TYPESCRIPT
XML
- trimWhitespace: (Optional) Whether to trim whitespace from chunks.
The default is
TRUE
. - keepSeparator: (Optional) Whether to keep the separator in the chunks.
The default is
FALSE
. - separatorPosition: (Optional) The position of the separator. Valid
values are
START
orEND
. The default isSTART
.START
means place the separator at the start of the following chunk, andEND
means place the separator at the end of the previous chunk.
- Example
-- returns ['def hello_world():\n print("Hello, World!")', '# Call the function\nhello_world()'] SELECT ML_FILE_FORMAT_TEXT_SPLITTER('def hello_world():\n print("Hello, World!")\n\n# Call the function\nhello_world()\n', 50, 0, 'PYTHON');
ML_LABEL_ENCODER¶
Encodes categorical variables into numerical labels.
- Syntax
ML_LABEL_ENCODER(input, categories [, includeZeroLabel])
- Description
- The
ML_LABEL_ENCODER
function encodes categorical variables into numerical labels. Each unique category is assigned a unique integer label. - Arguments
- input: Input value to encode.
- If the input value is
NULL
,NaN
, orInfinity
, it is considered in the unknown category, which is given the0
label. - If the input value is not one of the
categories
, it is labeled as-1
or0
depending onincludeZeroLabel
:-1
ifincludeZeroLabel
is TRUE and0
ifincludeZeroLabel
is FALSE.
- If the input value is
- categories: Arrays of category values to encode input value to.
Category values must be the same type as the
input
value.- If the
categories
array is empty, all inputs are considered to be in the unknown category, which is given the0
label. - The
categories
array can’t beNULL
, or an exception is thrown. - The
categories
array can’t haveNULL
or duplicate values, or an exception is thrown. - The
categories
array must be sorted in ascending lexicographical order, or an exception is thrown.
- If the
- includeZeroLabel: (Optional) The start index for valid categories is
0
. The default isFALSE
.- If
includeZeroLabel
isTRUE
, the valid categories index starts at0
, and unknown values are labeled as-1
. - If
includeZeroLabel
isFALSE
, the valid categories index starts at1
, and unknown values are labeled as0
.
- If
- input: Input value to encode.
- Example
-- returns 1 SELECT ML_LABEL_ENCODER('abc', ARRAY['abc', 'def', 'efg', 'hikj']); -- returns 0 SELECT ML_LABEL_ENCODER('abc', ARRAY['abc', 'def', 'efg', 'hikj'], TRUE );
ML_MAX_ABS_SCALER¶
Scales numerical values by their maximum absolute value.
- Syntax
ML_MAX_ABS_SCALER(value, absoluteMax)
- Description
- The
ML_MAX_ABS_SCALER
function scales numerical values by dividing them by the maximum absolute value. This preserves zero entries in sparse data. - Arguments
- value: Numerical expression to be scaled. If the input value is
NULL
,NaN
, orInfinity
, it is returned as is. - absoluteMax: Absolute Maximum value of the feature data seen in the
dataset.
- If
absoluteMax
isNULL
orNaN
, an exception is thrown. - If
absoluteMax
isInfinity
,0
is returned. - If
absoluteMax
is0
, the scaled value is returned as is.
- If
- value: Numerical expression to be scaled. If the input value is
- Example
-- returns 0.2 SELECT ML_MAX_ABS_SCALER(1, 5);
ML_MIN_MAX_SCALER¶
Scales numerical values to a specified range using min-max normalization.
- Syntax
ML_MIN_MAX_SCALER(value, min, max)
- Description
- The
ML_MIN_MAX_SCALER
function scales numerical values to a specified range using min-max normalization. The function transforms values to the range[0, 1]
by default, or to a custom range ifmin
andmax
are specified. - Arguments
- value: Numerical expression to be scaled. If the input value is
NULL
,NaN
, orInfinity
, it is returned as is.- If
value > max
, it is set to1.0
. - If
value < min
, it is set to0.0
. - If
max == min
, the range is set to1.0
to avoid division by zero.
- If
- min: Minimum value of the feature data seen in the dataset. If
min
isNULL
,NaN
, orInfinity
, an exception is thrown. - max: Maximum value of the feature data seen in the dataset.
- If
max
isNULL
,NaN
, orInfinity
, an exception is thrown. - If
max < min
, an exception is thrown.
- If
- value: Numerical expression to be scaled. If the input value is
- Example
-- returns 0.25 SELECT ML_MIN_MAX_SCALER(2, 1, 5);
ML_NGRAMS¶
Generates n-grams from an array of strings.
- Syntax
ML_NGRAMS(input [, nValue] [, separator])
- Description
The
ML_NGRAMS
function generates n-grams from an array of strings. N-grams are contiguous sequences of n items from a given sample of text.The ordering of the returned output is the same as the
input
array.- Arguments
- input: Array of CHAR or VARCHAR to return n-gram for.
- If the
input
array hasNULL
, it is ignored while forming N-GRAMS. - If the
input
array isNULL
or empty, an empty N-GRAMS array is returned. - Empty strings in the
input
array are treated as is. - Strings with only whitespace are treated as empty strings.
- If the
- nValue: (Optional) N value of n-gram function. The default is
2
.- If
nValue < 1
, an exception is thrown. - If
nValue > input.size()
, an empty N-GRAMS array is returned.
- If
- separator: (Optional) Characters to join n-gram values with. The default is whitespace.
- input: Array of CHAR or VARCHAR to return n-gram for.
- Example
-- returns ['ab', 'cd', 'de', 'pwe'] SELECT ML_NGRAMS(ARRAY['ab', 'cd', 'de', 'pwe'], 1, '#'); -- returns ['ab#cd', 'cd#de'] SELECT ML_NGRAMS(ARRAY['ab','cd','de', NULL], 2, '#');
ML_NORMALIZER¶
Normalizes numerical values using p-norm normalization.
- Syntax
ML_NORMALIZER(value, normValue)
- Description
- The
ML_NORMALIZER
function normalizes numerical values using p-norm normalization. This scales each sample to have unit norm. - Arguments
- value: Numerical expression to be scaled. If the input value is
NULL
,NaN
, orInfinity
, it is returned as is. - normValue: Calculated norm value of the feature data using p-norm.
- If
normValue
isNULL
orNaN
, an exception is thrown. - If
normValue
isInfinity
,0
is returned. - If
normValue
is0
, which is only possible when all the values are0
, the input value is returned as is.
- If
- value: Numerical expression to be scaled. If the input value is
- Example
-- returns 0.6 SELECT ML_NORMALIZER(3.0, 5.0);
ML_ONE_HOT_ENCODER¶
Encodes categorical variables into a binary vector representation.
- Syntax
ML_ONE_HOT_ENCODER(input, categories [, dropLast] [, handleUnknown])
- Description
- The
ML_ONE_HOT_ENCODER
function encodes categorical variables into a binary vector representation. Each category is represented by a binary vector where only one element is 1 and the rest are 0. - Arguments
- input: Input value to encode. If the input value is
NULL
, it is considered to be in the unknown category. - categories: Array of category values to encode input value to. The
input
argument must be of same type as thecategories
array.- If the
categories
array is empty, an exception is thrown. - The
categories
array can’t beNULL
, or an exception is thrown. - The
categories
array can’t haveNULL
or duplicate values, or an exception is thrown.
- If the
- dropLast: (Optional) Whether to drop the last category. The
default is
TRUE
. By default, the last category is dropped, to prevent perfectly collinear features. - handleUnknown: (Optional)
ERROR
,IGNORE
,KEEP
options to indicate how to handle unknown values. The default isIGNORE
.- If
handleUnknown
isERROR
, an exception is thrown when the input is an unknown value. - If
handleUnknown
isIGNORE
, unknown values are ignored and values of all the columns are 0. - If
handleUnknown
isKEEP
, the unknown category column has value 1. - If
handleUnknown
isKEEP
, the last column is for the unknown category.
- If
- input: Input value to encode. If the input value is
- Example
-- returns [1, 0, 0, 0] SELECT ML_ONE_HOT_ENCODER('abc', ARRAY['abc', 'def', 'efg', 'hikj']); -- returns [0, 0, 0, 0, 1] SELECT ML_ONE_HOT_ENCODER('abcd', ARRAY['abc', 'def', 'efg', 'hik'], TRUE, 'KEEP' );
ML_RECURSIVE_TEXT_SPLITTER¶
Splits text into chunks using multiple separators recursively.
- Syntax
ML_RECURSIVE_TEXT_SPLITTER(text, chunkSize, chunkOverlap [, separators] [, isSeparatorRegex] [, trimWhitespace] [, keepSeparator] [, separatorPosition])
- Description
The
ML_RECURSIVE_TEXT_SPLITTER
function splits text into chunks using multiple separators recursively. It starts with the first separator and recursively applies subsequent separators if chunks are still too large.If any argument other than
text
isNULL
, an exception is thrown.The returned array of chunks has the same order as the input.
- Arguments
- text: The input text to be split. If the input text is
NULL
, it is returned as is. - chunkSize: The size of each chunk. If
chunkSize < 0
orchunkOverlap > chunkSize
, an exception is thrown. - chunkOverlap: The number of overlapping characters between chunks. If
chunkOverlap < 0
, an exception is thrown. - separators: (Optional) The list of separators used for splitting.
The default is
["\n\n", "\n", " ", ""]
- isSeparatorRegex: (Optional) Whether the separator is a regex pattern.
The default is
FALSE
- trimWhitespace: (Optional) Whether to trim whitespace from chunks.
The default is
TRUE
- keepSeparator: (Optional) Whether to keep the separator in the chunks.
The default is
FALSE
- separatorPosition: (Optional) The position of the separator. Valid
values are
START
orEND
. The default isSTART
.START
means place the separator at the start of the following chunk, andEND
means place the separator at the end of the previous chunk.
- text: The input text to be split. If the input text is
- Example
-- returns ['Hello', '. world', '!'] SELECT ML_RECURSIVE_TEXT_SPLITTER('Hello. world!', 0, 0, ARRAY['[!]','[.]'], TRUE, TRUE, TRUE, 'START');
ML_ROBUST_SCALER¶
Scales numerical values using statistics that are robust to outliers.
- Syntax
ML_ROBUST_SCALER(value, median, firstQuartile, thirdQuartile [, withCentering, withScaling)
- Description
- The
ML_ROBUST_SCALER
function scales numerical values using statistics that are robust to outliers. It removes the median and scales the data according to the quantile range. - Arguments
- value: Numerical expression to be scaled. If the input value is
NULL
,NaN
, orInfinity
, it is returned as is. - median: Median of the feature data seen in the training dataset. If
median
isNULL
,NaN
, orInfinity
, an exception is thrown. - firstQuartile: First Quartile of feature data seen in the dataset. If
firstQuartile
isNULL
,NaN
, orInfinity
, an exception is thrown. - thirdQuartile: Third Quartile of feature data seen in the dataset.
- If
thirdQuartile
isNULL
,NaN
, orInfinity
, an exception is thrown. - If
thirdQuartile - firstQuartile = 0
, the range is set to1.0
to avoid division by zero.
- If
- withCentering: (Optional) Boolean value indicating to center the
numerical value using median before scaling. The default is
TRUE
. IfwithCentering
isFALSE
, the median value is ignored. - withScaling: (Optional) Boolean value indicating to scale the numerical
value using IQR after centering. The default is
TRUE
. IfwithScaling
isFALSE
, the firstQuartile and thirdQuartile values are ignored.
- value: Numerical expression to be scaled. If the input value is
- Example
-- returns 0.3333333333333333 SELECT ML_ROBUST_SCALER(2, 1, 0, 3, TRUE, TRUE);
ML_STANDARD_SCALER¶
Standardizes numerical values by removing the mean and scaling to unit variance.
- Syntax
ML_STANDARD_SCALER(value, mean, standardDeviation [, withCentering] [, withScaling])
- Description
- The
ML_STANDARD_SCALER
function standardizes numerical values by removing the mean and scaling to unit variance. This is useful for features that follow a normal distribution. - Arguments
- value: Numerical expression to be scaled. If the input value is
NULL, NaN
orInfinity
, it is returned as is. - mean: Mean of the feature data seen in the dataset. If
mean
isNULL, NaN
orInfinity
, an exception is thrown. - standardDeviation: Standard Deviation of the feature data seen in the
dataset. If
standardDeviation
isNULL
orNaN
, an exception is thrown.- If
standardDeviation
isInfinity
,0
is returned. - If
standardDeviation
is0
, the value does not need to be scaled, so it is returned as is.
- If
- withCentering: (Optional) Boolean value indicating to center the
numerical value using mean before scaling. The default is
TRUE
. IfwithCentering
isFALSE
, the mean value is ignored. - withScaling: (Optional) Boolean value indicating to scale the numerical
value using std after centering. The default is
TRUE
. IfwithScaling
isFALSE
, thestandardDeviation
value is ignored.
- value: Numerical expression to be scaled. If the input value is
- Example
-- returns 0.2 SELECT ML_STANDARD_SCALER(2, 1, 5, TRUE, TRUE);