:: Experimental :: Binarize a column of continuous features given a threshold.
:: Experimental ::
Bucketizer
maps a column of continuous features to a column of feature buckets.
:: Experimental :: Chi-Squared feature selection, which selects categorical features to use for predicting a categorical label.
:: Experimental :: Model fitted by ChiSqSelector.
:: Experimental :: Extracts a vocabulary from document collections and generates a CountVectorizerModel.
:: Experimental :: Converts a text document to a sparse vector of token counts.
:: Experimental :: A feature transformer that takes the 1D discrete cosine transform of a real vector.
:: Experimental :: Outputs the Hadamard product (i.
:: Experimental :: Maps a sequence of terms to their term frequencies using the hashing trick.
:: Experimental :: Compute the Inverse Document Frequency (IDF) given a collection of documents.
:: Experimental :: Model fitted by IDF.
:: Experimental :: A Transformer that maps a column of indices back to a new column of corresponding string values.
:: Experimental :: Implements the feature interaction transform.
:: Experimental :: Rescale each feature individually to a common range [min, max] linearly using column summary statistics, which is also known as min-max normalization or Rescaling.
:: Experimental :: Model fitted by MinMaxScaler.
:: Experimental :: A feature transformer that converts the input array of strings into an array of n-grams.
:: Experimental :: Normalize a vector to have unit norm using the given p-norm.
:: Experimental :: A one-hot encoder that maps a column of category indices to a column of binary vectors, with at most a single one-value per row that indicates the input category index.
:: Experimental :: PCA trains a model to project vectors to a low-dimensional space using PCA.
:: Experimental :: Model fitted by PCA.
:: Experimental :: Perform feature expansion in a polynomial space.
:: Experimental ::
QuantileDiscretizer
takes a column with continuous features and outputs a column with binned
categorical features.
:: Experimental :: Implements the transforms required for fitting a dataset against an R model formula.
:: Experimental :: A fitted RFormula.
:: Experimental ::
A regex based tokenizer that extracts tokens either by using the provided regex pattern to split
the text (default) or repeatedly matching the regex (if gaps
is false).
:: Experimental :: Implements the transformations which are defined by SQL statement.
:: Experimental :: Standardizes features by removing the mean and scaling to unit variance using column summary statistics on the samples in the training set.
:: Experimental :: Model fitted by StandardScaler.
:: Experimental :: A feature transformer that filters out stop words from input.
:: Experimental :: A label indexer that maps a string column of labels to an ML column of label indices.
:: Experimental :: Model fitted by StringIndexer.
:: Experimental :: A tokenizer that converts the input string to lowercase and then splits it by white spaces.
:: Experimental :: A feature transformer that merges multiple columns into a vector column.
:: Experimental :: Class for indexing categorical feature columns in a dataset of Vector.
:: Experimental :: Transform categorical features to use 0-based indices instead of their original values.
:: Experimental :: This class takes a feature vector and outputs a new feature vector with a subarray of the original features.
:: Experimental ::
Word2Vec trains a model of Map(String, Vector)
, i.
:: Experimental :: Model fitted by Word2Vec.
The expansion is done via recursion.
Feature transformers
The
ml.feature
package provides common feature transformers that help convert raw data or features into more suitable forms for model fitting. Most feature transformers are implemented as Transformers, which transform one DataFrame into another, e.g., HashingTF. Some feature transformers are implemented as Estimators, because the transformation requires some aggregated information of the dataset, e.g., document frequencies in IDF. For those feature transformers, calling Estimator!.fit is required to obtain the model first, e.g., IDFModel, in order to apply transformation. The transformation is usually done by appending new columns to the input DataFrame, so all input columns are carried over.We try to make each transformer minimal, so it becomes flexible to assemble feature transformation pipelines. Pipeline can be used to chain feature transformers, and VectorAssembler can be used to combine multiple feature transformations, for example:
Some feature transformers implemented in MLlib are inspired by those implemented in scikit-learn. The major difference is that most scikit-learn feature transformers operate eagerly on the entire input dataset, while MLlib's feature transformers operate lazily on individual columns, which is more efficient and flexible to handle large and complex datasets.
scikit-learn.preprocessing