Java-friendly constructor for org.apache.spark.mllib.tree.configuration.Strategy
Learning goal. Supported: org.apache.spark.mllib.tree.configuration.Algo.Classification, org.apache.spark.mllib.tree.configuration.Algo.Regression
Criterion used for information gain calculation. Supported for Classification: org.apache.spark.mllib.tree.impurity.Gini, org.apache.spark.mllib.tree.impurity.Entropy. Supported for Regression: org.apache.spark.mllib.tree.impurity.Variance.
Maximum depth of the tree. E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes.
Number of classes for classification. (Ignored for regression.) Default value is 2 (binary classification).
Maximum number of bins used for discretizing continuous features and for choosing how to split on features at each node. More bins give higher granularity.
Algorithm for calculating quantiles. Supported: org.apache.spark.mllib.tree.configuration.QuantileStrategy.Sort
A map storing information about the categorical variables and the number of discrete values they take. For example, an entry (n -> k) implies the feature n is categorical with k categories 0, 1, 2, ... , k-1. It's important to note that features are zero-indexed.
Minimum number of instances each child must have after split. Default value is 1. If a split cause left or right child to have less than minInstancesPerNode, this split will not be considered as a valid split.
Minimum information gain a split must get. Default value is 0.0. If a split has less information gain than minInfoGain, this split will not be considered as a valid split.
Maximum memory in MB allocated to histogram aggregation. Default value is 256 MB.
Fraction of the training data used for learning decision tree.
If this is true, instead of passing trees to executors, the algorithm will maintain a separate RDD of node Id cache for each row.
If the node Id cache is used, it will help to checkpoint the node Id cache periodically. This is the checkpoint directory to be used for the node Id cache.
How often to checkpoint when the node Id cache gets updated. E.g. 10 means that the cache will get checkpointed every 10 updates.
Learning goal.
A map storing information about the categorical variables and the number of discrete values they take.
A map storing information about the categorical variables and the number of discrete values they take. For example, an entry (n -> k) implies the feature n is categorical with k categories 0, 1, 2, ... , k-1. It's important to note that features are zero-indexed.
If the node Id cache is used, it will help to checkpoint the node Id cache periodically.
If the node Id cache is used, it will help to checkpoint the node Id cache periodically. This is the checkpoint directory to be used for the node Id cache.
How often to checkpoint when the node Id cache gets updated.
How often to checkpoint when the node Id cache gets updated. E.g. 10 means that the cache will get checkpointed every 10 updates.
Returns a shallow copy of this instance.
Criterion used for information gain calculation.
Criterion used for information gain calculation. Supported for Classification: org.apache.spark.mllib.tree.impurity.Gini, org.apache.spark.mllib.tree.impurity.Entropy. Supported for Regression: org.apache.spark.mllib.tree.impurity.Variance.
Maximum number of bins used for discretizing continuous features and for choosing how to split on features at each node.
Maximum number of bins used for discretizing continuous features and for choosing how to split on features at each node. More bins give higher granularity.
Maximum depth of the tree.
Maximum depth of the tree. E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes.
Maximum memory in MB allocated to histogram aggregation.
Maximum memory in MB allocated to histogram aggregation. Default value is 256 MB.
Minimum information gain a split must get.
Minimum information gain a split must get. Default value is 0.0. If a split has less information gain than minInfoGain, this split will not be considered as a valid split.
Minimum number of instances each child must have after split.
Minimum number of instances each child must have after split. Default value is 1. If a split cause left or right child to have less than minInstancesPerNode, this split will not be considered as a valid split.
Number of classes for classification.
Number of classes for classification. (Ignored for regression.) Default value is 2 (binary classification).
Algorithm for calculating quantiles.
Algorithm for calculating quantiles. Supported: org.apache.spark.mllib.tree.configuration.QuantileStrategy.Sort
Sets Algorithm using a String.
Sets categoricalFeaturesInfo using a Java Map.
Fraction of the training data used for learning decision tree.
If this is true, instead of passing trees to executors, the algorithm will maintain a separate RDD of node Id cache for each row.
:: Experimental :: Stores all the configuration options for tree construction