Aggregates on the entire Dataset without groups.
Aggregates on the entire Dataset without groups.
// ds.agg(...) is a shorthand for ds.groupBy().agg(...) ds.agg(max($"age"), avg($"salary")) ds.groupBy().agg(max($"age"), avg($"salary"))
2.0.0
(Java-specific) Aggregates on the entire Dataset without groups.
(Java-specific) Aggregates on the entire Dataset without groups.
// ds.agg(...) is a shorthand for ds.groupBy().agg(...) ds.agg(Map("age" -> "max", "salary" -> "avg")) ds.groupBy().agg(Map("age" -> "max", "salary" -> "avg"))
2.0.0
(Scala-specific) Aggregates on the entire Dataset without groups.
(Scala-specific) Aggregates on the entire Dataset without groups.
// ds.agg(...) is a shorthand for ds.groupBy().agg(...) ds.agg(Map("age" -> "max", "salary" -> "avg")) ds.groupBy().agg(Map("age" -> "max", "salary" -> "avg"))
2.0.0
(Scala-specific) Aggregates on the entire Dataset without groups.
(Scala-specific) Aggregates on the entire Dataset without groups.
// ds.agg(...) is a shorthand for ds.groupBy().agg(...) ds.agg("age" -> "max", "salary" -> "avg") ds.groupBy().agg("age" -> "max", "salary" -> "avg")
2.0.0
(Scala-specific) Returns a new Dataset with an alias set.
(Scala-specific) Returns a new Dataset with an alias set. Same as as
.
2.0.0
Returns a new Dataset with an alias set.
Returns a new Dataset with an alias set. Same as as
.
2.0.0
Selects column based on the column name and return it as a Column.
Selects column based on the column name and return it as a Column.
Note that the column name can also reference to a nested column like a.b
.
2.0.0
(Scala-specific) Returns a new Dataset with an alias set.
(Scala-specific) Returns a new Dataset with an alias set.
2.0.0
Returns a new Dataset with an alias set.
Returns a new Dataset with an alias set.
1.6.0
:: Experimental :: Returns a new Dataset where each record has been mapped on to the specified type.
:: Experimental ::
Returns a new Dataset where each record has been mapped on to the specified type. The
method used to map columns depend on the type of U
:
U
is a class, fields for the class will be mapped to columns of the same name
(case sensitivity is determined by spark.sql.caseSensitive
).U
is a tuple, the columns will be be mapped by ordinal (i.e. the first column will
be assigned to _1
).U
is a primitive type (i.e. String, Int, etc), then the first column of the
DataFrame will be used.If the schema of the Dataset does not match the desired U
type, you can use select
along with alias
or as
to rearrange or rename as required.
1.6.0
Persist this Dataset with the default storage level (MEMORY_AND_DISK
).
Persist this Dataset with the default storage level (MEMORY_AND_DISK
).
1.6.0
Returns a new Dataset that has exactly numPartitions
partitions.
Returns a new Dataset that has exactly numPartitions
partitions.
Similar to coalesce defined on an RDD, this operation results in a narrow dependency, e.g.
if you go from 1000 partitions to 100 partitions, there will not be a shuffle, instead each of
the 100 new partitions will claim 10 of the current partitions.
1.6.0
Selects column based on the column name and return it as a Column.
Selects column based on the column name and return it as a Column.
Note that the column name can also reference to a nested column like a.b
.
2.0.0
Returns an array that contains all of Rows in this Dataset.
Returns an array that contains all of Rows in this Dataset.
Running collect requires moving all the data into the application's driver process, and doing so on a very large dataset can crash the driver process with OutOfMemoryError.
For Java API, use collectAsList.
1.6.0
Returns a Java list that contains all of Rows in this Dataset.
Returns all column names as an array.
Returns all column names as an array.
1.6.0
Returns the number of rows in the Dataset.
Returns the number of rows in the Dataset.
1.6.0
Creates a temporary view using the given name.
Creates a temporary view using the given name. The lifetime of this temporary view is tied to the SparkSession that was used to create this Dataset.
2.0.0
Creates a temporary view using the given name.
Creates a temporary view using the given name. The lifetime of this temporary view is tied to the SparkSession that was used to create this Dataset.
2.0.0
AnalysisException
if the view name already exists
Create a multi-dimensional cube for the current Dataset using the specified columns, so we can run aggregation on them.
Create a multi-dimensional cube for the current Dataset using the specified columns, so we can run aggregation on them. See RelationalGroupedDataset for all the available aggregate functions.
This is a variant of cube that can only group by existing columns using column names (i.e. cannot construct expressions).
// Compute the average for all numeric columns cubed by department and group. ds.cube("department", "group").avg() // Compute the max age and average salary, cubed by department and gender. ds.cube($"department", $"gender").agg(Map( "salary" -> "avg", "age" -> "max" ))
2.0.0
Create a multi-dimensional cube for the current Dataset using the specified columns, so we can run aggregation on them.
Create a multi-dimensional cube for the current Dataset using the specified columns, so we can run aggregation on them. See RelationalGroupedDataset for all the available aggregate functions.
// Compute the average for all numeric columns cubed by department and group. ds.cube($"department", $"group").avg() // Compute the max age and average salary, cubed by department and gender. ds.cube($"department", $"gender").agg(Map( "salary" -> "avg", "age" -> "max" ))
2.0.0
Computes statistics for numeric columns, including count, mean, stddev, min, and max.
Computes statistics for numeric columns, including count, mean, stddev, min, and max. If no columns are given, this function computes statistics for all numerical columns.
This function is meant for exploratory data analysis, as we make no guarantee about the
backward compatibility of the schema of the resulting Dataset. If you want to
programmatically compute summary statistics, use the agg
function instead.
ds.describe("age", "height").show() // output: // summary age height // count 10.0 10.0 // mean 53.3 178.05 // stddev 11.6 15.7 // min 18.0 163.0 // max 92.0 192.0
1.6.0
Returns a new Dataset that contains only the unique rows from this Dataset.
Returns a new Dataset with a column dropped.
Returns a new Dataset with a column dropped. This version of drop accepts a Column rather than a name. This is a no-op if the Dataset doesn't have a column with an equivalent expression.
2.0.0
Returns a new Dataset with columns dropped.
Returns a new Dataset with columns dropped. This is a no-op if schema doesn't contain column name(s).
2.0.0
Returns a new Dataset with a column dropped.
Returns a new Dataset with a column dropped. This is a no-op if schema doesn't contain column name.
2.0.0
Returns a new Dataset with duplicate rows removed, considering only the subset of columns.
Returns a new Dataset with duplicate rows removed, considering only the subset of columns.
2.0.0
(Scala-specific) Returns a new Dataset with duplicate rows removed, considering only the subset of columns.
(Scala-specific) Returns a new Dataset with duplicate rows removed, considering only the subset of columns.
2.0.0
Returns a new Dataset that contains only the unique rows from this Dataset.
Returns all column names and their data types as an array.
Returns all column names and their data types as an array.
1.6.0
Returns a new Dataset containing rows in this Dataset but not in another Dataset.
Returns a new Dataset containing rows in this Dataset but not in another Dataset.
This is equivalent to EXCEPT
in SQL.
Note that, equality checking is performed directly on the encoded representation of the data
and thus is not affected by a custom equals
function defined on T
.
2.0.0
Prints the physical plan to the console for debugging purposes.
Prints the physical plan to the console for debugging purposes.
1.6.0
Prints the plans (logical and physical) to the console for debugging purposes.
Prints the plans (logical and physical) to the console for debugging purposes.
1.6.0
:: Experimental :: (Scala-specific) Returns a new Dataset where a single column has been expanded to zero or more rows by the provided function.
:: Experimental ::
(Scala-specific) Returns a new Dataset where a single column has been expanded to zero
or more rows by the provided function. This is similar to a LATERAL VIEW
in HiveQL. All
columns of the input row are implicitly joined with each value that is output by the function.
ds.explode("words", "word") {words: String => words.split(" ")}
2.0.0
:: Experimental :: (Scala-specific) Returns a new Dataset where each row has been expanded to zero or more rows by the provided function.
:: Experimental ::
(Scala-specific) Returns a new Dataset where each row has been expanded to zero or more
rows by the provided function. This is similar to a LATERAL VIEW
in HiveQL. The columns of
the input row are implicitly joined with each row that is output by the function.
The following example uses this function to count the number of books which contain a given word:
case class Book(title: String, words: String) val ds: Dataset[Book] case class Word(word: String) val allWords = ds.explode('words) { case Row(words: String) => words.split(" ").map(Word(_)) } val bookCountPerWord = allWords.groupBy("word").agg(countDistinct("title"))
2.0.0
:: Experimental ::
(Java-specific)
Returns a new Dataset that only contains elements where func
returns true
.
:: Experimental ::
(Java-specific)
Returns a new Dataset that only contains elements where func
returns true
.
1.6.0
:: Experimental ::
(Scala-specific)
Returns a new Dataset that only contains elements where func
returns true
.
:: Experimental ::
(Scala-specific)
Returns a new Dataset that only contains elements where func
returns true
.
1.6.0
Filters rows using the given SQL expression.
Filters rows using the given SQL expression.
peopleDs.filter("age > 15")
1.6.0
Filters rows using the given condition.
Filters rows using the given condition.
// The following are equivalent: peopleDs.filter($"age" > 15) peopleDs.where($"age" > 15)
1.6.0
Returns the first row.
Returns the first row. Alias for head().
1.6.0
:: Experimental :: (Java-specific) Returns a new Dataset by first applying a function to all elements of this Dataset, and then flattening the results.
:: Experimental :: (Scala-specific) Returns a new Dataset by first applying a function to all elements of this Dataset, and then flattening the results.
(Java-specific)
Runs func
on each element of this Dataset.
(Java-specific)
Runs func
on each element of this Dataset.
1.6.0
Applies a function f
to all rows.
Applies a function f
to all rows.
1.6.0
(Java-specific)
Runs func
on each partition of this Dataset.
(Java-specific)
Runs func
on each partition of this Dataset.
1.6.0
Applies a function f
to each partition of this Dataset.
Applies a function f
to each partition of this Dataset.
1.6.0
Groups the Dataset using the specified columns, so that we can run aggregation on them.
Groups the Dataset using the specified columns, so that we can run aggregation on them. See RelationalGroupedDataset for all the available aggregate functions.
This is a variant of groupBy that can only group by existing columns using column names (i.e. cannot construct expressions).
// Compute the average for all numeric columns grouped by department. ds.groupBy("department").avg() // Compute the max age and average salary, grouped by department and gender. ds.groupBy($"department", $"gender").agg(Map( "salary" -> "avg", "age" -> "max" ))
2.0.0
Groups the Dataset using the specified columns, so we can run aggregation on them.
Groups the Dataset using the specified columns, so we can run aggregation on them. See RelationalGroupedDataset for all the available aggregate functions.
// Compute the average for all numeric columns grouped by department. ds.groupBy($"department").avg() // Compute the max age and average salary, grouped by department and gender. ds.groupBy($"department", $"gender").agg(Map( "salary" -> "avg", "age" -> "max" ))
2.0.0
:: Experimental ::
(Java-specific)
Returns a KeyValueGroupedDataset where the data is grouped by the given key func
.
:: Experimental ::
(Java-specific)
Returns a KeyValueGroupedDataset where the data is grouped by the given key func
.
2.0.0
:: Experimental ::
(Scala-specific)
Returns a KeyValueGroupedDataset where the data is grouped by the given key func
.
:: Experimental ::
(Scala-specific)
Returns a KeyValueGroupedDataset where the data is grouped by the given key func
.
2.0.0
Returns the first row.
Returns the first row.
1.6.0
Returns the first n
rows.
Returns the first n
rows.
1.6.0
this method should only be used if the resulting array is expected to be small, as all the data is loaded into the driver's memory.
Returns a best-effort snapshot of the files that compose this Dataset.
Returns a best-effort snapshot of the files that compose this Dataset. This method simply asks each constituent BaseRelation for its respective files and takes the union of all results. Depending on the source relations, this may not find all input files. Duplicates are removed.
2.0.0
Returns a new Dataset containing rows only in both this Dataset and another Dataset.
Returns a new Dataset containing rows only in both this Dataset and another Dataset.
This is equivalent to INTERSECT
in SQL.
Note that, equality checking is performed directly on the encoded representation of the data
and thus is not affected by a custom equals
function defined on T
.
1.6.0
Returns true if the collect
and take
methods can be run locally
(without any Spark executors).
Returns true if the collect
and take
methods can be run locally
(without any Spark executors).
1.6.0
Returns true if this Dataset contains one or more sources that continuously return data as it arrives.
Returns true if this Dataset contains one or more sources that continuously
return data as it arrives. A Dataset that reads data from a streaming source
must be executed as a ContinuousQuery using the startStream()
method in
DataFrameWriter. Methods that return a single answer, e.g. count()
or
collect()
, will throw an AnalysisException when there is a streaming
source present.
2.0.0
Converts a JavaRDD to a PythonRDD.
Converts a JavaRDD to a PythonRDD.
Join with another DataFrame, using the given join expression.
Join with another DataFrame, using the given join expression. The following performs
a full outer join between df1
and df2
.
// Scala: import org.apache.spark.sql.functions._ df1.join(df2, $"df1Key" === $"df2Key", "outer") // Java: import static org.apache.spark.sql.functions.*; df1.join(df2, col("df1Key").equalTo(col("df2Key")), "outer");
Right side of the join.
Join expression.
One of: inner
, outer
, left_outer
, right_outer
, leftsemi
.
2.0.0
Inner join with another DataFrame, using the given join expression.
Inner join with another DataFrame, using the given join expression.
// The following two are equivalent: df1.join(df2, $"df1Key" === $"df2Key") df1.join(df2).where($"df1Key" === $"df2Key")
2.0.0
Equi-join with another DataFrame using the given columns.
Equi-join with another DataFrame using the given columns.
Different from other join functions, the join columns will only appear once in the output,
i.e. similar to SQL's JOIN USING
syntax.
Note that if you perform a self-join using this function without aliasing the input DataFrames, you will NOT be able to reference any columns after the join, since there is no way to disambiguate which side of the join you would like to reference.
Right side of the join operation.
Names of the columns to join on. This columns must exist on both sides.
One of: inner
, outer
, left_outer
, right_outer
, leftsemi
.
2.0.0
Inner equi-join with another DataFrame using the given columns.
Inner equi-join with another DataFrame using the given columns.
Different from other join functions, the join columns will only appear once in the output,
i.e. similar to SQL's JOIN USING
syntax.
// Joining df1 and df2 using the columns "user_id" and "user_name" df1.join(df2, Seq("user_id", "user_name"))
Note that if you perform a self-join using this function without aliasing the input DataFrames, you will NOT be able to reference any columns after the join, since there is no way to disambiguate which side of the join you would like to reference.
Right side of the join operation.
Names of the columns to join on. This columns must exist on both sides.
2.0.0
Inner equi-join with another DataFrame using the given column.
Inner equi-join with another DataFrame using the given column.
Different from other join functions, the join column will only appear once in the output,
i.e. similar to SQL's JOIN USING
syntax.
// Joining df1 and df2 using the column "user_id" df1.join(df2, "user_id")
Note that if you perform a self-join using this function without aliasing the input DataFrames, you will NOT be able to reference any columns after the join, since there is no way to disambiguate which side of the join you would like to reference.
Right side of the join operation.
Name of the column to join on. This column must exist on both sides.
2.0.0
Cartesian join with another DataFrame.
Cartesian join with another DataFrame.
Note that cartesian joins are very expensive without an extra filter that can be pushed down.
Right side of the join operation.
2.0.0
:: Experimental ::
Using inner equi-join to join this Dataset returning a Tuple2 for each pair
where condition
evaluates to true.
:: Experimental ::
Using inner equi-join to join this Dataset returning a Tuple2 for each pair
where condition
evaluates to true.
Right side of the join.
Join expression.
1.6.0
:: Experimental ::
Joins this Dataset returning a Tuple2 for each pair where condition
evaluates to
true.
:: Experimental ::
Joins this Dataset returning a Tuple2 for each pair where condition
evaluates to
true.
This is similar to the relation join
function with one important difference in the
result schema. Since joinWith
preserves objects present on either side of the join, the
result schema is similarly nested into a tuple under the column names _1
and _2
.
This type of join can be useful both for preserving type-safety with the original object types as well as working with relational data where either side of the join has column names in common.
Right side of the join.
Join expression.
One of: inner
, outer
, left_outer
, right_outer
, leftsemi
.
1.6.0
Returns a new Dataset by taking the first n
rows.
:: Experimental ::
(Java-specific)
Returns a new Dataset that contains the result of applying func
to each element.
:: Experimental ::
(Java-specific)
Returns a new Dataset that contains the result of applying func
to each element.
1.6.0
:: Experimental ::
(Scala-specific)
Returns a new Dataset that contains the result of applying func
to each element.
:: Experimental ::
(Scala-specific)
Returns a new Dataset that contains the result of applying func
to each element.
1.6.0
:: Experimental ::
(Java-specific)
Returns a new Dataset that contains the result of applying f
to each partition.
:: Experimental ::
(Java-specific)
Returns a new Dataset that contains the result of applying f
to each partition.
1.6.0
:: Experimental ::
(Scala-specific)
Returns a new Dataset that contains the result of applying func
to each partition.
:: Experimental ::
(Scala-specific)
Returns a new Dataset that contains the result of applying func
to each partition.
1.6.0
Returns a DataFrameNaFunctions for working with missing data.
Returns a DataFrameNaFunctions for working with missing data.
// Dropping rows containing any null values.
ds.na.drop()
1.6.0
Returns a new Dataset sorted by the given expressions.
Returns a new Dataset sorted by the given expressions.
This is an alias of the sort
function.
2.0.0
Returns a new Dataset sorted by the given expressions.
Returns a new Dataset sorted by the given expressions.
This is an alias of the sort
function.
2.0.0
Persist this Dataset with the given storage level.
Persist this Dataset with the given storage level.
One of: MEMORY_ONLY
, MEMORY_AND_DISK
, MEMORY_ONLY_SER
,
MEMORY_AND_DISK_SER
, DISK_ONLY
, MEMORY_ONLY_2
,
MEMORY_AND_DISK_2
, etc.
1.6.0
Persist this Dataset with the default storage level (MEMORY_AND_DISK
).
Persist this Dataset with the default storage level (MEMORY_AND_DISK
).
1.6.0
Prints the schema to the console in a nice tree format.
Prints the schema to the console in a nice tree format.
1.6.0
Randomly splits this Dataset with the provided weights.
Randomly splits this Dataset with the provided weights.
weights for splits, will be normalized if they don't sum to 1.
2.0.0
Randomly splits this Dataset with the provided weights.
Randomly splits this Dataset with the provided weights.
weights for splits, will be normalized if they don't sum to 1.
Seed for sampling. For Java API, use randomSplitAsList.
2.0.0
Returns a Java list that contains randomly split Dataset with the provided weights.
Returns a Java list that contains randomly split Dataset with the provided weights.
weights for splits, will be normalized if they don't sum to 1.
Seed for sampling.
2.0.0
Represents the content of the Dataset as an RDD of T.
Represents the content of the Dataset as an RDD of T.
1.6.0
:: Experimental :: (Java-specific) Reduces the elements of this Dataset using the specified binary function.
:: Experimental ::
(Java-specific)
Reduces the elements of this Dataset using the specified binary function. The given func
must be commutative and associative or the result may be non-deterministic.
1.6.0
:: Experimental :: (Scala-specific) Reduces the elements of this Dataset using the specified binary function.
:: Experimental ::
(Scala-specific)
Reduces the elements of this Dataset using the specified binary function. The given func
must be commutative and associative or the result may be non-deterministic.
1.6.0
Returns a new Dataset partitioned by the given partitioning expressions, using
spark.sql.shuffle.partitions
as number of partitions.
Returns a new Dataset partitioned by the given partitioning expressions, using
spark.sql.shuffle.partitions
as number of partitions.
The resulting Dataset is hash partitioned.
This is the same operation as "DISTRIBUTE BY" in SQL (Hive QL).
2.0.0
Returns a new Dataset partitioned by the given partitioning expressions into
numPartitions
.
Returns a new Dataset partitioned by the given partitioning expressions into
numPartitions
. The resulting Dataset is hash partitioned.
This is the same operation as "DISTRIBUTE BY" in SQL (Hive QL).
2.0.0
Returns a new Dataset that has exactly numPartitions
partitions.
Returns a new Dataset that has exactly numPartitions
partitions.
1.6.0
Create a multi-dimensional rollup for the current Dataset using the specified columns, so we can run aggregation on them.
Create a multi-dimensional rollup for the current Dataset using the specified columns, so we can run aggregation on them. See RelationalGroupedDataset for all the available aggregate functions.
This is a variant of rollup that can only group by existing columns using column names (i.e. cannot construct expressions).
// Compute the average for all numeric columns rolluped by department and group. ds.rollup("department", "group").avg() // Compute the max age and average salary, rolluped by department and gender. ds.rollup($"department", $"gender").agg(Map( "salary" -> "avg", "age" -> "max" ))
2.0.0
Create a multi-dimensional rollup for the current Dataset using the specified columns, so we can run aggregation on them.
Create a multi-dimensional rollup for the current Dataset using the specified columns, so we can run aggregation on them. See RelationalGroupedDataset for all the available aggregate functions.
// Compute the average for all numeric columns rolluped by department and group. ds.rollup($"department", $"group").avg() // Compute the max age and average salary, rolluped by department and gender. ds.rollup($"department", $"gender").agg(Map( "salary" -> "avg", "age" -> "max" ))
2.0.0
Returns a new Dataset by sampling a fraction of rows, using a random seed.
Returns a new Dataset by sampling a fraction of rows, using a random seed.
Sample with replacement or not.
Fraction of rows to generate.
1.6.0
Returns a new Dataset by sampling a fraction of rows.
Returns a new Dataset by sampling a fraction of rows.
Sample with replacement or not.
Fraction of rows to generate.
Seed for sampling.
1.6.0
Returns the schema of this Dataset.
Returns the schema of this Dataset.
1.6.0
:: Experimental :: Returns a new Dataset by computing the given Column expressions for each element.
:: Experimental :: Returns a new Dataset by computing the given Column expressions for each element.
:: Experimental :: Returns a new Dataset by computing the given Column expressions for each element.
:: Experimental :: Returns a new Dataset by computing the given Column expressions for each element.
:: Experimental :: Returns a new Dataset by computing the given Column expression for each element.
Selects a set of columns.
Selects a set of columns. This is a variant of select
that can only select
existing columns using column names (i.e. cannot construct expressions).
// The following two are equivalent: ds.select("colA", "colB") ds.select($"colA", $"colB")
2.0.0
Selects a set of column based expressions.
Selects a set of column based expressions.
ds.select($"colA", $"colB" + 1)
2.0.0
Selects a set of SQL expressions.
Selects a set of SQL expressions. This is a variant of select
that accepts
SQL expressions.
// The following are equivalent: ds.selectExpr("colA", "colB as newName", "abs(colC)") ds.select(expr("colA"), expr("colB as newName"), expr("abs(colC)"))
2.0.0
Internal helper function for building typed selects that return tuples.
Internal helper function for building typed selects that return tuples. For simplicity and code reuse, we do this without the help of the type system and then use helper functions that cast appropriately for the user facing interface.
Displays the Dataset in a tabular form.
Displays the Dataset in a tabular form. For example:
year month AVG('Adj Close) MAX('Adj Close) 1980 12 0.503218 0.595103 1981 01 0.523289 0.570307 1982 02 0.436504 0.475256 1983 03 0.410516 0.442194 1984 04 0.450090 0.483521
Number of rows to show
Whether truncate long strings. If true, strings more than 20 characters will be truncated and all cells will be aligned right
1.6.0
Displays the top 20 rows of Dataset in a tabular form.
Displays the top 20 rows of Dataset in a tabular form.
Whether truncate long strings. If true, strings more than 20 characters will be truncated and all cells will be aligned right
1.6.0
Displays the top 20 rows of Dataset in a tabular form.
Displays the top 20 rows of Dataset in a tabular form. Strings more than 20 characters will be truncated, and all cells will be aligned right.
1.6.0
Displays the Dataset in a tabular form.
Displays the Dataset in a tabular form. Strings more than 20 characters will be truncated, and all cells will be aligned right. For example:
year month AVG('Adj Close) MAX('Adj Close) 1980 12 0.503218 0.595103 1981 01 0.523289 0.570307 1982 02 0.436504 0.475256 1983 03 0.410516 0.442194 1984 04 0.450090 0.483521
Number of rows to show
1.6.0
Returns a new Dataset sorted by the given expressions.
Returns a new Dataset sorted by the given expressions. For example:
ds.sort($"col1", $"col2".desc)
2.0.0
Returns a new Dataset sorted by the specified column, all in ascending order.
Returns a new Dataset sorted by the specified column, all in ascending order.
// The following 3 are equivalent ds.sort("sortcol") ds.sort($"sortcol") ds.sort($"sortcol".asc)
2.0.0
Returns a new Dataset with each partition sorted by the given expressions.
Returns a new Dataset with each partition sorted by the given expressions.
This is the same operation as "SORT BY" in SQL (Hive QL).
2.0.0
Returns a new Dataset with each partition sorted by the given expressions.
Returns a new Dataset with each partition sorted by the given expressions.
This is the same operation as "SORT BY" in SQL (Hive QL).
2.0.0
Returns a DataFrameStatFunctions for working statistic functions support.
Returns a DataFrameStatFunctions for working statistic functions support.
// Finding frequent items in column with name 'a'. ds.stat.freqItems(Seq("a"))
1.6.0
Returns the first n
rows in the Dataset.
Returns the first n
rows in the Dataset.
Running take requires moving data into the application's driver process, and doing so with
a very large n
can crash the driver process with OutOfMemoryError.
1.6.0
Returns the first n
rows in the Dataset as a list.
Returns the first n
rows in the Dataset as a list.
Running take requires moving data into the application's driver process, and doing so with
a very large n
can crash the driver process with OutOfMemoryError.
1.6.0
Converts this strongly typed collection of data to generic DataFrame
with columns renamed.
Converts this strongly typed collection of data to generic DataFrame
with columns renamed.
This can be quite convenient in conversion from a RDD of tuples into a DataFrame with
meaningful names. For example:
val rdd: RDD[(Int, String)] = ... rdd.toDF() // this implicit conversion creates a DataFrame with column name `_1` and `_2` rdd.toDF("id", "name") // this creates a DataFrame with column name "id" and "name"
2.0.0
Converts this strongly typed collection of data to generic Dataframe.
Converts this strongly typed collection of data to generic Dataframe. In contrast to the strongly typed objects that Dataset operations work on, a Dataframe returns generic Row objects that allow fields to be accessed by ordinal or name.
1.6.0
Returns the content of the Dataset as a Dataset of JSON strings.
Returns the content of the Dataset as a Dataset of JSON strings.
2.0.0
Return an iterator that contains all of Rows in this Dataset.
Return an iterator that contains all of Rows in this Dataset.
The iterator will consume as much memory as the largest partition in this Dataset.
Note: this results in multiple Spark jobs, and if the input Dataset is the result of a wide transformation (e.g. join with different partitioners), to avoid recomputing the input Dataset should be cached first.
2.0.0
Concise syntax for chaining custom transformations.
Concise syntax for chaining custom transformations.
def featurize(ds: Dataset[T]): Dataset[U] = ...
ds
.transform(featurize)
.transform(...)
1.6.0
Returns a new Dataset containing union of rows in this Dataset and another Dataset.
Mark the Dataset as non-persistent, and remove all blocks for it from memory and disk.
Mark the Dataset as non-persistent, and remove all blocks for it from memory and disk.
1.6.0
Mark the Dataset as non-persistent, and remove all blocks for it from memory and disk.
Mark the Dataset as non-persistent, and remove all blocks for it from memory and disk.
Whether to block until all blocks are deleted.
1.6.0
Filters rows using the given SQL expression.
Filters rows using the given SQL expression.
peopleDs.where("age > 15")
1.6.0
Filters rows using the given condition.
Filters rows using the given condition. This is an alias for filter
.
// The following are equivalent: peopleDs.filter($"age" > 15) peopleDs.where($"age" > 15)
1.6.0
Returns a new Dataset by adding a column or replacing the existing column that has the same name.
Returns a new Dataset by adding a column or replacing the existing column that has the same name.
2.0.0
Returns a new Dataset with a column renamed.
Returns a new Dataset with a column renamed. This is a no-op if schema doesn't contain existingName.
2.0.0
:: Experimental :: Interface for saving the content of the Dataset out into external storage or streams.
:: Experimental :: Interface for saving the content of the Dataset out into external storage or streams.
1.6.0
Registers this Dataset as a temporary table using the given name.
Registers this Dataset as a temporary table using the given name. The lifetime of this temporary table is tied to the SparkSession that was used to create this Dataset.
(Since version 2.0.0) Use createOrReplaceTempView(viewName) instead.
1.6.0
Returns a new Dataset containing union of rows in this Dataset and another Dataset.
Returns a new Dataset containing union of rows in this Dataset and another Dataset.
This is equivalent to UNION ALL
in SQL.
To do a SQL-style set union (that does deduplication of elements), use this function followed by a distinct.
(Since version 2.0.0) use union()
2.0.0
A Dataset is a strongly typed collection of domain-specific objects that can be transformed in parallel using functional or relational operations. Each Dataset also has an untyped view called a DataFrame, which is a Dataset of Row.
Operations available on Datasets are divided into transformations and actions. Transformations are the ones that produce new Datasets, and actions are the ones that trigger computation and return results. Example transformations include map, filter, select, and aggregate (
groupBy
). Example actions count, show, or writing data out to file systems.Datasets are "lazy", i.e. computations are only triggered when an action is invoked. Internally, a Dataset represents a logical plan that describes the computation required to produce the data. When an action is invoked, Spark's query optimizer optimizes the logical plan and generates a physical plan for efficient execution in a parallel and distributed manner. To explore the logical plan as well as optimized physical plan, use the
explain
function.To efficiently support domain-specific objects, an Encoder is required. The encoder maps the domain specific type
T
to Spark's internal type system. For example, given a classPerson
with two fields,name
(string) andage
(int), an encoder is used to tell Spark to generate code at runtime to serialize thePerson
object into a binary structure. This binary structure often has much lower memory footprint as well as are optimized for efficiency in data processing (e.g. in a columnar format). To understand the internal binary representation for data, use theschema
function.There are typically two ways to create a Dataset. The most common way is by pointing Spark to some files on storage systems, using the
read
function available on aSparkSession
.Datasets can also be created through transformations available on existing Datasets. For example, the following creates a new Dataset by applying a filter on the existing one:
Dataset operations can also be untyped, through various domain-specific-language (DSL) functions defined in: Dataset (this class), Column, and functions. These operations are very similar to the operations available in the data frame abstraction in R or Python.
To select a column from the Dataset, use
apply
method in Scala andcol
in Java.Note that the Column type can also be manipulated through its various functions.
A more concrete example in Scala:
and in Java:
1.6.0