Spark Core¶
Public Classes¶
|
Main entry point for Spark functionality. |
|
A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. |
|
A broadcast variable created with |
|
A shared variable that can be accumulated, i.e., has a commutative and associative “add” operation. |
Helper object that defines how to accumulate values of a given type. |
|
|
Configuration for a Spark application. |
Resolves paths to files added through |
|
|
Flags for controlling the storage of an RDD. |
Contextual information about a task which can be read or mutated during execution. |
|
|
Wraps an RDD in a barrier stage, which forces Spark to launch tasks of this stage together. |
A |
|
|
Carries all task infos of a barrier task. |
|
Thread that is recommended to be used in PySpark instead of |
Provides utility method to determine Spark versions with given input string. |
Spark Context APIs¶
|
Create an |
|
Add an archive to be downloaded with this Spark job on every node. |
|
Add a file to be downloaded with this Spark job on every node. |
|
Add a .py or .zip dependency for all tasks to be executed on this SparkContext in the future. |
A unique identifier for the Spark application. |
|
|
Read a directory of binary files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI as a byte array. |
|
Load data from a flat binary file, assuming each record is a set of numbers with the specified numerical format (see ByteBuffer), and the number of bytes per record is constant. |
|
Broadcast a read-only variable to the cluster, returning a |
Cancel all jobs that have been scheduled or are running. |
|
|
Cancel active jobs for the specified group. |
Default min number of partitions for Hadoop RDDs when not given by user |
|
Default level of parallelism to use when not given by user (e.g. |
|
Dump the profile stats into directory path |
|
Create an |
|
Return the directory where RDDs are checkpointed. |
|
Return a copy of this SparkContext’s configuration |
|
Get a local property set in this thread, or null if it is missing. |
|
|
Get or instantiate a |
|
Read an ‘old’ Hadoop InputFormat with arbitrary key and value class from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. |
|
Read an ‘old’ Hadoop InputFormat with arbitrary key and value class, from an arbitrary Hadoop configuration, which is passed in as a Python dict. |
Returns a list of archive paths that are added to resources. |
|
Returns a list of file paths that are added to resources. |
|
|
Read a ‘new API’ Hadoop InputFormat with arbitrary key and value class from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. |
|
Read a ‘new API’ Hadoop InputFormat with arbitrary key and value class, from an arbitrary Hadoop configuration, which is passed in as a Python dict. |
|
Distribute a local Python collection to form an RDD. |
|
Load an RDD previously saved using |
|
Create a new RDD of int containing elements from start to end (exclusive), increased by step every element. |
Return the resource information of this |
|
|
Executes the given partitionFunc on the specified set of partitions, returning the result as an array of elements. |
|
Read a Hadoop SequenceFile with arbitrary key and value Writable class from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. |
|
Set the directory under which RDDs are going to be checkpointed. |
Set a human readable description of the current job. |
|
|
Assigns a group ID to all the jobs started by this thread until the group ID is set to a different value or cleared. |
|
Set a local property that affects jobs submitted from this thread, such as the Spark fair scheduler pool. |
|
Control our logLevel. |
|
Set a Java system property, such as spark.executor.memory. |
Print the profile stats to stdout |
|
Get SPARK_USER for user who is running SparkContext. |
|
Return the epoch time when the |
|
Return |
|
Shut down the |
|
|
Read a text file from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI, and return it as an RDD of Strings. |
Return the URL of the SparkUI instance started by this |
|
|
Build the union of a list of RDDs. |
The version of Spark on which this application is running. |
|
|
Read a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. |
RDD APIs¶
|
Aggregate the elements of each partition, and then the results for all the partitions, using a given combine functions and a neutral “zero value.” |
|
Aggregate the values of each key, using given combine functions and a neutral “zero value”. |
Marks the current stage as a barrier stage, where Spark must launch all tasks together. |
|
Persist this RDD with the default storage level (MEMORY_ONLY). |
|
|
Return the Cartesian product of this RDD and another one, that is, the RDD of all pairs of elements |
Mark this RDD for checkpointing. |
|
|
Removes an RDD’s shuffles and it’s non-persisted ancestors. |
|
Return a new RDD that is reduced into numPartitions partitions. |
|
For each key k in self or other, return a resulting RDD that contains a tuple with the list of values for that key in self as well as other. |
Return a list that contains all the elements in this RDD. |
|
Return the key-value pairs in this RDD to the master as a dictionary. |
|
|
When collect rdd, use this method to specify job group. |
|
Generic function to combine the elements for each key using a custom set of aggregation functions. |
The |
|
Return the number of elements in this RDD. |
|
|
Approximate version of count() that returns a potentially incomplete result within a timeout, even if not all tasks have finished. |
|
Return approximate number of distinct elements in the RDD. |
Count the number of elements for each key, and return the result to the master as a dictionary. |
|
Return the count of each unique value in this RDD as a dictionary of (value, count) pairs. |
|
|
Return a new RDD containing the distinct elements in this RDD. |
|
Return a new RDD containing only the elements that satisfy a predicate. |
Return the first element in this RDD. |
|
|
Return a new RDD by first applying a function to all elements of this RDD, and then flattening the results. |
Pass each value in the key-value pair RDD through a flatMap function without changing the keys; this also retains the original RDD’s partitioning. |
|
|
Aggregate the elements of each partition, and then the results for all the partitions, using a given associative function and a neutral “zero value.” |
|
Merge the values for each key using an associative function “func” and a neutral “zeroValue” which may be added to the result an arbitrary number of times, and must not change the result (e.g., 0 for addition, or 1 for multiplication.). |
|
Applies a function to all elements of this RDD. |
Applies a function to each partition of this RDD. |
|
|
Perform a right outer join of self and other. |
Gets the name of the file to which this RDD was checkpointed |
|
Returns the number of partitions in RDD |
|
Get the |
|
Get the RDD’s current storage level. |
|
|
Return an RDD created by coalescing all elements within each partition into a list. |
|
Return an RDD of grouped items. |
|
Group the values for each key in the RDD into a single sequence. |
|
Alias for cogroup but with support for multiple RDDs. |
|
Compute a histogram using the provided buckets. |
|
A unique ID for this RDD (within its SparkContext). |
|
Return the intersection of this RDD and another one. |
Return whether this RDD is checkpointed and materialized, either reliably or locally. |
|
Returns true if and only if the RDD contains no elements at all. |
|
Return whether this RDD is marked for local checkpointing. |
|
|
Return an RDD containing all pairs of elements with matching keys in self and other. |
|
Creates tuples of the elements in this RDD by applying f. |
|
Return an RDD with the keys of each tuple. |
|
Perform a left outer join of self and other. |
Mark this RDD for local checkpointing using Spark’s existing caching layer. |
|
|
Return the list of values in the RDD for key key. |
|
Return a new RDD by applying a function to each element of this RDD. |
|
Return a new RDD by applying a function to each partition of this RDD. |
|
Return a new RDD by applying a function to each partition of this RDD, while tracking the index of the original partition. |
|
Return a new RDD by applying a function to each partition of this RDD, while tracking the index of the original partition. |
Pass each value in the key-value pair RDD through a map function without changing the keys; this also retains the original RDD’s partitioning. |
|
|
Find the maximum item in this RDD. |
|
Compute the mean of this RDD’s elements. |
|
Approximate operation to return the mean within a timeout or meet the confidence. |
|
Find the minimum item in this RDD. |
|
Return the name of this RDD. |
|
Return a copy of the RDD partitioned using the specified partitioner. |
|
Set this RDD’s storage level to persist its values across operations after the first time it is computed. |
|
Return an RDD created by piping elements to a forked external process. |
|
Randomly splits this RDD with the provided weights. |
|
Reduces the elements of this RDD using the specified commutative and associative binary operator. |
|
Merge the values for each key using an associative and commutative reduce function. |
|
Merge the values for each key using an associative and commutative reduce function, but return the results immediately to the master as a dictionary. |
|
Return a new RDD that has exactly numPartitions partitions. |
Repartition the RDD according to the given partitioner and, within each resulting partition, sort records by their keys. |
|
|
Perform a right outer join of self and other. |
|
Return a sampled subset of this RDD. |
|
Return a subset of this RDD sampled by key (via stratified sampling). |
Compute the sample standard deviation of this RDD’s elements (which corrects for bias in estimating the standard deviation by dividing by N-1 instead of N). |
|
Compute the sample variance of this RDD’s elements (which corrects for bias in estimating the variance by dividing by N-1 instead of N). |
|
|
Output a Python RDD of key-value pairs (of form |
|
Output a Python RDD of key-value pairs (of form |
|
Output a Python RDD of key-value pairs (of form |
|
Output a Python RDD of key-value pairs (of form |
|
Save this RDD as a SequenceFile of serialized objects. |
|
Output a Python RDD of key-value pairs (of form |
|
Save this RDD as a text file, using string representations of elements. |
|
Assign a name to this RDD. |
|
Sorts this RDD by the given keyfunc |
|
Sorts this RDD, which is assumed to consist of (key, value) pairs. |
Return a |
|
Compute the standard deviation of this RDD’s elements. |
|
|
Return each value in self that is not contained in other. |
|
Return each (key, value) pair in self that has no pair with matching key in other. |
|
Add up the elements in this RDD. |
|
Approximate operation to return the sum within a timeout or meet the confidence. |
|
Take the first num elements of the RDD. |
|
Get the N elements from an RDD ordered in ascending order or as specified by the optional key function. |
|
Return a fixed-size sampled subset of this RDD. |
A description of this RDD and its recursive dependencies for debugging. |
|
|
Return an iterator that contains all of the elements in this RDD. |
|
Get the top N elements from an RDD. |
|
Aggregates the elements of this RDD in a multi-level tree pattern. |
|
Reduces the elements of this RDD in a multi-level tree pattern. |
|
Return the union of this RDD and another one. |
|
Mark the RDD as non-persistent, and remove all blocks for it from memory and disk. |
Return an RDD with the values of each tuple. |
|
Compute the variance of this RDD’s elements. |
|
|
Specify a |
|
Zips this RDD with another one, returning key-value pairs with the first element in each RDD second element in each RDD, etc. |
Zips this RDD with its element indices. |
|
Zips this RDD with generated unique Long ids. |
Broadcast and Accumulator¶
|
Destroy all data and metadata related to this broadcast variable. |
|
Write a pickled representation of value to the open file or socket. |
|
Read a pickled representation of value from the open file or socket. |
|
Read the pickled representation of an object from the open file and return the reconstituted object hierarchy specified therein. |
|
Delete cached copies of this broadcast on the executors. |
Return the broadcasted value |
|
|
Adds a term to this accumulator’s value |
Get the accumulator’s value; only usable in driver program |
|
|
Add two values of the accumulator’s data type, returning a new value; for efficiency, can also update value1 in place and return it. |
|
Provide a “zero value” for the type, compatible in dimensions with the provided value (e.g., a zero vector) |
Management¶
Return thread target wrapper which is recommended to be used in PySpark when the pinned thread mode is enabled. |
|
|
Does this configuration contain a given key? |
|
Get the configured value for some key, or return a default otherwise. |
Get all values as a list of key-value pairs. |
|
|
Set a configuration property. |
|
Set multiple parameters, passed as a list of key-value pairs. |
|
Set application name. |
|
Set an environment variable to be passed to executors. |
|
Set a configuration property, if not already set. |
|
Set master URL to connect to. |
|
Set path where Spark is installed on worker nodes. |
Returns a printable version of the configuration, as a list of key=value pairs, one per line. |
|
|
Get the absolute path of a file added through |
Get the root directory that contains files added through |
|
How many times this task has been attempted. |
|
CPUs allocated to the task. |
|
Return the currently active |
|
Get a local property set upstream in the driver, or None if it is missing. |
|
The ID of the RDD partition that is computed by this task. |
|
Resources allocated to the task. |
|
The ID of the stage that this task belong to. |
|
An ID that is unique to this task attempt (within the same |
|
|
Returns a new RDD by applying a function to each partition of the wrapped RDD, where tasks are launched together in a barrier stage. |
Returns a new RDD by applying a function to each partition of the wrapped RDD, while tracking the index of the original partition. |
|
|
This function blocks until all tasks in the same stage have reached this routine. |
How many times this task has been attempted. |
|
Sets a global barrier and waits until all tasks in this stage hit this barrier. |
|
CPUs allocated to the task. |
|
Return the currently active |
|
Get a local property set upstream in the driver, or None if it is missing. |
|
Returns |
|
The ID of the RDD partition that is computed by this task. |
|
Resources allocated to the task. |
|
The ID of the stage that this task belong to. |
|
An ID that is unique to this task attempt (within the same |
|
|
Given a Spark version string, return the (major version number, minor version number). |