public class SubtractedRDD<K,V,W> extends RDD<scala.Tuple2<K,V>>
It is possible to implement this operation with just cogroup
, but
that is less efficient because all of the entries from rdd2
, for
both matching and non-matching values in rdd1
, are kept in the
JHashMap until the end.
With this implementation, only the entries from rdd1
are kept in-memory,
and the entries from rdd2
are essentially streamed, as we only need to
touch each once to decide if the value needs to be removed.
This is particularly helpful when rdd1
is much smaller than rdd2
, as
you can use rdd1
's partitioner/partition size and not worry about running
out of memory because of the size of rdd2
.
Constructor and Description |
---|
SubtractedRDD(RDD<? extends scala.Product2<K,V>> rdd1,
RDD<? extends scala.Product2<K,W>> rdd2,
Partitioner part,
scala.reflect.ClassTag<K> evidence$1,
scala.reflect.ClassTag<V> evidence$2,
scala.reflect.ClassTag<W> evidence$3) |
Modifier and Type | Method and Description |
---|---|
void |
clearDependencies()
Clears the dependencies of this RDD.
|
scala.collection.Iterator<scala.Tuple2<K,V>> |
compute(Partition p,
TaskContext context)
:: DeveloperApi ::
Implemented by subclasses to compute a given partition.
|
scala.collection.Seq<Dependency<?>> |
getDependencies()
Implemented by subclasses to return how this RDD depends on parent RDDs.
|
Partition[] |
getPartitions()
Implemented by subclasses to return the set of partitions in this RDD.
|
scala.Some<Partitioner> |
partitioner()
Optionally overridden by subclasses to specify how they are partitioned.
|
Object |
rdd1() |
Object |
rdd2() |
SubtractedRDD<K,V,W> |
setSerializer(Serializer serializer)
Set a serializer for this RDD's shuffle, or null to use the default (spark.serializer)
|
aggregate, cache, cartesian, checkpoint, checkpointData, coalesce, collect, collect, collectPartitions, computeOrReadCheckpoint, conf, context, count, countApprox, countApproxDistinct, countApproxDistinct, countByValue, countByValueApprox, creationSite, dependencies, distinct, distinct, doCheckpoint, doubleRDDToDoubleRDDFunctions, elementClassTag, filter, filterWith, first, flatMap, flatMapWith, fold, foreach, foreachPartition, foreachWith, getCheckpointFile, getCreationSite, getNarrowAncestors, getStorageLevel, glom, groupBy, groupBy, groupBy, id, intersection, intersection, intersection, isCheckpointed, isEmpty, iterator, keyBy, map, mapPartitions, mapPartitionsWithContext, mapPartitionsWithIndex, mapPartitionsWithSplit, mapWith, markCheckpointed, max, min, name, numericRDDToDoubleRDDFunctions, partitions, persist, persist, pipe, pipe, pipe, preferredLocations, randomSplit, rddToAsyncRDDActions, rddToOrderedRDDFunctions, rddToPairRDDFunctions, rddToSequenceFileRDDFunctions, reduce, repartition, retag, retag, sample, saveAsObjectFile, saveAsTextFile, saveAsTextFile, setName, sortBy, sparkContext, subtract, subtract, subtract, take, takeOrdered, takeSample, toArray, toDebugString, toJavaRDD, toLocalIterator, top, toString, treeAggregate, treeReduce, union, unpersist, zip, zipPartitions, zipPartitions, zipPartitions, zipPartitions, zipPartitions, zipPartitions, zipWithIndex, zipWithUniqueId
initializeIfNecessary, initializeLogging, isTraceEnabled, log_, log, logDebug, logDebug, logError, logError, logInfo, logInfo, logName, logTrace, logTrace, logWarning, logWarning
public Object rdd1()
public Object rdd2()
public SubtractedRDD<K,V,W> setSerializer(Serializer serializer)
public scala.collection.Seq<Dependency<?>> getDependencies()
RDD
public Partition[] getPartitions()
RDD
public scala.Some<Partitioner> partitioner()
RDD
partitioner
in class RDD<scala.Tuple2<K,V>>
public scala.collection.Iterator<scala.Tuple2<K,V>> compute(Partition p, TaskContext context)
RDD
public void clearDependencies()
RDD
UnionRDD
for an example.