SubtractedRDD (Spark 1.3.1 JavaDoc)

Object
- org.apache.spark.rdd.RDD<scala.Tuple2<K,V>>
- - org.apache.spark.rdd.SubtractedRDD<K,V,W>

All Implemented Interfaces:

java.io.Serializable, Logging
```
public class SubtractedRDD<K,V,W>
extends RDD<scala.Tuple2<K,V>>
```
An optimized version of cogroup for set difference/subtraction.
It is possible to implement this operation with just cogroup, but that is less efficient because all of the entries from rdd2, for both matching and non-matching values in rdd1, are kept in the JHashMap until the end.
With this implementation, only the entries from rdd1 are kept in-memory, and the entries from rdd2 are essentially streamed, as we only need to touch each once to decide if the value needs to be removed.
This is particularly helpful when rdd1 is much smaller than rdd2, as you can use rdd1's partitioner/partition size and not worry about running out of memory because of the size of rdd2.

See Also:
Serialized Form

Constructor Summary

Constructors
Constructor and Description
`SubtractedRDD(RDD<? extends scala.Product2<K,V>> rdd1, RDD<? extends scala.Product2<K,W>> rdd2, Partitioner part, scala.reflect.ClassTag<K> evidence$1, scala.reflect.ClassTag<V> evidence$2, scala.reflect.ClassTag<W> evidence$3)`

Method Summary

Methods
Modifier and Type	Method and Description
`void`	`clearDependencies()` Clears the dependencies of this RDD.
`scala.collection.Iterator<scala.Tuple2<K,V>>`	`compute(Partition p, TaskContext context)` :: DeveloperApi :: Implemented by subclasses to compute a given partition.
`scala.collection.Seq<Dependency<?>>`	`getDependencies()` Implemented by subclasses to return how this RDD depends on parent RDDs.
`Partition[]`	`getPartitions()` Implemented by subclasses to return the set of partitions in this RDD.
`scala.Some<Partitioner>`	`partitioner()` Optionally overridden by subclasses to specify how they are partitioned.
`Object`	`rdd1()`
`Object`	`rdd2()`
`SubtractedRDD<K,V,W>`	`setSerializer(Serializer serializer)` Set a serializer for this RDD's shuffle, or null to use the default (spark.serializer)

Methods inherited from class org.apache.spark.rdd.RDD
aggregate, cache, cartesian, checkpoint, checkpointData, coalesce, collect, collect, collectPartitions, computeOrReadCheckpoint, conf, context, count, countApprox, countApproxDistinct, countApproxDistinct, countByValue, countByValueApprox, creationSite, dependencies, distinct, distinct, doCheckpoint, doubleRDDToDoubleRDDFunctions, elementClassTag, filter, filterWith, first, flatMap, flatMapWith, fold, foreach, foreachPartition, foreachWith, getCheckpointFile, getCreationSite, getNarrowAncestors, getStorageLevel, glom, groupBy, groupBy, groupBy, id, intersection, intersection, intersection, isCheckpointed, isEmpty, iterator, keyBy, map, mapPartitions, mapPartitionsWithContext, mapPartitionsWithIndex, mapPartitionsWithSplit, mapWith, markCheckpointed, max, min, name, numericRDDToDoubleRDDFunctions, partitions, persist, persist, pipe, pipe, pipe, preferredLocations, randomSplit, rddToAsyncRDDActions, rddToOrderedRDDFunctions, rddToPairRDDFunctions, rddToSequenceFileRDDFunctions, reduce, repartition, retag, retag, sample, saveAsObjectFile, saveAsTextFile, saveAsTextFile, setName, sortBy, sparkContext, subtract, subtract, subtract, take, takeOrdered, takeSample, toArray, toDebugString, toJavaRDD, toLocalIterator, top, toString, treeAggregate, treeReduce, union, unpersist, zip, zipPartitions, zipPartitions, zipPartitions, zipPartitions, zipPartitions, zipPartitions, zipWithIndex, zipWithUniqueId

Methods inherited from class Object
equals, getClass, hashCode, notify, notifyAll, wait, wait, wait

Methods inherited from interface org.apache.spark.Logging
initializeIfNecessary, initializeLogging, isTraceEnabled, log_, log, logDebug, logDebug, logError, logError, logInfo, logInfo, logName, logTrace, logTrace, logWarning, logWarning

- Constructor Detail
  - SubtractedRDD
```
public SubtractedRDD(RDD<? extends scala.Product2<K,V>> rdd1,
             RDD<? extends scala.Product2<K,W>> rdd2,
             Partitioner part,
             scala.reflect.ClassTag<K> evidence$1,
             scala.reflect.ClassTag<V> evidence$2,
             scala.reflect.ClassTag<W> evidence$3)
```
- Method Detail
  - rdd1
```
public Object rdd1()
```
  - rdd2
```
public Object rdd2()
```
  - setSerializer
```
public SubtractedRDD<K,V,W> setSerializer(Serializer serializer)
```
    Set a serializer for this RDD's shuffle, or null to use the default (spark.serializer)
  - getDependencies
```
public scala.collection.Seq<Dependency<?>> getDependencies()
```
    Description copied from class: RDD
    
    Implemented by subclasses to return how this RDD depends on parent RDDs. This method will only be called once, so it is safe to implement a time-consuming computation in it.
  - getPartitions
```
public Partition[] getPartitions()
```
    Description copied from class: RDD
    
    Implemented by subclasses to return the set of partitions in this RDD. This method will only be called once, so it is safe to implement a time-consuming computation in it.
  - partitioner
```
public scala.Some<Partitioner> partitioner()
```
    Description copied from class: RDD
    
    Optionally overridden by subclasses to specify how they are partitioned.
    
    Overrides:
    
    partitioner in class RDD<scala.Tuple2<K,V>>
  - compute
```
public scala.collection.Iterator<scala.Tuple2<K,V>> compute(Partition p,
                                                   TaskContext context)
```
    Description copied from class: RDD
    
    :: DeveloperApi :: Implemented by subclasses to compute a given partition.
    
    Specified by:
    
    compute in class RDD<scala.Tuple2<K,V>>
  - clearDependencies
```
public void clearDependencies()
```
    Description copied from class: RDD
    
    Clears the dependencies of this RDD. This method must ensure that all references to the original parent RDDs is removed to enable the parent RDDs to be garbage collected. Subclasses of RDD may override this method for implementing their own cleaning logic. See UnionRDD for an example.

Class SubtractedRDD<K,V,W>

Constructor Summary

Method Summary

Methods inherited from class org.apache.spark.rdd.RDD

Methods inherited from class Object

Methods inherited from interface org.apache.spark.Logging

Constructor Detail

SubtractedRDD

Method Detail

rdd1

rdd2

setSerializer

getDependencies

getPartitions

partitioner

compute

clearDependencies