LDAModel¶
- 
class pyspark.mllib.clustering.LDAModel(java_model: py4j.java_gateway.JavaObject)[source]¶
- A clustering model derived from the LDA method. - Latent Dirichlet Allocation (LDA), a topic model designed for text documents. Terminology - “word” = “term”: an element of the vocabulary 
- “token”: instance of a term appearing in a document 
- “topic”: multinomial distribution over words representing some concept 
 - New in version 1.5.0. - Notes - See the original LDA paper (journal version) [1] - 1
- Blei, D. et al. “Latent Dirichlet Allocation.” J. Mach. Learn. Res. 3 (2003): 993-1022. https://www.jmlr.org/papers/v3/blei03a 
 - Examples - >>> from pyspark.mllib.linalg import Vectors >>> from numpy.testing import assert_almost_equal, assert_equal >>> data = [ ... [1, Vectors.dense([0.0, 1.0])], ... [2, SparseVector(2, {0: 1.0})], ... ] >>> rdd = sc.parallelize(data) >>> model = LDA.train(rdd, k=2, seed=1) >>> model.vocabSize() 2 >>> model.describeTopics() [([1, 0], [0.5..., 0.49...]), ([0, 1], [0.5..., 0.49...])] >>> model.describeTopics(1) [([1], [0.5...]), ([0], [0.5...])] - >>> topics = model.topicsMatrix() >>> topics_expect = array([[0.5, 0.5], [0.5, 0.5]]) >>> assert_almost_equal(topics, topics_expect, 1) - >>> import os, tempfile >>> from shutil import rmtree >>> path = tempfile.mkdtemp() >>> model.save(sc, path) >>> sameModel = LDAModel.load(sc, path) >>> assert_equal(sameModel.topicsMatrix(), model.topicsMatrix()) >>> sameModel.vocabSize() == model.vocabSize() True >>> try: ... rmtree(path) ... except OSError: ... pass - Methods - call(name, *a)- Call method of java_model - describeTopics([maxTermsPerTopic])- Return the topics described by weighted terms. - load(sc, path)- Load the LDAModel from disk. - save(sc, path)- Save this model to the given path. - Inferred topics, where each topic is represented by a distribution over terms. - Vocabulary size (number of terms or terms in the vocabulary) - Methods Documentation - 
call(name: str, *a: Any) → Any¶
- Call method of java_model 
 - 
describeTopics(maxTermsPerTopic: Optional[int] = None) → List[Tuple[List[int], List[float]]][source]¶
- Return the topics described by weighted terms. - New in version 1.6.0. - Warning - If vocabSize and k are large, this can return a large object! - Parameters
- maxTermsPerTopicint, optional
- Maximum number of terms to collect for each topic. (default: vocabulary size) 
 
- Returns
- list
- Array over topics. Each topic is represented as a pair of matching arrays: (term indices, term weights in topic). Each topic’s terms are sorted in order of decreasing weight. 
 
 
 - 
classmethod load(sc: pyspark.context.SparkContext, path: str) → pyspark.mllib.clustering.LDAModel[source]¶
- Load the LDAModel from disk. - New in version 1.5.0. - Parameters
- scpyspark.SparkContext
- pathstr
- Path to where the model is stored. 
 
- sc
 
 - 
save(sc: pyspark.context.SparkContext, path: str) → None¶
- Save this model to the given path. - New in version 1.3.0.