Machine Learning in Spark

https://datascience.stackexchange.com/questions/12320

16-10-2019
|

Question

I am using Apache Spark to perform sentiment analysis.I am using Naive Bayes algorithm to classify the text. I don't know how to find out the probability of labels. I would be grateful if I know get some snippet in python to find the probability of labels.

Solution

Probability can be found for the test dataset once you trained the model and transformed for the test dataset e.g: if your trained Naive Bayes model is model then model.transform(test) contains a node of probability, for more details please check the below code, going to show you the probability node and others useful nodes also for iris dataset.

Partition dataset randomly into Training and Test sets. Set seed for reproducibility

(trainingData, testData) = irisdf.randomSplit([0.7, 0.3], seed = 100)

trainingData.cache()
testData.cache()

print trainingData.count()
print testData.count()

Output:

103
47

Next, we will use the VectorAssembler() to merge our feature columns into a single vector column, which we will be passing into our Naive Bayes model. Again, we will not transform the dataset just yet as we will be passing the VectorAssembler into our ML Pipeline.

from pyspark.ml.feature import VectorAssembler
vecAssembler = VectorAssembler(inputCols=["SepalLength", "SepalWidth", "PetalLength", "PetalWidth"], outputCol="features")

For iris dataset, it has three classes namely setosa, versicolor and virginica. So let's create a Multiclass Naive Bayes Classifier using pysaprk library ml.

from pyspark.ml.classification import NaiveBayes
from pyspark.ml import Pipeline

# Train a NaiveBayes model
nb = NaiveBayes(smoothing=1.0, modelType="multinomial")

# Chain labelIndexer, vecAssembler and NBmodel in a pipeline
pipeline = Pipeline(stages=[labelIndexer, vecAssembler, nb])

# Run stages in pipeline and train model
model = pipeline.fit(trainingData)

Analyse the created mode model, from which we can make predictions.

predictions = model.transform(testData)
# Display what results we can view
predictions.printSchema()

Output

root
 |-- SepalLength: double (nullable = true)
 |-- SepalWidth: double (nullable = true)
 |-- PetalLength: double (nullable = true)
 |-- PetalWidth: double (nullable = true)
 |-- Species: string (nullable = true)
 |-- label: double (nullable = true)
 |-- features: vector (nullable = true)
 |-- rawPrediction: vector (nullable = true)
 |-- probability: vector (nullable = true)
 |-- prediction: double (nullable = true)

You can also select a particular node to view for some dataset as:

# DISPLAY Selected nodes only
display(predictions.select("label", "prediction", "probability"))

Above will show you in tabular formate.

Reference:

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange