Apache Spark is very easy to apply machine learning algorithms to your own data analysis work. Its pipeline design references from scikit-learn, one of the famous Python ML Ecosystem Projects(which also includes NumPy, pandas, matplotlib, IPython). And Spark’s other features can also be found on above list.

Here, I may give brief to its ML pipeline module with some example codes, which are referenced by Berkeley open class in edX: BerkeleyX: CS110x Big Data Analysis with Apache Spark.

In order to finish a process of machine learning, we need to prepare two parts:

  • Data, which is used to train and test.
  • Model, which is the meat of the pattern underlying.

And the data part, for the supervised machine learning, is constituted by features and label. Spark’s ML pipeline will separate these parts to give more flexibility for users combining different data, model situations.


Specified features format, Vectors, using pyspark.ml.feature.VectorAssemble() method. E.g.

from pyspark.ml.feature import VectorAssembler
datasetDF = sqlContext.table("power_plant")
vectorizer = VectorAssembler()
vectorizer.setInputCols(["AT", "V", "AP", "RH"])

Test Data Applying

Use ML pipeline for test data switch

# Apply our LR model to the test data and predict power output

predictionsAndLabelsDF = lrModel.transform(testSetDF).select("AT", "V", "AP", "RH", "PE", "Predicted_PE")


Model handle in Spark, the evaluator:

# Now let's compute an evaluation metric for our test dataset
from pyspark.ml.evaluation import RegressionEvaluator

# Create an RMSE evaluator using the label and predicted columns
regEval = RegressionEvaluator(predictionCol="Predicted_PE", labelCol="PE", metricName="rmse")

# Run the evaluator on the DataFrame
rmse = regEval.evaluate(predictionsAndLabelsDF)

# Now let's compute another evaluation metric for our test dataset
r2 = regEval.evaluate(predictionsAndLabelsDF, {regEval.metricName: "r2"})

Model tuning and selection

Model selection is an important part to determine which pattern you may use to work. For example, we often use cross-validation to determine our linear regression model. In Spark, use specified format to store alternative parameters set.

  1. Get the tool, e.g. CrossValidator().
  2. Set the alternative parameters data set.
  3. Set the parameter Grid by using above alternative parameters data set.
  4. Use the model selection tool to find and return the best model.
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

# We can reuse the RegressionEvaluator, regEval, to judge the model based on the best Root Mean Squared Error
# Let's create our CrossValidator with 3 fold cross validation
crossval = CrossValidator(estimator=lrPipeline, evaluator=regEval, numFolds=3)

# Let's tune over our regularization parameter from 0.01 to 0.10
regParam = [x / 100.0 for x in range(1, 11)]

# We'll create a paramter grid using the ParamGridBuilder, and add the grid to the CrossValidator
paramGrid = (ParamGridBuilder()
                 .addGrid(lr.regParam, regParam)

# Now let's find and return the best model
cvModel = crossval.fit(trainingSetDF).bestModel