splicemachine.stats module

This module contains statistical functions to help with Machine Learning and data analysis.

class DecisionTreeVisualizer[source]

Visualize a decision tree, either in code like format, or graphviz

static add_node(dot, parent, node_hash, root, realroot=False)[source]

Traverse through the .debugString json and generate a graphviz tree

Parameters
  • dot – dot file object

  • parent – not used currently

  • node_hash – unique node id

  • root – the root of tree

  • realroot – whether or not it is the real root, or a recursive root

Returns

static feature_importance(spark, model, dataset, featuresCol='features')[source]

Return a dataframe containing the relative importance of each feature

Parameters
  • model – The Spark Machine Learning model

  • dataframe – Spark Dataframe

  • featureCol – (str) the column containing the feature vector

Returns

dataframe containing importance

static parse(lines)[source]

Lines in debug string

Parameters

lines

Returns

block json

static replacer(string, bad, good)[source]

Replace every string in “bad” with the corresponding string in “good”

Parameters
  • string – string to replace in

  • bad – array of strings to replace

  • good – array of strings to replace with

Returns

static tree_json(tree)[source]

Generate a JSON representation of a decision tree

Parameters

tree – tree debug string

Returns

json

static visualize(model, feature_column_names, label_names, size=None, horizontal=False, tree_name='tree', visual=False)[source]

Visualize a decision tree, either in a code like format, or graphviz

Parameters
  • model – the fitted decision tree classifier

  • feature_column_names – (List[str]) column names for features You can access these feature names by using your VectorAssembler (in PySpark) and calling it’s .getInputCols() function

  • label_names – (List[str]) labels vector (below avg, above avg)

  • size – tuple(int,int) The size of the graph. If unspecified, graphviz will automatically assign a size

  • horizontal – (Bool) if the tree should be rendered horizontally

  • tree_name – the name you would like to call the tree

  • visual – bool, true if you want a graphviz pdf containing your file

Return dot

The graphvis object

class IndReconstructer(inputCol=None, outputCol=None)[source]

Transformer to reconstruct String Index from OneHotDummy Columns. This can be used as a part of a Pipeline Ojbect

Follows: https://spark.apache.org/docs/latest/ml-pipeline.html#transformers

Parameters
  • Transformer – Inherited Class

  • HasInputCol – Inherited Class

  • HasOutputCol – Inherited Class

Returns

Transformed PySpark Dataframe With Original String Indexed Variables

class OneHotDummies(inputCol=None, outputCol=None)[source]

Transformer to generate dummy columns for categorical variables as a part of a preprocessing pipeline

Follows: https://spark.apache.org/docs/latest/ml-pipeline.html#transformers

Parameters
  • Transformer – Inherited Classes

  • HasInputCol – Inherited Classes

  • HasOutputCol – Inherited Classes

Returns

pyspark DataFrame

class OverSampleCrossValidator(estimator, estimatorParamMaps, evaluator, numFolds=3, seed=None, parallelism=3, collectSubModels=False, labelCol='label', altEvaluators=None, overSample=True)[source]

Class to perform Cross Validation model evaluation while over-sampling minority labels.

Example
>>> from pyspark.sql.session import SparkSession
>>> from pyspark.stats.classification import LogisticRegression
>>> from pyspark.stats.evaluation import BinaryClassificationEvaluator,

 MulticlassClassificationEvaluator
>>> from pyspark.stats.linalg import Vectors
>>> from splicemachine.stats.stats import OverSampleCrossValidator
>>> spark = SparkSession.builder.getOrCreate()
>>> dataset = spark.createDataFrame(
...      [(Vectors.dense([0.0]), 0.0),
...       (Vectors.dense([0.5]), 0.0),
...       (Vectors.dense([0.4]), 1.0),
...       (Vectors.dense([0.6]), 1.0),
...       (Vectors.dense([1.0]), 1.0)] * 10,
...      ["features", "label"])
>>> lr = LogisticRegression()
>>> grid = ParamGridBuilder().addGrid(lr.maxIter, [0, 1]).build()
>>> PRevaluator = BinaryClassificationEvaluator(metricName = 'areaUnderPR')
>>> AUCevaluator = BinaryClassificationEvaluator(metricName = 'areaUnderROC')
>>> ACCevaluator = MulticlassClassificationEvaluator(metricName="accuracy")
>>> cv = OverSampleCrossValidator(estimator=lr, estimatorParamMaps=grid,
        evaluator=AUCevaluator, altEvaluators = [PRevaluator, ACCevaluator],
        parallelism=2,seed = 1234)
>>> cvModel = cv.fit(dataset)
>>> print(cvModel.avgMetrics)
[(0.5, [0.5888888888888888, 0.3888888888888889]), (0.806878306878307,
    [0.8556863149300125, 0.7055555555555556])]
>>> print(AUCevaluator.evaluate(cvModel.transform(dataset)))
0.8333333333333333
class OverSampler(labelCol=None, strategy='auto', randomState=None)[source]

Transformer to oversample datapoints with minority labels

Follows: https://spark.apache.org/docs/latest/ml-pipeline.html#transformers

Parameters
  • Transformer – Inherited Class

  • HasInputCol – Inherited Class

  • HasOutputCol – Inherited Class

Returns

PySpark Dataframe with labels in approximately equal ratios

Example
>>> from pyspark.sql import functions as F
>>> from pyspark.sql.session import SparkSession
>>> from pyspark.stats.linalg import Vectors
>>> from splicemachine.stats.stats import OverSampler
>>> spark = SparkSession.builder.getOrCreate()
>>> df = spark.createDataFrame(
...      [(Vectors.dense([0.0]), 0.0),
...       (Vectors.dense([0.5]), 0.0),
...       (Vectors.dense([0.4]), 1.0),
...       (Vectors.dense([0.6]), 1.0),
...       (Vectors.dense([1.0]), 1.0)] * 10,
...      ["features", "Class"])
>>> df.groupBy(F.col("Class")).count().orderBy("count").show()
+-----+-----+
|Class|count|
+-----+-----+
|  0.0|   20|
|  1.0|   30|
+-----+-----+
>>> oversampler = OverSampler(labelCol = "Class", strategy = "auto")
>>> oversampler.transform(df).groupBy("Class").count().show()
+-----+-----+
|Class|count|
+-----+-----+
|  0.0|   29|
|  1.0|   30|
+-----+-----+
class Rounder(predictionCol='prediction', labelCol='label', clipPreds=True, maxLabel=None, minLabel=None)[source]

Transformer to round predictions for ordinal regression Follows: https://spark.apache.org/docs/latest/ml-pipeline.html#transformers

Parameters
  • Transformer – Inherited Class

  • HasInputCol – Inherited Class

  • HasOutputCol – Inherited Class

Returns

Transformed Dataframe with rounded predictionCol

Example
>>> from pyspark.sql.session import SparkSession
>>> from splicemachine.stats.stats import Rounder
>>> spark = SparkSession.builder.getOrCreate()
>>> dataset = spark.createDataFrame(
...      [(0.2, 0.0),
...       (1.2, 1.0),
...       (1.6, 2.0),
...       (1.1, 0.0),
...       (3.1, 0.0)],
...      ["prediction", "label"])
>>> dataset.show()
+----------+-----+
|prediction|label|
+----------+-----+
|       0.2|  0.0|
|       1.2|  1.0|
|       1.6|  2.0|
|       1.1|  0.0|
|       3.1|  0.0|
+----------+-----+
>>> rounder = Rounder(predictionCol = "prediction", labelCol = "label",
    clipPreds = True)
>>> rounder.transform(dataset).show()
+----------+-----+
|prediction|label|
+----------+-----+
|       0.0|  0.0|
|       1.0|  1.0|
|       2.0|  2.0|
|       1.0|  0.0|
|       2.0|  0.0|
+----------+-----+
>>> rounderNoClip = Rounder(predictionCol = "prediction", labelCol = "label",
    clipPreds = False)
>>> rounderNoClip.transform(dataset).show()
+----------+-----+
|prediction|label|
+----------+-----+
|       0.0|  0.0|
|       1.0|  1.0|
|       2.0|  2.0|
|       1.0|  0.0|
|       3.0|  0.0|
+----------+-----+
class SpliceBaseEvaluator(spark, evaluator, supported_metrics, predictionCol='prediction', labelCol='label')[source]

Base ModelEvaluator

get_results(as_dict=False)[source]

Get Results

Parameters

dict – whether to get results in a dict or not

Returns

dictionary

input(predictions_dataframe)[source]

Input a dataframe

Parameters
  • ev – evaluator class

  • predictions_dataframe – input df

Returns

none

class SpliceBinaryClassificationEvaluator(spark, predictionCol='prediction', labelCol='label', confusion_matrix=True)[source]

A Splice Machine evaluator for Spark Binary Classification models. Implements functions from SpliceBaseEvaluator.

input(predictions_dataframe)[source]

Evaluate actual vs Predicted in a dataframe

Parameters

predictions_dataframe – the dataframe containing the label and the predicition

plotROC(fittedEstimator, ax)[source]

Plots the receiver operating characteristic curve for the trained classifier

Parameters
  • fittedEstimator – fitted logistic regression model

  • ax – matplotlib axis object

Returns

axis with ROC plot

class SpliceMultiClassificationEvaluator(spark, predictionCol='prediction', labelCol='label')[source]

A Splice Machine evaluator for Spark MultiClass models. Implements functions from SpliceBaseEvaluator.

class SpliceRegressionEvaluator(spark, predictionCol='prediction', labelCol='label')[source]

A Splice Machine evaluator for Spark Regression models. Implements functions from SpliceBaseEvaluator.

best_fit_distribution(data, col_name, bins, ax)[source]

Model data by finding best fit distribution to data

Parameters
  • data – DataFrame with one column containing the feature whose distribution is to be investigated

  • col_name – column name for feature

  • bins – number of bins to use in generating the histogram of this data

  • ax – axis to plot histogram on

Returns

(best_distribution.name, best_params, best_sse) best_distribution.name: string of the best distribution name best_params: parameters for this distribution best_sse: sum of squared errors for this distribution against the empirical pdf

estimateCovariance(df, features_col='features')[source]

Compute the covariance matrix for a given dataframe.

Parameters
  • df – PySpark dataframe

  • features_col – name of the column with the features, defaults to ‘features’

Returns

np.ndarray: A multi-dimensional array where the number of rows and columns both equal the length of the arrays in the input dataframe.

Note

The multi-dimensional covariance array should be calculated using outer products. Don’t forget to normalize the data by first subtracting the mean.

get_confusion_matrix(spark, TP, TN, FP, FN)[source]

Creates and returns a confusion matrix

Parameters
  • TP – True Positives

  • TN – True Negatives

  • FP – False Positives

  • FN – False Negatives

Returns

Spark DataFrame

get_string_pipeline(df, cols_to_exclude, steps=['StringIndexer', 'OneHotEncoder', 'OneHotDummies'])[source]

Generates a list of preprocessing stages

Parameters
  • df – DataFrame including only the training data

  • cols_to_exclude – Column names we don’t want to to include in the preprocessing (i.e. SUBJECT/ target column)

  • stages – preprocessing steps to take

Returns

(stages, Numeric_Columns) stages: list of pipeline stages to be used in preprocessing Numeric_Columns: list of columns that contain numeric features

inspectTable(spliceMLCtx, sql, topN=5)[source]

Inspect the values of the columns of the table (dataframe) returned from the sql query

Parameters
  • spliceMLCtx – SpliceMLContext

  • sql – sql string to execute

  • topN – the number of most frequent elements of a column to return, defaults to 5

make_pdf(dist, params, size=10000)[source]

Generate distributions’s Probability Distribution Function

Parameters
Returns

series of probability density function for this distribution

pca_with_scores(df, k=10)[source]

Computes the top k principal components, corresponding scores, and all eigenvalues.

Parameters
  • df – A Spark dataframe with a ‘features’ column, which (column) consists of DenseVectors.

  • k – The number of principal components to return., defaults to 10

Returns

(eigenvectors, RDD of scores, eigenvalues)

  • Eigenvectors: multi-dimensional array where the number of rows equals the length of the arrays in the input RDD and the number of columns equals`k`.

  • RDD of scores: has the same number of rows as data and consists of arrays of length k.

  • Eigenvalues is an array of length d (the number of features).

Note

All eigenvalues should be returned in sorted order (largest to smallest). eigh returns each eigenvectors as a column. This function should also return eigenvectors as columns.

postprocessing_pipeline(df, cols_to_exclude)[source]

Assemble postprocessing pipeline to reconstruct original categorical indexed values from OneHotDummy Columns

Parameters
  • df – DataFrame Including the original string Columns

  • cols_to_exclude – list of columns to exclude

Returns

(reconstructers, String_Columns) reconstructers: list of IndReconstructer stages String_Columns: list of columns that are being reconstructed

reconstructPCA(sql, df, pc, mean, std, originalColumns, fits, pcaColumn='pcaFeatures')[source]

Reconstruct data from lower dimensional space after performing PCA

Parameters
  • sql – SQLContext

  • df – PySpark DataFrame: inputted PySpark DataFrame

  • pc – numpy.ndarray: principal components projected onto

  • mean – numpy.ndarray: mean of original columns

  • std – numpy.ndarray: standard deviation of original columns

  • originalColumns – list: original column names

  • fits – fits of features returned from best_fit_distribution

  • pcaColumn – column in df that contains PCA features, defaults to ‘pcaFeatures’

Returns

dataframe containing reconstructed data

varianceExplained(df, k=10)[source]

Returns the proportion of variance explained by k principal componenets. Calls the above PCA procedure

Parameters
  • df – PySpark DataFrame

  • k – number of principal components , defaults to 10

Returns

(proportion, principal_components, scores, eigenvalues)

vector_assembler_pipeline(df, columns, doPCA=False, k=10)[source]

After preprocessing String Columns, this function can be used to assemble a feature vector to be used for learning creates the following stages: VectorAssembler -> Standard Scalar [{ -> PCA}]

Parameters
  • df – DataFrame containing preprocessed Columns

  • columns – list of Column names of the preprocessed columns

  • doPCA – Do you want to do PCA as part of the vector assembler? defaults to False

  • k – Number of Principal Components to use, defaults to 10

Returns

List of vector assembling stages