splicemachine.stats module¶

This module contains statistical functions to help with Machine Learning and data analysis.

class DecisionTreeVisualizer[source]¶

Visualize a decision tree, either in code like format, or graphviz

static add_node(dot, parent, node_hash, root, realroot=False)[source]¶

Traverse through the .debugString json and generate a graphviz tree

Parameters

dot – dot file object
parent – not used currently
node_hash – unique node id
root – the root of tree
realroot – whether or not it is the real root, or a recursive root

Returns

static feature_importance(spark, model, dataset, featuresCol='features')[source]¶

Return a dataframe containing the relative importance of each feature

Parameters

model – The Spark Machine Learning model
dataframe – Spark Dataframe
featureCol – (str) the column containing the feature vector

Returns

dataframe containing importance

static parse(lines)[source]¶

Lines in debug string

Parameters: lines –
Returns: block json

static replacer(string, bad, good)[source]¶

Replace every string in “bad” with the corresponding string in “good”

Parameters

string – string to replace in
bad – array of strings to replace
good – array of strings to replace with

Returns

static tree_json(tree)[source]¶

Generate a JSON representation of a decision tree

Parameters: tree – tree debug string
Returns: json

static visualize(model, feature_column_names, label_names, size=None, horizontal=False, tree_name='tree', visual=False)[source]¶

Visualize a decision tree, either in a code like format, or graphviz

Parameters

model – the fitted decision tree classifier
feature_column_names – (List[str]) column names for features You can access these feature names by using your VectorAssembler (in PySpark) and calling it’s .getInputCols() function
label_names – (List[str]) labels vector (below avg, above avg)
size – tuple(int,int) The size of the graph. If unspecified, graphviz will automatically assign a size
horizontal – (Bool) if the tree should be rendered horizontally
tree_name – the name you would like to call the tree
visual – bool, true if you want a graphviz pdf containing your file

Return dot

The graphvis object

class IndReconstructer(inputCol=None, outputCol=None)[source]¶

Transformer to reconstruct String Index from OneHotDummy Columns. This can be used as a part of a Pipeline Ojbect

Follows: https://spark.apache.org/docs/latest/ml-pipeline.html#transformers

Parameters

Transformer – Inherited Class
HasInputCol – Inherited Class
HasOutputCol – Inherited Class

Returns

Transformed PySpark Dataframe With Original String Indexed Variables

class OneHotDummies(inputCol=None, outputCol=None)[source]¶

Transformer to generate dummy columns for categorical variables as a part of a preprocessing pipeline

Follows: https://spark.apache.org/docs/latest/ml-pipeline.html#transformers

Parameters

Transformer – Inherited Classes
HasInputCol – Inherited Classes
HasOutputCol – Inherited Classes

Returns

pyspark DataFrame

class OverSampleCrossValidator(estimator, estimatorParamMaps, evaluator, numFolds=3, seed=None, parallelism=3, collectSubModels=False, labelCol='label', altEvaluators=None, overSample=True)[source]¶

Class to perform Cross Validation model evaluation while over-sampling minority labels.

Example

>>> from pyspark.sql.session import SparkSession
>>> from pyspark.stats.classification import LogisticRegression
>>> from pyspark.stats.evaluation import BinaryClassificationEvaluator,

 MulticlassClassificationEvaluator
>>> from pyspark.stats.linalg import Vectors
>>> from splicemachine.stats.stats import OverSampleCrossValidator
>>> spark = SparkSession.builder.getOrCreate()
>>> dataset = spark.createDataFrame(
...      [(Vectors.dense([0.0]), 0.0),
...       (Vectors.dense([0.5]), 0.0),
...       (Vectors.dense([0.4]), 1.0),
...       (Vectors.dense([0.6]), 1.0),
...       (Vectors.dense([1.0]), 1.0)] * 10,
...      ["features", "label"])
>>> lr = LogisticRegression()
>>> grid = ParamGridBuilder().addGrid(lr.maxIter, [0, 1]).build()
>>> PRevaluator = BinaryClassificationEvaluator(metricName = 'areaUnderPR')
>>> AUCevaluator = BinaryClassificationEvaluator(metricName = 'areaUnderROC')
>>> ACCevaluator = MulticlassClassificationEvaluator(metricName="accuracy")
>>> cv = OverSampleCrossValidator(estimator=lr, estimatorParamMaps=grid,
        evaluator=AUCevaluator, altEvaluators = [PRevaluator, ACCevaluator],
        parallelism=2,seed = 1234)
>>> cvModel = cv.fit(dataset)
>>> print(cvModel.avgMetrics)
[(0.5, [0.5888888888888888, 0.3888888888888889]), (0.806878306878307,
    [0.8556863149300125, 0.7055555555555556])]
>>> print(AUCevaluator.evaluate(cvModel.transform(dataset)))
0.8333333333333333

class OverSampler(labelCol=None, strategy='auto', randomState=None)[source]¶

Transformer to oversample datapoints with minority labels

Follows: https://spark.apache.org/docs/latest/ml-pipeline.html#transformers

Parameters

Transformer – Inherited Class
HasInputCol – Inherited Class
HasOutputCol – Inherited Class

Returns

PySpark Dataframe with labels in approximately equal ratios

Example

>>> from pyspark.sql import functions as F
>>> from pyspark.sql.session import SparkSession
>>> from pyspark.stats.linalg import Vectors
>>> from splicemachine.stats.stats import OverSampler
>>> spark = SparkSession.builder.getOrCreate()
>>> df = spark.createDataFrame(
...      [(Vectors.dense([0.0]), 0.0),
...       (Vectors.dense([0.5]), 0.0),
...       (Vectors.dense([0.4]), 1.0),
...       (Vectors.dense([0.6]), 1.0),
...       (Vectors.dense([1.0]), 1.0)] * 10,
...      ["features", "Class"])
>>> df.groupBy(F.col("Class")).count().orderBy("count").show()
+-----+-----+
|Class|count|
+-----+-----+
|  0.0|   20|
|  1.0|   30|
+-----+-----+
>>> oversampler = OverSampler(labelCol = "Class", strategy = "auto")
>>> oversampler.transform(df).groupBy("Class").count().show()
+-----+-----+
|Class|count|
+-----+-----+
|  0.0|   29|
|  1.0|   30|
+-----+-----+

class Rounder(predictionCol='prediction', labelCol='label', clipPreds=True, maxLabel=None, minLabel=None)[source]¶

Transformer to round predictions for ordinal regression Follows: https://spark.apache.org/docs/latest/ml-pipeline.html#transformers

Parameters

Transformer – Inherited Class
HasInputCol – Inherited Class
HasOutputCol – Inherited Class

Returns

Transformed Dataframe with rounded predictionCol

Example

>>> from pyspark.sql.session import SparkSession
>>> from splicemachine.stats.stats import Rounder
>>> spark = SparkSession.builder.getOrCreate()
>>> dataset = spark.createDataFrame(
...      [(0.2, 0.0),
...       (1.2, 1.0),
...       (1.6, 2.0),
...       (1.1, 0.0),
...       (3.1, 0.0)],
...      ["prediction", "label"])
>>> dataset.show()
+----------+-----+
|prediction|label|
+----------+-----+
|       0.2|  0.0|
|       1.2|  1.0|
|       1.6|  2.0|
|       1.1|  0.0|
|       3.1|  0.0|
+----------+-----+
>>> rounder = Rounder(predictionCol = "prediction", labelCol = "label",
    clipPreds = True)
>>> rounder.transform(dataset).show()
+----------+-----+
|prediction|label|
+----------+-----+
|       0.0|  0.0|
|       1.0|  1.0|
|       2.0|  2.0|
|       1.0|  0.0|
|       2.0|  0.0|
+----------+-----+
>>> rounderNoClip = Rounder(predictionCol = "prediction", labelCol = "label",
    clipPreds = False)
>>> rounderNoClip.transform(dataset).show()
+----------+-----+
|prediction|label|
+----------+-----+
|       0.0|  0.0|
|       1.0|  1.0|
|       2.0|  2.0|
|       1.0|  0.0|
|       3.0|  0.0|
+----------+-----+

class SpliceBaseEvaluator(spark, evaluator, supported_metrics, predictionCol='prediction', labelCol='label')[source]¶

Base ModelEvaluator

get_results(as_dict=False)[source]¶

Get Results

Parameters: dict – whether to get results in a dict or not
Returns: dictionary

input(predictions_dataframe)[source]¶

Input a dataframe

Parameters

ev – evaluator class
predictions_dataframe – input df

Returns

none

class SpliceBinaryClassificationEvaluator(spark, predictionCol='prediction', labelCol='label', confusion_matrix=True)[source]¶

A Splice Machine evaluator for Spark Binary Classification models. Implements functions from SpliceBaseEvaluator.

input(predictions_dataframe)[source]¶

Evaluate actual vs Predicted in a dataframe

Parameters: predictions_dataframe – the dataframe containing the label and the predicition

plotROC(fittedEstimator, ax)[source]¶

Plots the receiver operating characteristic curve for the trained classifier

Parameters

fittedEstimator – fitted logistic regression model
ax – matplotlib axis object

Returns

axis with ROC plot

class SpliceMultiClassificationEvaluator(spark, predictionCol='prediction', labelCol='label')[source]¶: A Splice Machine evaluator for Spark MultiClass models. Implements functions from SpliceBaseEvaluator.

class SpliceRegressionEvaluator(spark, predictionCol='prediction', labelCol='label')[source]¶: A Splice Machine evaluator for Spark Regression models. Implements functions from SpliceBaseEvaluator.

best_fit_distribution(data, col_name, bins, ax)[source]¶

Model data by finding best fit distribution to data

Parameters

data – DataFrame with one column containing the feature whose distribution is to be investigated
col_name – column name for feature
bins – number of bins to use in generating the histogram of this data
ax – axis to plot histogram on

Returns

(best_distribution.name, best_params, best_sse) best_distribution.name: string of the best distribution name best_params: parameters for this distribution best_sse: sum of squared errors for this distribution against the empirical pdf

estimateCovariance(df, features_col='features')[source]¶

Compute the covariance matrix for a given dataframe.

Parameters

df – PySpark dataframe
features_col – name of the column with the features, defaults to ‘features’

Returns

np.ndarray: A multi-dimensional array where the number of rows and columns both equal the length of the arrays in the input dataframe.

Note

The multi-dimensional covariance array should be calculated using outer products. Don’t forget to normalize the data by first subtracting the mean.

get_confusion_matrix(spark, TP, TN, FP, FN)[source]¶

Creates and returns a confusion matrix

Parameters

TP – True Positives
TN – True Negatives
FP – False Positives
FN – False Negatives

Returns

Spark DataFrame

get_string_pipeline(df, cols_to_exclude, steps=['StringIndexer', 'OneHotEncoder', 'OneHotDummies'])[source]¶

Generates a list of preprocessing stages

Parameters

df – DataFrame including only the training data
cols_to_exclude – Column names we don’t want to to include in the preprocessing (i.e. SUBJECT/ target column)
stages – preprocessing steps to take

Returns

(stages, Numeric_Columns) stages: list of pipeline stages to be used in preprocessing Numeric_Columns: list of columns that contain numeric features

inspectTable(spliceMLCtx, sql, topN=5)[source]¶

Inspect the values of the columns of the table (dataframe) returned from the sql query

Parameters

spliceMLCtx – SpliceMLContext
sql – sql string to execute
topN – the number of most frequent elements of a column to return, defaults to 5

make_pdf(dist, params, size=10000)[source]¶

Generate distributions’s Probability Distribution Function

Parameters

dist – scipy.stats distribution object: https://docs.scipy.org/doc/scipy/reference/stats.html
params – distribution parameters
size – how many data points to generate , defaults to 10000

Returns

series of probability density function for this distribution

pca_with_scores(df, k=10)[source]¶

Computes the top k principal components, corresponding scores, and all eigenvalues.

Parameters

df – A Spark dataframe with a ‘features’ column, which (column) consists of DenseVectors.
k – The number of principal components to return., defaults to 10

Returns

(eigenvectors, RDD of scores, eigenvalues)

Eigenvectors: multi-dimensional array where the number of rows equals the length of the arrays in the input RDD and the number of columns equals`k`.
RDD of scores: has the same number of rows as data and consists of arrays of length k.
Eigenvalues is an array of length d (the number of features).

Note

All eigenvalues should be returned in sorted order (largest to smallest). eigh returns each eigenvectors as a column. This function should also return eigenvectors as columns.

postprocessing_pipeline(df, cols_to_exclude)[source]¶

Assemble postprocessing pipeline to reconstruct original categorical indexed values from OneHotDummy Columns

Parameters

df – DataFrame Including the original string Columns
cols_to_exclude – list of columns to exclude

Returns

(reconstructers, String_Columns) reconstructers: list of IndReconstructer stages String_Columns: list of columns that are being reconstructed

reconstructPCA(sql, df, pc, mean, std, originalColumns, fits, pcaColumn='pcaFeatures')[source]¶

Reconstruct data from lower dimensional space after performing PCA

Parameters

sql – SQLContext
df – PySpark DataFrame: inputted PySpark DataFrame
pc – numpy.ndarray: principal components projected onto
mean – numpy.ndarray: mean of original columns
std – numpy.ndarray: standard deviation of original columns
originalColumns – list: original column names
fits – fits of features returned from best_fit_distribution
pcaColumn – column in df that contains PCA features, defaults to ‘pcaFeatures’

Returns

dataframe containing reconstructed data

varianceExplained(df, k=10)[source]¶

Returns the proportion of variance explained by k principal componenets. Calls the above PCA procedure

Parameters

df – PySpark DataFrame
k – number of principal components , defaults to 10

Returns

(proportion, principal_components, scores, eigenvalues)

vector_assembler_pipeline(df, columns, doPCA=False, k=10)[source]¶

After preprocessing String Columns, this function can be used to assemble a feature vector to be used for learning creates the following stages: VectorAssembler -> Standard Scalar [{ -> PCA}]

Parameters

df – DataFrame containing preprocessed Columns
columns – list of Column names of the preprocessed columns
doPCA – Do you want to do PCA as part of the vector assembler? defaults to False
k – Number of Principal Components to use, defaults to 10

Returns

List of vector assembling stages

Splice MLManager documentation

splicemachine.stats module¶