splicemachine.stats module¶
This module contains statistical functions to help with Machine Learning and data analysis.
-
class
DecisionTreeVisualizer[source]¶ Visualize a decision tree, either in code like format, or graphviz
-
static
add_node(dot, parent, node_hash, root, realroot=False)[source]¶ Traverse through the .debugString json and generate a graphviz tree
- Parameters
dot – dot file object
parent – not used currently
node_hash – unique node id
root – the root of tree
realroot – whether or not it is the real root, or a recursive root
- Returns
-
static
feature_importance(spark, model, dataset, featuresCol='features')[source]¶ Return a dataframe containing the relative importance of each feature
- Parameters
model – The Spark Machine Learning model
dataframe – Spark Dataframe
featureCol – (str) the column containing the feature vector
- Returns
dataframe containing importance
-
static
replacer(string, bad, good)[source]¶ Replace every string in “bad” with the corresponding string in “good”
- Parameters
string – string to replace in
bad – array of strings to replace
good – array of strings to replace with
- Returns
-
static
tree_json(tree)[source]¶ Generate a JSON representation of a decision tree
- Parameters
tree – tree debug string
- Returns
json
-
static
visualize(model, feature_column_names, label_names, size=None, horizontal=False, tree_name='tree', visual=False)[source]¶ Visualize a decision tree, either in a code like format, or graphviz
- Parameters
model – the fitted decision tree classifier
feature_column_names – (List[str]) column names for features You can access these feature names by using your VectorAssembler (in PySpark) and calling it’s .getInputCols() function
label_names – (List[str]) labels vector (below avg, above avg)
size – tuple(int,int) The size of the graph. If unspecified, graphviz will automatically assign a size
horizontal – (Bool) if the tree should be rendered horizontally
tree_name – the name you would like to call the tree
visual – bool, true if you want a graphviz pdf containing your file
- Return dot
The graphvis object
-
static
-
class
IndReconstructer(inputCol=None, outputCol=None)[source]¶ Transformer to reconstruct String Index from OneHotDummy Columns. This can be used as a part of a Pipeline Ojbect
Follows: https://spark.apache.org/docs/latest/ml-pipeline.html#transformers
- Parameters
Transformer – Inherited Class
HasInputCol – Inherited Class
HasOutputCol – Inherited Class
- Returns
Transformed PySpark Dataframe With Original String Indexed Variables
-
class
OneHotDummies(inputCol=None, outputCol=None)[source]¶ Transformer to generate dummy columns for categorical variables as a part of a preprocessing pipeline
Follows: https://spark.apache.org/docs/latest/ml-pipeline.html#transformers
- Parameters
Transformer – Inherited Classes
HasInputCol – Inherited Classes
HasOutputCol – Inherited Classes
- Returns
pyspark DataFrame
-
class
OverSampleCrossValidator(estimator, estimatorParamMaps, evaluator, numFolds=3, seed=None, parallelism=3, collectSubModels=False, labelCol='label', altEvaluators=None, overSample=True)[source]¶ Class to perform Cross Validation model evaluation while over-sampling minority labels.
- Example
>>> from pyspark.sql.session import SparkSession >>> from pyspark.stats.classification import LogisticRegression >>> from pyspark.stats.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator >>> from pyspark.stats.linalg import Vectors >>> from splicemachine.stats.stats import OverSampleCrossValidator >>> spark = SparkSession.builder.getOrCreate() >>> dataset = spark.createDataFrame( ... [(Vectors.dense([0.0]), 0.0), ... (Vectors.dense([0.5]), 0.0), ... (Vectors.dense([0.4]), 1.0), ... (Vectors.dense([0.6]), 1.0), ... (Vectors.dense([1.0]), 1.0)] * 10, ... ["features", "label"]) >>> lr = LogisticRegression() >>> grid = ParamGridBuilder().addGrid(lr.maxIter, [0, 1]).build() >>> PRevaluator = BinaryClassificationEvaluator(metricName = 'areaUnderPR') >>> AUCevaluator = BinaryClassificationEvaluator(metricName = 'areaUnderROC') >>> ACCevaluator = MulticlassClassificationEvaluator(metricName="accuracy") >>> cv = OverSampleCrossValidator(estimator=lr, estimatorParamMaps=grid, evaluator=AUCevaluator, altEvaluators = [PRevaluator, ACCevaluator], parallelism=2,seed = 1234) >>> cvModel = cv.fit(dataset) >>> print(cvModel.avgMetrics) [(0.5, [0.5888888888888888, 0.3888888888888889]), (0.806878306878307, [0.8556863149300125, 0.7055555555555556])] >>> print(AUCevaluator.evaluate(cvModel.transform(dataset))) 0.8333333333333333
-
class
OverSampler(labelCol=None, strategy='auto', randomState=None)[source]¶ Transformer to oversample datapoints with minority labels
Follows: https://spark.apache.org/docs/latest/ml-pipeline.html#transformers
- Parameters
Transformer – Inherited Class
HasInputCol – Inherited Class
HasOutputCol – Inherited Class
- Returns
PySpark Dataframe with labels in approximately equal ratios
- Example
>>> from pyspark.sql import functions as F >>> from pyspark.sql.session import SparkSession >>> from pyspark.stats.linalg import Vectors >>> from splicemachine.stats.stats import OverSampler >>> spark = SparkSession.builder.getOrCreate() >>> df = spark.createDataFrame( ... [(Vectors.dense([0.0]), 0.0), ... (Vectors.dense([0.5]), 0.0), ... (Vectors.dense([0.4]), 1.0), ... (Vectors.dense([0.6]), 1.0), ... (Vectors.dense([1.0]), 1.0)] * 10, ... ["features", "Class"]) >>> df.groupBy(F.col("Class")).count().orderBy("count").show() +-----+-----+ |Class|count| +-----+-----+ | 0.0| 20| | 1.0| 30| +-----+-----+ >>> oversampler = OverSampler(labelCol = "Class", strategy = "auto") >>> oversampler.transform(df).groupBy("Class").count().show() +-----+-----+ |Class|count| +-----+-----+ | 0.0| 29| | 1.0| 30| +-----+-----+
-
class
Rounder(predictionCol='prediction', labelCol='label', clipPreds=True, maxLabel=None, minLabel=None)[source]¶ Transformer to round predictions for ordinal regression Follows: https://spark.apache.org/docs/latest/ml-pipeline.html#transformers
- Parameters
Transformer – Inherited Class
HasInputCol – Inherited Class
HasOutputCol – Inherited Class
- Returns
Transformed Dataframe with rounded predictionCol
- Example
>>> from pyspark.sql.session import SparkSession >>> from splicemachine.stats.stats import Rounder >>> spark = SparkSession.builder.getOrCreate() >>> dataset = spark.createDataFrame( ... [(0.2, 0.0), ... (1.2, 1.0), ... (1.6, 2.0), ... (1.1, 0.0), ... (3.1, 0.0)], ... ["prediction", "label"]) >>> dataset.show() +----------+-----+ |prediction|label| +----------+-----+ | 0.2| 0.0| | 1.2| 1.0| | 1.6| 2.0| | 1.1| 0.0| | 3.1| 0.0| +----------+-----+ >>> rounder = Rounder(predictionCol = "prediction", labelCol = "label", clipPreds = True) >>> rounder.transform(dataset).show() +----------+-----+ |prediction|label| +----------+-----+ | 0.0| 0.0| | 1.0| 1.0| | 2.0| 2.0| | 1.0| 0.0| | 2.0| 0.0| +----------+-----+ >>> rounderNoClip = Rounder(predictionCol = "prediction", labelCol = "label", clipPreds = False) >>> rounderNoClip.transform(dataset).show() +----------+-----+ |prediction|label| +----------+-----+ | 0.0| 0.0| | 1.0| 1.0| | 2.0| 2.0| | 1.0| 0.0| | 3.0| 0.0| +----------+-----+
-
class
SpliceBaseEvaluator(spark, evaluator, supported_metrics, predictionCol='prediction', labelCol='label')[source]¶ Base ModelEvaluator
-
class
SpliceBinaryClassificationEvaluator(spark, predictionCol='prediction', labelCol='label', confusion_matrix=True)[source]¶ A Splice Machine evaluator for Spark Binary Classification models. Implements functions from SpliceBaseEvaluator.
-
class
SpliceMultiClassificationEvaluator(spark, predictionCol='prediction', labelCol='label')[source]¶ A Splice Machine evaluator for Spark MultiClass models. Implements functions from SpliceBaseEvaluator.
-
class
SpliceRegressionEvaluator(spark, predictionCol='prediction', labelCol='label')[source]¶ A Splice Machine evaluator for Spark Regression models. Implements functions from SpliceBaseEvaluator.
-
best_fit_distribution(data, col_name, bins, ax)[source]¶ Model data by finding best fit distribution to data
- Parameters
data – DataFrame with one column containing the feature whose distribution is to be investigated
col_name – column name for feature
bins – number of bins to use in generating the histogram of this data
ax – axis to plot histogram on
- Returns
(best_distribution.name, best_params, best_sse) best_distribution.name: string of the best distribution name best_params: parameters for this distribution best_sse: sum of squared errors for this distribution against the empirical pdf
-
estimateCovariance(df, features_col='features')[source]¶ Compute the covariance matrix for a given dataframe.
- Parameters
df – PySpark dataframe
features_col – name of the column with the features, defaults to ‘features’
- Returns
np.ndarray: A multi-dimensional array where the number of rows and columns both equal the length of the arrays in the input dataframe.
- Note
The multi-dimensional covariance array should be calculated using outer products. Don’t forget to normalize the data by first subtracting the mean.
-
get_confusion_matrix(spark, TP, TN, FP, FN)[source]¶ Creates and returns a confusion matrix
- Parameters
TP – True Positives
TN – True Negatives
FP – False Positives
FN – False Negatives
- Returns
Spark DataFrame
-
get_string_pipeline(df, cols_to_exclude, steps=['StringIndexer', 'OneHotEncoder', 'OneHotDummies'])[source]¶ Generates a list of preprocessing stages
- Parameters
df – DataFrame including only the training data
cols_to_exclude – Column names we don’t want to to include in the preprocessing (i.e. SUBJECT/ target column)
stages – preprocessing steps to take
- Returns
(stages, Numeric_Columns) stages: list of pipeline stages to be used in preprocessing Numeric_Columns: list of columns that contain numeric features
-
inspectTable(spliceMLCtx, sql, topN=5)[source]¶ Inspect the values of the columns of the table (dataframe) returned from the sql query
- Parameters
spliceMLCtx – SpliceMLContext
sql – sql string to execute
topN – the number of most frequent elements of a column to return, defaults to 5
-
make_pdf(dist, params, size=10000)[source]¶ Generate distributions’s Probability Distribution Function
- Parameters
dist – scipy.stats distribution object: https://docs.scipy.org/doc/scipy/reference/stats.html
params – distribution parameters
size – how many data points to generate , defaults to 10000
- Returns
series of probability density function for this distribution
-
pca_with_scores(df, k=10)[source]¶ Computes the top k principal components, corresponding scores, and all eigenvalues.
- Parameters
df – A Spark dataframe with a ‘features’ column, which (column) consists of DenseVectors.
k – The number of principal components to return., defaults to 10
- Returns
(eigenvectors, RDD of scores, eigenvalues)
Eigenvectors: multi-dimensional array where the number of rows equals the length of the arrays in the input RDD and the number of columns equals`k`.
RDD of scores: has the same number of rows as data and consists of arrays of length k.
Eigenvalues is an array of length d (the number of features).
- Note
All eigenvalues should be returned in sorted order (largest to smallest). eigh returns each eigenvectors as a column. This function should also return eigenvectors as columns.
-
postprocessing_pipeline(df, cols_to_exclude)[source]¶ Assemble postprocessing pipeline to reconstruct original categorical indexed values from OneHotDummy Columns
- Parameters
df – DataFrame Including the original string Columns
cols_to_exclude – list of columns to exclude
- Returns
(reconstructers, String_Columns) reconstructers: list of IndReconstructer stages String_Columns: list of columns that are being reconstructed
-
reconstructPCA(sql, df, pc, mean, std, originalColumns, fits, pcaColumn='pcaFeatures')[source]¶ Reconstruct data from lower dimensional space after performing PCA
- Parameters
sql – SQLContext
df – PySpark DataFrame: inputted PySpark DataFrame
pc – numpy.ndarray: principal components projected onto
mean – numpy.ndarray: mean of original columns
std – numpy.ndarray: standard deviation of original columns
originalColumns – list: original column names
fits – fits of features returned from best_fit_distribution
pcaColumn – column in df that contains PCA features, defaults to ‘pcaFeatures’
- Returns
dataframe containing reconstructed data
-
varianceExplained(df, k=10)[source]¶ Returns the proportion of variance explained by k principal componenets. Calls the above PCA procedure
- Parameters
df – PySpark DataFrame
k – number of principal components , defaults to 10
- Returns
(proportion, principal_components, scores, eigenvalues)
-
vector_assembler_pipeline(df, columns, doPCA=False, k=10)[source]¶ After preprocessing String Columns, this function can be used to assemble a feature vector to be used for learning creates the following stages: VectorAssembler -> Standard Scalar [{ -> PCA}]
- Parameters
df – DataFrame containing preprocessed Columns
columns – list of Column names of the preprocessed columns
doPCA – Do you want to do PCA as part of the vector assembler? defaults to False
k – Number of Principal Components to use, defaults to 10
- Returns
List of vector assembling stages