splicemachine.features package¶

Submodules¶

splicemachine.features.feature_store module¶

This Module contains the classes and APIs for interacting with the Splice Machine Feature Store.

class FeatureStore(splice_ctx: Optional[splicemachine.spark.context.PySpliceContext] = None)[source]¶

Bases: object

alter_feature_set(schema_name: str, table_name: str, primary_keys: Optional[Dict[str, str]] = None, desc: Optional[str] = None, version: Optional[Union[str, int]] = None) → splicemachine.features.feature_set.FeatureSet [source]¶

Alters the specified (or default latest) version of a feature set, if that version is not yet deployed. Use this method when you want to make changes to an undeployed version of a feature set, or when you want to change version-independant metadata, such as description.

Parameters

schema_name – The schema under which to create the feature set table
table_name – The table name for this feature set
primary_keys – The primary key column(s) of this feature set
desc – The (optional) description
version – The version you wish to alter (number or ‘latest’). If None, will default to the latest undeployed version

Returns

FeatureSet

alter_training_view(name: str, sql: Optional[str] = None, primary_keys: Optional[List[str]] = None, join_keys: Optional[List[str]] = None, ts_col: Optional[str] = None, label_col: Optional[str] = None, desc: Optional[str] = None, version: Optional[Union[str, int]] = None) → None[source]¶

Alters an existing version of a training view. Use this method when you want to make changes to a version of a training view that has no dependencies, or when you want to change version-independent metadata, such as description.

Parameters

name – The training set name. This must be unique to other existing training sets unless replace is True
sql –
(str) a SELECT statement that includes:
- the primary key column(s) - uniquely identifying a training row/case
- the inference timestamp column - timestamp column with which to join features (temporal join timestamp)
- join key(s) - the references to the other feature tables’ primary keys (ie customer_id, location_id)
- (optionally) the label expression - defining what the training set is trying to predict
primary_keys – (List[str]) The list of columns from the training SQL that identify the training row
ts_col – The timestamp column of the training SQL that identifies the inference timestamp
label_col – (Optional[str]) The optional label column from the training SQL.
join_keys – (List[str]) A list of join keys in the sql that are used to get the desired features in get_training_set
desc – (Optional[str]) An optional description of the training set
version – The version you wish to alter (number or ‘latest’). If None, will default to the latest version

Returns

create_aggregation_feature_set_from_source(source_name: str, schema_name: str, table_name: str, start_time: datetime.datetime, schedule_interval: str, aggregations: List[splicemachine.features.pipelines.feature_aggregation.FeatureAggregation], backfill_start_time: Optional[datetime.datetime] = None, backfill_interval: Optional[str] = None, description: Optional[str] = None, run_backfill: Optional[bool] = True)[source]¶

Creates a temporal aggregation feature set by creating a pipeline linking a Source to a feature set. Sources are created with features.FeatureStore.create_source(). Provided aggregations will generate the features for the feature set. This will create the feature set along with aggregation calculations to create features

Parameters

source_name – The name of the of the source created via create_source
schema_name – The schema name of the feature set
table_name – The table name of the feature set
start_time – The start time for the pipeline to run
schedule_interval – The frequency with which to run the pipeline.
aggregations – The list of FeatureAggregations to apply to the column names of the source SQL statement
backfill_start_time – The datetime representing the earliest point in time to get data from when running backfill
backfill_interval – The “sliding window” interval to increase each timepoint by when performing backfill
run_backfill – Whether or not to run backfill when calling this function. Default False. If this is True backfill_start_time and backfill_interval MUST BE SET

Returns

(FeatureSet) the created Feature Set

Example

from splicemachine.features.pipelines import AggWindow, FeatureAgg, FeatureAggregation
from datetime import datetime
source_name = 'CUSTOMER_RFM'
fs.create_source(
    name=source_name,
    sql='SELECT * FROM RETAIL_RFM.CUSTOMER_CATEGORY_ACTIVITY',
    event_ts_column='INVOICEDATE',
    update_ts_column='LAST_UPDATE_TS',
    primary_keys=['CUSTOMERID']
)
fs.create_aggregation_feature_set_from_source(

)
start_time = datetime.today()
schedule_interval = AggWindow.get_window(5,AggWindow.DAY)
backfill_start = datetime.strptime('2002-01-01 00:00:00', '%Y-%m-%d %H:%M:%S')
backfill_interval = schedule_interval
fs.create_aggregation_feature_set_from_source
(
    source_name, 'RETAIL_FS', 'AUTO_RFM', start_time=start_time,
    schedule_interval=schedule_interval, backfill_start_time=backfill_start,
    backfill_interval=backfill_interval,
    aggregations = [
        FeatureAggregation(feature_name_prefix = 'AR_CLOTHING_QTY',     column_name = 'CLOTHING_QTY',     agg_functions=['sum','max'],   agg_windows=['1d','2d','90d'], agg_default_value = 0.0 ),
        FeatureAggregation(feature_name_prefix = 'AR_DELICATESSEN_QTY', column_name = 'DELICATESSEN_QTY', agg_functions=['avg'],         agg_windows=['1d','2d', '2w'], agg_default_value = 11.5 ),
        FeatureAggregation(feature_name_prefix = 'AR_GARDEN_QTY' ,      column_name = 'GARDEN_QTY',       agg_functions=['count','avg'], agg_windows=['30d','90d', '1q'], agg_default_value = 8 )
    ]
)

This will create, deploy and return a FeatureSet called ‘RETAIL_FS.AUTO_RFM’. The Feature Set will have 15 features:

6 for the AR_CLOTHING_QTY prefix (sum & max over provided agg windows)
3 for the AR_DELICATESSEN_QTY prefix (avg over provided agg windows)
6 for the AR_GARDEN_QTY prefix (count & avg over provided agg windows)

A Pipeline is also created and scheduled in Airflow that feeds it every 5 days from the Source CUSTOMER_RFM Backfill will also occur, reading data from the source as of ‘2002-01-01 00:00:00’ with a 5 day window

create_feature(schema_name: str, table_name: str, name: str, feature_data_type: str, feature_type: str, desc: Optional[str] = None, tags: Optional[List[str]] = None, attributes: Optional[Dict[str, str]] = None)[source]¶

Add a feature to a feature set

Parameters

schema_name – The feature set schema
table_name – The feature set table name to add the feature to
name – The feature name
feature_data_type – The datatype of the feature. Must be a valid SQL datatype
feature_type –
splicemachine.features.FeatureType of the feature. The available types are from the FeatureType class: FeatureType.[categorical, ordinal, continuous]. You can see available feature types by running
```
from splicemachine.features import FeatureType
print(FeatureType.get_valid())
```
desc – The (optional) feature description (default None)
tags – (optional) List of (str) tag words (default None)
attributes – (optional) Dict of (str) attribute key/value pairs (default None)

Returns

Feature created

create_feature_set(schema_name: str, table_name: str, primary_keys: Dict[str, str], desc: Optional[str] = None, features: Optional[List[splicemachine.features.feature.Feature]] = None) → splicemachine.features.feature_set.FeatureSet [source]¶

Creates and returns a new feature set

Parameters

schema_name – The schema under which to create the feature set table
table_name – The table name for this feature set
primary_keys – The primary key column(s) of this feature set
desc – The (optional) description
features – An optional list of features. If provided, the Features will be created with the Feature Set

Example

from splicemachine.features import FeatureType, Feature
f1 = Feature(
    name='my_first_feature',
    description='the first feature',
    feature_data_type='INT',
    feature_type=FeatureType.ordinal,
    tags=['good_feature','a new tag', 'ordinal'],
    attributes={'quality':'awesome'}
)
f2 = Feature(
    name='my_second_feature',
    description='the second feature',
    feature_data_type='FLOAT',
    feature_type=FeatureType.continuous,
    tags=['not_as_good_feature','a new tag'],
    attributes={'quality':'not as awesome'}
)
feats = [f1, f2]
feature_set = fs.create_feature_set(
    schema_name='splice',
    table_name='foo',
    primary_keys={'MOMENT_KEY':"INT"},
    desc='test fset',
    features=feats
)

Returns

FeatureSet

create_source(name: str, sql: str, event_ts_column: datetime.datetime, update_ts_column: datetime.datetime, primary_keys: List[str])[source]¶

Creates, validates, and stores a source in the Feature Store that can be used to create a Pipeline that feeds a feature set

Example

fs.create_source(
    name='CUSTOMER_RFM',
    sql='SELECT * FROM RETAIL_RFM.CUSTOMER_CATEGORY_ACTIVITY',
    event_ts_column='INVOICEDATE',
    update_ts_column='LAST_UPDATE_TS',
    primary_keys=['CUSTOMERID']
)

Parameters

name – The name of the source. This must be unique across the feature store
sql – the SQL statement that returns the base result set to be used in future aggregation pipelines
event_ts_column – The column of the source query that determines the time of the event (row) being

described. This is not necessarily the time the record was recorded, but the time the event itself occured.

Parameters: update_ts_column – The column that indicates the time when the record was last updated. When scheduled

pipelines run, they will filter on this column to get only the records that have not been queried before.

Parameters: primary_keys – The list of columns in the source SQL that uniquely identifies each row. These become

the primary keys of the feature set(s) that is/are eventually created from this source.

create_training_view(name: str, sql: str, primary_keys: List[str], join_keys: List[str], ts_col: str, label_col: Optional[str] = None, desc: Optional[str] = None) → None[source]¶

Registers a training view for use in generating training SQL

Parameters

name – The training set name. This must be unique to other existing training sets
sql –
(str) a SELECT statement that includes:
- the primary key column(s) - uniquely identifying a training row/case
- the inference timestamp column - timestamp column with which to join features (temporal join timestamp)
- join key(s) - the references to the other feature tables’ primary keys (ie customer_id, location_id)
- (optionally) the label expression - defining what the training set is trying to predict
primary_keys – (List[str]) The list of columns from the training SQL that identify the training row
ts_col – The timestamp column of the training SQL that identifies the inference timestamp
label_col – (Optional[str]) The optional label column from the training SQL.
replace – (Optional[bool]) Whether to replace an existing training view
join_keys – (List[str]) A list of join keys in the sql that are used to get the desired features in get_training_set
desc – (Optional[str]) An optional description of the training set
verbose – Whether or not to print the SQL before execution (default False)

Returns

deploy_feature_set(schema_name: str, table_name: str, version: Union[str, int] = 'latest', migrate: bool = False)[source]¶

Deploys a feature set to the database. This persists the feature stores existence. As of now, once deployed you cannot delete the feature set or add/delete features. The feature set must have already been created with create_feature_set()

Parameters

schema_name – The schema of the created feature set
table_name – The table of the created feature set
version – The version of the feature set to deploy
migrate – Whether or not to migrate data from a past version of this feature set

describe_feature_set(schema_name: str, table_name: str) → None[source]¶

Prints out a description of a given feature set, with all features in the feature set and whether the feature set is deployed

Parameters

schema_name – feature set schema name
table_name – feature set table name

Returns

None

describe_feature_sets() → None[source]¶

Prints out a description of a all feature sets, with all features in the feature sets and whether the feature set is deployed

Returns: None

describe_training_view(training_view: str, version: Union[int, str] = 'latest') → None[source]¶

Prints out a description of a given training view, the ID, name, description and optional label

Parameters

training_view – The training view name
version – The training view version

Returns

None

describe_training_views() → None[source]¶

Prints out a description of all training views, the ID, name, description and optional label

Parameters: training_view – The training view name
Returns: None

display_feature_search(pandas_profile=True)[source]¶

Returns an interactive feature search that enables users to search for features and profiles the selected Feature. Two forms of this search exist. 1 for use inside of the managed Splice Machine notebook environment, and one for standard Jupyter. This is because the managed Splice Jupyter environment has extra functionality that would not be present outside of it. The search will be automatically rendered depending on the environment.

Parameters: pandas_profile – Whether to use pandas / spark to profile the feature. If pandas is selected

but the dataset is too large, it will fall back to Spark. Default Pandas.

display_model_drift(schema_name: str, table_name: str, time_intervals: int, start_time: Optional[datetime.datetime] = None, end_time: Optional[datetime.datetime] = None)[source]¶

Displays as many as time_intervals plots showing the distribution of the model prediction within each time period. Time periods are equal periods of time where predictions are present in the model table schema_name.table_name. Model predictions are first filtered to only those occurring after start_time if specified and before end_time if specified.

Parameters

schema_name – schema where the model table resides
table_name – name of the model table
time_intervals – number of time intervals to plot
start_time – if specified, filters to only show predictions occurring after this date/time
end_time – if specified, filters to only show predictions occurring before this date/time

display_model_feature_drift(schema_name: str, table_name: str)[source]¶

Displays feature by feature comparison between the training set of the deployed model and the input feature values used with the model since deployment.

Parameters

schema_name – name of database schema where model table is deployed
table_name – name of the model table

Returns

None

feature_exists(name: str) → bool[source]¶

Returns if a feature exists or not

Parameters: name – The feature name
Returns: bool True if the feature exists, False otherwise

feature_set_exists(schema: str, table: str) → bool[source]¶

Returns if a feature set exists or not

Parameters

schema – The feature set schema
table – The feature set table

Returns

bool True if the feature exists, False otherwise

get_backfill_intervals(schema_name: str, table_name: str) → List[datetime.datetime][source]¶

Gets the backfill intervals necessary for the parameterized backfill SQL obtained from the features.FeatureStore.get_backfill_sql() function. This function will likely not be necessary as you can perform backfill at the time of feature set creation automatically.

Parameters

schema_name – The schema name of the feature set
table_name – The table name of the feature set

Returns

The list of datetimes necessary to parameterize the backfill SQL

get_backfill_sql(schema_name: str, table_name: str)[source]¶

Returns the necessary parameterized SQL statement to perform backfill on an Aggregate Feature Set. The Feature Set must have been deployed using the features.FeatureStore.create_aggregation_feature_set_from_source() function. Meaning there must be a Source and a Pipeline associated to it. This function will likely not be necessary as you can perform backfill at the time of feature set creation automatically.

This SQL will be parameterized and need a timestamp to execute. You can get those timestamps with the features.FeatureStore.get_backfill_interval() with the same parameters

Parameters

schema_name – The schema name of the feature set
table_name – The table name of the feature set

Returns

The parameterized Backfill SQL

get_deployments(schema_name: Optional[str] = None, table_name: Optional[str] = None, training_set: Optional[str] = None, feature: Optional[str] = None, feature_set: Optional[str] = None, version: Optional[Union[str, int]] = None)[source]¶

Returns a list of all (or specified) available deployments

Parameters

schema_name – model schema name
table_name – model table name
training_set – training set name
feature – passing this in will return all deployments that used this feature
feature_set – passing this in will return all deployments that used this feature set
version – the version of the feature set parameter, if used

Returns

List[Deployment] the list of Deployments as dicts

get_feature_description()[source]¶

get_feature_details(name: str) → splicemachine.features.feature.Feature [source]¶

Returns a Feature and it’s detailed information

Parameters: name – The feature name
Returns: Feature

get_feature_primary_keys(features: List[str]) → Dict[str, List[str]][source]¶

Returns a dictionary mapping each individual feature to its primary key(s). This function is not yet implemented.

Parameters: features – (List[str]) The list of features to get primary keys for
Returns: Dict[str, List[str]] A mapping of {feature name: [pk1, pk2, etc]}

get_feature_sets(feature_set_names: Optional[List[str]] = None) → List[splicemachine.features.feature_set.FeatureSet][source]¶

Returns a list of available feature sets

Parameters: feature_set_names – A list of feature set names in the format ‘{schema_name}.{table_name}’. If none will return all FeatureSets
Returns: List[FeatureSet] the list of Feature Sets

get_feature_vector(features: List[Union[str, splicemachine.features.feature.Feature]], join_key_values: Dict[str, str], return_primary_keys=True, return_sql=False) → Union[str, pandas.core.frame.DataFrame][source]¶

Gets a feature vector given a list of Features and primary key values for their corresponding Feature Sets

Parameters

features – List of str Feature names or Features
join_key_values – (dict) join key values to get the proper Feature values formatted as {join_key_column_name: join_key_value}
return_primary_keys – Whether to return the Feature Set primary keys in the vector. Default True
return_sql – Whether to return the SQL needed to get the vector or the values themselves. Default False

Returns

Pandas Dataframe or str (SQL statement)

get_feature_vector_sql_from_training_view(training_view: str, features: List[Union[str, splicemachine.features.feature.Feature]]) → str[source]¶

Returns the parameterized feature retrieval SQL used for online model serving.

Parameters

training_view – (str) The name of the registered training view

features –

(List[str]) the list of features from the feature store to be included in the training

NOTE

This function will error if the view SQL is missing a view key required 

to retrieve the desired features

Returns

(str) the parameterized feature vector SQL

get_features_by_name(names: Optional[List[str]] = None, as_list=False) → Union[List[splicemachine.features.feature.Feature], pandas.core.frame.DataFrame][source]¶

Returns a dataframe or list of features whose names are provided

Parameters

names – The list of feature names
as_list – Whether or not to return a list of features. Default False

Returns

SparkDF or List[Feature] The list of Feature objects or Spark Dataframe of features and their metadata. Note, this is not the Feature

values, simply the describing metadata about the features. To create a training dataset with Feature values, see features.FeatureStore.get_training_set() or features.FeatureStore.get_feature_dataset()

get_features_from_feature_set(schema_name: str, table_name: str) → List[splicemachine.features.feature.Feature][source]¶

Returns either a pandas DF of feature details or a List of features for a specified feature set. You can get features from multiple feature sets by concatenating the results of this call. For example, to get features from 2 feature sets, foo.bar1 and foo2.bar4:

features = fs.get_features_from_feature_set('foo','bar1') + fs.get_features_from_feature_set('foo2','bar4')

If you want a list of just the Feature NAMES (ie a List[str]) you can simply run:

features = fs.get_features_from_feature_set('foo','bar1') + fs.get_features_from_feature_set('foo2','bar4')
feature_names = [f.name for f in features]

Parameters

schema_name – Feature Set schema name
table_name – Feature Set table name

Returns

List of Features

get_pipeline_sql(schema_name: str, table_name: str)[source]¶

Returns the incremental pipeline SQL that feeds a feature set from a source (thus creating a pipeline). Pipelines are managed for you by default by Splice Machine via Airflow, but if you opt out of using the managed pipelines you can use this function to get the incremental SQL.

This SQL will be parameterized and need a timestamp to execute. You can get those timestamps with the features.FeatureStore.get_backfill_interval() with the same parameters

Parameters

schema_name – The schema name of the feature set
table_name – The table name of the feature set

Returns

The incremental Pipeline SQL

get_summary() → Dict[str, str][source]¶

This function returns a summary of the feature store including:

Number of feature sets
Number of deployed feature sets
Number of features
Number of deployed features
Number of training sets
Number of training views
Number of associated models - this is a count of the MLManager.RUNS table where the splice.model_name tag is set and the splice.feature_store.training_set parameter is set
Number of active (deployed) models (that have used the feature store for training)
Number of pending feature sets - this will will require a new table featurestore.pending_feature_set_deployments and it will be a count of that

get_training_set(features: Union[List[splicemachine.features.feature.Feature], List[str]], current_values_only: bool = False, start_time: Optional[datetime.datetime] = None, end_time: Optional[datetime.datetime] = None, label: Optional[str] = None, return_pk_cols: bool = False, return_ts_col: bool = False, return_type: str = 'spark', return_sql: bool = False, save_as: Optional[str] = None) → pyspark.sql.dataframe.DataFrame[source]¶

Gets a set of feature values across feature sets that is not time dependent (ie for non time series clustering). This feature dataset will be treated and tracked implicitly the same way a training_dataset is tracked from features.FeatureStore.get_training_set() . The dataset’s metadata and features used will be tracked in mlflow automatically (see get_training_set for more details).

NOTE

The way point-in-time correctness is guaranteed here is by choosing one of the Feature Sets as the "anchor" dataset.
This means that the points in time that the query is based off of will be the points in time in which the anchor
Feature Set recorded changes. The anchor Feature Set is the Feature Set that contains the superset of all primary key
columns across all Feature Sets from all Features provided. If more than 1 Feature Set has the superset of
all Feature Sets, the Feature Set with the most primary keys is selected. If more than 1 Feature Set has the same
maximum number of primary keys, the Feature Set is chosen by alphabetical order (schema_name, table_name).

Parameters

features –

List of Features or strings of feature names

NOTE

The Features Sets which the list of Features come from must have common join keys,
otherwise the function will fail. If there is no common join key, it is recommended to
create a Training View to specify the join conditions.

current_values_only – If you only want the most recent values of the features, set this to true. Otherwise, all history will be returned. Default False
start_time – How far back in history you want Feature values. If not specified (and current_values_only is False), all history will be returned. This parameter only takes effect if current_values_only is False.
end_time – The most recent values for each selected Feature. This will be the cutoff time, such that any Feature values that were updated after this point in time won’t be selected. If not specified (and current_values_only is False), Feature values up to the moment in time you call the function (now) will be retrieved. This parameter only takes effect if current_values_only is False.
label – An optional label to specify for the training set. If specified, the feature set of that feature will be used as the “anchor” feature set, meaning all point in time joins will be made to the timestamps of that feature set. This feature will also be recorded as a “label” feature for this particular training set (but not others in the future, unless this label is again specified).
return_pk_cols – bool Whether or not the returned sql should include the primary key column(s)
return_ts_cols – bool Whether or not the returned sql should include the timestamp column
return_type – How the data should be returned. If not specified, a Spark DF will be returned. Available arguments are: ‘spark’, ‘pandas’, ‘json’, ‘sql’ sql will return the SQL necessary to generate the dataframe
save_as – Whether or not to save this Training Set (metadata) in the feature store for reproducibility. This enables you to version and persist the metadata for a training set of a specific model development. If you are using the Splice Machine managed MLFlow Service, this will be fully automated and managed for you upon model deployment, however you can still use this parameter to customize the name of the training set (it will default to the run id). If you are NOT using Splice Machine’s mlflow service, this is a useful way to link specific modeling experiments to the exact training sets used. This DOES NOT persist the training set itself, rather the metadata required to reproduce the identical training set.

Returns

Spark DF or SQL statement necessary to generate the Training Set

get_training_set_by_name(name, version: Optional[int] = None, return_pk_cols: bool = False, return_ts_col: bool = False, return_sql=False, return_type: str = 'spark')[source]¶

Returns a Spark DF (or SQL) of an EXISTING Training Set (one that was saved with the save_as parameter in get_training_set() or get_training_set_from_view(). This is useful if you’ve deployed a model with a Training Set and

Parameters

name – Training Set name
version – The version of this training set. If not set, it will grab the newest version
return_pk_cols – bool Whether or not the returned sql should include the primary key column(s)
return_ts_cols – bool Whether or not the returned sql should include the timestamp column
return_sql – [DEPRECATED] (Optional[bool]) Return the SQL statement (str) instead of the Spark DF. Defaults False
return_type – How the data should be returned. If not specified, a Spark DF will be returned. Available arguments are: ‘spark’, ‘pandas’, ‘json’, ‘sql’ sql will return the SQL necessary to generate the dataframe

Returns

Spark DF or SQL

get_training_set_features(training_set: Optional[str] = None)[source]¶

Returns a list of all features from an available Training Set, as well as details about that Training Set

Parameters: training_set – training set name
Returns: TrainingSet as dict

get_training_set_from_deployment(schema_name: str, table_name: str, label: Optional[str] = None, return_pk_cols: bool = False, return_ts_col: bool = False, return_type: str = 'spark')[source]¶

Reads Feature Store metadata to rebuild orginal training data set used for the given deployed model.

Parameters

schema_name – model schema name
table_name – model table name
label – An optional label to specify for the training set. If specified, the feature set of that feature will be used as the “anchor” feature set, meaning all point in time joins will be made to the timestamps of that feature set. This feature will also be recorded as a “label” feature for this particular training set (but not others in the future, unless this label is again specified).
return_pk_cols – bool Whether or not the returned sql should include the primary key column(s)
return_ts_cols – bool Whether or not the returned sql should include the timestamp column
return_type – How the data should be returned. If not specified, a Spark DF will be returned. Available arguments are: ‘spark’, ‘pandas’, ‘json’, ‘sql’ sql will return the SQL necessary to generate the dataframe

Returns

SparkDF the Training Frame

get_training_set_from_view(training_view: str, features: Optional[Union[List[splicemachine.features.feature.Feature], List[str]]] = None, start_time: Optional[datetime.datetime] = None, end_time: Optional[datetime.datetime] = None, return_pk_cols: bool = False, return_ts_col: bool = False, return_sql: bool = False, return_type: str = 'spark', save_as: Optional[str] = None) → pyspark.sql.dataframe.DataFrame[source]¶

Returns the training set as a Spark Dataframe from a Training View. When a user calls this function (assuming they have registered the feature store with mlflow using register_feature_store() ) the training dataset’s metadata will be tracked in mlflow automatically.

The following will be tracked:

Training View
Selected features
Start time
End time

This tracking will occur in the current run (if there is an active run) or in the next run that is started after calling this function (if no run is currently active).

Parameters

training_view – (str) The name of the registered training view
features –
(List[str] OR List[Feature]) the list of features from the feature store to be included in the training. If a list of strings is passed in it will be converted to a list of Feature. If not provided will return all available features.
NOTE
This function will error if the view SQL is missing a join key required to retrieve the desired features
start_time –
(Optional[datetime]) The start time of the query (how far back in the data to start). Default None
NOTE
If start_time is None, query will start from beginning of history
end_time –
(Optional[datetime]) The end time of the query (how far recent in the data to get). Default None
NOTE
If end_time is None, query will get most recently available data
return_pk_cols – bool Whether or not the returned sql should include the primary key column(s)
return_ts_cols – bool Whether or not the returned sql should include the timestamp column
return_sql – [DEPRECATED] (Optional[bool]) Return the SQL statement (str) instead of the Spark DF. Defaults False
return_type – How the data should be returned. If not specified, a Spark DF will be returned. Available arguments are: ‘spark’, ‘pandas’, ‘json’, ‘sql’ sql will return the SQL necessary to generate the dataframe
save_as – Whether or not to save this Training Set (metadata) in the feature store for reproducibility. This enables you to version and persist the metadata for a training set of a specific model development. If you are using the Splice Machine managed MLFlow Service, this will be fully automated and managed for you upon model deployment, however you can still use this parameter to customize the name of the training set (it will default to the run id). If you are NOT using Splice Machine’s mlflow service, this is a useful way to link specific modeling experiments to the exact training sets used. This DOES NOT persist the training set itself, rather the metadata required to reproduce the identical training set.

Returns

Optional[SparkDF, str] The Spark dataframe of the training set or the SQL that is used to generate it (for debugging)

get_training_view(training_view: str, version: Union[int, str] = 'latest') → splicemachine.features.training_view.TrainingView [source]¶

Gets a training view by name

Parameters

training_view – Training view name
version – Training view version

Returns

TrainingView

get_training_view_features(training_view: str, version: Union[int, str] = 'latest') → List[splicemachine.features.feature.Feature][source]¶

Returns the available features for the given a training view name

Parameters

training_view – The name of the training view
version – The version of the training view

Returns

A list of available Feature objects

get_training_view_id(name: str) → int[source]¶

Returns the unique view ID from a name

Parameters: name – The training view name
Returns: The training view id

get_training_views(_filter: Optional[Dict[str, Union[int, str]]] = None) → List[splicemachine.features.training_view.TrainingView][source]¶

Returns a list of all available training views with an optional filter

Parameters: _filter – Dictionary container the filter keyword (label, description etc) and the value to filter on If None, will return all TrainingViews
Returns: List[TrainingView]

link_training_set_to_mlflow(features: Union[List[splicemachine.features.feature.Feature], List[str]], create_time: datetime.datetime, start_time: Optional[datetime.datetime] = None, end_time: Optional[datetime.datetime] = None, tvw: Optional[splicemachine.features.training_view.TrainingView] = None, current_values_only: bool = False, training_set_id: Optional[int] = None, training_set_version: Optional[int] = None, training_set_name: Optional[str] = None)[source]¶

list_training_sets() → Dict[str, Optional[str]][source]¶

Returns a dictionary a training sets available, with the map name -> description. If there is no description, the value will be an emtpy string

Returns: Dict[str, Optional[str]]

login_fs(username, password)[source]¶

Function to login to the Feature Store using basic auth. These correspond to your Splice Machine database user and password. If you are running outside of the managed Splice Machine Cloud Service, you must call either this or set_token in order to call any functions in the feature store, or by setting the SPLICE_JUPYTER_USER and SPLICE_JUPYTER_PASSWORD environments variable before creating your FeatureStore object.

Parameters

username – Username
password – Password

register_splice_context(splice_ctx: splicemachine.spark.context.PySpliceContext) → None[source]¶

remove_feature(name: str)[source]¶

Removes a feature. This will run 2 checks.

See if the feature exists.
See if the feature belongs to a feature set that has already been deployed.

If either of these are true, this function will throw an error explaining which check has failed

param name: feature name
return

remove_feature_set(schema_name: str, table_name: str, version: Optional[Union[str, int]] = None, purge: bool = False) → None[source]¶

Deletes a feature set if appropriate. You can currently delete a feature set in two scenarios: 1. The feature set has not been deployed 2. The feature set has been deployed, but not linked to any training sets

If both of these conditions are false, this will fail.

Optionally set purge=True to force delete the feature set and all of the associated Training Sets using the Feature Set. ONLY USE IF YOU KNOW WHAT YOU ARE DOING. This will delete Training Sets, but will still fail if there is an active deployment with this feature set. That cannot be overwritten

Parameters

schema_name – The Feature Set Schema
table_name – The Feature Set Table
version – The Feature Set Version
purge – Whether to force delete training sets that use the feature set (that are not used in deployments)

remove_source(name: str)[source]¶

Removes a Source by name. You cannot remove a Source that has child dependencies (Feature Sets). If there is a Feature Set that is deployed and a Pipeline that is feeding it, you cannot delete the Source until you remove the Feature Set (which in turn removes the Pipeline)

Parameters: name – The Source name

remove_training_view(name: str, version: Union[str, int] = 'latest')[source]¶

This removes a training view if it is not being used by any currently deployed models. NOTE: Once this training view is removed, you will not be able to deploy any models that were trained using this view

Parameters

name – The view name
version – The view version

run_feature_elimination(df, features: List[Union[str, splicemachine.features.feature.Feature]], label: str = 'label', n: int = 10, verbose: int = 0, model_type: str = 'classification', step: int = 1, log_mlflow: bool = False, mlflow_run_name: Optional[str] = None, return_importances: bool = False)[source]¶

Runs feature elimination using a Spark decision tree on the dataframe passed in. Optionally logs results to mlflow

Parameters

df – The dataframe with features and label
features – The list of feature names (or Feature objects) to run elimination on
label – the label column names
n – The number of features desired. Default 10
verbose – The level of verbosity. 0 indicated no printing. 1 indicates printing remaining features after each round. 2 indicates print features and relative importances after each round. Default 0
model_type – Whether the model to test with will be a regression or classification model. Default classification
log_mlflow – Whether or not to log results to mlflow as nested runs. Default false
mlflow_run_name – The name of the parent run under which all subsequent runs will live. The children run names will be {mlflow_run_name}_{num_features}_features. ie testrun_5_features, testrun_4_features etc

Returns

set_feature_description()[source]¶

set_feature_store_url(url: str)[source]¶

Sets the Feature Store URL. You must call this before calling any feature store functions, or set the FS_URL environment variable before creating your Feature Store object

Parameters: url – The Feature Store URL

set_token(token)[source]¶

Function to login to the Feature Store using JWT. This corresponds to your Splice Machine database user’s JWT token. If you are running outside of the managed Splice Machine Cloud Service, you must call either this or login_fs in order to call any functions in the feature store, or by setting the SPLICE_JUPYTER_TOKEN environment variable before creating your FeatureStore object.

Parameters: token – JWT Token

training_view_exists(name: str) → bool[source]¶

Returns if a training view exists or not

Parameters: name – The training view name
Returns: bool True if the training view exists, False otherwise

update_feature_metadata(name: str, desc: Optional[str] = None, tags: Optional[List[str]] = None, attributes: Optional[Dict[str, str]] = None)[source]¶

Update the metadata of a feature

Parameters

name – The feature name
desc – The (optional) feature description (default None)
tags – (optional) List of (str) tag words (default None)
attributes – (optional) Dict of (str) attribute key/value pairs (default None)

Returns

updated Feature

update_feature_set(schema_name: str, table_name: str, primary_keys: Dict[str, str], desc: Optional[str] = None, features: Optional[List[splicemachine.features.feature.Feature]] = None) → splicemachine.features.feature_set.FeatureSet [source]¶

Creates and returns a new version of an existing feature set. Use this method when you want to make changes to a deployed feature set.

Parameters

schema_name – The schema under which to create the feature set table
table_name – The table name for this feature set
primary_keys – The primary key column(s) of this feature set
desc – The (optional) description
features – An optional list of features. If provided, any non-existant Features will be created with the Feature Set

Example

from splicemachine.features import FeatureType, Feature
f1 = Feature(
    name='my_first_feature',
    description='the first feature',
    feature_data_type='INT',
    feature_type=FeatureType.ordinal,
    tags=['good_feature','a new tag', 'ordinal'],
    attributes={'quality':'awesome'}
)
f2 = Feature(
    name='my_second_feature',
    description='the second feature',
    feature_data_type='FLOAT',
    feature_type=FeatureType.continuous,
    tags=['not_as_good_feature','a new tag'],
    attributes={'quality':'not as awesome'}
)
feats = [f1, f2]
feature_set = fs.update_feature_set(
    schema_name='splice',
    table_name='foo',
    primary_keys={'MOMENT_KEY':"INT"},
    desc='test fset',
    features=feats
)

Returns

FeatureSet

update_training_view(name: str, sql: str, primary_keys: List[str], join_keys: List[str], ts_col: str, label_col: Optional[str] = None, desc: Optional[str] = None) → None[source]¶

Creates and returns a new version of a training view for use in generating training SQL. Use this function when you want to make changes to a training view without affecting its dependencies

Parameters

name – The training set name.
sql –
(str) a SELECT statement that includes:
- the primary key column(s) - uniquely identifying a training row/case
- the inference timestamp column - timestamp column with which to join features (temporal join timestamp)
- join key(s) - the references to the other feature tables’ primary keys (ie customer_id, location_id)
- (optionally) the label expression - defining what the training set is trying to predict
primary_keys – (List[str]) The list of columns from the training SQL that identify the training row
ts_col – The timestamp column of the training SQL that identifies the inference timestamp
label_col – (Optional[str]) The optional label column from the training SQL.
replace – (Optional[bool]) Whether to replace an existing training view
join_keys – (List[str]) A list of join keys in the sql that are used to get the desired features in get_training_set
desc – (Optional[str]) An optional description of the training set
verbose – Whether or not to print the SQL before execution (default False)

Returns

splicemachine.features.feature_set¶

This describes the Python representation of a Feature Set. A feature set is a database table that contains Features and their metadata. The Feature Set class is mostly used internally but can be used by the user to see the available Features in the given Feature Set, to see the table and schema name it is deployed to (if it is deployed), and to deploy the feature set (which can also be done directly through the Feature Store). Feature Sets are unique by their schema.table name, as they exist in the Splice Machine database as a SQL table. They are case insensitive. To see the full contents of your Feature Set, you can print, return, or .__dict__ your Feature Set object.

class FeatureSet(*, splice_ctx: Optional[splicemachine.spark.context.PySpliceContext] = None, table_name, schema_name, description, primary_keys: Dict[str, str], feature_set_id=None, deployed: bool = False, **kwargs)[source]¶

Bases: object

is_deployed()[source]¶: Returns whether or not this Feature Set has been deployed (the schema.table has been created in the database) :return: (bool) True if the Feature Set is deployed

splicemachine.features.Feature¶

This describes the Python representation of a Feature. A Feature is a column of a Feature Set table with particular metadata. A Feature is the smallest unit in the Feature Store, and each Feature within a Feature Set is individually tracked for changes to enable full time travel and point-in-time consistent training datasets. Features’ names are unique and case insensitive. To see the full contents of your Feature, you can print, return, or .__dict__ your Feature object.

class Feature(*, name, description, feature_data_type, feature_type, tags, attributes, feature_set_id=None, feature_id=None, **kwargs)[source]¶

Bases: object

is_categorical()[source]¶: Returns if the type of this feature is categorical

is_continuous()[source]¶: Returns if the type of this feature is continuous

is_ordinal()[source]¶: Returns if the type of this feature is ordinal

splicemachine.features.training_view¶

This describes the Python representation of a Training View. A Training View is a SQL statement defining an event of interest, and metadata around how to create a training dataset with that view. To see the full contents of your Training View, you can print, return, or .__dict__ your Training View object.

class TrainingView(*, pk_columns: List[str], ts_column, label_column, sql_text, name, description, view_id=None, view_version=None, **kwargs)[source]¶: Bases: object

Splice MLManager documentation

splicemachine.features package¶

Submodules¶

splicemachine.features.feature_store module¶

splicemachine.features.feature_set¶

splicemachine.features.Feature¶

splicemachine.features.training_view¶

Module contents¶