splicemachine.features package

Submodules

splicemachine.features.feature_store module

This Module contains the classes and APIs for interacting with the Splice Machine Feature Store.

class FeatureStore(splice_ctx: Optional[splicemachine.spark.context.PySpliceContext] = None)[source]

Bases: object

alter_feature_set(schema_name: str, table_name: str, primary_keys: Optional[Dict[str, str]] = None, desc: Optional[str] = None, version: Optional[Union[str, int]] = None)splicemachine.features.feature_set.FeatureSet[source]

Alters the specified (or default latest) version of a feature set, if that version is not yet deployed. Use this method when you want to make changes to an undeployed version of a feature set, or when you want to change version-independant metadata, such as description.

Parameters
  • schema_name – The schema under which to create the feature set table

  • table_name – The table name for this feature set

  • primary_keys – The primary key column(s) of this feature set

  • desc – The (optional) description

  • version – The version you wish to alter (number or ‘latest’). If None, will default to the latest undeployed version

Returns

FeatureSet

alter_training_view(name: str, sql: Optional[str] = None, primary_keys: Optional[List[str]] = None, join_keys: Optional[List[str]] = None, ts_col: Optional[str] = None, label_col: Optional[str] = None, desc: Optional[str] = None, version: Optional[Union[str, int]] = None)None[source]

Alters an existing version of a training view. Use this method when you want to make changes to a version of a training view that has no dependencies, or when you want to change version-independent metadata, such as description.

Parameters
  • name – The training set name. This must be unique to other existing training sets unless replace is True

  • sql

    (str) a SELECT statement that includes:

    • the primary key column(s) - uniquely identifying a training row/case

    • the inference timestamp column - timestamp column with which to join features (temporal join timestamp)

    • join key(s) - the references to the other feature tables’ primary keys (ie customer_id, location_id)

    • (optionally) the label expression - defining what the training set is trying to predict

  • primary_keys – (List[str]) The list of columns from the training SQL that identify the training row

  • ts_col – The timestamp column of the training SQL that identifies the inference timestamp

  • label_col – (Optional[str]) The optional label column from the training SQL.

  • join_keys – (List[str]) A list of join keys in the sql that are used to get the desired features in get_training_set

  • desc – (Optional[str]) An optional description of the training set

  • version – The version you wish to alter (number or ‘latest’). If None, will default to the latest version

Returns

create_aggregation_feature_set_from_source(source_name: str, schema_name: str, table_name: str, start_time: datetime.datetime, schedule_interval: str, aggregations: List[splicemachine.features.pipelines.feature_aggregation.FeatureAggregation], backfill_start_time: Optional[datetime.datetime] = None, backfill_interval: Optional[str] = None, description: Optional[str] = None, run_backfill: Optional[bool] = True)[source]

Creates a temporal aggregation feature set by creating a pipeline linking a Source to a feature set. Sources are created with features.FeatureStore.create_source(). Provided aggregations will generate the features for the feature set. This will create the feature set along with aggregation calculations to create features

Parameters
  • source_name – The name of the of the source created via create_source

  • schema_name – The schema name of the feature set

  • table_name – The table name of the feature set

  • start_time – The start time for the pipeline to run

  • schedule_interval – The frequency with which to run the pipeline.

  • aggregations – The list of FeatureAggregations to apply to the column names of the source SQL statement

  • backfill_start_time – The datetime representing the earliest point in time to get data from when running backfill

  • backfill_interval – The “sliding window” interval to increase each timepoint by when performing backfill

  • run_backfill – Whether or not to run backfill when calling this function. Default False. If this is True backfill_start_time and backfill_interval MUST BE SET

Returns

(FeatureSet) the created Feature Set

Example
from splicemachine.features.pipelines import AggWindow, FeatureAgg, FeatureAggregation
from datetime import datetime
source_name = 'CUSTOMER_RFM'
fs.create_source(
    name=source_name,
    sql='SELECT * FROM RETAIL_RFM.CUSTOMER_CATEGORY_ACTIVITY',
    event_ts_column='INVOICEDATE',
    update_ts_column='LAST_UPDATE_TS',
    primary_keys=['CUSTOMERID']
)
fs.create_aggregation_feature_set_from_source(

)
start_time = datetime.today()
schedule_interval = AggWindow.get_window(5,AggWindow.DAY)
backfill_start = datetime.strptime('2002-01-01 00:00:00', '%Y-%m-%d %H:%M:%S')
backfill_interval = schedule_interval
fs.create_aggregation_feature_set_from_source
(
    source_name, 'RETAIL_FS', 'AUTO_RFM', start_time=start_time,
    schedule_interval=schedule_interval, backfill_start_time=backfill_start,
    backfill_interval=backfill_interval,
    aggregations = [
        FeatureAggregation(feature_name_prefix = 'AR_CLOTHING_QTY',     column_name = 'CLOTHING_QTY',     agg_functions=['sum','max'],   agg_windows=['1d','2d','90d'], agg_default_value = 0.0 ),
        FeatureAggregation(feature_name_prefix = 'AR_DELICATESSEN_QTY', column_name = 'DELICATESSEN_QTY', agg_functions=['avg'],         agg_windows=['1d','2d', '2w'], agg_default_value = 11.5 ),
        FeatureAggregation(feature_name_prefix = 'AR_GARDEN_QTY' ,      column_name = 'GARDEN_QTY',       agg_functions=['count','avg'], agg_windows=['30d','90d', '1q'], agg_default_value = 8 )
    ]
)

This will create, deploy and return a FeatureSet called ‘RETAIL_FS.AUTO_RFM’. The Feature Set will have 15 features:

  • 6 for the AR_CLOTHING_QTY prefix (sum & max over provided agg windows)

  • 3 for the AR_DELICATESSEN_QTY prefix (avg over provided agg windows)

  • 6 for the AR_GARDEN_QTY prefix (count & avg over provided agg windows)

A Pipeline is also created and scheduled in Airflow that feeds it every 5 days from the Source CUSTOMER_RFM Backfill will also occur, reading data from the source as of ‘2002-01-01 00:00:00’ with a 5 day window

create_feature(schema_name: str, table_name: str, name: str, feature_data_type: str, feature_type: str, desc: Optional[str] = None, tags: Optional[List[str]] = None, attributes: Optional[Dict[str, str]] = None)[source]

Add a feature to a feature set

Parameters
  • schema_name – The feature set schema

  • table_name – The feature set table name to add the feature to

  • name – The feature name

  • feature_data_type – The datatype of the feature. Must be a valid SQL datatype

  • feature_type

    splicemachine.features.FeatureType of the feature. The available types are from the FeatureType class: FeatureType.[categorical, ordinal, continuous]. You can see available feature types by running

    from splicemachine.features import FeatureType
    print(FeatureType.get_valid())
    

  • desc – The (optional) feature description (default None)

  • tags – (optional) List of (str) tag words (default None)

  • attributes – (optional) Dict of (str) attribute key/value pairs (default None)

Returns

Feature created

create_feature_set(schema_name: str, table_name: str, primary_keys: Dict[str, str], desc: Optional[str] = None, features: Optional[List[splicemachine.features.feature.Feature]] = None)splicemachine.features.feature_set.FeatureSet[source]

Creates and returns a new feature set

Parameters
  • schema_name – The schema under which to create the feature set table

  • table_name – The table name for this feature set

  • primary_keys – The primary key column(s) of this feature set

  • desc – The (optional) description

  • features – An optional list of features. If provided, the Features will be created with the Feature Set

Example
from splicemachine.features import FeatureType, Feature
f1 = Feature(
    name='my_first_feature',
    description='the first feature',
    feature_data_type='INT',
    feature_type=FeatureType.ordinal,
    tags=['good_feature','a new tag', 'ordinal'],
    attributes={'quality':'awesome'}
)
f2 = Feature(
    name='my_second_feature',
    description='the second feature',
    feature_data_type='FLOAT',
    feature_type=FeatureType.continuous,
    tags=['not_as_good_feature','a new tag'],
    attributes={'quality':'not as awesome'}
)
feats = [f1, f2]
feature_set = fs.create_feature_set(
    schema_name='splice',
    table_name='foo',
    primary_keys={'MOMENT_KEY':"INT"},
    desc='test fset',
    features=feats
)
Returns

FeatureSet

create_source(name: str, sql: str, event_ts_column: datetime.datetime, update_ts_column: datetime.datetime, primary_keys: List[str])[source]

Creates, validates, and stores a source in the Feature Store that can be used to create a Pipeline that feeds a feature set

Example
fs.create_source(
    name='CUSTOMER_RFM',
    sql='SELECT * FROM RETAIL_RFM.CUSTOMER_CATEGORY_ACTIVITY',
    event_ts_column='INVOICEDATE',
    update_ts_column='LAST_UPDATE_TS',
    primary_keys=['CUSTOMERID']
)
Parameters
  • name – The name of the source. This must be unique across the feature store

  • sql – the SQL statement that returns the base result set to be used in future aggregation pipelines

  • event_ts_column – The column of the source query that determines the time of the event (row) being

described. This is not necessarily the time the record was recorded, but the time the event itself occured.

Parameters

update_ts_column – The column that indicates the time when the record was last updated. When scheduled

pipelines run, they will filter on this column to get only the records that have not been queried before.

Parameters

primary_keys – The list of columns in the source SQL that uniquely identifies each row. These become

the primary keys of the feature set(s) that is/are eventually created from this source.

create_training_view(name: str, sql: str, primary_keys: List[str], join_keys: List[str], ts_col: str, label_col: Optional[str] = None, desc: Optional[str] = None)None[source]

Registers a training view for use in generating training SQL

Parameters
  • name – The training set name. This must be unique to other existing training sets

  • sql

    (str) a SELECT statement that includes:

    • the primary key column(s) - uniquely identifying a training row/case

    • the inference timestamp column - timestamp column with which to join features (temporal join timestamp)

    • join key(s) - the references to the other feature tables’ primary keys (ie customer_id, location_id)

    • (optionally) the label expression - defining what the training set is trying to predict

  • primary_keys – (List[str]) The list of columns from the training SQL that identify the training row

  • ts_col – The timestamp column of the training SQL that identifies the inference timestamp

  • label_col – (Optional[str]) The optional label column from the training SQL.

  • replace – (Optional[bool]) Whether to replace an existing training view

  • join_keys – (List[str]) A list of join keys in the sql that are used to get the desired features in get_training_set

  • desc – (Optional[str]) An optional description of the training set

  • verbose – Whether or not to print the SQL before execution (default False)

Returns

deploy_feature_set(schema_name: str, table_name: str, version: Union[str, int] = 'latest', migrate: bool = False)[source]

Deploys a feature set to the database. This persists the feature stores existence. As of now, once deployed you cannot delete the feature set or add/delete features. The feature set must have already been created with create_feature_set()

Parameters
  • schema_name – The schema of the created feature set

  • table_name – The table of the created feature set

  • version – The version of the feature set to deploy

  • migrate – Whether or not to migrate data from a past version of this feature set

describe_feature_set(schema_name: str, table_name: str)None[source]

Prints out a description of a given feature set, with all features in the feature set and whether the feature set is deployed

Parameters
  • schema_name – feature set schema name

  • table_name – feature set table name

Returns

None

describe_feature_sets()None[source]

Prints out a description of a all feature sets, with all features in the feature sets and whether the feature set is deployed

Returns

None

describe_training_view(training_view: str, version: Union[int, str] = 'latest')None[source]

Prints out a description of a given training view, the ID, name, description and optional label

Parameters
  • training_view – The training view name

  • version – The training view version

Returns

None

describe_training_views()None[source]

Prints out a description of all training views, the ID, name, description and optional label

Parameters

training_view – The training view name

Returns

None

Returns an interactive feature search that enables users to search for features and profiles the selected Feature. Two forms of this search exist. 1 for use inside of the managed Splice Machine notebook environment, and one for standard Jupyter. This is because the managed Splice Jupyter environment has extra functionality that would not be present outside of it. The search will be automatically rendered depending on the environment.

Parameters

pandas_profile – Whether to use pandas / spark to profile the feature. If pandas is selected

but the dataset is too large, it will fall back to Spark. Default Pandas.

display_model_drift(schema_name: str, table_name: str, time_intervals: int, start_time: Optional[datetime.datetime] = None, end_time: Optional[datetime.datetime] = None)[source]

Displays as many as time_intervals plots showing the distribution of the model prediction within each time period. Time periods are equal periods of time where predictions are present in the model table schema_name.table_name. Model predictions are first filtered to only those occurring after start_time if specified and before end_time if specified.

Parameters
  • schema_name – schema where the model table resides

  • table_name – name of the model table

  • time_intervals – number of time intervals to plot

  • start_time – if specified, filters to only show predictions occurring after this date/time

  • end_time – if specified, filters to only show predictions occurring before this date/time

display_model_feature_drift(schema_name: str, table_name: str)[source]

Displays feature by feature comparison between the training set of the deployed model and the input feature values used with the model since deployment.

Parameters
  • schema_name – name of database schema where model table is deployed

  • table_name – name of the model table

Returns

None

feature_exists(name: str)bool[source]

Returns if a feature exists or not

Parameters

name – The feature name

Returns

bool True if the feature exists, False otherwise

feature_set_exists(schema: str, table: str)bool[source]

Returns if a feature set exists or not

Parameters
  • schema – The feature set schema

  • table – The feature set table

Returns

bool True if the feature exists, False otherwise

get_backfill_intervals(schema_name: str, table_name: str)List[datetime.datetime][source]

Gets the backfill intervals necessary for the parameterized backfill SQL obtained from the features.FeatureStore.get_backfill_sql() function. This function will likely not be necessary as you can perform backfill at the time of feature set creation automatically.

Parameters
  • schema_name – The schema name of the feature set

  • table_name – The table name of the feature set

Returns

The list of datetimes necessary to parameterize the backfill SQL

get_backfill_sql(schema_name: str, table_name: str)[source]

Returns the necessary parameterized SQL statement to perform backfill on an Aggregate Feature Set. The Feature Set must have been deployed using the features.FeatureStore.create_aggregation_feature_set_from_source() function. Meaning there must be a Source and a Pipeline associated to it. This function will likely not be necessary as you can perform backfill at the time of feature set creation automatically.

This SQL will be parameterized and need a timestamp to execute. You can get those timestamps with the features.FeatureStore.get_backfill_interval() with the same parameters

Parameters
  • schema_name – The schema name of the feature set

  • table_name – The table name of the feature set

Returns

The parameterized Backfill SQL

get_deployments(schema_name: Optional[str] = None, table_name: Optional[str] = None, training_set: Optional[str] = None, feature: Optional[str] = None, feature_set: Optional[str] = None, version: Optional[Union[str, int]] = None)[source]

Returns a list of all (or specified) available deployments

Parameters
  • schema_name – model schema name

  • table_name – model table name

  • training_set – training set name

  • feature – passing this in will return all deployments that used this feature

  • feature_set – passing this in will return all deployments that used this feature set

  • version – the version of the feature set parameter, if used

Returns

List[Deployment] the list of Deployments as dicts

get_feature_description()[source]
get_feature_details(name: str)splicemachine.features.feature.Feature[source]

Returns a Feature and it’s detailed information

Parameters

name – The feature name

Returns

Feature

get_feature_primary_keys(features: List[str])Dict[str, List[str]][source]

Returns a dictionary mapping each individual feature to its primary key(s). This function is not yet implemented.

Parameters

features – (List[str]) The list of features to get primary keys for

Returns

Dict[str, List[str]] A mapping of {feature name: [pk1, pk2, etc]}

get_feature_sets(feature_set_names: Optional[List[str]] = None)List[splicemachine.features.feature_set.FeatureSet][source]

Returns a list of available feature sets

Parameters

feature_set_names – A list of feature set names in the format ‘{schema_name}.{table_name}’. If none will return all FeatureSets

Returns

List[FeatureSet] the list of Feature Sets

get_feature_vector(features: List[Union[str, splicemachine.features.feature.Feature]], join_key_values: Dict[str, str], return_primary_keys=True, return_sql=False)Union[str, pandas.core.frame.DataFrame][source]

Gets a feature vector given a list of Features and primary key values for their corresponding Feature Sets

Parameters
  • features – List of str Feature names or Features

  • join_key_values – (dict) join key values to get the proper Feature values formatted as {join_key_column_name: join_key_value}

  • return_primary_keys – Whether to return the Feature Set primary keys in the vector. Default True

  • return_sql – Whether to return the SQL needed to get the vector or the values themselves. Default False

Returns

Pandas Dataframe or str (SQL statement)

get_feature_vector_sql_from_training_view(training_view: str, features: List[Union[str, splicemachine.features.feature.Feature]])str[source]

Returns the parameterized feature retrieval SQL used for online model serving.

Parameters
  • training_view – (str) The name of the registered training view

  • features

    (List[str]) the list of features from the feature store to be included in the training

    NOTE
    This function will error if the view SQL is missing a view key required 
    
    to retrieve the desired features
    

Returns

(str) the parameterized feature vector SQL

get_features_by_name(names: Optional[List[str]] = None, as_list=False)Union[List[splicemachine.features.feature.Feature], pandas.core.frame.DataFrame][source]

Returns a dataframe or list of features whose names are provided

Parameters
  • names – The list of feature names

  • as_list – Whether or not to return a list of features. Default False

Returns

SparkDF or List[Feature] The list of Feature objects or Spark Dataframe of features and their metadata. Note, this is not the Feature

values, simply the describing metadata about the features. To create a training dataset with Feature values, see features.FeatureStore.get_training_set() or features.FeatureStore.get_feature_dataset()

get_features_from_feature_set(schema_name: str, table_name: str)List[splicemachine.features.feature.Feature][source]

Returns either a pandas DF of feature details or a List of features for a specified feature set. You can get features from multiple feature sets by concatenating the results of this call. For example, to get features from 2 feature sets, foo.bar1 and foo2.bar4:

features = fs.get_features_from_feature_set('foo','bar1') + fs.get_features_from_feature_set('foo2','bar4')

If you want a list of just the Feature NAMES (ie a List[str]) you can simply run:

features = fs.get_features_from_feature_set('foo','bar1') + fs.get_features_from_feature_set('foo2','bar4')
feature_names = [f.name for f in features]
Parameters
  • schema_name – Feature Set schema name

  • table_name – Feature Set table name

Returns

List of Features

get_pipeline_sql(schema_name: str, table_name: str)[source]

Returns the incremental pipeline SQL that feeds a feature set from a source (thus creating a pipeline). Pipelines are managed for you by default by Splice Machine via Airflow, but if you opt out of using the managed pipelines you can use this function to get the incremental SQL.

This SQL will be parameterized and need a timestamp to execute. You can get those timestamps with the features.FeatureStore.get_backfill_interval() with the same parameters

Parameters
  • schema_name – The schema name of the feature set

  • table_name – The table name of the feature set

Returns

The incremental Pipeline SQL

get_summary()Dict[str, str][source]

This function returns a summary of the feature store including:

  • Number of feature sets

  • Number of deployed feature sets

  • Number of features

  • Number of deployed features

  • Number of training sets

  • Number of training views

  • Number of associated models - this is a count of the MLManager.RUNS table where the splice.model_name tag is set and the splice.feature_store.training_set parameter is set

  • Number of active (deployed) models (that have used the feature store for training)

  • Number of pending feature sets - this will will require a new table featurestore.pending_feature_set_deployments and it will be a count of that

get_training_set(features: Union[List[splicemachine.features.feature.Feature], List[str]], current_values_only: bool = False, start_time: Optional[datetime.datetime] = None, end_time: Optional[datetime.datetime] = None, label: Optional[str] = None, return_pk_cols: bool = False, return_ts_col: bool = False, return_type: str = 'spark', return_sql: bool = False, save_as: Optional[str] = None)pyspark.sql.dataframe.DataFrame[source]

Gets a set of feature values across feature sets that is not time dependent (ie for non time series clustering). This feature dataset will be treated and tracked implicitly the same way a training_dataset is tracked from features.FeatureStore.get_training_set() . The dataset’s metadata and features used will be tracked in mlflow automatically (see get_training_set for more details).

NOTE
The way point-in-time correctness is guaranteed here is by choosing one of the Feature Sets as the "anchor" dataset.
This means that the points in time that the query is based off of will be the points in time in which the anchor
Feature Set recorded changes. The anchor Feature Set is the Feature Set that contains the superset of all primary key
columns across all Feature Sets from all Features provided. If more than 1 Feature Set has the superset of
all Feature Sets, the Feature Set with the most primary keys is selected. If more than 1 Feature Set has the same
maximum number of primary keys, the Feature Set is chosen by alphabetical order (schema_name, table_name).
Parameters
  • features

    List of Features or strings of feature names

    NOTE
    The Features Sets which the list of Features come from must have common join keys,
    otherwise the function will fail. If there is no common join key, it is recommended to
    create a Training View to specify the join conditions.
    

  • current_values_only – If you only want the most recent values of the features, set this to true. Otherwise, all history will be returned. Default False

  • start_time – How far back in history you want Feature values. If not specified (and current_values_only is False), all history will be returned. This parameter only takes effect if current_values_only is False.

  • end_time – The most recent values for each selected Feature. This will be the cutoff time, such that any Feature values that were updated after this point in time won’t be selected. If not specified (and current_values_only is False), Feature values up to the moment in time you call the function (now) will be retrieved. This parameter only takes effect if current_values_only is False.

  • label – An optional label to specify for the training set. If specified, the feature set of that feature will be used as the “anchor” feature set, meaning all point in time joins will be made to the timestamps of that feature set. This feature will also be recorded as a “label” feature for this particular training set (but not others in the future, unless this label is again specified).

  • return_pk_cols – bool Whether or not the returned sql should include the primary key column(s)

  • return_ts_cols – bool Whether or not the returned sql should include the timestamp column

  • return_type – How the data should be returned. If not specified, a Spark DF will be returned. Available arguments are: ‘spark’, ‘pandas’, ‘json’, ‘sql’ sql will return the SQL necessary to generate the dataframe

  • save_as – Whether or not to save this Training Set (metadata) in the feature store for reproducibility. This enables you to version and persist the metadata for a training set of a specific model development. If you are using the Splice Machine managed MLFlow Service, this will be fully automated and managed for you upon model deployment, however you can still use this parameter to customize the name of the training set (it will default to the run id). If you are NOT using Splice Machine’s mlflow service, this is a useful way to link specific modeling experiments to the exact training sets used. This DOES NOT persist the training set itself, rather the metadata required to reproduce the identical training set.

Returns

Spark DF or SQL statement necessary to generate the Training Set

get_training_set_by_name(name, version: Optional[int] = None, return_pk_cols: bool = False, return_ts_col: bool = False, return_sql=False, return_type: str = 'spark')[source]

Returns a Spark DF (or SQL) of an EXISTING Training Set (one that was saved with the save_as parameter in get_training_set() or get_training_set_from_view(). This is useful if you’ve deployed a model with a Training Set and

Parameters
  • name – Training Set name

  • version – The version of this training set. If not set, it will grab the newest version

  • return_pk_cols – bool Whether or not the returned sql should include the primary key column(s)

  • return_ts_cols – bool Whether or not the returned sql should include the timestamp column

  • return_sql – [DEPRECATED] (Optional[bool]) Return the SQL statement (str) instead of the Spark DF. Defaults False

  • return_type – How the data should be returned. If not specified, a Spark DF will be returned. Available arguments are: ‘spark’, ‘pandas’, ‘json’, ‘sql’ sql will return the SQL necessary to generate the dataframe

Returns

Spark DF or SQL

get_training_set_features(training_set: Optional[str] = None)[source]

Returns a list of all features from an available Training Set, as well as details about that Training Set

Parameters

training_set – training set name

Returns

TrainingSet as dict

get_training_set_from_deployment(schema_name: str, table_name: str, label: Optional[str] = None, return_pk_cols: bool = False, return_ts_col: bool = False, return_type: str = 'spark')[source]

Reads Feature Store metadata to rebuild orginal training data set used for the given deployed model.

Parameters
  • schema_name – model schema name

  • table_name – model table name

  • label – An optional label to specify for the training set. If specified, the feature set of that feature will be used as the “anchor” feature set, meaning all point in time joins will be made to the timestamps of that feature set. This feature will also be recorded as a “label” feature for this particular training set (but not others in the future, unless this label is again specified).

  • return_pk_cols – bool Whether or not the returned sql should include the primary key column(s)

  • return_ts_cols – bool Whether or not the returned sql should include the timestamp column

  • return_type – How the data should be returned. If not specified, a Spark DF will be returned. Available arguments are: ‘spark’, ‘pandas’, ‘json’, ‘sql’ sql will return the SQL necessary to generate the dataframe

Returns

SparkDF the Training Frame

get_training_set_from_view(training_view: str, features: Optional[Union[List[splicemachine.features.feature.Feature], List[str]]] = None, start_time: Optional[datetime.datetime] = None, end_time: Optional[datetime.datetime] = None, return_pk_cols: bool = False, return_ts_col: bool = False, return_sql: bool = False, return_type: str = 'spark', save_as: Optional[str] = None)pyspark.sql.dataframe.DataFrame[source]

Returns the training set as a Spark Dataframe from a Training View. When a user calls this function (assuming they have registered the feature store with mlflow using register_feature_store() ) the training dataset’s metadata will be tracked in mlflow automatically.

The following will be tracked:

  • Training View

  • Selected features

  • Start time

  • End time

This tracking will occur in the current run (if there is an active run) or in the next run that is started after calling this function (if no run is currently active).

Parameters
  • training_view – (str) The name of the registered training view

  • features

    (List[str] OR List[Feature]) the list of features from the feature store to be included in the training. If a list of strings is passed in it will be converted to a list of Feature. If not provided will return all available features.

    NOTE
    This function will error if the view SQL is missing a join key required to retrieve the
    desired features
    

  • start_time

    (Optional[datetime]) The start time of the query (how far back in the data to start). Default None

    NOTE
    If start_time is None, query will start from beginning of history
    

  • end_time

    (Optional[datetime]) The end time of the query (how far recent in the data to get). Default None

    NOTE
    If end_time is None, query will get most recently available data
    

  • return_pk_cols – bool Whether or not the returned sql should include the primary key column(s)

  • return_ts_cols – bool Whether or not the returned sql should include the timestamp column

  • return_sql – [DEPRECATED] (Optional[bool]) Return the SQL statement (str) instead of the Spark DF. Defaults False

  • return_type – How the data should be returned. If not specified, a Spark DF will be returned. Available arguments are: ‘spark’, ‘pandas’, ‘json’, ‘sql’ sql will return the SQL necessary to generate the dataframe

  • save_as – Whether or not to save this Training Set (metadata) in the feature store for reproducibility. This enables you to version and persist the metadata for a training set of a specific model development. If you are using the Splice Machine managed MLFlow Service, this will be fully automated and managed for you upon model deployment, however you can still use this parameter to customize the name of the training set (it will default to the run id). If you are NOT using Splice Machine’s mlflow service, this is a useful way to link specific modeling experiments to the exact training sets used. This DOES NOT persist the training set itself, rather the metadata required to reproduce the identical training set.

Returns

Optional[SparkDF, str] The Spark dataframe of the training set or the SQL that is used to generate it (for debugging)

get_training_view(training_view: str, version: Union[int, str] = 'latest')splicemachine.features.training_view.TrainingView[source]

Gets a training view by name

Parameters
  • training_view – Training view name

  • version – Training view version

Returns

TrainingView

get_training_view_features(training_view: str, version: Union[int, str] = 'latest')List[splicemachine.features.feature.Feature][source]

Returns the available features for the given a training view name

Parameters
  • training_view – The name of the training view

  • version – The version of the training view

Returns

A list of available Feature objects

get_training_view_id(name: str)int[source]

Returns the unique view ID from a name

Parameters

name – The training view name

Returns

The training view id

get_training_views(_filter: Optional[Dict[str, Union[int, str]]] = None)List[splicemachine.features.training_view.TrainingView][source]

Returns a list of all available training views with an optional filter

Parameters

_filter – Dictionary container the filter keyword (label, description etc) and the value to filter on If None, will return all TrainingViews

Returns

List[TrainingView]

list_training_sets()Dict[str, Optional[str]][source]

Returns a dictionary a training sets available, with the map name -> description. If there is no description, the value will be an emtpy string

Returns

Dict[str, Optional[str]]

login_fs(username, password)[source]

Function to login to the Feature Store using basic auth. These correspond to your Splice Machine database user and password. If you are running outside of the managed Splice Machine Cloud Service, you must call either this or set_token in order to call any functions in the feature store, or by setting the SPLICE_JUPYTER_USER and SPLICE_JUPYTER_PASSWORD environments variable before creating your FeatureStore object.

Parameters
  • username – Username

  • password – Password

register_splice_context(splice_ctx: splicemachine.spark.context.PySpliceContext)None[source]
remove_feature(name: str)[source]
Removes a feature. This will run 2 checks.
  1. See if the feature exists.

  2. See if the feature belongs to a feature set that has already been deployed.

If either of these are true, this function will throw an error explaining which check has failed

param name

feature name

return

remove_feature_set(schema_name: str, table_name: str, version: Optional[Union[str, int]] = None, purge: bool = False)None[source]

Deletes a feature set if appropriate. You can currently delete a feature set in two scenarios: 1. The feature set has not been deployed 2. The feature set has been deployed, but not linked to any training sets

If both of these conditions are false, this will fail.

Optionally set purge=True to force delete the feature set and all of the associated Training Sets using the Feature Set. ONLY USE IF YOU KNOW WHAT YOU ARE DOING. This will delete Training Sets, but will still fail if there is an active deployment with this feature set. That cannot be overwritten

Parameters
  • schema_name – The Feature Set Schema

  • table_name – The Feature Set Table

  • version – The Feature Set Version

  • purge – Whether to force delete training sets that use the feature set (that are not used in deployments)

remove_source(name: str)[source]

Removes a Source by name. You cannot remove a Source that has child dependencies (Feature Sets). If there is a Feature Set that is deployed and a Pipeline that is feeding it, you cannot delete the Source until you remove the Feature Set (which in turn removes the Pipeline)

Parameters

name – The Source name

remove_training_view(name: str, version: Union[str, int] = 'latest')[source]

This removes a training view if it is not being used by any currently deployed models. NOTE: Once this training view is removed, you will not be able to deploy any models that were trained using this view

Parameters
  • name – The view name

  • version – The view version

run_feature_elimination(df, features: List[Union[str, splicemachine.features.feature.Feature]], label: str = 'label', n: int = 10, verbose: int = 0, model_type: str = 'classification', step: int = 1, log_mlflow: bool = False, mlflow_run_name: Optional[str] = None, return_importances: bool = False)[source]

Runs feature elimination using a Spark decision tree on the dataframe passed in. Optionally logs results to mlflow

Parameters
  • df – The dataframe with features and label

  • features – The list of feature names (or Feature objects) to run elimination on

  • label – the label column names

  • n – The number of features desired. Default 10

  • verbose – The level of verbosity. 0 indicated no printing. 1 indicates printing remaining features after each round. 2 indicates print features and relative importances after each round. Default 0

  • model_type – Whether the model to test with will be a regression or classification model. Default classification

  • log_mlflow – Whether or not to log results to mlflow as nested runs. Default false

  • mlflow_run_name – The name of the parent run under which all subsequent runs will live. The children run names will be {mlflow_run_name}_{num_features}_features. ie testrun_5_features, testrun_4_features etc

Returns

set_feature_description()[source]
set_feature_store_url(url: str)[source]

Sets the Feature Store URL. You must call this before calling any feature store functions, or set the FS_URL environment variable before creating your Feature Store object

Parameters

url – The Feature Store URL

set_token(token)[source]

Function to login to the Feature Store using JWT. This corresponds to your Splice Machine database user’s JWT token. If you are running outside of the managed Splice Machine Cloud Service, you must call either this or login_fs in order to call any functions in the feature store, or by setting the SPLICE_JUPYTER_TOKEN environment variable before creating your FeatureStore object.

Parameters

token – JWT Token

training_view_exists(name: str)bool[source]

Returns if a training view exists or not

Parameters

name – The training view name

Returns

bool True if the training view exists, False otherwise

update_feature_metadata(name: str, desc: Optional[str] = None, tags: Optional[List[str]] = None, attributes: Optional[Dict[str, str]] = None)[source]

Update the metadata of a feature

Parameters
  • name – The feature name

  • desc – The (optional) feature description (default None)

  • tags – (optional) List of (str) tag words (default None)

  • attributes – (optional) Dict of (str) attribute key/value pairs (default None)

Returns

updated Feature

update_feature_set(schema_name: str, table_name: str, primary_keys: Dict[str, str], desc: Optional[str] = None, features: Optional[List[splicemachine.features.feature.Feature]] = None)splicemachine.features.feature_set.FeatureSet[source]

Creates and returns a new version of an existing feature set. Use this method when you want to make changes to a deployed feature set.

Parameters
  • schema_name – The schema under which to create the feature set table

  • table_name – The table name for this feature set

  • primary_keys – The primary key column(s) of this feature set

  • desc – The (optional) description

  • features – An optional list of features. If provided, any non-existant Features will be created with the Feature Set

Example
from splicemachine.features import FeatureType, Feature
f1 = Feature(
    name='my_first_feature',
    description='the first feature',
    feature_data_type='INT',
    feature_type=FeatureType.ordinal,
    tags=['good_feature','a new tag', 'ordinal'],
    attributes={'quality':'awesome'}
)
f2 = Feature(
    name='my_second_feature',
    description='the second feature',
    feature_data_type='FLOAT',
    feature_type=FeatureType.continuous,
    tags=['not_as_good_feature','a new tag'],
    attributes={'quality':'not as awesome'}
)
feats = [f1, f2]
feature_set = fs.update_feature_set(
    schema_name='splice',
    table_name='foo',
    primary_keys={'MOMENT_KEY':"INT"},
    desc='test fset',
    features=feats
)
Returns

FeatureSet

update_training_view(name: str, sql: str, primary_keys: List[str], join_keys: List[str], ts_col: str, label_col: Optional[str] = None, desc: Optional[str] = None)None[source]

Creates and returns a new version of a training view for use in generating training SQL. Use this function when you want to make changes to a training view without affecting its dependencies

Parameters
  • name – The training set name.

  • sql

    (str) a SELECT statement that includes:

    • the primary key column(s) - uniquely identifying a training row/case

    • the inference timestamp column - timestamp column with which to join features (temporal join timestamp)

    • join key(s) - the references to the other feature tables’ primary keys (ie customer_id, location_id)

    • (optionally) the label expression - defining what the training set is trying to predict

  • primary_keys – (List[str]) The list of columns from the training SQL that identify the training row

  • ts_col – The timestamp column of the training SQL that identifies the inference timestamp

  • label_col – (Optional[str]) The optional label column from the training SQL.

  • replace – (Optional[bool]) Whether to replace an existing training view

  • join_keys – (List[str]) A list of join keys in the sql that are used to get the desired features in get_training_set

  • desc – (Optional[str]) An optional description of the training set

  • verbose – Whether or not to print the SQL before execution (default False)

Returns

splicemachine.features.feature_set

This describes the Python representation of a Feature Set. A feature set is a database table that contains Features and their metadata. The Feature Set class is mostly used internally but can be used by the user to see the available Features in the given Feature Set, to see the table and schema name it is deployed to (if it is deployed), and to deploy the feature set (which can also be done directly through the Feature Store). Feature Sets are unique by their schema.table name, as they exist in the Splice Machine database as a SQL table. They are case insensitive. To see the full contents of your Feature Set, you can print, return, or .__dict__ your Feature Set object.

class FeatureSet(*, splice_ctx: Optional[splicemachine.spark.context.PySpliceContext] = None, table_name, schema_name, description, primary_keys: Dict[str, str], feature_set_id=None, deployed: bool = False, **kwargs)[source]

Bases: object

is_deployed()[source]

Returns whether or not this Feature Set has been deployed (the schema.table has been created in the database) :return: (bool) True if the Feature Set is deployed

splicemachine.features.Feature

This describes the Python representation of a Feature. A Feature is a column of a Feature Set table with particular metadata. A Feature is the smallest unit in the Feature Store, and each Feature within a Feature Set is individually tracked for changes to enable full time travel and point-in-time consistent training datasets. Features’ names are unique and case insensitive. To see the full contents of your Feature, you can print, return, or .__dict__ your Feature object.

class Feature(*, name, description, feature_data_type, feature_type, tags, attributes, feature_set_id=None, feature_id=None, **kwargs)[source]

Bases: object

is_categorical()[source]

Returns if the type of this feature is categorical

is_continuous()[source]

Returns if the type of this feature is continuous

is_ordinal()[source]

Returns if the type of this feature is ordinal

splicemachine.features.training_view

This describes the Python representation of a Training View. A Training View is a SQL statement defining an event of interest, and metadata around how to create a training dataset with that view. To see the full contents of your Training View, you can print, return, or .__dict__ your Training View object.

class TrainingView(*, pk_columns: List[str], ts_column, label_column, sql_text, name, description, view_id=None, view_version=None, **kwargs)[source]

Bases: object

Module contents