splicemachine.features package¶
Submodules¶
splicemachine.features.feature_store module¶
This Module contains the classes and APIs for interacting with the Splice Machine Feature Store.
-
class
FeatureStore(splice_ctx: Optional[splicemachine.spark.context.PySpliceContext] = None)[source]¶ Bases:
object-
create_aggregation_feature_set_from_source(source_name: str, schema_name: str, table_name: str, start_time: datetime.datetime, schedule_interval: str, aggregations: List[splicemachine.features.pipelines.feature_aggregation.FeatureAggregation], backfill_start_time: Optional[datetime.datetime] = None, backfill_interval: Optional[str] = None, description: Optional[str] = None, run_backfill: Optional[bool] = True)[source]¶ Creates a temporal aggregation feature set by creating a pipeline linking a Source to a feature set. Sources are created with
features.FeatureStore.create_source(). Provided aggregations will generate the features for the feature set. This will create the feature set along with aggregation calculations to create features- Parameters
source_name – The name of the of the source created via create_source
schema_name – The schema name of the feature set
table_name – The table name of the feature set
start_time – The start time for the pipeline to run
schedule_interval – The frequency with which to run the pipeline.
aggregations – The list of FeatureAggregations to apply to the column names of the source SQL statement
backfill_start_time – The datetime representing the earliest point in time to get data from when running backfill
backfill_interval – The “sliding window” interval to increase each timepoint by when performing backfill
run_backfill – Whether or not to run backfill when calling this function. Default False. If this is True backfill_start_time and backfill_interval MUST BE SET
- Returns
(FeatureSet) the created Feature Set
- Example
from splicemachine.features.pipelines import AggWindow, FeatureAgg, FeatureAggregation from datetime import datetime source_name = 'CUSTOMER_RFM' fs.create_source( name=source_name, sql='SELECT * FROM RETAIL_RFM.CUSTOMER_CATEGORY_ACTIVITY', event_ts_column='INVOICEDATE', update_ts_column='LAST_UPDATE_TS', primary_keys=['CUSTOMERID'] ) fs.create_aggregation_feature_set_from_source( ) start_time = datetime.today() schedule_interval = AggWindow.get_window(5,AggWindow.DAY) backfill_start = datetime.strptime('2002-01-01 00:00:00', '%Y-%m-%d %H:%M:%S') backfill_interval = schedule_interval fs.create_aggregation_feature_set_from_source ( source_name, 'RETAIL_FS', 'AUTO_RFM', start_time=start_time, schedule_interval=schedule_interval, backfill_start_time=backfill_start, backfill_interval=backfill_interval, aggregations = [ FeatureAggregation(feature_name_prefix = 'AR_CLOTHING_QTY', column_name = 'CLOTHING_QTY', agg_functions=['sum','max'], agg_windows=['1d','2d','90d'], agg_default_value = 0.0 ), FeatureAggregation(feature_name_prefix = 'AR_DELICATESSEN_QTY', column_name = 'DELICATESSEN_QTY', agg_functions=['avg'], agg_windows=['1d','2d', '2w'], agg_default_value = 11.5 ), FeatureAggregation(feature_name_prefix = 'AR_GARDEN_QTY' , column_name = 'GARDEN_QTY', agg_functions=['count','avg'], agg_windows=['30d','90d', '1q'], agg_default_value = 8 ) ] )
This will create, deploy and return a FeatureSet called ‘RETAIL_FS.AUTO_RFM’. The Feature Set will have 15 features:
6 for the AR_CLOTHING_QTY prefix (sum & max over provided agg windows)
3 for the AR_DELICATESSEN_QTY prefix (avg over provided agg windows)
6 for the AR_GARDEN_QTY prefix (count & avg over provided agg windows)
A Pipeline is also created and scheduled in Airflow that feeds it every 5 days from the Source CUSTOMER_RFM Backfill will also occur, reading data from the source as of ‘2002-01-01 00:00:00’ with a 5 day window
-
create_feature(schema_name: str, table_name: str, name: str, feature_data_type: str, feature_type: str, desc: Optional[str] = None, tags: Optional[List[str]] = None, attributes: Optional[Dict[str, str]] = None)[source]¶ Add a feature to a feature set
- Parameters
schema_name – The feature set schema
table_name – The feature set table name to add the feature to
name – The feature name
feature_data_type – The datatype of the feature. Must be a valid SQL datatype
feature_type –
splicemachine.features.FeatureType of the feature. The available types are from the FeatureType class: FeatureType.[categorical, ordinal, continuous]. You can see available feature types by running
from splicemachine.features import FeatureType print(FeatureType.get_valid())
desc – The (optional) feature description (default None)
tags – (optional) List of (str) tag words (default None)
attributes – (optional) Dict of (str) attribute key/value pairs (default None)
- Returns
Feature created
-
create_feature_set(schema_name: str, table_name: str, primary_keys: Dict[str, str], desc: Optional[str] = None, features: Optional[List[splicemachine.features.feature.Feature]] = None) → splicemachine.features.feature_set.FeatureSet[source]¶ Creates and returns a new feature set
- Parameters
schema_name – The schema under which to create the feature set table
table_name – The table name for this feature set
primary_keys – The primary key column(s) of this feature set
desc – The (optional) description
features – An optional list of features. If provided, the Features will be created with the Feature Set
- Example
from splicemachine.features import FeatureType, Feature f1 = Feature( name='my_first_feature', description='the first feature', feature_data_type='INT', feature_type=FeatureType.ordinal, tags=['good_feature','a new tag', 'ordinal'], attributes={'quality':'awesome'} ) f2 = Feature( name='my_second_feature', description='the second feature', feature_data_type='FLOAT', feature_type=FeatureType.continuous, tags=['not_as_good_feature','a new tag'], attributes={'quality':'not as awesome'} ) feats = [f1, f2] feature_set = fs.create_feature_set( schema_name='splice', table_name='foo', primary_keys={'MOMENT_KEY':"INT"}, desc='test fset', features=feats )
- Returns
FeatureSet
-
create_source(name: str, sql: str, event_ts_column: datetime.datetime, update_ts_column: datetime.datetime, primary_keys: List[str])[source]¶ Creates, validates, and stores a source in the Feature Store that can be used to create a Pipeline that feeds a feature set
- Example
fs.create_source( name='CUSTOMER_RFM', sql='SELECT * FROM RETAIL_RFM.CUSTOMER_CATEGORY_ACTIVITY', event_ts_column='INVOICEDATE', update_ts_column='LAST_UPDATE_TS', primary_keys=['CUSTOMERID'] )
- Parameters
name – The name of the source. This must be unique across the feature store
sql – the SQL statement that returns the base result set to be used in future aggregation pipelines
event_ts_column – The column of the source query that determines the time of the event (row) being
described. This is not necessarily the time the record was recorded, but the time the event itself occured.
- Parameters
update_ts_column – The column that indicates the time when the record was last updated. When scheduled
pipelines run, they will filter on this column to get only the records that have not been queried before.
- Parameters
primary_keys – The list of columns in the source SQL that uniquely identifies each row. These become
the primary keys of the feature set(s) that is/are eventually created from this source.
-
create_training_view(name: str, sql: str, primary_keys: List[str], join_keys: List[str], ts_col: str, label_col: Optional[str] = None, replace: Optional[bool] = False, desc: Optional[str] = None, verbose=False) → None[source]¶ Registers a training view for use in generating training SQL
- Parameters
name – The training set name. This must be unique to other existing training sets unless replace is True
sql –
(str) a SELECT statement that includes:
the primary key column(s) - uniquely identifying a training row/case
the inference timestamp column - timestamp column with which to join features (temporal join timestamp)
join key(s) - the references to the other feature tables’ primary keys (ie customer_id, location_id)
(optionally) the label expression - defining what the training set is trying to predict
primary_keys – (List[str]) The list of columns from the training SQL that identify the training row
ts_col – The timestamp column of the training SQL that identifies the inference timestamp
label_col – (Optional[str]) The optional label column from the training SQL.
replace – (Optional[bool]) Whether to replace an existing training view
join_keys – (List[str]) A list of join keys in the sql that are used to get the desired features in get_training_set
desc – (Optional[str]) An optional description of the training set
verbose – Whether or not to print the SQL before execution (default False)
- Returns
-
deploy_feature_set(schema_name: str, table_name: str)[source]¶ Deploys a feature set to the database. This persists the feature stores existence. As of now, once deployed you cannot delete the feature set or add/delete features. The feature set must have already been created with
create_feature_set()- Parameters
schema_name – The schema of the created feature set
table_name – The table of the created feature set
-
describe_feature_set(schema_name: str, table_name: str) → None[source]¶ Prints out a description of a given feature set, with all features in the feature set and whether the feature set is deployed
- Parameters
schema_name – feature set schema name
table_name – feature set table name
- Returns
None
-
describe_feature_sets() → None[source]¶ Prints out a description of a all feature sets, with all features in the feature sets and whether the feature set is deployed
- Returns
None
-
describe_training_view(training_view: str) → None[source]¶ Prints out a description of a given training view, the ID, name, description and optional label
- Parameters
training_view – The training view name
- Returns
None
-
describe_training_views() → None[source]¶ Prints out a description of all training views, the ID, name, description and optional label
- Parameters
training_view – The training view name
- Returns
None
-
display_feature_search(pandas_profile=True)[source]¶ Returns an interactive feature search that enables users to search for features and profiles the selected Feature. Two forms of this search exist. 1 for use inside of the managed Splice Machine notebook environment, and one for standard Jupyter. This is because the managed Splice Jupyter environment has extra functionality that would not be present outside of it. The search will be automatically rendered depending on the environment.
- Parameters
pandas_profile – Whether to use pandas / spark to profile the feature. If pandas is selected
but the dataset is too large, it will fall back to Spark. Default Pandas.
-
display_model_drift(schema_name: str, table_name: str, time_intervals: int, start_time: Optional[datetime.datetime] = None, end_time: Optional[datetime.datetime] = None)[source]¶ Displays as many as time_intervals plots showing the distribution of the model prediction within each time period. Time periods are equal periods of time where predictions are present in the model table schema_name.table_name. Model predictions are first filtered to only those occurring after start_time if specified and before end_time if specified.
- Parameters
schema_name – schema where the model table resides
table_name – name of the model table
time_intervals – number of time intervals to plot
start_time – if specified, filters to only show predictions occurring after this date/time
end_time – if specified, filters to only show predictions occurring before this date/time
-
display_model_feature_drift(schema_name: str, table_name: str)[source]¶ Displays feature by feature comparison between the training set of the deployed model and the input feature values used with the model since deployment.
- Parameters
schema_name – name of database schema where model table is deployed
table_name – name of the model table
- Returns
None
-
feature_exists(name: str) → bool[source]¶ Returns if a feature exists or not
- Parameters
name – The feature name
- Returns
bool True if the feature exists, False otherwise
-
feature_set_exists(schema: str, table: str) → bool[source]¶ Returns if a feature set exists or not
- Parameters
schema – The feature set schema
table – The feature set table
- Returns
bool True if the feature exists, False otherwise
-
get_backfill_intervals(schema_name: str, table_name: str) → List[datetime.datetime][source]¶ Gets the backfill intervals necessary for the parameterized backfill SQL obtained from the
features.FeatureStore.get_backfill_sql()function. This function will likely not be necessary as you can perform backfill at the time of feature set creation automatically.- Parameters
schema_name – The schema name of the feature set
table_name – The table name of the feature set
- Returns
The list of datetimes necessary to parameterize the backfill SQL
-
get_backfill_sql(schema_name: str, table_name: str)[source]¶ Returns the necessary parameterized SQL statement to perform backfill on an Aggregate Feature Set. The Feature Set must have been deployed using the
features.FeatureStore.create_aggregation_feature_set_from_source()function. Meaning there must be a Source and a Pipeline associated to it. This function will likely not be necessary as you can perform backfill at the time of feature set creation automatically.This SQL will be parameterized and need a timestamp to execute. You can get those timestamps with the
features.FeatureStore.get_backfill_interval()with the same parameters- Parameters
schema_name – The schema name of the feature set
table_name – The table name of the feature set
- Returns
The parameterized Backfill SQL
-
get_deployments(schema_name: Optional[str] = None, table_name: Optional[str] = None, training_set: Optional[str] = None, feature: Optional[str] = None, feature_set: Optional[str] = None)[source]¶ Returns a list of all (or specified) available deployments
- Parameters
schema_name – model schema name
table_name – model table name
training_set – training set name
feature – passing this in will return all deployments that used this feature
feature_set – passing this in will return all deployments that used this feature set
- Returns
List[Deployment] the list of Deployments as dicts
-
get_feature_details(name: str) → splicemachine.features.feature.Feature[source]¶ Returns a Feature and it’s detailed information
- Parameters
name – The feature name
- Returns
Feature
-
get_feature_primary_keys(features: List[str]) → Dict[str, List[str]][source]¶ Returns a dictionary mapping each individual feature to its primary key(s). This function is not yet implemented.
- Parameters
features – (List[str]) The list of features to get primary keys for
- Returns
Dict[str, List[str]] A mapping of {feature name: [pk1, pk2, etc]}
-
get_feature_sets(feature_set_names: Optional[List[str]] = None) → List[splicemachine.features.feature_set.FeatureSet][source]¶ Returns a list of available feature sets
- Parameters
feature_set_names – A list of feature set names in the format ‘{schema_name}.{table_name}’. If none will return all FeatureSets
- Returns
List[FeatureSet] the list of Feature Sets
-
get_feature_vector(features: List[Union[str, splicemachine.features.feature.Feature]], join_key_values: Dict[str, str], return_primary_keys=True, return_sql=False) → Union[str, pandas.core.frame.DataFrame][source]¶ Gets a feature vector given a list of Features and primary key values for their corresponding Feature Sets
- Parameters
features – List of str Feature names or Features
join_key_values – (dict) join key values to get the proper Feature values formatted as {join_key_column_name: join_key_value}
return_primary_keys – Whether to return the Feature Set primary keys in the vector. Default True
return_sql – Whether to return the SQL needed to get the vector or the values themselves. Default False
- Returns
Pandas Dataframe or str (SQL statement)
-
get_feature_vector_sql_from_training_view(training_view: str, features: List[Union[str, splicemachine.features.feature.Feature]]) → str[source]¶ Returns the parameterized feature retrieval SQL used for online model serving.
- Parameters
training_view – (str) The name of the registered training view
features –
(List[str]) the list of features from the feature store to be included in the training
- NOTE
This function will error if the view SQL is missing a view key required to retrieve the desired features
- Returns
(str) the parameterized feature vector SQL
-
get_features_by_name(names: Optional[List[str]] = None, as_list=False) → Union[List[splicemachine.features.feature.Feature], pandas.core.frame.DataFrame][source]¶ Returns a dataframe or list of features whose names are provided
- Parameters
names – The list of feature names
as_list – Whether or not to return a list of features. Default False
- Returns
SparkDF or List[Feature] The list of Feature objects or Spark Dataframe of features and their metadata. Note, this is not the Feature
values, simply the describing metadata about the features. To create a training dataset with Feature values, see
features.FeatureStore.get_training_set()orfeatures.FeatureStore.get_feature_dataset()
-
get_features_from_feature_set(schema_name: str, table_name: str) → List[splicemachine.features.feature.Feature][source]¶ Returns either a pandas DF of feature details or a List of features for a specified feature set. You can get features from multiple feature sets by concatenating the results of this call. For example, to get features from 2 feature sets, foo.bar1 and foo2.bar4:
features = fs.get_features_from_feature_set('foo','bar1') + fs.get_features_from_feature_set('foo2','bar4')
If you want a list of just the Feature NAMES (ie a List[str]) you can simply run:
features = fs.get_features_from_feature_set('foo','bar1') + fs.get_features_from_feature_set('foo2','bar4') feature_names = [f.name for f in features]
- Parameters
schema_name – Feature Set schema name
table_name – Feature Set table name
- Returns
List of Features
-
get_pipeline_sql(schema_name: str, table_name: str)[source]¶ Returns the incremental pipeline SQL that feeds a feature set from a source (thus creating a pipeline). Pipelines are managed for you by default by Splice Machine via Airflow, but if you opt out of using the managed pipelines you can use this function to get the incremental SQL.
This SQL will be parameterized and need a timestamp to execute. You can get those timestamps with the
features.FeatureStore.get_backfill_interval()with the same parameters- Parameters
schema_name – The schema name of the feature set
table_name – The table name of the feature set
- Returns
The incremental Pipeline SQL
-
get_summary() → Dict[str, str][source]¶ This function returns a summary of the feature store including:
Number of feature sets
Number of deployed feature sets
Number of features
Number of deployed features
Number of training sets
Number of training views
Number of associated models - this is a count of the MLManager.RUNS table where the splice.model_name tag is set and the splice.feature_store.training_set parameter is set
Number of active (deployed) models (that have used the feature store for training)
Number of pending feature sets - this will will require a new table featurestore.pending_feature_set_deployments and it will be a count of that
-
get_training_set(features: Union[List[splicemachine.features.feature.Feature], List[str]], current_values_only: bool = False, start_time: Optional[datetime.datetime] = None, end_time: Optional[datetime.datetime] = None, label: Optional[str] = None, return_pk_cols: bool = False, return_ts_col: bool = False, return_type: str = 'spark', return_sql: bool = False, save_as: Optional[str] = None) → pyspark.sql.dataframe.DataFrame[source]¶ Gets a set of feature values across feature sets that is not time dependent (ie for non time series clustering). This feature dataset will be treated and tracked implicitly the same way a training_dataset is tracked from
features.FeatureStore.get_training_set(). The dataset’s metadata and features used will be tracked in mlflow automatically (see get_training_set for more details).- NOTE
The way point-in-time correctness is guaranteed here is by choosing one of the Feature Sets as the "anchor" dataset. This means that the points in time that the query is based off of will be the points in time in which the anchor Feature Set recorded changes. The anchor Feature Set is the Feature Set that contains the superset of all primary key columns across all Feature Sets from all Features provided. If more than 1 Feature Set has the superset of all Feature Sets, the Feature Set with the most primary keys is selected. If more than 1 Feature Set has the same maximum number of primary keys, the Feature Set is chosen by alphabetical order (schema_name, table_name).
- Parameters
features –
List of Features or strings of feature names
- NOTE
The Features Sets which the list of Features come from must have common join keys, otherwise the function will fail. If there is no common join key, it is recommended to create a Training View to specify the join conditions.
current_values_only – If you only want the most recent values of the features, set this to true. Otherwise, all history will be returned. Default False
start_time – How far back in history you want Feature values. If not specified (and current_values_only is False), all history will be returned. This parameter only takes effect if current_values_only is False.
end_time – The most recent values for each selected Feature. This will be the cutoff time, such that any Feature values that were updated after this point in time won’t be selected. If not specified (and current_values_only is False), Feature values up to the moment in time you call the function (now) will be retrieved. This parameter only takes effect if current_values_only is False.
label – An optional label to specify for the training set. If specified, the feature set of that feature will be used as the “anchor” feature set, meaning all point in time joins will be made to the timestamps of that feature set. This feature will also be recorded as a “label” feature for this particular training set (but not others in the future, unless this label is again specified).
return_pk_cols – bool Whether or not the returned sql should include the primary key column(s)
return_ts_cols – bool Whether or not the returned sql should include the timestamp column
return_type – How the data should be returned. If not specified, a Spark DF will be returned. Available arguments are: ‘spark’, ‘pandas’, ‘json’, ‘sql’ sql will return the SQL necessary to generate the dataframe
save_as – Whether or not to save this Training Set (metadata) in the feature store for reproducibility. This enables you to version and persist the metadata for a training set of a specific model development. If you are using the Splice Machine managed MLFlow Service, this will be fully automated and managed for you upon model deployment, however you can still use this parameter to customize the name of the training set (it will default to the run id). If you are NOT using Splice Machine’s mlflow service, this is a useful way to link specific modeling experiments to the exact training sets used. This DOES NOT persist the training set itself, rather the metadata required to reproduce the identical training set.
- Returns
Spark DF or SQL statement necessary to generate the Training Set
-
get_training_set_by_name(name, version: Optional[int] = None, return_pk_cols: bool = False, return_ts_col: bool = False, return_sql=False, return_type: str = 'spark')[source]¶ Returns a Spark DF (or SQL) of an EXISTING Training Set (one that was saved with the save_as parameter in
get_training_set()orget_training_set_from_view(). This is useful if you’ve deployed a model with a Training Set and- Parameters
name – Training Set name
version – The version of this training set. If not set, it will grab the newest version
return_pk_cols – bool Whether or not the returned sql should include the primary key column(s)
return_ts_cols – bool Whether or not the returned sql should include the timestamp column
return_sql – [DEPRECATED] (Optional[bool]) Return the SQL statement (str) instead of the Spark DF. Defaults False
return_type – How the data should be returned. If not specified, a Spark DF will be returned. Available arguments are: ‘spark’, ‘pandas’, ‘json’, ‘sql’ sql will return the SQL necessary to generate the dataframe
- Returns
Spark DF or SQL
-
get_training_set_features(training_set: Optional[str] = None)[source]¶ Returns a list of all features from an available Training Set, as well as details about that Training Set
- Parameters
training_set – training set name
- Returns
TrainingSet as dict
-
get_training_set_from_deployment(schema_name: str, table_name: str, label: Optional[str] = None, return_pk_cols: bool = False, return_ts_col: bool = False, return_type: str = 'spark')[source]¶ Reads Feature Store metadata to rebuild orginal training data set used for the given deployed model.
- Parameters
schema_name – model schema name
table_name – model table name
label – An optional label to specify for the training set. If specified, the feature set of that feature will be used as the “anchor” feature set, meaning all point in time joins will be made to the timestamps of that feature set. This feature will also be recorded as a “label” feature for this particular training set (but not others in the future, unless this label is again specified).
return_pk_cols – bool Whether or not the returned sql should include the primary key column(s)
return_ts_cols – bool Whether or not the returned sql should include the timestamp column
return_type – How the data should be returned. If not specified, a Spark DF will be returned. Available arguments are: ‘spark’, ‘pandas’, ‘json’, ‘sql’ sql will return the SQL necessary to generate the dataframe
- Returns
SparkDF the Training Frame
-
get_training_set_from_view(training_view: str, features: Optional[Union[List[splicemachine.features.feature.Feature], List[str]]] = None, start_time: Optional[datetime.datetime] = None, end_time: Optional[datetime.datetime] = None, return_pk_cols: bool = False, return_ts_col: bool = False, return_sql: bool = False, return_type: str = 'spark', save_as: Optional[str] = None) → pyspark.sql.dataframe.DataFrame[source]¶ Returns the training set as a Spark Dataframe from a Training View. When a user calls this function (assuming they have registered the feature store with mlflow using
register_feature_store()) the training dataset’s metadata will be tracked in mlflow automatically.The following will be tracked:
Training View
Selected features
Start time
End time
This tracking will occur in the current run (if there is an active run) or in the next run that is started after calling this function (if no run is currently active).
- Parameters
training_view – (str) The name of the registered training view
features –
(List[str] OR List[Feature]) the list of features from the feature store to be included in the training. If a list of strings is passed in it will be converted to a list of Feature. If not provided will return all available features.
- NOTE
This function will error if the view SQL is missing a join key required to retrieve the desired features
start_time –
(Optional[datetime]) The start time of the query (how far back in the data to start). Default None
- NOTE
If start_time is None, query will start from beginning of history
end_time –
(Optional[datetime]) The end time of the query (how far recent in the data to get). Default None
- NOTE
If end_time is None, query will get most recently available data
return_pk_cols – bool Whether or not the returned sql should include the primary key column(s)
return_ts_cols – bool Whether or not the returned sql should include the timestamp column
return_sql – [DEPRECATED] (Optional[bool]) Return the SQL statement (str) instead of the Spark DF. Defaults False
return_type – How the data should be returned. If not specified, a Spark DF will be returned. Available arguments are: ‘spark’, ‘pandas’, ‘json’, ‘sql’ sql will return the SQL necessary to generate the dataframe
save_as – Whether or not to save this Training Set (metadata) in the feature store for reproducibility. This enables you to version and persist the metadata for a training set of a specific model development. If you are using the Splice Machine managed MLFlow Service, this will be fully automated and managed for you upon model deployment, however you can still use this parameter to customize the name of the training set (it will default to the run id). If you are NOT using Splice Machine’s mlflow service, this is a useful way to link specific modeling experiments to the exact training sets used. This DOES NOT persist the training set itself, rather the metadata required to reproduce the identical training set.
- Returns
Optional[SparkDF, str] The Spark dataframe of the training set or the SQL that is used to generate it (for debugging)
-
get_training_view(training_view: str) → splicemachine.features.training_view.TrainingView[source]¶ Gets a training view by name
- Parameters
training_view – Training view name
- Returns
TrainingView
-
get_training_view_features(training_view: str) → List[splicemachine.features.feature.Feature][source]¶ Returns the available features for the given a training view name
- Parameters
training_view – The name of the training view
- Returns
A list of available Feature objects
-
get_training_view_id(name: str) → int[source]¶ Returns the unique view ID from a name
- Parameters
name – The training view name
- Returns
The training view id
-
get_training_views(_filter: Optional[Dict[str, Union[int, str]]] = None) → List[splicemachine.features.training_view.TrainingView][source]¶ Returns a list of all available training views with an optional filter
- Parameters
_filter – Dictionary container the filter keyword (label, description etc) and the value to filter on If None, will return all TrainingViews
- Returns
List[TrainingView]
-
link_training_set_to_mlflow(features: Union[List[splicemachine.features.feature.Feature], List[str]], create_time: datetime.datetime, start_time: Optional[datetime.datetime] = None, end_time: Optional[datetime.datetime] = None, tvw: Optional[splicemachine.features.training_view.TrainingView] = None, current_values_only: bool = False, training_set_id: Optional[int] = None, training_set_version: Optional[int] = None, training_set_name: Optional[str] = None)[source]¶
-
list_training_sets() → Dict[str, Optional[str]][source]¶ Returns a dictionary a training sets available, with the map name -> description. If there is no description, the value will be an emtpy string
- Returns
Dict[str, Optional[str]]
-
login_fs(username, password)[source]¶ Function to login to the Feature Store using basic auth. These correspond to your Splice Machine database user and password. If you are running outside of the managed Splice Machine Cloud Service, you must call either this or set_token in order to call any functions in the feature store, or by setting the SPLICE_JUPYTER_USER and SPLICE_JUPYTER_PASSWORD environments variable before creating your FeatureStore object.
- Parameters
username – Username
password – Password
-
register_splice_context(splice_ctx: splicemachine.spark.context.PySpliceContext) → None[source]¶
-
remove_feature(name: str)[source]¶ - Removes a feature. This will run 2 checks.
See if the feature exists.
See if the feature belongs to a feature set that has already been deployed.
If either of these are true, this function will throw an error explaining which check has failed
- param name
feature name
- return
-
remove_feature_set(schema_name: str, table_name: str, purge: bool = False) → None[source]¶ Deletes a feature set if appropriate. You can currently delete a feature set in two scenarios: 1. The feature set has not been deployed 2. The feature set has been deployed, but not linked to any training sets
If both of these conditions are false, this will fail.
Optionally set purge=True to force delete the feature set and all of the associated Training Sets using the Feature Set. ONLY USE IF YOU KNOW WHAT YOU ARE DOING. This will delete Training Sets, but will still fail if there is an active deployment with this feature set. That cannot be overwritten
- Parameters
schema_name – The Feature Set Schema
table_name – The Feature Set Table
purge – Whether to force delete training sets that use the feature set (that are not used in deployments)
-
remove_source(name: str)[source]¶ Removes a Source by name. You cannot remove a Source that has child dependencies (Feature Sets). If there is a Feature Set that is deployed and a Pipeline that is feeding it, you cannot delete the Source until you remove the Feature Set (which in turn removes the Pipeline)
- Parameters
name – The Source name
-
remove_training_view(name: str)[source]¶ This removes a training view if it is not being used by any currently deployed models. NOTE: Once this training view is removed, you will not be able to deploy any models that were trained using this view
- Parameters
name – The view name
-
run_feature_elimination(df, features: List[Union[str, splicemachine.features.feature.Feature]], label: str = 'label', n: int = 10, verbose: int = 0, model_type: str = 'classification', step: int = 1, log_mlflow: bool = False, mlflow_run_name: Optional[str] = None, return_importances: bool = False)[source]¶ Runs feature elimination using a Spark decision tree on the dataframe passed in. Optionally logs results to mlflow
- Parameters
df – The dataframe with features and label
features – The list of feature names (or Feature objects) to run elimination on
label – the label column names
n – The number of features desired. Default 10
verbose – The level of verbosity. 0 indicated no printing. 1 indicates printing remaining features after each round. 2 indicates print features and relative importances after each round. Default 0
model_type – Whether the model to test with will be a regression or classification model. Default classification
log_mlflow – Whether or not to log results to mlflow as nested runs. Default false
mlflow_run_name – The name of the parent run under which all subsequent runs will live. The children run names will be {mlflow_run_name}_{num_features}_features. ie testrun_5_features, testrun_4_features etc
- Returns
-
set_feature_store_url(url: str)[source]¶ Sets the Feature Store URL. You must call this before calling any feature store functions, or set the FS_URL environment variable before creating your Feature Store object
- Parameters
url – The Feature Store URL
-
set_token(token)[source]¶ Function to login to the Feature Store using JWT. This corresponds to your Splice Machine database user’s JWT token. If you are running outside of the managed Splice Machine Cloud Service, you must call either this or login_fs in order to call any functions in the feature store, or by setting the SPLICE_JUPYTER_TOKEN environment variable before creating your FeatureStore object.
- Parameters
token – JWT Token
-
training_view_exists(name: str) → bool[source]¶ Returns if a training view exists or not
- Parameters
name – The training view name
- Returns
bool True if the training view exists, False otherwise
-
update_feature_metadata(name: str, desc: Optional[str] = None, tags: Optional[List[str]] = None, attributes: Optional[Dict[str, str]] = None)[source]¶ Update the metadata of a feature
- Parameters
name – The feature name
desc – The (optional) feature description (default None)
tags – (optional) List of (str) tag words (default None)
attributes – (optional) Dict of (str) attribute key/value pairs (default None)
- Returns
updated Feature
-
splicemachine.features.feature_set¶
This describes the Python representation of a Feature Set. A feature set is a database table that contains Features and their metadata. The Feature Set class is mostly used internally but can be used by the user to see the available Features in the given Feature Set, to see the table and schema name it is deployed to (if it is deployed), and to deploy the feature set (which can also be done directly through the Feature Store). Feature Sets are unique by their schema.table name, as they exist in the Splice Machine database as a SQL table. They are case insensitive. To see the full contents of your Feature Set, you can print, return, or .__dict__ your Feature Set object.
-
class
FeatureSet(*, splice_ctx: Optional[splicemachine.spark.context.PySpliceContext] = None, table_name, schema_name, description, primary_keys: Dict[str, str], feature_set_id=None, deployed: bool = False, **kwargs)[source]¶ Bases:
object
splicemachine.features.Feature¶
This describes the Python representation of a Feature. A Feature is a column of a Feature Set table with particular metadata. A Feature is the smallest unit in the Feature Store, and each Feature within a Feature Set is individually tracked for changes to enable full time travel and point-in-time consistent training datasets. Features’ names are unique and case insensitive. To see the full contents of your Feature, you can print, return, or .__dict__ your Feature object.
splicemachine.features.training_view¶
This describes the Python representation of a Training View. A Training View is a SQL statement defining an event of interest, and metadata around how to create a training dataset with that view. To see the full contents of your Training View, you can print, return, or .__dict__ your Training View object.