Configuration

The constructor of SapientML class consumes various parameters depending on plugin installation. Here we show the parameters you can assign at the constructor of SapientML in cases of each model_type assigned.

Model types

sapientml provides the plugin mechanism for generating source code that is different from the original algorithm of sapientml in utilizing machine learning models and preprocessing components. Each plugin has a unique model_type, and users can choose one of them as a parameter of the constructor of SapientML class. The default value of model_type is sapientml, which is provided by sapientml_core plugin.

Here is the list of the available model_type.

  • sapientml: pip install sapientml
  • fujitsu-automl: pip install --extra-index-url https://sapientml.azure-api.net/pypi fujitsu-automl
  • timeseries-forcast: pip install --extra-index-url https://sapientml.azure-api.net/pypi fujitsu-automl-timeseries
  • timeseries-static-classification: pip install --extra-index-url https://sapientml.azure-api.net/pypi fujitsu-automl-timeseries
  • timeseries-regression: pip install --extra-index-url https://sapientml.azure-api.net/pypi fujitsu-automl-timeseries
  • model-selector: pip install --extra-index-url https://sapientml.azure-api.net/pypi fujitsu-automl-selector

Parameters for sapientml

  • target_columns (list[str]):
    • Names of target columns.
  • task_type ('classification', 'regression', or None) = None:
    • Identifies the task type from classification or regression, or automatically suggests it if set to None
  • adaptation_metric (str) = 'f1' if task_type is 'classification', 'r2' if 'regression':
    • Metric for evaluation. f1, auc, ROC_AUC, accuracy, Gini, LogLoss, MCC (Matthews correlation coefficient), QWK (Quadratic weighted kappa) are available for classification. r2, RMSLE, RMSE, MAE are available for regression.
  • split_method ('random', 'time', or 'group') = 'random':
    • Method of train-test split. random uses random split. time requires split_column_name. This sorts the data rows based on the column, and then splits data. group requires split_column_name. This splits the data so that rows with the same value of split_column_name are not placed in both training and test data.
  • split_seed (int) = 17:
    • Random seed for train-test split. Ignored when split_method='time'.
  • split_train_size (float) = 0.75:
    • The ratio of training size to input data. Ignored when split_method='time'.
  • split_column_name (str or None) = None:
    • Name of the column used to split. Ignored when split_method='random'
  • time_split_num (int) = 5:
    • Passed to n_splits of TimeSeriesSplit. Valid only when split_method='time'.
  • time_split_index (int) = 4:
    • The index of the split from TimeSeriesSplit. Valid only when split_method='time'.
  • split_stratification (bool or None) = None:
    • To perform stratification in train-test split. Valid only when task_type='classification'.
  • initial_timeout (int) = 600:
    • Timelimit to execute each generated script. Ignored when hyperparameter_tuning=True and hyperparameter_tuning_timeout is set.
  • timeout_for_test (int) = 0:
    • Timelimit to execute test script (final_script) and Visualization.
  • cancel (CancellationToken or None) = None:
    • Object to interrupt evaluations.
  • project_name (str or None) = None:
    • Project name.
  • debug (bool) = False:
    • Debug mode or not.
  • use_pos_list (list[str]) = ["名詞", "動詞", "助動詞", "形容詞", "副詞"]:
    • List of parts-of-speech to be used during text analysis. This variable is used for japanese texts analysis. Select the part of speech below. "名詞", "動詞", "形容詞", "形容動詞", "副詞".
  • use_word_stemming (bool) = True:
    • Specify whether or not word stemming is used. This variable is used for japanese texts analysis.
  • n_models (int) = 3:
    • Number of output models to be tried.
  • seed_for_model (int) = 42:
    • Random seed for models such as RandomForestClassifier.
  • id_columns_for_prediction (list[str] or None) = None:
    • Name of the dataframe columns that outputs the prediction result.
  • use_word_list (list[str], dict[str, list[str]], or None) = None:
    • List of words to be used as features when generating explanatory variables from text. If dict type is specified, key must be a column name and value must be a list of words.
  • hyperparameter_tuning (bool) = False:
    • On/Off of hyperparameter tuning.
  • hyperparameter_tuning_n_trials (int) = 10:
    • The number of trials of hyperparameter tuning.
  • hyperparameter_tuning_timeout (int) = 0:
    • Time limit for hyperparameter tuning in each generated script. Ignored when hyperparameter_tuning is False.
  • hyperparameter_tuning_random_state (int) = 1023:
    • Random seed for hyperparameter tuning.
  • predict_option ('default' or 'probability') = 'default':
    • Specify predict method (default: predict(), probability: predict_proba().)
  • permutation_importance (bool) = True:
    • On/Off of outputting permutation importance calculation code.
  • add_explanation (bool) = False
    • If True, outputs ipynb files including EDA and explanation.

Parameters for fujitsu-automl

  • target_columns (list[str]):
    • Names of target columns.
  • task_type ('classification', 'regression', or None) = None:
    • Identifies the task type from classification or regression, or automatically suggests it if set to None
  • adaptation_metric (str) = 'f1' if task_type is 'classification', 'r2' if 'regression':
    • Metric for evaluation. f1, auc, ROC_AUC, accuracy, Gini, LogLoss, MCC (Matthews correlation coefficient), QWK (Quadratic weighted kappa) are available for classification. r2, RMSLE, RMSE, MAE are available for regression.
  • split_method ('random', 'time', or 'group') = 'random'
    • Method of train-test split. random uses random split. time requires split_column_name. This sorts the data rows based on the column, and then splits data. group requires split_column_name. This splits the data so that rows with the same value of split_column_name are not placed in both training and test data.
  • split_seed (int) = 17:
    • Random seed for train-test split. Ignored when split_method='time'.
  • split_train_size (float) = 0.75:
    • The ratio of training size to input data. Ignored when split_method='time'.
  • split_column_name (str or None) = None:
    • Name of the column used to split. Ignored when split_method='random'
  • time_split_num (int) = 5:
    • Passed to n_splits of TimeSeriesSplit. Valid only when split_method='time'.
  • time_split_index (int) = 4:
    • The index of the split from TimeSeriesSplit. Valid only when split_method='time'.
  • split_stratification (bool or None) = None:
    • To perform stratification in train-test split. Valid only when task_type='classification'.
  • initial_timeout (int) = 600:
    • Timelimit to execute each generated script. Ignored when hyperparameter_tuning=True and hyperparameter_tuning_timeout is set.
  • timeout_for_test (int) = 0:
    • Timelimit to execute test script (final_script) and Visualization.
  • cancel (CancellationToken or None) = None:
    • Object to interrupt evaluations.
  • project_name (str or None) = None:
    • Project name.
  • debug (bool) = False:
    • Debug mode or not.
  • use_pos_list (list[str]) = ["名詞", "動詞", "助動詞", "形容詞", "副詞"]:
    • List of parts-of-speech to be used during text analysis. This variable is used for japanese texts analysis. Select the part of speech below. "名詞", "動詞", "形容詞", "形容動詞", "副詞".
  • use_word_stemming (bool) = True:
    • Specify whether or not word stemming is used. This variable is used for japanese texts analysis.
  • n_models (int) = 3:
    • Number of output models to be tried.
  • seed_for_model (int) = 42:
    • Random seed for models such as RandomForestClassifier.
  • id_columns_for_prediction (list[str] or None) = None:
    • Name of the dataframe columns that outputs the prediction result.
  • use_word_list (list[str], dict[str, list[str]], or None) = None:
    • List of words to be used as features when generating explanatory variables from text. If dict type is specified, key must be a column name and value must be a list of words.
  • hyperparameter_tuning (bool) = False:
    • On/Off of hyperparameter tuning.
  • hyperparameter_tuning_n_trials (int) = 10:
    • The number of trials of hyperparameter tuning.
  • hyperparameter_tuning_timeout (int) = 0:
    • Time limit for hyperparameter tuning in each generated script. Ignored when hyperparameter_tuning is False.
  • hyperparameter_tuning_random_state (int) = 1023:
    • Random seed for hyperparameter tuning.
  • predict_option ('default' or 'probability') = 'default'
    • Specify predict method (default: predict(), probability: predict_proba().)
  • permutation_importance (bool) = True:
    • On/Off of outputting permutation importance calculation code.
  • add_explanation (bool) = False:
    • If True, outputs ipynb files including EDA and explanation.

Parameters for timeseries-forcast

  • target_columns (list[str]):
    • Names of target columns.
  • id_columns : list[str]:
    • Id column(s) to distinguish different series.
  • time_column : str:
    • Name of the time column.
  • series_columns : list[str]:
    • Column(s) describing values in time series.
  • prediction_frequency (Union[str, pd.DateOffset, pd.Timedelta] or None) = None:
    • Frequency of prediction time. If prediction_frequency is not set, test_dataframe is used to set prediction_frequency. At least one of test_dataframe or prediction_frequency must be set.
  • earliest_prediction_time : Union[str, datetime] or None, default None:
    • Earliest prediction time to remove too old data.
  • target_range_begin (Union[str, pd.DateOffset, pd.Timedelta] or None) = None:
    • The offset when target_time begins from prediction_time (inclusive).
  • target_range_end (Union[str, pd.DateOffset, pd.Timedelta] or None) = None:
    • The offset when target_time ends from prediction_time (exclusive).
  • feature_specs_list (list[dict] or None) = None:
    • Feature specification. Each dict can include the following keys.
  • windows : list[str]:
    • Windows to calculate features. The format of str is "{time}" or "{time}-{time}" such as "3d" and "1y-2y". The format of time is "{int}{unit}", where the unit is 'y', 'mon', 'w', 'd', 'h', 'min', 's', or 'ms'. If "{time1}-{time2}", the window is [prediction_time-time2, prediction_time-time1]. If "{time}", the window is [prediction_time-time, prediction_time]. If not specified, default windows are used. The left value of a window is inclusive and the right value is exclusive.
  • functions (list[str]) = ["max", "min", "mean"]:
    • Aggregation functions. If not specified, ["max", "min", "mean"] is used.
  • groupby_columns_list : list[list[str or (str, bool)]]:
    • Column name list for groupby. If not specified, [id_col] is specified. The columns must be included in both task_df and data_df. The bool set to True, if features are expanded to columns. If bool is appreviated, False is set. Example: [['item_id', 'store_id'], [('item_id', False), ('store_state', True)]]
  • feature_selection (bool) = True:
    • Whether or not to perform feature selection.
  • reference_prediction : bool
    • Whether to forecast by autocorrelation as reference information.
  • time_split_num (int) = 5:
    • Passed to n_splits of TimeSeriesSplit.
  • time_split_index (int) = 4:
    • The index of the split from TimeSeriesSplit.
  • initial_timeout (int) = 600:
    • Timelimit to execute each generated script. Ignored when hyperparameter_tuning=True and hyperparameter_tuning_timeout is set.
  • timeout_for_test (int) = 0:
    • Timelimit to execute test script (final_script) and Visualization.
  • cancel (CancellationToken or None) = None:
    • Object to interrupt evaluations.
  • project_name (str or None) = None:
    • Project name.
  • debug (bool) = False:
    • Debug mode or not.
  • use_pos_list (list[str]) = ["名詞", "動詞", "助動詞", "形容詞", "副詞"]:
    • List of parts-of-speech to be used during text analysis. This variable is used for japanese texts analysis. Select the part of speech below. "名詞", "動詞", "形容詞", "形容動詞", "副詞".
  • use_word_stemming (bool) = True:
    • Specify whether or not word stemming is used. This variable is used for japanese texts analysis.
  • n_models (int) = 3:
    • Number of output models to be tried.
  • seed_for_model (int) = 42:
    • Random seed for models such as RandomForestClassifier.
  • id_columns_for_prediction (list[str] or None) = None:
    • Name of the dataframe columns that outputs the prediction result.
  • use_word_list (list[str], dict[str, list[str]], or None) = None:
    • List of words to be used as features when generating explanatory variables from text. If dict type is specified, key must be a column name and value must be a list of words.
  • hyperparameter_tuning (bool) = False:
    • On/Off of hyperparameter tuning.
  • hyperparameter_tuning_n_trials (int) = 10:
    • The number of trials of hyperparameter tuning.
  • hyperparameter_tuning_timeout (int) = 0:
    • Time limit for hyperparameter tuning in each generated script. Ignored when hyperparameter_tuning is False.
  • hyperparameter_tuning_random_state (int) = 1023:
    • Random seed for hyperparameter tuning.
  • predict_option ('default' or 'probability') = 'default':
    • Specify predict method (default: predict(), probability: predict_proba().)
  • permutation_importance (bool) = True:
    • On/Off of outputting permutation importance calculation code.
  • add_explanation (bool) = False:
    • If True, outputs ipynb files including EDA and explanation.

Parameters for timeseries-static-classification

  • target_columns (list[str]):
    • Names of target columns.
  • id_columns : list[str]:
    • Id column(s) to distinguish different series.
  • time_column : str:
    • Name of the time column. The type of column must be int, float or str as number.
  • series_columns : list[str]:
    • Column(s) describing values in time series.
  • static_columns (list[str] or None) = None:
    • Constant value column names.
  • to_datetime_unit (str) = 's':
    • Unit of time when converting to time.
  • window_width (float) = 3.0: The size of window for extracting features.
  • window_stride (float) = 3.0 The amount of window stride.
  • functions (Optional[list[str]]) = None:
    • Function for computing time series features. e.g. "min", "max", "mean"
  • padding : (float) = 0.0
    • A value padded to the end of short timeseries in a dataset. After padding, their length will be the same as the longest one in the dataset.
  • task_type (str) = 'classification':
    • Must be 'classification'.
  • adaptation_metric (str) = 'f1':
    • Metric for evaluation. f1, auc, ROC_AUC, accuracy, Gini, LogLoss, MCC (Matthews correlation coefficient), QWK (Quadratic weighted kappa) are availabe.
  • split_method ('random', 'time', or 'group') = 'random':
    • Method of train-test split. random uses random split. time requires split_column_name. This sorts the data rows based on the column, and then splits data. group requires split_column_name. This splits the data so that rows with the same value of split_column_name are not placed in both training and test data.
  • split_seed (int) = 17:
    • Random seed for train-test split. Ignored when split_method='time'.
  • split_train_size (float) = 0.75:
    • The ratio of training size to input data. Ignored when split_method='time'.
  • split_column_name (str or None) = None:
    • Name of the column used to split. Ignored when split_method='random'
  • time_split_num (int) = 5:
    • Passed to n_splits of TimeSeriesSplit. Valid only when split_method='time'.
  • time_split_index (int) = 4:
    • The index of the split from TimeSeriesSplit. Valid only when split_method='time'.
  • split_stratification (bool or None) = None:
    • To perform stratification in train-test split. Valid only when task_type='classification'.
  • initial_timeout (int) = 600:
    • Timelimit to execute each generated script. Ignored when hyperparameter_tuning=True and hyperparameter_tuning_timeout is set.
  • timeout_for_test (int) = 0:
    • Timelimit to execute test script (final_script) and Visualization.
  • cancel (CancellationToken or None) = None:
    • Object to interrupt evaluations.
  • project_name (str or None) = None:
    • Project name.
  • debug (bool) = False:
    • Debug mode or not.
  • use_pos_list (list[str]) = ["名詞", "動詞", "助動詞", "形容詞", "副詞"]:
    • List of parts-of-speech to be used during text analysis. This variable is used for japanese texts analysis. Select the part of speech below. "名詞", "動詞", "形容詞", "形容動詞", "副詞".
  • use_word_stemming (bool) = True:
    • Specify whether or not word stemming is used. This variable is used for japanese texts analysis.
  • n_models (int) = 3:
    • Number of output models to be tried.
  • seed_for_model (int) = 42:
    • Random seed for models such as RandomForestClassifier.
  • id_columns_for_prediction (list[str] or None) = None:
    • Name of the dataframe columns that outputs the prediction result.
  • use_word_list (list[str], dict[str, list[str]], or None) = None:
    • List of words to be used as features when generating explanatory variables from text. If dict type is specified, key must be a column name and value must be a list of words.
  • hyperparameter_tuning (bool) = False:
    • On/Off of hyperparameter tuning.
  • hyperparameter_tuning_n_trials (int) = 10:
    • The number of trials of hyperparameter tuning.
  • hyperparameter_tuning_timeout (int) = 0:
    • Time limit for hyperparameter tuning in each generated script. Ignored when hyperparameter_tuning is False.
  • hyperparameter_tuning_random_state (int) = 1023:
    • Random seed for hyperparameter tuning.
  • predict_option ('default' or 'probability') = 'default':
    • Specify predict method (default: predict(), probability: predict_proba().)
  • permutation_importance (bool) = True:
    • On/Off of outputting permutation importance calculation code.
  • add_explanation (bool) = False:
    • If True, outputs ipynb files including EDA and explanation.

Parameters for timeseries-regression

  • target_columns (list[str]):
    • Names of target columns.
  • id_columns : list[str]:
    • Id column(s) to distinguish different series.
  • time_column : str:
    • Name of the time column.
  • series_columns : list[str]:
    • Column(s) describing values in time series.
  • apply_log1p : bool:
    • Whether to apply log1p to series_columns.
  • diff_orders : Union[str, list, None]:
    • Orders of differencing to remove stochastic trend. If "auto", automatically determine the orders using modules in pmdarima. If list[int], given orders are applied to all series_columns. If dict such as {"ID1": {"col1":[1,2], "col2":[6,12]}, "ID2":{...}, }, given orders are applied to specified columns of specified IDs. If None, no differenciation is applied.
  • alpha: float:
    • Level of significance used in conducting the ADF test and Ljun-Box test.
  • time_split_num (int) = 5:
    • Passed to n_splits of TimeSeriesSplit.
  • time_split_index (int) = 4:
    • The index of the split from TimeSeriesSplit.
  • initial_timeout (int) = 600:
    • Timelimit to execute each generated script. Ignored when hyperparameter_tuning=True and hyperparameter_tuning_timeout is set.
  • timeout_for_test (int) = 0:
    • Timelimit to execute test script (final_script) and Visualization.
  • cancel (CancellationToken or None) = None:
    • Object to interrupt evaluations.
  • project_name (str or None) = None:
    • Project name.
  • debug (bool) = False:
    • Debug mode or not.
  • use_pos_list (list[str]) = ["名詞", "動詞", "助動詞", "形容詞", "副詞"]:
    • List of parts-of-speech to be used during text analysis. This variable is used for japanese texts analysis. Select the part of speech below. "名詞", "動詞", "形容詞", "形容動詞", "副詞".
  • use_word_stemming (bool) = True:
    • Specify whether or not word stemming is used. This variable is used for japanese texts analysis.
  • n_models (int) = 3:
    • Number of output models to be tried.
  • seed_for_model (int) = 42:
    • Random seed for models such as RandomForestClassifier.
  • id_columns_for_prediction (list[str] or None) = None:
    • Name of the dataframe columns that outputs the prediction result.
  • use_word_list (list[str], dict[str, list[str]], or None) = None:
    • List of words to be used as features when generating explanatory variables from text. If dict type is specified, key must be a column name and value must be a list of words.
  • hyperparameter_tuning (bool) = False:
    • On/Off of hyperparameter tuning.
  • hyperparameter_tuning_n_trials (int) = 10:
    • The number of trials of hyperparameter tuning.
  • hyperparameter_tuning_timeout (int) = 0:
    • Time limit for hyperparameter tuning in each generated script. Ignored when hyperparameter_tuning is False.
  • hyperparameter_tuning_random_state (int) = 1023:
    • Random seed for hyperparameter tuning.
  • predict_option ('default' or 'probability') = 'default':
    • Specify predict method (default: predict(), probability: predict_proba().)
  • permutation_importance (bool) = True:
    • On/Off of outputting permutation importance calculation code.
  • add_explanation (bool) = False:
    • If True, outputs ipynb files including EDA and explanation.