Configuration
Attention: Some methods and attributes are not yet implemented and may not work as documented.
Parameters for WideLearnerClient
- preproc (str or tuple[str, …] or dict[str, str or tuple[str, …]], default=('qscale', 'onehot')):
- Preprocessing methods to be applied.
Each preprocessing method is numerical or categorical. Numerical methods work only on numeric input columns, and categorical methods work only on non-numeric input columns.
If a single string is given, a method for columns of another type is completed with the default value 'qscale' or 'onehot'. For example, 'ecut5' is equivalent to specifying ('ecut5', 'onehot'), and 'onecold' is equivalent to specifying ('qscale', 'onecold').
If the methods are listed in a tuple, each column is processed using all methods that can handle it. Columns that cannot be processed by any method are ignored.
When a dict for a column-to-method mapping is given, column-specific conversions are performed, where the columns in the dict are processed using the corresponding method and the rest are ignored. If more than one method is listed, the input column can be converted in more than one way. This format forces categorical methods to be applied even for numeric columns. Attempt of numerical method to a non-numeric column will result in an error.
-
Numerical methods:
-
'scale' – Scale the values linearly to real numbers from 0 to 1. Two columns are generated: '↗' represents a positive correlation with the original value, and '↘' represents a negative correlation.
-
'qscale' – Scale the values to real numbers that follow a uniform distribution from 0 to 1. Two columns are generated: '⇧' represents a positive correlation with the original value, and '⇩' represents a negative correlation.
-
'bin2', 'bin3', … – Divide the range into k intervals (bins) that contain approximately the same width. One binary column is generated for each bin.
-
'cut2', 'cut3', … – Divide the range into k intervals (bins) that contain approximately the same width. Two binary columns, '<' and '≥', are generated for each cutpoint.
-
'qbin2', 'qbin3', … – Divide the range into k intervals (bins) that contain approximately the same number of samples. One binary column is generated for each bin.
-
'qcut2', 'qcut3', … – Divide the range into k intervals (bins) that contain approximately the same number of samples. Two binary columns, '<' and '≥', are generated for each cutpoint.
-
'ebin2', 'ebin3', … – Divide the range into k intervals (bins) by a greedy algorithm to minimize entropy. One binary column is generated for each bin.
-
'ecut2', 'ecut3', … – Divide the range into k intervals (bins) by a greedy algorithm to minimize entropy. Two binary columns, '<' and '≥', are generated for each cutpoint.
-
'pass' – No conversion.
-
'flip' – Convert value x to 1 - x.
-
Categorical methods:
-
'onehot' – One-hot binary encoding. A column is converted to multiple binary columns corresponding to its values. A value of 1 indicates that the original column has that value.
-
'onecold' – One-cold binary encoding. A column is converted to multiple binary columns corresponding to its values. A value of 1 indicates that the original column does not have that value.
Note that if the column has fewer than three unique values, all methods except 'pass' and 'flip' convert the column in the same way as 'onehot'.
-
miner ({'chunky', 'copula', 'fair', 'closed', 'auto'}, default='auto'):
- Knowledge chunk mining engine to be used. 'chunky' for ChunkyMiner, 'copula' for CopulaMiner, 'fair' for FairMiner, and 'closed' for ClosedPatternMiner. 'chunky' is the fastest, but only supports binary datasets without sample_weight. Select 'fair' or 'closed' to use fairness-related constraints. 'closed' has a different mining strategy than others and does not support max_len. If 'auto', the appropriate one is selected.
-
chunk_mode ({'all', 'minimal', 'supp', 'conf', 'chi2', 'nmi', 'auto'}, default='conf'):
- Criterion for selecting output chunks from all chunks that satisfy the constraints.
-
'all' – All chunks.
-
'minimal' or 'supp' – Chunks that do not contain other chunks.
-
'conf' – Chunks that do not contain other chunks with the same or higher confidence.
-
'chi2' – Chunks that do not contain other chunks with the same or higher chi-squared value.
-
'nmi' – Chunks that do not contain other chunks with the same or higher mutual information.
-
'auto' – Chunks that do not contain other chunks with the same or higher ranking in chunk_score.
-
It can be a single value described above or a dict in the form {class1: value1, class2: value2, …}.
-
chunk_score ({'supp', 'conf', 'chi2', 'nmi'}, default='nmi'):
- Criterion for ranking the chunks.
-
'supp' – Ratio of positive hit samples to all positive samples.
-
'conf' – Ratio of positive hit samples to all hit samples.
-
'chi2' – Chi-squared value.
-
'nmi' – Mutual information.
It can be a single value described above or a dict in the form {class1: value1, class2: value2, …}.
-
chunk_limit (int, default=10000):
- Upper limit of the number of chunks. For a given number k, approximately the top k chunks in chunk_score are enumerated. Chunks of a tie score may be enumerated beyond this limit if use_exact_limit is not set. It can be a single value described above or a dict in the form {class1: value1, class2: value2, …}.
-
use_exact_limit (bool or int, default=False):
- If True, the number of chunks is exactly capped by chunk_limit. If a positive integer is given, it is exactly limited by that number.
-
max_len (int or None or list[int|None], default=None):
- Upper limit of the chunk length. If multiple values are specified in the list, multiple runs are performed with each limit value and the results are merged. One recommended setting is [None, 1], which is intended to improve the accuracy of the classifier by not missing very basic chunks of length 1. It can be a single value described above or a dict in the form {class1: value1, class2: value2, …}.
-
min_npos (int or None, default=None):
- Lower limit of the number of positive hit samples for a chunk. It can be a single value described above or a dict in the form {class1: value1, class2: value2, …}.
-
max_nneg (int or None, default=None):
- Upper limit of the number of negative hit samples for a chunk. It can be a single value described above or a dict in the form {class1: value1, class2: value2, …}.
-
min_conf (float or None, default=None):
- Lower limit of the ratio of positive hit samples to all hit samples for a chunk. It can be a single value described above or a dict in the form {class1: value1, class2: value2, …}.
-
min_chi2 (float or None, default=None):
- Lower limit of the chi-squared value for a chunk. It can be a single value described above or a dict in the form {class1: value1, class2: value2, …}.
-
min_nmi (float or None, default=None):
- Lower limit of the normalized mutual information for a chunk, in the range [0, 1]. It can be a single value described above or a dict in the form {class1: value1, class2: value2, …}.
-
min_supp (float or None, default=0.001):
- Lower limit of the ratio of positive hit samples to all positive samples for a chunk, which is an alternative way to specify min_npos. The range [0, 1] of min_supp corresponds linearly to the range [0, total_npos] of min_npos. It can be a single value described above or a dict in the form {class1: value1, class2: value2, …}.
-
conf_ratio (float or None, default=None):
- Alternative way to specify min_conf. For positive chunks, range [0, 1] of conf_ratio corresponds linearly to the range [total_npos/(total_npos+total_nneg), 1] of min_conf, and similarly for negative chunks. It can be a single value described above or a dict in the form {class1: value1, class2: value2, …}.
-
fair_supp (float or None, default=None):
- Lower limit of the ratio of hit samples to all samples in a fairness group, common to all groups.
-
fair_ratio (float or array-like of shape (n_fairgroups,) or dict[object,float] or None, default=None):
- Lower limits of p/q for positive rules and (1-p)/(1-q) for negative rules, where p is the ratio of hit samples to all samples in the corresponding group and q is the ratio of hit samples to all samples in any other group. A scalar value defines the constraint common to all groups. Constraints for each group can be represented by a vector in the corresponding order or by a dictionary with the group IDs as keys.
-
solver ({'glmnet', 'liblinear', 'saga', None}, default='glmnet'):
- Algorithm to use in the optimization problem. 'glmnet' ensures that all chunks are positively or negatively weighted to be consistent with their own target classes. 'liblinear' and 'saga' do not support cut_point, max_n_weighted, and n_lambda parameters. If None, no solver is executed and all positive and negative chunks are weighted to 1 and -1, respectively.
-
class_weight (dict or None or 'balanced', default=None):
- Weights associated with classes in the form {class_label: weight}. If None, all classes are supposed to have weight one. If 'balanced', weights are adjusted inversely proportional to class frequencies in the training data as n_samples / (n_classes * np.bincount(y)).
-
fit_intercept (bool, default=True):
- Whether to include an intercept term in the model. It corresponds to weighting an unconditional chunk or empty itemset.
-
l1_ratio (float or list[float], default=1.0):
- Elastic-Net mixing parameter, with 0 <= l1_ratio <= 1. Set 0 for ridge, 1 for lasso. If multiple values are specified, the one with the best cross-validation score is selected. 'liblinear' solver supports only the values 0 and 1.
-
C (float or list[float] or None, default=None):
- Inverse of regularization strength λ; must be a positive float. The smaller the value, the stronger the regularization. If multiple values are specified, the one with the best cross-validation score is selected. If None, λ is determined fully automatically.
-
n_lambda (int, default=100):
- Maximum number of λ values to compute when C=None. Smaller values speed up the solver instead of finer optimization. Only supported by the 'glmnet' solver.
-
cv (int or cross-validation generator, default=5):
- Cross-validation generator to be used for tuning λ and l1_ratio. If int, sklearn.model_selection.StratifiedKFold with the given number of folds will be used.
-
cv_score (str or callable or None, default='neg_log_loss'):
- Strategy to evaluate the performance of the cross-validated model. Valid options include 'accuracy', 'roc_auc', 'f1', 'precision', and 'recall'. If None, use the default classification score. See the description of the scoring parameter in the scikit-learn user guide.
-
cut_point (float, default=0):
- The cut point to use for selecting the best λ: arg_max λ cv_score(λ) >= cv_score(λ_max) - cut_point * standard_error(λ_max). Only supported by the 'glmnet' solver.
-
max_n_weighted (int, default=None):
- Maximum number of chunks with nonzero weights. Only supported by the 'glmnet' solver.
-
max_iter (int, default=1000000):
- Maximum passes over the data when fitting the model.
-
n_jobs (int, default=None):
- Maximum number of CPU cores the solver will use to fit the model. None means 1 unless in a joblib.parallel_backend context. If -1, all CPUs can be used.
-
random_state (int, RandomState instance or None, default=None):
- Seed for the random number generator, used for determining the CV folds.
-
use_sparse_matrix (bool, default=True.):
- Whether to use the sparse matrix form in fit method for fast computation.
-
verbose (int, default=0)
- Verbosity level of messages.
-
pbar (bool, default=None):
- Whether to show a progress bar. If None, it is automatically determined by the execution environment.
-
n_features_in_ (int):
- The number of features passed to the fit() method.
-
feature_names_in_ (ndarray of shape (n_features,)):
- Names of features seen during fit.
-
preproc_ (WidePreprocessor or None):
- Preprocessor object.
-
features_ (array of shape (n_features,)):
- Names of features after preprocessing.
-
classes_ (array of shape (n_classes,))
- The distinct class labels found in y.
-
class_ratio_ (array of shape (n_classes,)):
- Ratio of the target class samples to all samples in fitting each model.
-
fairgroups_ (array of shape (n_fairgroups,)):
- The distinct fairness group IDs.
-
miner_ ({'chunky', 'copula', 'fair', 'closed'}):
- Knowledge chunk mining engine actually used.
-
all_chunks_ (list[tuple[ChunkSet, ChunkSet]] of length n_classes or 1):
- Pair of the set of all positive chunks and the set of all negative chunks for each class.
-
chunks_ (list[tuple[ChunkSet, ChunkSet]] of length n_classes or 1):
- Pair of the set of weighted positive chunks and the set of weighted negative chunks for each class.
-
weight_ (list[tuple[array, array]] of length n_classes or 1):
- Pair of positive and negative coefficient vectors for each class.
-
chunkset_ (ChunkSet):
- Set of positive and negative weighted chunks of all classes, merged into a ChunkSet object of the first class for convenience. Empty itemsets are not included and the same chunk will not appear more than once.
-
coef_ (ndarray of shape (1, n_chunks) or (n_classes, n_chunks)):
- Coefficients of the chunks in the decision function, ordered according to their order in chunkset_.
-
intercept_ (ndarray of shape (1,) or (n_classes,)):
- Intercept (a.k.a. bias) added to the decision function.
-
lambda_ (list[float] of length n_classes or 1):
- Regularization strength parameter for each class.
-
l1_ratio_ (list[float] of length n_classes or 1):
- Elastic-Net mixing parameter for each class.
-
cv_score_ (list[float] of length n_classes or 1) :
- Cross-varidation score for each class.
Methods
Implemented Methods
| method | explaination |
|---|---|
| WideLearner.init([preproc, miner, ...]) | |
| WideLearner.chunkdata([columns, all, sort]) | Generate a table of knowledge chunks. |
| WideLearner.fit(X, y[, sample_weight, ...]) | Fit the model according to the given training data. |
| WideLearner.predict(X) | Predict class labels for samples in X. |
| WideLearner.predict_proba(X) | Probability estimates. |
| WideLearner.transform(X, *[, all]) | Convert samples in X to vectors of chunk values. |
Unimplemented Methods
| method | explaination |
|---|---|
| WideLearner.chunkmatrix([all]) | Generate a matrix representing the relation of chunks and features. |
| WideLearner.chunks([all]) | Get an iterator of all chunks. |
| WideLearner.classifier() | Get trained linear classifier that takes chunk values as input. |
| WideLearner.fit_transform(X[, y]) | Fit to data, then transform it. |
| WideLearner.get_params([deep]) | Get parameters for this estimator. |
| WideLearner.n_chunks([all]) | Get the total number of chunks. |
| WideLearner.plot_chunks([plotter, all]) | Plot weighted chunks on the 2-d space of positive and negative support values. |
| WideLearner.score(X, y[, sample_weight]) | Return the mean accuracy on the given test data and labels. |
| WideLearner.set_output(*[, transform]) | Set output container. |
| WideLearner.set_params(**params) | Set the parameters of this estimator. |