Usage
Attention: Some methods and attributes are not yet implemented and may not work as documented.
Wide Learning is an AI Technology that Simulates the Scientific Discovery Process.
The "discovery" process in science involves repeating the following steps:
- Thinking of a hypothesis
- Using data from observations and experiments to verify the validity of the hypothesis
- If the hypothesis is not correct, a new one is considered (Return to 1)
After repeating this procedure, the surviving hypothesis is considered to be a "discovery", a new theory in science.
Users can experiment with Wide Learning by just learning how to use the class WideLearnerClient.
WideLearnerClient implements the classifier interface of scikit-learn. Users do not have to worry about the internal pipeline and can perform training by fit() and prediction by predict() as usual.
>>>from sklearn.datasets import load_iris
>>>X, y = load_iris(return_X_y=True, as_frame=True)
>>>test = [49, 99, 149]
>>>X_train, X_test = X.drop(test), X.iloc[test]
>>>y_train, y_test = y.drop(test), y.iloc[test]
>>>y_test.values
array([0, 1, 2])
>>>from widelearning.api import WideLearnerClient
>>>wl = WideLearnerClient(random_state=0)
>>>wl.fit(X_train, y_train)
>>>WideLearnerClient(random_state=0)
>>>wl.predict(X_test)
array([0, 1, 2])
>>>wl.score(X_test, y_test)
1.0
In order to view the results with meaningful names, use pandas.DataFrame with column names (instead of a simple numpy array) as the parameter X of fit(X, y) and predict(X). Use chunkdata() after fit() to get the table of KCs. There is plot_chunks() to visualize the KCs using Plotly or Seaborn.
WideLearnerClient executes the following pipeline internally:
- Preprocessing with
WidePreprocessor - KC enumeration with
ChunkyMinerorCopulaMiner - Weight optimization with
LogitNetin python-glmnet orLogisticRegressionCVin scikit-learn
1.Preprocessing
The input dataset for the WideLearnerClient needs to store numeric features in numeric columns, and categorical features in string columns. The role of preprocessing is to convert this data into numerical columns in the range of 0 to 1 that can be handled in the subsequent pipeline (KC enumeration). You can specify preprocessing settings in the first argument preproc of widelearning.api.WideLearnerClient.
To specify individual preprocessing methods for each column, pass a dictionary object that uses the column name as a key and the name of the preprocessing method as a value to preproc. We will describe the choices for preprocessing methods in the following subsections.
If a is the preprocessing method for numerical features and b is the preprocessing method for categorical features, specifying preproc=(a, b) will apply a to all numerical features and b to all categorical features1. The default is preproc=('qscale', 'onehot'). preproc=(a, 'onehot') can be abbreviated as preproc=a, and similarly, preproc=('qscale', b) can be abbreviated as preproc=b.
[1]The order in the tuple is not important, and specifying preproc=(b, a) will yield the same results. In addition, it is possible to apply multiple preprocessing methods to the same type of feature. For instance, it is permissible to specify as preproc=('onehot', 'onecold', 'cut5', 'qcut4').
1.1 Methods for numerical features
Preprocessing of numerical features can be classified into scaling and binarization. In general, scaling is better when there is a monotony between the number and the target label, such as “The larger the number, the higher the probability of label 1.” or “The smaller the number, the higher the probability of label 1.”, and binarization is better for extracting more complex relationships.
Scaling
Normalizes the numeric values to the range of [0, 1]. In order for Wide Learning to discover both the property that appears stronger as the value increases and the property that appears stronger as the value decreases, it generates two normalized numerical features from the original feature: one that has a positive correlation with the original (minimum mapped to 0, maximum mapped to 1) and the other that has a negative correlation (minimum mapped to 1, maximum mapped to 0).
There are two scaling options: 'scale' which converts values linearly with MinMaxScaler (clip = True), and 'qscale' which converts values into uniform distribution with QuantileTransformer. A good choice is 'scale' to reflect the original values directly in the classification model, and is 'qscale' to reduce the effects of outliers.
>>>X_train.columns
Index(['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)',
'petal width (cm)'],
dtype='object')
>>>wl = WideLearnerClient('scale').fit(X_train, y_train)
>>>wl.features_
array(['sepal length (cm)↗', 'sepal length (cm)↘', 'sepal width (cm)↗',
'sepal width (cm)↘', 'petal length (cm)↗', 'petal length (cm)↘',
'petal width (cm)↗', 'petal width (cm)↘'], dtype=object)
>>>wl = WideLearnerClient(preproc='qscale').fit(X_train, y_train)
>>>wl.features_
array(['sepal length (cm)⇧', 'sepal length (cm)⇩', 'sepal width (cm)⇧',
'sepal width (cm)⇩', 'petal length (cm)⇧', 'petal length (cm)⇩',
'petal width (cm)⇧', 'petal width (cm)⇩'], dtype=object)
Binarization
Generates a binary vector of {0, 1} which discretizes the numbers into k intervals according to the specified discretization method and the integer k ≥ 2, then encodes the membership of each interval in the “cut” or “bin” manner. “cut” generates at most (k – 1) x 2 binary features that indicate whether the number is less than or not less than each cut point, and “bin” generates at most k binary features that are one-hot representations of whether the number belongs to each interval2. In Wide Learning, “cut” is more commonly used because it allows feature combinations to express various intervals.
[2]The final number of intervals may be less than k because the intervals without training samples are merged into adjacent ones.
“cutk” ('cut2', 'cut3', …) and “bink” ('bin2', 'bin3', …) divide the range of values in the column into k intervals of the same width3. “qcutk” ('qcut2', 'qcut3', …) and “qbink” ('qbin2', 'qbin3', …) divide the range of values in the column into k intervals that contain approximately the same number of training samples. The relationship between “cutk” and “qcutk” (“bink” and “qbink”) is similar to the relationship between 'scale' and 'qscale'. “cutk” and “bink” are suitable for features that should be focused on the original values themselves, and “qcutk” and “qbink” are suitable for features that should be focused on the relative magnitude of values.
[3]The final intervals will not have exactly the same width because the cutpoints are rounded to appropriate values between the values of the nearest upper and lower training samples.
“ecutk” ('ecut2', 'ecut3', …) and “ebink” ('ebin2', 'ebin3', …) are supervised discretization methods, which determine the cutpoints by calculating the entropy so that the correlation between the discretized feature and the target is maintained. It should be noted that the interactions with other features are not taken into account, but it often helps improve accuracy of the classification model.
>>>wl = WideLearnerClient(
... preproc={
... 'sepal length (cm)': 'cut3',
... 'sepal width (cm)': 'qcut3',
... 'petal length (cm)': 'ecut3',
... 'petal width (cm)': 'ebin3',
... }
...).fit(X_train, y_train)
>>>wl.features_
array(['sepal length (cm)<5.5', 'sepal length (cm)≥5.5',
'sepal length (cm)<6.7', 'sepal length (cm)≥6.7',
'sepal width (cm)<2.9', 'sepal width (cm)≥2.9',
'sepal width (cm)<3.2', 'sepal width (cm)≥3.2',
'petal length (cm)<2.0', 'petal length (cm)≥2.0',
'petal length (cm)<4.8', 'petal length (cm)≥4.8',
'petal width (cm)<0.8', '0.8≤petal width (cm)<1.8',
'petal width (cm)≥1.8'], dtype=object)
1.2 Methods for categorical features
The standard preprocessing method for a categorical feature would be 'onehot', which generates a binary vector that represents the category in one-hot code. This method is simple and easy to interpret, and has the effect of reducing the computational cost of enumeration and optimization in latter steps of the pipeline.
However, when applying to a feature with a large number of categories, only items with small support are generated, and good analysis by Wide Learning may not be possible. The way to solve the problem with manual feature engineering is to reduce the number of categories by grouping. For better analysis, it is a good idea to group categories differently to create multiple features from one original feature. Then, Wide Learning will automatically find useful combinations of the groups.
Another preprocessing method, 'onecold', has the effect of automating the grouping of categories at the expense of the computational cost of the KC enumeration. It generates a binary vector which swaps 0 and 1 from the result of 'onehot'4. That is, each binary feature indicates that the sample value is not in a specific category. It allows feature combinations to represent any subset of the categorical values.
[4] Features that do not have more than two values are processed in the same way as 'onehot'.
>>>import random
>>>import pandas as pd
>>>pd.set_option('display.width', 999)
>>>pd.set_option('display.max_colwidth', 999)
>>>pd.set_option('display.max_columns', 999)
>>>X = pd.DataFrame(
... {
... 'A': random.choices(['a0', 'a1', 'a2'], k=100), # categorical
... 'B': random.choices(['b0', 'b1'], k=100), # categorical
... 'C': random.choices([0, 1, 2], k=100), # numeric
... 'D': random.choices(['0', '1', '2'], k=100), # categorical
... }
...)
>>>X.dtypes
A object
B object
C int64
D object
dtype: object
>>>y = [0] * 50 + [1] * 50
>>>wl = WideLearnerClient().fit(X, y)
>>>wl.features_
array(["A='a0'", "A='a1'", "A='a2'", "B='b0'", "B='b1'", 'C⇧', 'C⇩',
"D='0'", "D='1'", "D='2'"], dtype=object)
>>>wl = WideLearnerClient(preproc=('cut3', 'onecold')).fit(X, y)
>>>wl.features_
array(["A≠'a0'", "A≠'a1'", "A≠'a2'", "B='b0'", "B='b1'", 'C<1', 'C≥1',
'C<2', 'C≥2', "D≠'0'", "D≠'1'", "D≠'2'"], dtype=object)
2. KC enumeration
In this step, the subsets of preprocessed featues (itemsets) are enumerated as knowledge chunks (KCs). For a binary classification problem to labels C0 and C1, KCs are enumerated for the two target classes “C0” and “C1”. For a classification problem to three or more labels, KCs are enumerated for two target classes “C” and “¬C (other than C)” for each label C.
A KC is the itemset that:
satisfies all user-specified constraints,
has a higher confidence score than its subsets, and
has a top-k mutual information score in the itemsets satisfying the above conditions.
The number of KCs in each target class can be approximately limited using chunk_limit5. You can set this to a small value to get a simple model. Setting a larger value may improve the accuracy of the classification model, but it may also explode the processing of the optimization step.
[5]Exact top-k computation is ommited for speedup.
>>>from sklearn.datasets import load_iris
>>>X, y = load_iris(return_X_y=True, as_frame=True)
# using solver=None to skip the optimization step
>>>wl = WideLearnerClient('cut7', max_len=None, chunk_limit=1, solver=None).fit(X, y)
>>>wl.n_chunks(all=True)
13
>>>wl.chunkdata(all=True)
weight len npos nneg supp conf chi2 nmi
label chunk
0 petal width (cm)<0.8 1.0 1 50 0 1.00 1.000000 150.000000 1.000000
petal length (cm)<2.0 1.0 1 50 0 1.00 1.000000 150.000000 1.000000
¬0 petal width (cm)≥0.8 -1.0 1 100 0 1.00 1.000000 150.000000 1.000000
petal length (cm)≥2.0 -1.0 1 100 0 1.00 1.000000 150.000000 1.000000
petal length (cm)≥1.8 ∧ petal width (cm)≥0.5 -1.0 2 100 0 1.00 1.000000 150.000000 1.000000
1 petal length (cm)<5.3 ∧ petal width (cm)≥0.8 ∧ petal width (cm)<1.9 1.0 2 50 8 1.00 0.862069 118.965517 0.756287
petal length (cm)≥2.0 ∧ petal length (cm)<5.3 ∧ petal width (cm)<1.9 1.0 2 50 8 1.00 0.862069 118.965517 0.756287
petal length (cm)≥1.8 ∧ petal length (cm)<5.3 ∧ petal width (cm)≥0.5 ∧ petal width (cm)<1.9 1.0 2 50 8 1.00 0.862069 118.965517 0.756287
sepal length (cm)≥4.9 ∧ sepal width (cm)<3.8 ∧ petal length (cm)≥1.8 ∧ petal length (cm)<5.3 ∧ petal width (cm)<1.9 1.0 4 50 8 1.00 0.862069 118.965517 0.756287
¬1 petal width (cm)<0.8 -1.0 1 50 0 0.50 1.000000 37.500000 0.274018
petal length (cm)<2.0 -1.0 1 50 0 0.50 1.000000 37.500000 0.274018
2 petal length (cm)≥4.4 ∧ petal width (cm)≥1.5 1.0 2 49 14 0.98 0.777778 96.551724 0.593289
¬2 petal length (cm)<5.3 ∧ petal width (cm)<1.9 -1.0 2 100 8 1.00 0.925926 116.666667 0.701315
>>>wl = WideLearnerClient('cut7', max_len=None, chunk_limit=100, solver=None).fit(X, y)
>>>wl.n_chunks(all=True)
548
Parameters for optional constraints include max_len, min_npos, max_neg, min_supp, min_conf, min_chi2, min_nmi. See the API reference of WideLearnerClient for more information.
3. Weight optimization
The classification model of WideLearnerClient consists of one or more linear functions of the KC values 6. Weight optimization is the step of determining the coefficients (weights) and intercepts 7in those linear functions.
[6]The value of a KC is the product of the item values that make up it. For example, for a sample with item values A = 0.5 and B = 0.8, a KC of items {A, B} has a value of 0.5 x 0.8 = 0.4. When all items in a KC are binary, it is the same as a logical AND of the binary values. For example, a KC of binary items {C, D} has a value of 1 if and only if C ∧ D. The value of an empty KC {} is always 1.
[7]In chunkdata() and plot_chunks(), the intercept is displayed as a weight to an empty KC {}. Note that the empty KC does not satisfy the enumeration conditions in the previous section.
Setting fit_intercept=False results in making a linear model where the intercept is fixed at zero.
Parameter l1_ratio takes a values in [0, 1] and specifies the mixing ratio of L1 and L2 regularization in Elastic-Net. l1_ratio=0 is equivalent to Ridge (L2 only) and l1_ratio=1 is equivalent to Lasso (L1 only). Therefore, the closer to 1, the stronger the effect of selecting one from a group of highly correlated KCs.
>>>from sklearn.datasets import load_iris
>>>from sklearn.model_selection import train_test_split
>>>X, y = load_iris(return_X_y=True, as_frame=True)
>>>y_0 = y == 0
>>>X_train, X_test, y_train, y_test = train_test_split(
... X, y_0, test_size=0.33, stratify=y_0, random_state=0
...)
# Lasso with intercept
>>>wl = WideLearnerClient(
... 'scale', fit_intercept=True, l1_ratio=1, random_state=1
...).fit(X_train, y_train)
>>>wl.score(X_test, y_test)
1.0
>>>wl.chunkdata()
weight len npos nneg supp conf chi2 nmi
label chunk
True petal length (cm)↘ × petal width (cm)↘ 23.015085 2 28.439972 9.099576 0.861817 0.757600 49.701686 0.414764
sepal width (cm)↗ × petal length (cm)↘ 4.604287 2 18.795904 7.536723 0.569573 0.713788 23.812898 0.181879
False -14.374686 0 67.000000 33.000000 1.000000 0.670000 0.000000 0.000000
# Lasso without intercept
>>>wl = WideLearnerClient(
... 'scale', fit_intercept=False, l1_ratio=1, random_state=1
...).fit(X_train, y_train)
>>>wl.score(X_test, y_test)
1.0
>>>wl.chunkdata()
weight len npos nneg supp conf chi2 nmi
label chunk
True petal length (cm)↘ × petal width (cm)↘ 13.001155 2 28.439972 9.099576 0.861817 0.757600 49.701686 0.414764
False petal width (cm)↗ -1.867848 1 44.041667 2.166667 0.657338 0.953111 31.140801 0.283293
sepal width (cm)↘ -6.425330 1 42.833333 12.583333 0.639303 0.772932 5.956385 0.047081
petal length (cm)↗ -10.316972 1 44.457627 2.576271 0.663547 0.945225 30.422906 0.272887
# Elestic-Net without intercept
>>>wl = WideLearnerClient(
... 'scale', fit_intercept=False, l1_ratio=0.5, random_state=1
...).fit(X_train, y_train)
>>>wl.score(X_test, y_test)
1.0
>>>wl.chunkdata()
weight len npos nneg supp conf chi2 nmi
label chunk
True petal length (cm)↘ × petal width (cm)↘ 3.141001 2 28.439972 9.099576 0.861817 0.757600 49.701686 0.414764
sepal length (cm)↘ × petal length (cm)↘ × petal width (cm)↘ 3.005075 3 22.800912 5.027521 0.690937 0.819339 41.759278 0.329134
sepal width (cm)↗ × petal length (cm)↘ × petal width (cm)↘ 1.562555 3 17.523717 2.768597 0.531022 0.863564 32.780539 0.253875
sepal width (cm)↗ × petal length (cm)↘ 1.160689 2 18.795904 7.536723 0.569573 0.713788 23.812898 0.181879
sepal length (cm)↘ × petal length (cm)↘ 1.121336 2 24.354905 11.407807 0.738027 0.681014 31.024387 0.246821
sepal length (cm)↘ × sepal width (cm)↗ × petal length (cm)↘ × petal width (cm)↘ 1.110712 4 13.657534 1.377991 0.413865 0.908351 26.771670 0.208692
sepal width (cm)↗ × petal width (cm)↘ 1.107409 2 19.024306 7.493056 0.576494 0.717428 24.498501 0.187346
sepal length (cm)↘ × petal width (cm)↘ 1.047620 2 24.698232 11.502525 0.748431 0.682257 31.844430 0.254197
sepal length (cm)↘ × sepal width (cm)↗ × petal length (cm)↘ 0.865064 3 14.625535 3.440699 0.443198 0.809551 22.934289 0.173656
sepal length (cm)↘ × sepal width (cm)↗ × petal width (cm)↘ 0.835394 3 14.818550 3.396991 0.449047 0.813511 23.550275 0.178560
petal length (cm)↘ 0.509627 1 30.423729 22.542373 0.921931 0.574400 30.422906 0.272887
petal width (cm)↘ 0.358236 1 30.833333 22.958333 0.934343 0.573199 31.140801 0.283293
sepal length (cm)↘ × sepal width (cm)↗ 0.185292 2 15.877525 9.618687 0.481137 0.622741 13.263995 0.100278
sepal width (cm)↗ 0.099418 1 20.416667 24.166667 0.618687 0.457944 5.956385 0.047081
False sepal length (cm)↗ × sepal width (cm)↘ × petal width (cm)↗ -0.106405 3 15.789247 0.137574 0.235660 0.991362 8.848554 0.100453
sepal length (cm)↗ × sepal width (cm)↘ × petal length (cm)↗ -0.141141 3 16.110448 0.168464 0.240454 0.989651 8.985762 0.101108
sepal length (cm)↗ × petal width (cm)↗ -0.466383 2 26.241162 0.470960 0.391659 0.982369 16.085091 0.168928
sepal length (cm)↗ × petal length (cm)↗ -0.524704 2 26.562404 0.537237 0.396454 0.980176 16.175589 0.168501
sepal width (cm)↘ × petal length (cm)↗ × petal width (cm)↗ -0.903704 3 18.693385 0.062735 0.279006 0.996655 11.141477 0.128453
sepal length (cm)↗ × sepal width (cm)↘ -1.060094 2 23.148990 2.066919 0.345507 0.918031 9.381863 0.087559
sepal length (cm)↗ -1.138718 1 37.696970 6.606061 0.562641 0.850889 11.771678 0.098219
petal length (cm)↗ × petal width (cm)↗ -1.537311 2 30.598870 0.182910 0.456700 0.994058 21.121739 0.227261
sepal width (cm)↘ × petal width (cm)↗ -3.110567 2 27.368056 0.774306 0.408478 0.972486 16.207266 0.164727
sepal width (cm)↘ × petal length (cm)↗ -3.227391 2 27.827684 0.955508 0.415339 0.966803 16.102934 0.161124
sepal width (cm)↘ -3.787949 1 42.833333 12.583333 0.639303 0.772932 5.956385 0.047081
petal width (cm)↗ -4.165566 1 44.041667 2.166667 0.657338 0.953111 31.140801 0.283293
petal length (cm)↗ -4.331671 1 44.457627 2.576271 0.663547 0.945225 30.422906 0.272887
The regularization strength λ is automatically optimized with cross-validation. The crosss-validation generator and the number of folds can be customized with parameter cv. The crosss-validation score can be specified in cv_score.
Set a dict object to class_weight to specify the training sample weight for each class. Given the string 'balanced', weights are adjusted inversely proportional to class frequencies in the training samples.
Optimization results are nondeterministic. For the reproducibility of the results, fix the random seed by setting random_state.