Usage

Attention: Some methods and attributes are not yet implemented and may not work as documented.

Wide Learning is an AI Technology that Simulates the Scientific Discovery Process.

The "discovery" process in science involves repeating the following steps:

Thinking of a hypothesis
Using data from observations and experiments to verify the validity of the hypothesis
If the hypothesis is not correct, a new one is considered (Return to 1)

After repeating this procedure, the surviving hypothesis is considered to be a "discovery", a new theory in science.

Users can experiment with Wide Learning by just learning how to use the class WideLearnerClient. WideLearnerClient implements the classifier interface of scikit-learn. Users do not have to worry about the internal pipeline and can perform training by fit() and prediction by predict() as usual.

>>>from sklearn.datasets import load_iris
>>>X, y = load_iris(return_X_y=True, as_frame=True)
>>>test = [49, 99, 149]
>>>X_train, X_test = X.drop(test), X.iloc[test]
>>>y_train, y_test = y.drop(test), y.iloc[test]
>>>y_test.values
array([0, 1, 2])
>>>from widelearning.api import WideLearnerClient
>>>wl = WideLearnerClient(random_state=0)
>>>wl.fit(X_train, y_train)
>>>WideLearnerClient(random_state=0)
>>>wl.predict(X_test)
array([0, 1, 2])
>>>wl.score(X_test, y_test)
1.0

In order to view the results with meaningful names, use pandas.DataFrame with column names (instead of a simple numpy array) as the parameter X of fit(X, y) and predict(X). Use chunkdata() after fit() to get the table of KCs. There is plot_chunks() to visualize the KCs using Plotly or Seaborn.

WideLearnerClient executes the following pipeline internally:

Preprocessing with WidePreprocessor
KC enumeration with ChunkyMiner or CopulaMiner
Weight optimization with LogitNet in python-glmnet or LogisticRegressionCV in scikit-learn

1.Preprocessing

The input dataset for the WideLearnerClient needs to store numeric features in numeric columns, and categorical features in string columns. The role of preprocessing is to convert this data into numerical columns in the range of 0 to 1 that can be handled in the subsequent pipeline (KC enumeration). You can specify preprocessing settings in the first argument preproc of widelearning.api.WideLearnerClient.

To specify individual preprocessing methods for each column, pass a dictionary object that uses the column name as a key and the name of the preprocessing method as a value to preproc. We will describe the choices for preprocessing methods in the following subsections.

If a is the preprocessing method for numerical features and b is the preprocessing method for categorical features, specifying preproc=(a, b) will apply a to all numerical features and b to all categorical features¹. The default is preproc=('qscale', 'onehot'). preproc=(a, 'onehot') can be abbreviated as preproc=a, and similarly, preproc=('qscale', b) can be abbreviated as preproc=b.

[1]The order in the tuple is not important, and specifying preproc=(b, a) will yield the same results. In addition, it is possible to apply multiple preprocessing methods to the same type of feature. For instance, it is permissible to specify as preproc=('onehot', 'onecold', 'cut5', 'qcut4').

1.1 Methods for numerical features

Preprocessing of numerical features can be classified into scaling and binarization. In general, scaling is better when there is a monotony between the number and the target label, such as “The larger the number, the higher the probability of label 1.” or “The smaller the number, the higher the probability of label 1.”, and binarization is better for extracting more complex relationships.

Scaling

Normalizes the numeric values to the range of [0, 1]. In order for Wide Learning to discover both the property that appears stronger as the value increases and the property that appears stronger as the value decreases, it generates two normalized numerical features from the original feature: one that has a positive correlation with the original (minimum mapped to 0, maximum mapped to 1) and the other that has a negative correlation (minimum mapped to 1, maximum mapped to 0).

There are two scaling options: 'scale' which converts values linearly with MinMaxScaler (clip = True), and 'qscale' which converts values into uniform distribution with QuantileTransformer. A good choice is 'scale' to reflect the original values directly in the classification model, and is 'qscale' to reduce the effects of outliers.

>>>X_train.columns
Index(['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)',
       'petal width (cm)'],
      dtype='object')
>>>wl = WideLearnerClient('scale').fit(X_train, y_train)
>>>wl.features_
array(['sepal length (cm)↗', 'sepal length (cm)↘', 'sepal width (cm)↗',
       'sepal width (cm)↘', 'petal length (cm)↗', 'petal length (cm)↘',
       'petal width (cm)↗', 'petal width (cm)↘'], dtype=object)
>>>wl = WideLearnerClient(preproc='qscale').fit(X_train, y_train)
>>>wl.features_
array(['sepal length (cm)⇧', 'sepal length (cm)⇩', 'sepal width (cm)⇧',
       'sepal width (cm)⇩', 'petal length (cm)⇧', 'petal length (cm)⇩',
       'petal width (cm)⇧', 'petal width (cm)⇩'], dtype=object)

Binarization

Generates a binary vector of {0, 1} which discretizes the numbers into k intervals according to the specified discretization method and the integer k ≥ 2, then encodes the membership of each interval in the “cut” or “bin” manner. “cut” generates at most (k – 1) x 2 binary features that indicate whether the number is less than or not less than each cut point, and “bin” generates at most k binary features that are one-hot representations of whether the number belongs to each interval². In Wide Learning, “cut” is more commonly used because it allows feature combinations to express various intervals.

[2]The final number of intervals may be less than k because the intervals without training samples are merged into adjacent ones.

“cutk” ('cut2', 'cut3', …) and “bink” ('bin2', 'bin3', …) divide the range of values in the column into k intervals of the same width³. “qcutk” ('qcut2', 'qcut3', …) and “qbink” ('qbin2', 'qbin3', …) divide the range of values in the column into k intervals that contain approximately the same number of training samples. The relationship between “cutk” and “qcutk” (“bink” and “qbink”) is similar to the relationship between 'scale' and 'qscale'. “cutk” and “bink” are suitable for features that should be focused on the original values themselves, and “qcutk” and “qbink” are suitable for features that should be focused on the relative magnitude of values.

[3]The final intervals will not have exactly the same width because the cutpoints are rounded to appropriate values between the values of the nearest upper and lower training samples.

“ecutk” ('ecut2', 'ecut3', …) and “ebink” ('ebin2', 'ebin3', …) are supervised discretization methods, which determine the cutpoints by calculating the entropy so that the correlation between the discretized feature and the target is maintained. It should be noted that the interactions with other features are not taken into account, but it often helps improve accuracy of the classification model.

>>>wl = WideLearnerClient(
...    preproc={
...        'sepal length (cm)': 'cut3',
...        'sepal width (cm)': 'qcut3',
...        'petal length (cm)': 'ecut3',
...        'petal width (cm)': 'ebin3',
...    }
...).fit(X_train, y_train)
>>>wl.features_
array(['sepal length (cm)<5.5', 'sepal length (cm)≥5.5',
       'sepal length (cm)<6.7', 'sepal length (cm)≥6.7',
       'sepal width (cm)<2.9', 'sepal width (cm)≥2.9',
       'sepal width (cm)<3.2', 'sepal width (cm)≥3.2',
       'petal length (cm)<2.0', 'petal length (cm)≥2.0',
       'petal length (cm)<4.8', 'petal length (cm)≥4.8',
       'petal width (cm)<0.8', '0.8≤petal width (cm)<1.8',
       'petal width (cm)≥1.8'], dtype=object)

1.2 Methods for categorical features

The standard preprocessing method for a categorical feature would be 'onehot', which generates a binary vector that represents the category in one-hot code. This method is simple and easy to interpret, and has the effect of reducing the computational cost of enumeration and optimization in latter steps of the pipeline.

However, when applying to a feature with a large number of categories, only items with small support are generated, and good analysis by Wide Learning may not be possible. The way to solve the problem with manual feature engineering is to reduce the number of categories by grouping. For better analysis, it is a good idea to group categories differently to create multiple features from one original feature. Then, Wide Learning will automatically find useful combinations of the groups.

Another preprocessing method, 'onecold', has the effect of automating the grouping of categories at the expense of the computational cost of the KC enumeration. It generates a binary vector which swaps 0 and 1 from the result of 'onehot'⁴. That is, each binary feature indicates that the sample value is not in a specific category. It allows feature combinations to represent any subset of the categorical values.

[4] Features that do not have more than two values are processed in the same way as 'onehot'.

>>>import random
>>>import pandas as pd
>>>pd.set_option('display.width', 999)
>>>pd.set_option('display.max_colwidth', 999)
>>>pd.set_option('display.max_columns', 999)
>>>X = pd.DataFrame(
...    {
...        'A': random.choices(['a0', 'a1', 'a2'], k=100), # categorical
...        'B': random.choices(['b0', 'b1'], k=100),       # categorical
...        'C': random.choices([0, 1, 2], k=100),          # numeric
...        'D': random.choices(['0', '1', '2'], k=100),    # categorical
...    }
...)
>>>X.dtypes
A    object
B    object
C     int64
D    object
dtype: object
>>>y = [0] * 50 + [1] * 50
>>>wl = WideLearnerClient().fit(X, y)
>>>wl.features_
array(["A='a0'", "A='a1'", "A='a2'", "B='b0'", "B='b1'", 'C⇧', 'C⇩',
       "D='0'", "D='1'", "D='2'"], dtype=object)
>>>wl = WideLearnerClient(preproc=('cut3', 'onecold')).fit(X, y)
>>>wl.features_
array(["A≠'a0'", "A≠'a1'", "A≠'a2'", "B='b0'", "B='b1'", 'C<1', 'C≥1',
       'C<2', 'C≥2', "D≠'0'", "D≠'1'", "D≠'2'"], dtype=object)

2. KC enumeration

In this step, the subsets of preprocessed featues (itemsets) are enumerated as knowledge chunks (KCs). For a binary classification problem to labels C0 and C1, KCs are enumerated for the two target classes “C0” and “C1”. For a classification problem to three or more labels, KCs are enumerated for two target classes “C” and “¬C (other than C)” for each label C.

A KC is the itemset that:

satisfies all user-specified constraints,

has a higher confidence score than its subsets, and

has a top-k mutual information score in the itemsets satisfying the above conditions.

The number of KCs in each target class can be approximately limited using chunk_limit⁵. You can set this to a small value to get a simple model. Setting a larger value may improve the accuracy of the classification model, but it may also explode the processing of the optimization step.

[5]Exact top-k computation is ommited for speedup.

>>>from sklearn.datasets import load_iris
>>>X, y = load_iris(return_X_y=True, as_frame=True)
# using solver=None to skip the optimization step
>>>wl = WideLearnerClient('cut7', max_len=None, chunk_limit=1, solver=None).fit(X, y)
>>>wl.n_chunks(all=True)
13
>>>wl.chunkdata(all=True)  
                                                                                                                                     weight  len  npos  nneg  supp      conf        chi2       nmi
label chunk
0     petal width (cm)<0.8                                                                                                            1.0    1    50     0  1.00  1.000000  150.000000  1.000000
      petal length (cm)<2.0                                                                                                           1.0    1    50     0  1.00  1.000000  150.000000  1.000000
¬0    petal width (cm)≥0.8                                                                                                           -1.0    1   100     0  1.00  1.000000  150.000000  1.000000
      petal length (cm)≥2.0                                                                                                          -1.0    1   100     0  1.00  1.000000  150.000000  1.000000
      petal length (cm)≥1.8 ∧ petal width (cm)≥0.5                                                                                 -1.0    2   100     0  1.00  1.000000  150.000000  1.000000
1     petal length (cm)<5.3 ∧ petal width (cm)≥0.8 ∧ petal width (cm)<1.9                                                         1.0    2    50     8  1.00  0.862069  118.965517  0.756287
      petal length (cm)≥2.0 ∧ petal length (cm)<5.3 ∧ petal width (cm)<1.9                                                        1.0    2    50     8  1.00  0.862069  118.965517  0.756287
      petal length (cm)≥1.8 ∧ petal length (cm)<5.3 ∧ petal width (cm)≥0.5 ∧ petal width (cm)<1.9                               1.0    2    50     8  1.00  0.862069  118.965517  0.756287
      sepal length (cm)≥4.9 ∧ sepal width (cm)<3.8 ∧ petal length (cm)≥1.8 ∧ petal length (cm)<5.3 ∧ petal width (cm)<1.9     1.0    4    50     8  1.00  0.862069  118.965517  0.756287
¬1    petal width (cm)<0.8                                                                                                           -1.0    1    50     0  0.50  1.000000   37.500000  0.274018
      petal length (cm)<2.0                                                                                                          -1.0    1    50     0  0.50  1.000000   37.500000  0.274018
2     petal length (cm)≥4.4 ∧ petal width (cm)≥1.5                                                                                  1.0    2    49    14  0.98  0.777778   96.551724  0.593289
¬2    petal length (cm)<5.3 ∧ petal width (cm)<1.9                                                                                 -1.0    2   100     8  1.00  0.925926  116.666667  0.701315

>>>wl = WideLearnerClient('cut7', max_len=None, chunk_limit=100, solver=None).fit(X, y)
>>>wl.n_chunks(all=True)
548

Parameters for optional constraints include max_len, min_npos, max_neg, min_supp, min_conf, min_chi2, min_nmi. See the API reference of WideLearnerClient for more information.

3. Weight optimization

The classification model of WideLearnerClient consists of one or more linear functions of the KC values ⁶. Weight optimization is the step of determining the coefficients (weights) and intercepts ⁷in those linear functions.

[6]The value of a KC is the product of the item values that make up it. For example, for a sample with item values A = 0.5 and B = 0.8, a KC of items {A, B} has a value of 0.5 x 0.8 = 0.4. When all items in a KC are binary, it is the same as a logical AND of the binary values. For example, a KC of binary items {C, D} has a value of 1 if and only if C ∧ D. The value of an empty KC {} is always 1.

[7]In chunkdata() and plot_chunks(), the intercept is displayed as a weight to an empty KC {}. Note that the empty KC does not satisfy the enumeration conditions in the previous section.

Setting fit_intercept=False results in making a linear model where the intercept is fixed at zero.

Parameter l1_ratio takes a values in [0, 1] and specifies the mixing ratio of L1 and L2 regularization in Elastic-Net. l1_ratio=0 is equivalent to Ridge (L2 only) and l1_ratio=1 is equivalent to Lasso (L1 only). Therefore, the closer to 1, the stronger the effect of selecting one from a group of highly correlated KCs.

>>>from sklearn.datasets import load_iris
>>>from sklearn.model_selection import train_test_split
>>>X, y = load_iris(return_X_y=True, as_frame=True)
>>>y_0 = y == 0
>>>X_train, X_test, y_train, y_test = train_test_split(
...    X, y_0, test_size=0.33, stratify=y_0, random_state=0
...)
# Lasso with intercept
>>>wl = WideLearnerClient(
...    'scale', fit_intercept=True, l1_ratio=1, random_state=1
...).fit(X_train, y_train)
>>>wl.score(X_test, y_test)
1.0
>>>wl.chunkdata()  
                                                     weight  len       npos       nneg      supp      conf       chi2       nmi
label chunk
True  petal length (cm)↘ × petal width (cm)↘  23.015085    2  28.439972   9.099576  0.861817  0.757600  49.701686  0.414764
      sepal width (cm)↗ × petal length (cm)↘   4.604287    2  18.795904   7.536723  0.569573  0.713788  23.812898  0.181879
False                                            -14.374686    0  67.000000  33.000000  1.000000  0.670000   0.000000  0.000000

# Lasso without intercept
>>>wl = WideLearnerClient(
...    'scale', fit_intercept=False, l1_ratio=1, random_state=1
...).fit(X_train, y_train)
>>>wl.score(X_test, y_test)
1.0
>>>wl.chunkdata()  
                                                     weight  len       npos       nneg      supp      conf       chi2       nmi
label chunk
True  petal length (cm)↘ × petal width (cm)↘  13.001155    2  28.439972   9.099576  0.861817  0.757600  49.701686  0.414764
False petal width (cm)↗                         -1.867848    1  44.041667   2.166667  0.657338  0.953111  31.140801  0.283293
      sepal width (cm)↘                         -6.425330    1  42.833333  12.583333  0.639303  0.772932   5.956385  0.047081
      petal length (cm)↗                       -10.316972    1  44.457627   2.576271  0.663547  0.945225  30.422906  0.272887

# Elestic-Net without intercept
>>>wl = WideLearnerClient(
...    'scale', fit_intercept=False, l1_ratio=0.5, random_state=1
...).fit(X_train, y_train)
>>>wl.score(X_test, y_test)
1.0
>>>wl.chunkdata()  
                                                                                                 weight  len       npos       nneg      supp      conf       chi2       nmi
label chunk
True  petal length (cm)↘ × petal width (cm)↘                                               3.141001    2  28.439972   9.099576  0.861817  0.757600  49.701686  0.414764
      sepal length (cm)↘ × petal length (cm)↘ × petal width (cm)↘                        3.005075    3  22.800912   5.027521  0.690937  0.819339  41.759278  0.329134
      sepal width (cm)↗ × petal length (cm)↘ × petal width (cm)↘                         1.562555    3  17.523717   2.768597  0.531022  0.863564  32.780539  0.253875
      sepal width (cm)↗ × petal length (cm)↘                                               1.160689    2  18.795904   7.536723  0.569573  0.713788  23.812898  0.181879
      sepal length (cm)↘ × petal length (cm)↘                                              1.121336    2  24.354905  11.407807  0.738027  0.681014  31.024387  0.246821
      sepal length (cm)↘ × sepal width (cm)↗ × petal length (cm)↘ × petal width (cm)↘  1.110712    4  13.657534   1.377991  0.413865  0.908351  26.771670  0.208692
      sepal width (cm)↗ × petal width (cm)↘                                                1.107409    2  19.024306   7.493056  0.576494  0.717428  24.498501  0.187346
      sepal length (cm)↘ × petal width (cm)↘                                               1.047620    2  24.698232  11.502525  0.748431  0.682257  31.844430  0.254197
      sepal length (cm)↘ × sepal width (cm)↗ × petal length (cm)↘                        0.865064    3  14.625535   3.440699  0.443198  0.809551  22.934289  0.173656
      sepal length (cm)↘ × sepal width (cm)↗ × petal width (cm)↘                         0.835394    3  14.818550   3.396991  0.449047  0.813511  23.550275  0.178560
      petal length (cm)↘                                                                     0.509627    1  30.423729  22.542373  0.921931  0.574400  30.422906  0.272887
      petal width (cm)↘                                                                      0.358236    1  30.833333  22.958333  0.934343  0.573199  31.140801  0.283293
      sepal length (cm)↘ × sepal width (cm)↗                                               0.185292    2  15.877525   9.618687  0.481137  0.622741  13.263995  0.100278
      sepal width (cm)↗                                                                      0.099418    1  20.416667  24.166667  0.618687  0.457944   5.956385  0.047081
False sepal length (cm)↗ × sepal width (cm)↘ × petal width (cm)↗                        -0.106405    3  15.789247   0.137574  0.235660  0.991362   8.848554  0.100453
      sepal length (cm)↗ × sepal width (cm)↘ × petal length (cm)↗                       -0.141141    3  16.110448   0.168464  0.240454  0.989651   8.985762  0.101108
      sepal length (cm)↗ × petal width (cm)↗                                              -0.466383    2  26.241162   0.470960  0.391659  0.982369  16.085091  0.168928
      sepal length (cm)↗ × petal length (cm)↗                                             -0.524704    2  26.562404   0.537237  0.396454  0.980176  16.175589  0.168501
      sepal width (cm)↘ × petal length (cm)↗ × petal width (cm)↗                        -0.903704    3  18.693385   0.062735  0.279006  0.996655  11.141477  0.128453
      sepal length (cm)↗ × sepal width (cm)↘                                              -1.060094    2  23.148990   2.066919  0.345507  0.918031   9.381863  0.087559
      sepal length (cm)↗                                                                    -1.138718    1  37.696970   6.606061  0.562641  0.850889  11.771678  0.098219
      petal length (cm)↗ × petal width (cm)↗                                              -1.537311    2  30.598870   0.182910  0.456700  0.994058  21.121739  0.227261
      sepal width (cm)↘ × petal width (cm)↗                                               -3.110567    2  27.368056   0.774306  0.408478  0.972486  16.207266  0.164727
      sepal width (cm)↘ × petal length (cm)↗                                              -3.227391    2  27.827684   0.955508  0.415339  0.966803  16.102934  0.161124
      sepal width (cm)↘                                                                     -3.787949    1  42.833333  12.583333  0.639303  0.772932   5.956385  0.047081
      petal width (cm)↗                                                                     -4.165566    1  44.041667   2.166667  0.657338  0.953111  31.140801  0.283293
      petal length (cm)↗                                                                    -4.331671    1  44.457627   2.576271  0.663547  0.945225  30.422906  0.272887

The regularization strength λ is automatically optimized with cross-validation. The crosss-validation generator and the number of folds can be customized with parameter cv. The crosss-validation score can be specified in cv_score.

Set a dict object to class_weight to specify the training sample weight for each class. Given the string 'balanced', weights are adjusted inversely proportional to class frequencies in the training samples.

Optimization results are nondeterministic. For the reproducibility of the results, fix the random seed by setting random_state.