Usage
fujitsu-automl is provided as one of the plugins of sapientml. The usage of Fujitsu AutoML is the almost same as that of SapientML except the authentication in calling APIs.
sapientml generates source code to train and predict a machine learning model from a CSV-formatted dataset and requirements of a machine learning task to be solved.
SapientML class
sapientml provides SapientML
class that provides the top level API of SapientML. In the constructor of SapientML
, you firstly need to set target_columns
as a requirement of the task. target_columns
specifies which the task is to predict. Second, you can set task_type
from classification
or regression
as a type of machine learning task. You can also skip setting task_type
and in that case SapientML automatially suggests task type by looking into values of the target columns.
from sapientml import SapientML
cls = SapientML(
model_type="fujitsu-automl",
target_columns=["survived"],
task_type=None, # suggested automatically from the target columns
)
As well as model classes of the other well-known libraries like scikit-learn, SapientML
provides fit
and predict
to conduct model training and prediction by using generated code.
import pandas as pd
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split
train_data = pd.read_csv("https://github.com/sapientml/sapientml/files/12481088/titanic.csv")
train_data, test_data = train_test_split(train_data)
y_true = test_data["survived"].reset_index(drop=True)
test_data.drop(["survived"], axis=1, inplace=True)
cls.fit(train_data, output_dir="./outputs")
y_pred = cls.predict(test_data)
print(f"F1 score: {f1_score(y_true, y_pred)}")
Generated source code
After calling fit
, you can get generated source code at ./outputs
folder. Here is the example of files generated by fit
:
outputs
├── 1_script.py
├── 2_script.py
├── 3_script.py
├── final_predict.py
├── final_script.out.json
├── final_script.py
├── final_train.py
└── lib
└── sample_dataset.py
1_script.py
, 2_script.py
, and 3_script.py
are scripts of the hold-out validation using the preprocessors and the top-3 most plausible models.
final_script.py
is the script that selects the model actually achieved the highest score of the top-3 models, and final_script.out.json
contains its score.
final_train.py
is the script for training the selected model, and final_predict.py
is the the script for prediction using the model trained by final_train.py
.
lib
folder contains modules that the above scripts uses.
Using generated code as a model
After calling fit
, you can also get cls.model
, which is a GeneratedModel
instance that contains generated source code and .pkl
files of preprocessers and a actual machine learning model. The instance also asts a usual model providing fit
and predict
.
cls.fit(train_data)
model = cls.model # obtains GeneratedModel instance
You can get the set of source code and .pkl
files by referring model.files
or by looking into ./outputs
folder after calling model.save("./model")
. Here is the example of files contained in GeneratedModel
:
model
├── final_predict.py
├── final_train.py
├── lib
│ └── sample_dataset.py
├── model.pkl
├── ordinalEncoder.pkl
├── simpleimputer-numeric.pkl
└── simpleimputer-string.pkl
The actual behavior of model.fit
is a subprocess executing final_train.py
.
Beware that model.fit(another_train_data)
is not retraining the existing model but buiding a new one.
model.predict
creates a subprocess executing final_predict.py
as well.
SapientML
provides a utility function to restore the SapientML
instance from generated model.
import pickle
cls.fit(train_data)
with open("model.pkl", "wb") as f:
pickle.dump(sml.model, f)
with open("model.pkl", "rb") as f:
model = pickle.load(f)
sml = SapientML.from_pretrained(model)