IMDB
This notebook shows how the green-tsetlin Tsetlin Machine trains on the IMDB sentiment dataset.
[ ]:
import numpy as np
seed = 42
rng = np.random.default_rng(seed)
With sklearn CountVectorizer, we can transform the data into bag-of-words.
[ ]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
import datasets
imdb = datasets.load_dataset('imdb')
x, y = imdb['train']['text'], imdb['train']['label']
vectorizer = CountVectorizer(ngram_range=(1, 1), binary=True, lowercase=True, max_features=5000)
vectorizer.fit(x)
x_bin = vectorizer.transform(x).toarray().astype(np.uint8)
y = np.array(y).astype(np.uint32)
shuffle_index = [i for i in range(len(x))]
rng.shuffle(shuffle_index)
x_bin = x_bin[shuffle_index]
y = y[shuffle_index]
x_bin = x_bin[:2500]
y = y[:2500]
train_x_bin, val_x_bin, train_y, val_y = train_test_split(x_bin, y, test_size=0.2, random_state=seed, shuffle=True)
The green-tsetlin library offers cpu heavy and less cpu heavy implemenation of the library, offering systems with older cpus a plug-and-play version of the library. - pip install green-tsetlin - pip install green-tsetlin[cpu]
With a number of different parameters to set in the TM, we can optimize by using the built in TM optuna optimizer, green_tsetlin.hpsearch.HyperparameterSearch().
HyperparameterSearch:
search spaces: Set a disired search space for each paramater. Either set the search space to a tuple, e.g (1, 4) will search between 1 and 4, or set it to a single value \(\\\) e.g 4 will only search on 4.
clause_space=(50, 250)orclause_space=125literal budget: Optimize for a minimum literal budget by setting
minimize_literal_budget=True.Cross validation: Set
k_folds=kto an integer \(k > 2\) to run cross validation k times on each trial
HyperparameterSearch.optimize:
Run optimization over
n_trials, store in database, e.g"sqlite:///my_database.db".
See the Optuna documentation here:
[ ]:
from green_tsetlin.hpsearch import HyperparameterSearch
hpsearch = HyperparameterSearch(s_space=(2.0, 20.0),
clause_space=(100, 1000),
threshold_space=(100, 1500),
max_epoch_per_trial=15,
literal_budget=(5, 10),
k_folds=2,
n_jobs=5,
seed=42,
minimize_literal_budget=False)
hpsearch.set_train_data(train_x_bin, train_y)
hpsearch.set_test_data(val_x_bin, val_y)
hpsearch.optimize(n_trials=30, study_name="IMDB hpsearch", show_progress_bar=True, storage=None)
We get the results by calling HyperparameterSearch().best_trials.
See the Optuna documentation here:
[ ]:
params = hpsearch.best_trials[0].params
performance = hpsearch.best_trials[0].values
print("best paramaters: ", params)
print("best score: ", performance)