IMDB

This notebook shows how the green-tsetlin Tsetlin Machine trains on the IMDB sentiment dataset.

[ ]:

import numpy as np

seed = 42
rng = np.random.default_rng(seed)

With sklearn CountVectorizer, we can transform the data into bag-of-words.

E.g the input text “I love swimming in the ocean” is transformed to : [0, 1, 1, 1, 0, 0]
This vector is based on the vocabulary of the CountVectorizer, e.g [“dogs”, “love”, “ocean”, “swimming”, “biking”, “movie”]
We obtain the vocabulary by fitting the data. This gives us words / tokens that occur in the data.

[ ]:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
import datasets

imdb = datasets.load_dataset('imdb')
x, y = imdb['train']['text'], imdb['train']['label']

vectorizer = CountVectorizer(ngram_range=(1, 1), binary=True, lowercase=True, max_features=5000)
vectorizer.fit(x)

x_bin = vectorizer.transform(x).toarray().astype(np.uint8)
y = np.array(y).astype(np.uint32)

shuffle_index = [i for i in range(len(x))]
rng.shuffle(shuffle_index)

x_bin = x_bin[shuffle_index]
y = y[shuffle_index]

x_bin = x_bin[:2500]
y = y[:2500]

train_x_bin, val_x_bin, train_y, val_y = train_test_split(x_bin, y, test_size=0.2, random_state=seed, shuffle=True)

The green-tsetlin library offers cpu heavy and less cpu heavy implemenation of the library, offering systems with older cpus a plug-and-play version of the library. - pip install green-tsetlin - pip install green-tsetlin[cpu]

With a number of different parameters to set in the TM, we can optimize by using the built in TM optuna optimizer, green_tsetlin.hpsearch.HyperparameterSearch().

HyperparameterSearch:

search spaces: Set a disired search space for each paramater. Either set the search space to a tuple, e.g (1, 4) will search between 1 and 4, or set it to a single value \(\\\) e.g 4 will only search on 4. clause_space=(50, 250) or clause_space=125
literal budget: Optimize for a minimum literal budget by setting minimize_literal_budget=True.
Cross validation: Set k_folds=k to an integer \(k > 2\) to run cross validation k times on each trial

HyperparameterSearch.optimize:

Run optimization over n_trials, store in database, e.g "sqlite:///my_database.db".

See the Optuna documentation here:

[ ]:

from green_tsetlin.hpsearch import HyperparameterSearch


hpsearch = HyperparameterSearch(s_space=(2.0, 20.0),
                                clause_space=(100, 1000),
                                threshold_space=(100, 1500),
                                max_epoch_per_trial=15,
                                literal_budget=(5, 10),
                                k_folds=2,
                                n_jobs=5,
                                seed=42,
                                minimize_literal_budget=False)

hpsearch.set_train_data(train_x_bin, train_y)
hpsearch.set_test_data(val_x_bin, val_y)

hpsearch.optimize(n_trials=30, study_name="IMDB hpsearch", show_progress_bar=True, storage=None)

We get the results by calling HyperparameterSearch().best_trials.

See the Optuna documentation here:

[ ]:

params = hpsearch.best_trials[0].params
performance = hpsearch.best_trials[0].values

print("best paramaters: ", params)
print("best score: ", performance)