Contextual Bandits

This is the documentation page for the python package contextualbandits. For more details, see the project’s GitHub page:

https://www.github.com/david-cortes/contextualbandits/

Installation

Package is available on PyPI, can be installed with

pip install contextualbandits

If it fails to install due to not being able to compile C code, an earlier pure-Python version can be installed with

pip install contextualbandits==0.1.8.5

Getting started

You can find user guides with detailed examples in the following links:

Online Contextual Bandits

Off policy Learning in Contextual Bandits

Policy Evaluation in Contextual Bandits

Online Contextual Bandits

Hint: if in doubt of where to start or which method to choose, the safest bet is BootstrappedUCB.

Policy classes - first one from each group is the recommended one to use:

ActiveExplorer

class contextualbandits.online.ActiveExplorer(base_algorithm, nchoices, f_grad_norm='auto', case_one_class='auto', active_choice='weighted', explore_prob=0.15, decay=0.9997, beta_prior='auto', smoothing=None, noise_to_smooth=True, batch_train=False, refit_buffer=None, deep_copy_buffer=True, assume_unique_reward=False, random_state=None, njobs=-1)

Active Explorer

Selects a proportion of actions according to an active learning heuristic based on gradient. Works only for differentiable and preferably smooth functions.

Note

Here, for the predictions that are made according to an active learning heuristic (these are selected at random, just like in Epsilon-Greedy), the guiding heuristic is the gradient that the observation, having either label (either weighted by the estimted probability, or taking the maximum or minimum), would produce on each model that predicts a class, given the current coefficients for that model. This of course requires being able to calculate gradients - package comes with pre-defined gradient functions for linear and logistic regression, and allows passing custom functions for others.

Parameters
  • base_algorithm (obj) – Base binary classifier for which each sample for each class will be fit. Will look for, in this order:

    1. A ‘predict_proba’ method with outputs (n_samples, 2), values in [0,1], rows suming to 1

    2. A ‘decision_function’ method with unbounded outputs (n_samples,) to which it will apply a sigmoid function.

    3. A ‘predict’ method with outputs (n_samples,) with values in [0,1].

    Can also pass a list with a different (or already-fit) classifier for each arm.

  • nchoices (int or list-like) – Number of arms/labels to choose from. Can also pass a list, array, or Series with arm names, in which case the outputs from predict will follow these names and arms can be dropped by name, and new ones added with a custom name.

  • f_grad_norm (str ‘auto’ or function(base_algorithm, X, pred) -> array (n_samples, 2)) – Function that calculates the row-wise norm of the gradient from observations in X if their class were negative (first column) or positive (second column). Can also use different functions for each arm, in which case it accepts them as a list of functions with length equal to nchoices. The option ‘auto’ will only work with scikit-learn’s ‘LogisticRegression’, ‘SGDClassifier’ (log-loss only), and ‘RidgeClassifier’; with stochQN’s ‘StochasticLogisticRegression’; and with this package’s ‘LinearRegression’.

  • case_one_class (str ‘auto’, ‘zero’, None, or function(X, n_pos, n_neg, rng) -> array(n_samples, 2)) – If some arm/choice/class has only rewards of one type, many models will fail to fit, and consequently the gradients will be undefined. Likewise, if the model has not been fit, the gradient might also be undefined, and this requires a workaround.

    • If passing ‘None’, will assume that base_algorithm can be fit to data of only-positive or only-negative class without problems, and that it can calculate gradients and predictions with a base_algorithm object that has not been fitted. Be aware that the methods ‘predict’, ‘predict_proba’, and ‘decision_function’ in base_algorithm might be overwritten with another method that wraps it in a try-catch block, so don’t rely on it producing errors when unfitted.

    • If passing a function, will take the output of it as the row-wise gradient norms when it compares them against other arms/classes, with the first column having the values if the observations were of negative class, and the second column if they were positive class. The other inputs to this function are the number of positive and negative examples that have been observed, and a Generator object from NumPy to use for generating random numbers.

    • If passing a list, will assume each entry is a function as described above, to be used with each corresponding arm.

    • If passing ‘auto’, will generate random numbers:

      • negative: ~ Gamma(log10(n_features) / (n_pos+1)/(n_pos+n_neg+2), log10(n_features)).

      • positive: ~ Gamma(log10(n_features) * (n_pos+1)/(n_pos+n_neg+2), log10(n_features)).

    • If passing ‘zero’, it will output zero whenever models have not been fitted.

    Note that the theoretically correct approach for a logistic regression would be to assume models with all-zero coefficients, in which case the gradient is defined in the absence of any data, but this tends to produce bad end results.

  • active_choice (str in {‘min’, ‘max’, ‘weighted’}) – How to calculate the gradient that an observation would have on the loss function for each classifier, given that it could be either class (positive or negative) for the classifier that predicts each arm. If weighted, they are weighted by the same probability estimates from the base algorithm.

  • explore_prob (float (0,1)) – Probability of selecting an action according to active learning criteria.

  • decay (float (0,1)) – After each prediction, the probability of selecting an arm according to active learning criteria is set to p = p*decay

  • beta_prior (str ‘auto’, None, tuple ((a,b), n), or list[tuple((a,b), n)]) – If not ‘None’, when there are less than ‘n’ samples with and without a reward from a given arm, it will predict the score for that class as a random number drawn from a beta distribution with the prior specified by ‘a’ and ‘b’. If set to “auto”, will be calculated as:

    beta_prior = ((2/log2(nchoices), 4), 2)

    Can also pass different priors per arm, in which case they should be passed as a list of tuples. This parameter can have a very large impact in the end results, and it’s recommended to tune it accordingly - scenarios with low expected reward rates should have priors that result in drawing small random numbers, whereas scenarios with large expected reward rates should have stronger priors and tend towards larger random numbers. Also, the more arms there are, the smaller the optimal expected value for these random numbers. Recommended to use only one of beta_prior or smoothing.

  • smoothing (None, tuple (a,b), or list) – If not None, predictions will be smoothed as yhat_smooth = (yhat*n + a)/(n + b), where ‘n’ is the number of times each arm was chosen in the training data. Can also pass it as a list of tuples with different ‘a’ and ‘b’ parameters for each arm (e.g. if there are arm features, these parameters can be determined through a different model). Recommended to use only one of beta_prior or smoothing.

  • noise_to_smooth (bool) – If passing smoothing, whether to add a small amount of random noise \(\sim Uniform(0, 10^{-12})\) in order to break ties at random instead of choosing the smallest arm index. Ignored when passing smoothing=None.

  • batch_train (bool) – Whether the base algorithm will be fit to the data in batches as it comes (streaming), or to the whole dataset each time it is refit. Requires a classifier with a ‘partial_fit’ method.

  • refit_buffer (int or None) – Number of observations per arm to keep as a reserve for passing to ‘partial_fit’. If passing it, up until the moment there are at least this number of observations for a given arm, that arm will keep the observations when calling ‘fit’ and ‘partial_fit’, and will translate calls to ‘partial_fit’ to calls to ‘fit’ with the new plus stored observations. After the reserve number is reached, calls to ‘partial_fit’ will enlarge the data batch with the stored observations, and old stored observations will be gradually replaced with the new ones (at random, not on a FIFO basis). This technique can greatly enchance the performance when fitting the data in batches, but memory consumption can grow quite large. If passing sparse CSR matrices as input to ‘fit’ and ‘partial_fit’, these will be converted to dense once they go into this reserve, and then converted back to CSR to augment the new data. Calls to ‘fit’ will override this reserve. Ignored when passing ‘batch_train=False’.

  • deep_copy_buffer (bool) – Whether to make deep copies of the data that is stored in the reserve for refit_buffer. If passing ‘False’, when the reserve is not yet full, these will only store shallow copies of the data, which is faster but will not let Python’s garbage collector free memory after deleting the data, and if the original data is overwritten, so will this buffer. Ignored when not using refit_buffer.

  • assume_unique_reward (bool) – Whether to assume that only one arm has a reward per observation. If set to ‘True’, whenever an arm receives a reward, the classifiers for all other arms will be fit to that observation too, having negative label.

  • random_state (int, None, RandomState, or Generator) – Either an integer which will be used as seed for initializing a Generator object for random number generation, a RandomState object (from NumPy) from which to draw an integer, or a Generator object (from NumPy), which will be used directly. While this controls random number generation for this meteheuristic, there can still be other sources of variations upon re-runs, such as data aggregations in parallel (e.g. from OpenMP or BLAS functions).

  • njobs (int or None) – Number of parallel jobs to run. If passing None will set it to 1. If passing -1 will set it to the number of CPU cores. Note that if the base algorithm is itself parallelized, this might result in a slowdown as both compete for available threads, so don’t set parallelization in both. The parallelization uses shared memory, thus you will only see a speed up if your base classifier releases the Python GIL, and will otherwise result in slower runs.

References

1

Cortes, David. “Adapting multi-armed bandits policies to contextual bandits scenarios.” arXiv preprint arXiv:1811.04383 (2018).

add_arm(arm_name=None, fitted_classifier=None, n_w_rew=0, n_wo_rew=0, smoothing=None, beta_prior=None, refit_buffer_X=None, refit_buffer_r=None, f_grad_norm=None, case_one_class=None)

Adds a new arm to the pool of choices

Parameters
  • arm_name (object) – Name for this arm. Only applicable when using named arms. If None, will use the name of the last arm plus 1 (will only work when the names are integers).

  • fitted_classifier (object) – If a classifier has already been fit to rewards coming from this arm, you can pass it here, otherwise, will be started from the same ‘base_classifier’ as the initial arms. If using bootstrapped methods or methods from this module which do not accept arbitrary classifiers as input, don’t pass a classifier here (unless using the classes like e.g. utils._BootstrappedClassifierBase). If the constructor was called with different base_algorithm per arm, must pass a base classifier here. Not applicable for the classes that do not take a base_algorithm.

  • n_w_rew (int) – Number of trials/rounds with rewards coming from this arm (only used when using a beta prior or smoothing).

  • n_wo_rew (int) – Number of trials/rounds without rewards coming from this arm (only used when using a beta prior or smoothing).

  • smoothing (None, tuple (a,b), or list) – Smoothing parameters to use for this arm (see documentation of the class constructor for details). If None and if the smoothing passed to the constructor didn’t have separate entries per arm, will use the same smoothing as was passed in the constructor. If no smoothing was passed to the constructor, the smoothing here will be ignored. Must pass a smoothing here if the constructor was passed a smoothing with different entries per arm.

  • beta_prior (None or tuple((a,b), n)) – Beta prior to use for this arm. See the class’ documenation for details. Must be passed if the constructor was provided different beta priors per arm. If None and the constructor had a single beta_prior, will use that same beta_prior for this new arm. Note that n_w_rew and n_wo_rew will be counted towards the threshold ‘n’ in here. Cannot be passed if the constructor did not have a beta_prior.

  • refit_buffer_X (array(m, n) or None) – Refit buffer of ‘X’ data to use for the new arm. Ignored when using ‘batch_train=False’ or ‘refit_buffer=None’.

  • refit_buffer_r (array(m,) or None) – Refit buffer of rewards data to use for the new arm. Ignored when using ‘batch_train=False’ or ‘refit_buffer=None’.

  • f_grad_norm (function) – Gradient calculation function to use for this arm. This is only for the policies that make choices according to active learning criteria, and only for situations in which the policy was passed different functions for each arm.

  • case_one_class (function) – Gradient workaround function for single-class data. This is only for the policies that make choices according to active learning criteria, and only for situations in which the policy was passed different functions for each arm.

Returns

self – This object

Return type

object

decision_function(X)

Get the scores for each arm following this policy’s action-choosing criteria.

Parameters

X (array (n_samples, n_features)) – Data for which to obtain decision function scores for each arm.

Returns

scores – Scores following this policy for each arm.

Return type

array (n_samples, n_choices)

drop_arm(arm_name)

Drop an arm/choice

Drops (removes/deletes) an arm from the set of available choices to the policy.

Note

The available arms, if named, are stored in attribute ‘choice_names’.

Parameters

arm_name (int or object) – Arm to drop. If passing an integer, will drop at that index (starting at zero). Otherwise, will drop the arm matching this name (argument must be of the same type as the individual entries passed to ‘nchoices’ in the initialization).

Returns

self – This object

Return type

object

fit(X, a, r, continue_from_last=False)

Fits the base algorithm (one per class [and per sample if bootstrapped]) to partially labeled data.

Parameters
  • X (array(n_samples, n_features) or CSR(n_samples, n_features)) – Matrix of covariates for the available data.

  • a (array(n_samples, ), int type) – Arms or actions that were chosen for each observations.

  • r (array(n_samples, ), {0,1}) – Rewards that were observed for the chosen actions. Must be binary rewards 0/1.

  • continue_from_last (bool) – If the policy was previously fit to data, whether to assume that this new call to ‘fit’ will continue from the exact same dataset as before plus new rows appended at the end of ‘X’, ‘a’, ‘r’. In this case, will only refit the models that have new data according to ‘a’. Note that the bootstrapped policies will still benefit from extra refits. This option should not be used when there are calls to ‘partial_fit’ between calls to fit. Ignored if using assume_unique_reward=True.

Returns

self – This object

Return type

obj

partial_fit(X, a, r)

Fits the base algorithm (one per class) to partially labeled data in batches.

Note

In order to use this method, the base classifier must have a ‘partial_fit’ method, such as ‘sklearn.linear_model.SGDClassifier’. This method is not available for ‘LogisticUCB’, ‘LogisticTS’, ‘PartitionedUCB’, ‘PartitionedTS’.

Parameters
  • X (array(n_samples, n_features) or CSR(n_samples, n_features)) – Matrix of covariates for the available data.

  • a (array(n_samples, ), int type) – Arms or actions that were chosen for each observations.

  • r (array(n_samples, ), {0,1}) – Rewards that were observed for the chosen actions. Must be binary rewards 0/1.

Returns

self – This object

Return type

obj

predict(X, exploit=False, output_score=False)

Selects actions according to this policy for new data.

Parameters
  • X (array (n_samples, n_features)) – New observations for which to choose an action according to this policy.

  • exploit (bool) – Whether to make a prediction according to the policy, or to just choose the arm with the highest expected reward according to current models.

  • output_score (bool) – Whether to output the score that this method predicted, in case it is desired to use it with this pakckage’s offpolicy and evaluation modules.

Returns

pred – Actions chosen by the policy. If passing output_score=True, it will be a dictionary with the chosen arm and the score that the arm got following this policy with the classifiers used.

Return type

array (n_samples,) or dict(“choice” : array(n_samples,), “score” : array(n_samples,))

reset_active_choice(active_choice='weighted')

Set the active gradient criteria to a custom form

Parameters

active_choice (str in {‘min’, ‘max’, ‘weighted’}) – How to calculate the gradient that an observation would have on the loss function for each classifier, given that it could be either class (positive or negative) for the classifier that predicts each arm. If weighted, they are weighted by the same probability estimates from the base algorithm.

Returns

self – This object

Return type

obj

reset_explore_prob(explore_prob=0.2)

Set the active exploration probability to a custom number

Parameters

explore_prob (float between 0 and 1) – The new exploration probability. Note that it will still apply decay on it after being reset.

Returns

self – This object

Return type

obj

topN(X, n)

Get top-N ranked actions for each observation

Note

This method will rank choices/arms according to what the policy dictates - it is not an exploitation-mode rank, so if e.g. there are random choices for some observations, there will be random ranks in here.

Parameters
  • X (array (n_samples, n_features)) – New observations for which to rank actions according to this policy.

  • n (int) – Number of top-ranked actions to output

Returns

topN – The top-ranked actions for each observation

Return type

array(n_samples, n)

AdaptiveGreedy

class contextualbandits.online.AdaptiveGreedy(base_algorithm, nchoices, window_size=500, percentile=30, decay=0.9998, decay_type='percentile', initial_thr='auto', beta_prior='auto', smoothing=None, noise_to_smooth=True, batch_train=False, refit_buffer=None, deep_copy_buffer=True, assume_unique_reward=False, active_choice=None, f_grad_norm='auto', case_one_class='auto', random_state=None, njobs=-1)

Adaptive Greedy

Takes the action with highest estimated reward, unless that estimation falls below a certain threshold, in which case it takes a an action either at random or according to an active learning heuristic (same way as ActiveExplorer).

Note

The hyperparameters here can make a large impact on the quality of the choices. Be sure to tune the threshold (or percentile), decay, and prior (or smoothing parameters).

Note

The threshold for the reward probabilities can be set to a hard-coded number, or to be calculated dynamically by keeping track of the predictions it makes, and taking a fixed percentile of that distribution to be the threshold. In the second case, these are calculated in separate batches rather than in a sliding window.

Can also be set to make choices in the same way as ‘ActiveExplorer’ rather than random (see ‘greedy_choice’ parameter).

Parameters
  • base_algorithm (obj) – Base binary classifier for which each sample for each class will be fit. Will look for, in this order:

    1. A ‘predict_proba’ method with outputs (n_samples, 2), values in [0,1], rows suming to 1

    2. A ‘decision_function’ method with unbounded outputs (n_samples,) to which it will apply a sigmoid function.

    3. A ‘predict’ method with outputs (n_samples,) with values in [0,1].

    Can also pass a list with a different (or already-fit) classifier for each arm.

  • nchoices (int or list-like) – Number of arms/labels to choose from. Can also pass a list, array, or series with arm names, in which case the outputs from predict will follow these names and arms can be dropped by name, and new ones added with a custom name.

  • window_size (int) – Number of predictions after which the threshold will be updated to the desired percentile.

  • percentile (int in [0,100] or None) – Percentile of the predictions sample to set as threshold, below which actions are random. If None, will not take percentiles, will instead use the intial threshold and apply decay to it.

  • decay (float (0,1) or None) –

    After each prediction, either the threshold or the percentile gets adjusted to:

    val_{t+1} = val_t*decay

  • decay_type (str, either ‘percentile’ or ‘threshold’) – Whether to decay the threshold itself or the percentile of the predictions to take after each prediction. Ignored when using ‘decay=None’. If passing ‘percentile=None’ and ‘decay_type=percentile’, will be forced to ‘threshold’.

  • initial_thr (str ‘auto’ or float (0,1)) – Initial threshold for the prediction below which a random action is taken. If set to ‘auto’, will be calculated as initial_thr = 1 / (2 * sqrt(nchoices)). Note that if ‘base_algorithm’ has a ‘decision_function’ method, it will first apply a sigmoid function to the output, and then compare it to the threshold, so the threshold should lie between zero and one.

  • beta_prior (str ‘auto’, None, tuple ((a,b), n), or list[tuple((a,b), n)]) – If not ‘None’, when there are less than ‘n’ samples with and without a reward from a given arm, it will predict the score for that class as a random number drawn from a beta distribution with the prior specified by ‘a’ and ‘b’. If set to “auto”, will be calculated as:

    beta_prior = ((3/nchoices, 4), 2)

    Can also pass different priors per arm, in which case they should be passed as a list of tuples. This parameter can have a very large impact in the end results, and it’s recommended to tune it accordingly - scenarios with low expected reward rates should have priors that result in drawing small random numbers, whereas scenarios with large expected reward rates should have stronger priors and tend towards larger random numbers. Also, the more arms there are, the smaller the optimal expected value for these random numbers. Note that the default value for AdaptiveGreedy is different than from the other methods in this module, and it’s recommended to experiment with different values of this hyperparameter. Recommended to use only one of beta_prior or smoothing.

  • smoothing (None, tuple (a,b), or list) – If not None, predictions will be smoothed as yhat_smooth = (yhat*n + a)/(n + b), where ‘n’ is the number of times each arm was chosen in the training data. Can also pass it as a list of tuples with different ‘a’ and ‘b’ parameters for each arm (e.g. if there are arm features, these parameters can be determined through a different model). This will not work well with non-probabilistic classifiers such as SVM, in which case you might want to define a class that embeds it with some recalibration built-in. Recommended to use only one of beta_prior or smoothing.

  • noise_to_smooth (bool) – If passing smoothing, whether to add a small amount of random noise \(\sim Uniform(0, 10^{-12})\) in order to break ties at random instead of choosing the smallest arm index. Ignored when passing smoothing=None.

  • batch_train (bool) – Whether the base algorithm will be fit to the data in batches as it comes (streaming), or to the whole dataset each time it is refit. Requires a classifier with a ‘partial_fit’ method.

  • refit_buffer (int or None) – Number of observations per arm to keep as a reserve for passing to ‘partial_fit’. If passing it, up until the moment there are at least this number of observations for a given arm, that arm will keep the observations when calling ‘fit’ and ‘partial_fit’, and will translate calls to ‘partial_fit’ to calls to ‘fit’ with the new plus stored observations. After the reserve number is reached, calls to ‘partial_fit’ will enlarge the data batch with the stored observations, and old stored observations will be gradually replaced with the new ones (at random, not on a FIFO basis). This technique can greatly enchance the performance when fitting the data in batches, but memory consumption can grow quite large. If passing sparse CSR matrices as input to ‘fit’ and ‘partial_fit’, these will be converted to dense once they go into this reserve, and then converted back to CSR to augment the new data. Calls to ‘fit’ will override this reserve. Ignored when passing ‘batch_train=False’.

  • deep_copy_buffer (bool) – Whether to make deep copies of the data that is stored in the reserve for refit_buffer. If passing ‘False’, when the reserve is not yet full, these will only store shallow copies of the data, which is faster but will not let Python’s garbage collector free memory after deleting the data, and if the original data is overwritten, so will this buffer. Ignored when not using refit_buffer.

  • assume_unique_reward (bool) – Whether to assume that only one arm has a reward per observation. If set to ‘True’, whenever an arm receives a reward, the classifiers for all other arms will be fit to that observation too, having negative label.

  • active_choice (None or str in {‘min’, ‘max’, ‘weighted’}) – How to select arms when predictions are below the threshold. If passing None, selects them at random (default). If passing ‘min’, ‘max’ or ‘weighted’, selects them in the same way as ‘ActiveExplorer’. Non-random active selection requires being able to calculate gradients (gradients for logistic regression and linear regression (from this package) are already defined with an option ‘auto’ below).

  • f_grad_norm (None, str ‘auto’, list, or function(base_algorithm, X, pred) -> array (n_samples, 2)) – (When passing active_choice) Function that calculates the row-wise norm of the gradient from observations in X if their class were Function that calculates the row-wise norm of the gradient from observations in X if their class were negative (first column) or positive (second column). Can also use different functions for each arm, in which case it accepts them as a list of functions with length equal to nchoices. The option ‘auto’ will only work with scikit-learn’s ‘LogisticRegression’, ‘SGDClassifier’, and ‘RidgeClassifier’; with stochQN’s ‘StochasticLogisticRegression’; and with this package’s ‘LinearRegression’.

  • case_one_class (str ‘auto’, ‘zero’, None, list, or function(X, n_pos, n_neg, rng) -> array(n_samples, 2)) – (When passing active_choice) If some arm/choice/class has only rewards of one type, many models will fail to fit, and consequently the gradients will be undefined. Likewise, if the model has not been fit, the gradient might also be undefined, and this requires a workaround.

    • If passing ‘None’, will assume that base_algorithm can be fit to data of only-positive or only-negative class without problems, and that it can calculate gradients and predictions with a base_algorithm object that has not been fitted. Be aware that the methods ‘predict’, ‘predict_proba’, and ‘decision_function’ in base_algorithm might be overwritten with another method that wraps it in a try-catch block, so don’t rely on it producing errors when unfitted.

    • If passing a function, will take the output of it as the row-wise gradient norms when it compares them against other arms/classes, with the first column having the values if the observations were of negative class, and the second column if they were positive class. The inputs to this function (signature described above) are the number of positive and negative examples that have been observed, and a Generator object from NumPy to use for generating random numbers.

    • If passing a list, will assume each entry is a function as described above, to be used with each corresponding arm.

    • If passing ‘auto’, will generate random numbers:

      • negative: ~ Gamma(log10(n_features) / (n_pos+1)/(n_pos+n_neg+2), log10(n_features)).

      • positive: ~ Gamma(log10(n_features) * (n_pos+1)/(n_pos+n_neg+2), log10(n_features)).

    • If passing ‘zero’, it will output zero whenever models have not been fitted.

    Note that the theoretically correct approach for a logistic regression would be to assume models with all-zero coefficients, in which case the gradient is defined in the absence of any data, but this tends to produce bad end results.

  • random_state (int, None, RandomState, or Generator) – Either an integer which will be used as seed for initializing a Generator object for random number generation, a RandomState object (from NumPy) from which to draw an integer, or a Generator object (from NumPy), which will be used directly. While this controls random number generation for this meteheuristic, there can still be other sources of variations upon re-runs, such as data aggregations in parallel (e.g. from OpenMP or BLAS functions).

  • njobs (int or None) – Number of parallel jobs to run. If passing None will set it to 1. If passing -1 will set it to the number of CPU cores. Note that if the base algorithm is itself parallelized, this might result in a slowdown as both compete for available threads, so don’t set parallelization in both. The parallelization uses shared memory, thus you will only see a speed up if your base classifier releases the Python GIL, and will otherwise result in slower runs.

References

1

Chakrabarti, Deepayan, et al. “Mortal multi-armed bandits.” Advances in neural information processing systems. 2009.

2

Cortes, David. “Adapting multi-armed bandits policies to contextual bandits scenarios.” arXiv preprint arXiv:1811.04383 (2018).

add_arm(arm_name=None, fitted_classifier=None, n_w_rew=0, n_wo_rew=0, smoothing=None, beta_prior=None, refit_buffer_X=None, refit_buffer_r=None, f_grad_norm=None, case_one_class=None)

Adds a new arm to the pool of choices

Parameters
  • arm_name (object) – Name for this arm. Only applicable when using named arms. If None, will use the name of the last arm plus 1 (will only work when the names are integers).

  • fitted_classifier (object) – If a classifier has already been fit to rewards coming from this arm, you can pass it here, otherwise, will be started from the same ‘base_classifier’ as the initial arms. If using bootstrapped methods or methods from this module which do not accept arbitrary classifiers as input, don’t pass a classifier here (unless using the classes like e.g. utils._BootstrappedClassifierBase). If the constructor was called with different base_algorithm per arm, must pass a base classifier here. Not applicable for the classes that do not take a base_algorithm.

  • n_w_rew (int) – Number of trials/rounds with rewards coming from this arm (only used when using a beta prior or smoothing).

  • n_wo_rew (int) – Number of trials/rounds without rewards coming from this arm (only used when using a beta prior or smoothing).

  • smoothing (None, tuple (a,b), or list) – Smoothing parameters to use for this arm (see documentation of the class constructor for details). If None and if the smoothing passed to the constructor didn’t have separate entries per arm, will use the same smoothing as was passed in the constructor. If no smoothing was passed to the constructor, the smoothing here will be ignored. Must pass a smoothing here if the constructor was passed a smoothing with different entries per arm.

  • beta_prior (None or tuple((a,b), n)) – Beta prior to use for this arm. See the class’ documenation for details. Must be passed if the constructor was provided different beta priors per arm. If None and the constructor had a single beta_prior, will use that same beta_prior for this new arm. Note that n_w_rew and n_wo_rew will be counted towards the threshold ‘n’ in here. Cannot be passed if the constructor did not have a beta_prior.

  • refit_buffer_X (array(m, n) or None) – Refit buffer of ‘X’ data to use for the new arm. Ignored when using ‘batch_train=False’ or ‘refit_buffer=None’.

  • refit_buffer_r (array(m,) or None) – Refit buffer of rewards data to use for the new arm. Ignored when using ‘batch_train=False’ or ‘refit_buffer=None’.

  • f_grad_norm (function) – Gradient calculation function to use for this arm. This is only for the policies that make choices according to active learning criteria, and only for situations in which the policy was passed different functions for each arm.

  • case_one_class (function) – Gradient workaround function for single-class data. This is only for the policies that make choices according to active learning criteria, and only for situations in which the policy was passed different functions for each arm.

Returns

self – This object

Return type

object

decision_function(X)

Get the scores for each arm following this policy’s action-choosing criteria.

Parameters

X (array (n_samples, n_features)) – Data for which to obtain decision function scores for each arm.

Returns

scores – Scores following this policy for each arm.

Return type

array (n_samples, n_choices)

drop_arm(arm_name)

Drop an arm/choice

Drops (removes/deletes) an arm from the set of available choices to the policy.

Note

The available arms, if named, are stored in attribute ‘choice_names’.

Parameters

arm_name (int or object) – Arm to drop. If passing an integer, will drop at that index (starting at zero). Otherwise, will drop the arm matching this name (argument must be of the same type as the individual entries passed to ‘nchoices’ in the initialization).

Returns

self – This object

Return type

object

fit(X, a, r, continue_from_last=False)

Fits the base algorithm (one per class [and per sample if bootstrapped]) to partially labeled data.

Parameters
  • X (array(n_samples, n_features) or CSR(n_samples, n_features)) – Matrix of covariates for the available data.

  • a (array(n_samples, ), int type) – Arms or actions that were chosen for each observations.

  • r (array(n_samples, ), {0,1}) – Rewards that were observed for the chosen actions. Must be binary rewards 0/1.

  • continue_from_last (bool) – If the policy was previously fit to data, whether to assume that this new call to ‘fit’ will continue from the exact same dataset as before plus new rows appended at the end of ‘X’, ‘a’, ‘r’. In this case, will only refit the models that have new data according to ‘a’. Note that the bootstrapped policies will still benefit from extra refits. This option should not be used when there are calls to ‘partial_fit’ between calls to fit. Ignored if using assume_unique_reward=True.

Returns

self – This object

Return type

obj

partial_fit(X, a, r)

Fits the base algorithm (one per class) to partially labeled data in batches.

Note

In order to use this method, the base classifier must have a ‘partial_fit’ method, such as ‘sklearn.linear_model.SGDClassifier’. This method is not available for ‘LogisticUCB’, ‘LogisticTS’, ‘PartitionedUCB’, ‘PartitionedTS’.

Parameters
  • X (array(n_samples, n_features) or CSR(n_samples, n_features)) – Matrix of covariates for the available data.

  • a (array(n_samples, ), int type) – Arms or actions that were chosen for each observations.

  • r (array(n_samples, ), {0,1}) – Rewards that were observed for the chosen actions. Must be binary rewards 0/1.

Returns

self – This object

Return type

obj

predict(X, exploit=False)

Selects actions according to this policy for new data.

Parameters
  • X (array (n_samples, n_features)) – New observations for which to choose an action according to this policy.

  • exploit (bool) – Whether to make a prediction according to the policy, or to just choose the arm with the highest expected reward according to current models.

Returns

pred – Actions chosen by the policy.

Return type

array (n_samples,)

reset_active_choice(active_choice='weighted')

Set the active gradient criteria to a custom form

Parameters

active_choice (str in {‘min’, ‘max’, ‘weighted’}) – How to calculate the gradient that an observation would have on the loss function for each classifier, given that it could be either class (positive or negative) for the classifier that predicts each arm. If weighted, they are weighted by the same probability estimates from the base algorithm.

Returns

self – This object

Return type

obj

reset_percentile(percentile=30)

Set the moving percentile to a custom number

Parameters

percentile (int between 0 and 100) – The new percentile to set. Note that it will still apply decay to it after being set through this method.

Returns

self – This object

Return type

obj

reset_threshold(threshold='auto')

Set the adaptive threshold to a custom number

Parameters

threshold (float or “auto”) – New threshold to use. If passing “auto”, will set it to 1.5/nchoices. Note that this threshold will still be decayed if the object was initialized with decay_type="threshold", and will still be updated if initialized with percentile != None.

Returns

self – This object

Return type

obj

topN(X, n)

Get top-N ranked actions for each observation

Note

This method will rank choices/arms according to what the policy dictates - it is not an exploitation-mode rank, so if e.g. there are random choices for some observations, there will be random ranks in here.

Parameters
  • X (array (n_samples, n_features)) – New observations for which to rank actions according to this policy.

  • n (int) – Number of top-ranked actions to output

Returns

topN – The top-ranked actions for each observation

Return type

array(n_samples, n)

BootstrappedTS

class contextualbandits.online.BootstrappedTS(base_algorithm, nchoices, nsamples=10, beta_prior='auto', smoothing=None, noise_to_smooth=True, sample_unique=True, sample_weighted=False, batch_train=False, refit_buffer=None, deep_copy_buffer=True, assume_unique_reward=False, batch_sample_method='gamma', random_state=None, njobs_arms=-1, njobs_samples=1)

Bootstrapped Thompson Sampling

Performs Thompson Sampling by fitting several models per class on bootstrapped samples, then makes predictions by taking one of them at random for each class.

Note

When fitting the algorithm to data in batches (online), it’s not possible to take an exact bootstrapped sample, as the sample is not known in advance. In theory, as the sample size grows to infinity, the number of times that an observation appears in a bootstrapped sample is distributed \(\sim Poisson(1)\). However, assigning random gamma-distributed weights to observations produces a more stable effect, so it also has the option to assign weights randomly \(\sim Gamma(1,1)\).

Note

If you plan to make only one call to ‘predict’ between calls to ‘fit’ and have sample_unique=False, you can pass nsamples=1 without losing any precision.

Parameters
  • base_algorithm (obj) – Base binary classifier for which each sample for each class will be fit. Will look for, in this order:

    1. A ‘predict_proba’ method with outputs (n_samples, 2), values in [0,1], rows suming to 1

    2. A ‘decision_function’ method with unbounded outputs (n_samples,) to which it will apply a sigmoid function.

    3. A ‘predict’ method with outputs (n_samples,) with values in [0,1].

    Can also pass a list with a different (or already-fit) classifier for each arm.

  • nchoices (int or list-like) – Number of arms/labels to choose from. Can also pass a list, array, or Series with arm names, in which case the outputs from predict will follow these names and arms can be dropped by name, and new ones added with a custom name.

  • nsamples (int) – Number of bootstrapped samples per class to take.

  • beta_prior (str ‘auto’, None, tuple ((a,b), n), or list[tuple((a,b), n)]) – If not ‘None’, when there are less than ‘n’ samples with and without a reward from a given arm, it will predict the score for that class as a random number drawn from a beta distribution with the prior specified by ‘a’ and ‘b’. If set to “auto”, will be calculated as:

    beta_prior = ((2/log2(nchoices), 4), 2)

    Can also pass different priors per arm, in which case they should be passed as a list of tuples. This parameter can have a very large impact in the end results, and it’s recommended to tune it accordingly - scenarios with low expected reward rates should have priors that result in drawing small random numbers, whereas scenarios with large expected reward rates should have stronger priors and tend towards larger random numbers. Also, the more arms there are, the smaller the optimal expected value for these random numbers. Recommended to use only one of beta_prior or smoothing.

  • smoothing (None, tuple (a,b), or list) – If not None, predictions will be smoothed as yhat_smooth = (yhat*n + a)/(n + b), where ‘n’ is the number of times each arm was chosen in the training data. Can also pass it as a list of tuples with different ‘a’ and ‘b’ parameters for each arm (e.g. if there are arm features, these parameters can be determined through a different model). This will not work well with non-probabilistic classifiers such as SVM, in which case you might want to define a class that embeds it with some recalibration built-in. Recommended to use only one of beta_prior or smoothing.

  • noise_to_smooth (bool) – If passing smoothing, whether to add a small amount of random noise \(\sim Uniform(0, 10^{-12})\) in order to break ties at random instead of choosing the smallest arm index. Ignored when passing smoothing=None.

  • sample_unique (bool) – Whether to use a different bootstrapped classifier per row at each arm when calling ‘predict’. If passing ‘False’, will take the same bootstrapped classifier within an arm for all the rows passed in a single call to ‘predict’. Passing ‘False’ is a faster alternative, but the theoretically correct way is using a different one per row. Forced to ‘True’ when passing sample_weighted=True.

  • sample_weighted (bool) – Whether to take a weighted average from the predictions from each bootstrapped classifier at a given arm, with random weights. This will make the predictions more variable (i.e. more randomness in exploration). The alternative (and default) is to take a prediction from a single classifier each time.

  • batch_train (bool) – Whether the base algorithm will be fit to the data in batches as it comes (streaming), or to the whole dataset each time it is refit. Requires a classifier with a ‘partial_fit’ method.

  • refit_buffer (int or None) – Number of observations per arm to keep as a reserve for passing to ‘partial_fit’. If passing it, up until the moment there are at least this number of observations for a given arm, that arm will keep the observations when calling ‘fit’ and ‘partial_fit’, and will translate calls to ‘partial_fit’ to calls to ‘fit’ with the new plus stored observations. After the reserve number is reached, calls to ‘partial_fit’ will enlarge the data batch with the stored observations, and old stored observations will be gradually replaced with the new ones (at random, not on a FIFO basis). This technique can greatly enchance the performance when fitting the data in batches, but memory consumption can grow quite large. If passing sparse CSR matrices as input to ‘fit’ and ‘partial_fit’, these will be converted to dense once they go into this reserve, and then converted back to CSR to augment the new data. Calls to ‘fit’ will override this reserve. Ignored when passing ‘batch_train=False’.

  • deep_copy_buffer (bool) – Whether to make deep copies of the data that is stored in the reserve for refit_buffer. If passing ‘False’, when the reserve is not yet full, these will only store shallow copies of the data, which is faster but will not let Python’s garbage collector free memory after deleting the data, and if the original data is overwritten, so will this buffer. Ignored when not using refit_buffer.

  • assume_unique_reward (bool) – Whether to assume that only one arm has a reward per observation. If set to ‘True’, whenever an arm receives a reward, the classifiers for all other arms will be fit to that observation too, having negative label.

  • batch_sample_method (str, either ‘gamma’ or ‘poisson’) – How to simulate bootstrapped samples when training in batch mode (online). See Note.

  • random_state (int, None, RandomState, or Generator) – Either an integer which will be used as seed for initializing a Generator object for random number generation, a RandomState object (from NumPy) from which to draw an integer, or a Generator object (from NumPy), which will be used directly. While this controls random number generation for this meteheuristic, there can still be other sources of variations upon re-runs, such as data aggregations in parallel (e.g. from OpenMP or BLAS functions).

  • njobs_arms (int or None) – Number of parallel jobs to run (for dividing work across arms). If passing None will set it to 1. If passing -1 will set it to the number of CPU cores. Note that if the base algorithm is itself parallelized, this might result in a slowdown as both compete for available threads, so don’t set parallelization in both. The total number of parallel jobs will be njobs_arms * njobs_samples. The parallelization uses shared memory, thus you will only see a speed up if your base classifier releases the Python GIL, and will otherwise result in slower runs.

  • njobs_samples (int or None) – Number of parallel jobs to run (for dividing work across samples within one arm). If passing None will set it to 1. If passing -1 will set it to the number of CPU cores. The total number of parallel jobs will be njobs_arms * njobs_samples. The parallelization uses shared memory, thus you will only see a speed up if your base classifier releases the Python GIL, and will otherwise result in slower runs.

References

1

Cortes, David. “Adapting multi-armed bandits policies to contextual bandits scenarios.” arXiv preprint arXiv:1811.04383 (2018).

2

Chapelle, Olivier, and Lihong Li. “An empirical evaluation of thompson sampling.” Advances in neural information processing systems. 2011.

add_arm(arm_name=None, fitted_classifier=None, n_w_rew=0, n_wo_rew=0, smoothing=None, beta_prior=None, refit_buffer_X=None, refit_buffer_r=None, f_grad_norm=None, case_one_class=None)

Adds a new arm to the pool of choices

Parameters
  • arm_name (object) – Name for this arm. Only applicable when using named arms. If None, will use the name of the last arm plus 1 (will only work when the names are integers).

  • fitted_classifier (object) – If a classifier has already been fit to rewards coming from this arm, you can pass it here, otherwise, will be started from the same ‘base_classifier’ as the initial arms. If using bootstrapped methods or methods from this module which do not accept arbitrary classifiers as input, don’t pass a classifier here (unless using the classes like e.g. utils._BootstrappedClassifierBase). If the constructor was called with different base_algorithm per arm, must pass a base classifier here. Not applicable for the classes that do not take a base_algorithm.

  • n_w_rew (int) – Number of trials/rounds with rewards coming from this arm (only used when using a beta prior or smoothing).

  • n_wo_rew (int) – Number of trials/rounds without rewards coming from this arm (only used when using a beta prior or smoothing).

  • smoothing (None, tuple (a,b), or list) – Smoothing parameters to use for this arm (see documentation of the class constructor for details). If None and if the smoothing passed to the constructor didn’t have separate entries per arm, will use the same smoothing as was passed in the constructor. If no smoothing was passed to the constructor, the smoothing here will be ignored. Must pass a smoothing here if the constructor was passed a smoothing with different entries per arm.

  • beta_prior (None or tuple((a,b), n)) – Beta prior to use for this arm. See the class’ documenation for details. Must be passed if the constructor was provided different beta priors per arm. If None and the constructor had a single beta_prior, will use that same beta_prior for this new arm. Note that n_w_rew and n_wo_rew will be counted towards the threshold ‘n’ in here. Cannot be passed if the constructor did not have a beta_prior.

  • refit_buffer_X (array(m, n) or None) – Refit buffer of ‘X’ data to use for the new arm. Ignored when using ‘batch_train=False’ or ‘refit_buffer=None’.

  • refit_buffer_r (array(m,) or None) – Refit buffer of rewards data to use for the new arm. Ignored when using ‘batch_train=False’ or ‘refit_buffer=None’.

  • f_grad_norm (function) – Gradient calculation function to use for this arm. This is only for the policies that make choices according to active learning criteria, and only for situations in which the policy was passed different functions for each arm.

  • case_one_class (function) – Gradient workaround function for single-class data. This is only for the policies that make choices according to active learning criteria, and only for situations in which the policy was passed different functions for each arm.

Returns

self – This object

Return type

object

decision_function(X)

Get the scores for each arm following this policy’s action-choosing criteria.

Parameters

X (array (n_samples, n_features)) – Data for which to obtain decision function scores for each arm.

Returns

scores – Scores following this policy for each arm.

Return type

array (n_samples, n_choices)

drop_arm(arm_name)

Drop an arm/choice

Drops (removes/deletes) an arm from the set of available choices to the policy.

Note

The available arms, if named, are stored in attribute ‘choice_names’.

Parameters

arm_name (int or object) – Arm to drop. If passing an integer, will drop at that index (starting at zero). Otherwise, will drop the arm matching this name (argument must be of the same type as the individual entries passed to ‘nchoices’ in the initialization).

Returns

self – This object

Return type

object

fit(X, a, r, continue_from_last=False)

Fits the base algorithm (one per class [and per sample if bootstrapped]) to partially labeled data.

Parameters
  • X (array(n_samples, n_features) or CSR(n_samples, n_features)) – Matrix of covariates for the available data.

  • a (array(n_samples, ), int type) – Arms or actions that were chosen for each observations.

  • r (array(n_samples, ), {0,1}) – Rewards that were observed for the chosen actions. Must be binary rewards 0/1.

  • continue_from_last (bool) – If the policy was previously fit to data, whether to assume that this new call to ‘fit’ will continue from the exact same dataset as before plus new rows appended at the end of ‘X’, ‘a’, ‘r’. In this case, will only refit the models that have new data according to ‘a’. Note that the bootstrapped policies will still benefit from extra refits. This option should not be used when there are calls to ‘partial_fit’ between calls to fit. Ignored if using assume_unique_reward=True.

Returns

self – This object

Return type

obj

partial_fit(X, a, r)

Fits the base algorithm (one per class) to partially labeled data in batches.

Note

In order to use this method, the base classifier must have a ‘partial_fit’ method, such as ‘sklearn.linear_model.SGDClassifier’. This method is not available for ‘LogisticUCB’, ‘LogisticTS’, ‘PartitionedUCB’, ‘PartitionedTS’.

Parameters
  • X (array(n_samples, n_features) or CSR(n_samples, n_features)) – Matrix of covariates for the available data.

  • a (array(n_samples, ), int type) – Arms or actions that were chosen for each observations.

  • r (array(n_samples, ), {0,1}) – Rewards that were observed for the chosen actions. Must be binary rewards 0/1.

Returns

self – This object

Return type

obj

predict(X, exploit=False, output_score=False)

Selects actions according to this policy for new data.

Parameters
  • X (array (n_samples, n_features)) – New observations for which to choose an action according to this policy.

  • exploit (bool) – Whether to make a prediction according to the policy, or to just choose the arm with the highest expected reward according to current models.

  • output_score (bool) – Whether to output the score that this method predicted, in case it is desired to use it with this pakckage’s offpolicy and evaluation modules.

Returns

pred – Actions chosen by the policy. If passing output_score=True, it will be a dictionary with the chosen arm and the score that the arm got following this policy with the classifiers used.

Return type

array (n_samples,) or dict(“choice” : array(n_samples,), “score” : array(n_samples,))

topN(X, n)

Get top-N ranked actions for each observation

Note

This method will rank choices/arms according to what the policy dictates - it is not an exploitation-mode rank, so if e.g. there are random choices for some observations, there will be random ranks in here.

Parameters
  • X (array (n_samples, n_features)) – New observations for which to rank actions according to this policy.

  • n (int) – Number of top-ranked actions to output

Returns

topN – The top-ranked actions for each observation

Return type

array(n_samples, n)

BootstrappedUCB

class contextualbandits.online.BootstrappedUCB(base_algorithm, nchoices, nsamples=10, percentile=80, beta_prior='auto', smoothing=None, noise_to_smooth=True, batch_train=False, refit_buffer=None, deep_copy_buffer=True, assume_unique_reward=False, batch_sample_method='gamma', random_state=None, njobs_arms=-1, njobs_samples=1)

Bootstrapped Upper Confidence Bound

Obtains an upper confidence bound by taking the percentile of the predictions from a set of classifiers, all fit with different bootstrapped samples (multiple samples per arm).

Note

When fitting the algorithm to data in batches (online), it’s not possible to take an exact bootstrapped sample, as the sample is not known in advance. In theory, as the sample size grows to infinity, the number of times that an observation appears in a bootstrapped sample is distributed \(\sim Poisson(1)\). However, assigning random gamma-distributed weights to observations produces a more stable effect, so it also has the option to assign weights randomly \(\sim Gamma(1,1)\).

Parameters
  • base_algorithm (obj or list) – Base binary classifier for which each sample for each class will be fit. Will look for, in this order:

    1. A ‘predict_proba’ method with outputs (n_samples, 2), values in [0,1], rows suming to 1

    2. A ‘decision_function’ method with unbounded outputs (n_samples,) to which it will apply a sigmoid function.

    3. A ‘predict’ method with outputs (n_samples,) with values in [0,1].

    Can also pass a list with a different (or already-fit) classifier for each arm.

  • nchoices (int or list-like) – Number of arms/labels to choose from. Can also pass a list, array, or Series with arm names, in which case the outputs from predict will follow these names and arms can be dropped by name, and new ones added with a custom name.

  • nsamples (int) – Number of bootstrapped samples per class to take.

  • percentile (int [0,100]) – Percentile of the predictions sample to take

  • beta_prior (str ‘auto’, None, tuple ((a,b), n), or list[tuple((a,b), n)]) – If not ‘None’, when there are less than ‘n’ samples with and without a reward from a given arm, it will predict the score for that class as a random number drawn from a beta distribution with the prior specified by ‘a’ and ‘b’. If set to “auto”, will be calculated as:

    beta_prior = ((3/log2(nchoices), 4), 2)

    Can also pass different priors per arm, in which case they should be passed as a list of tuples. Note that it will only generate one random number per arm, so the ‘a’ parameter should be higher than for other methods. This parameter can have a very large impact in the end results, and it’s recommended to tune it accordingly - scenarios with low expected reward rates should have priors that result in drawing small random numbers, whereas scenarios with large expected reward rates should have stronger priors and tend towards larger random numbers. Also, the more arms there are, the smaller the optimal expected value for these random numbers. Recommended to use only one of beta_prior or smoothing.

  • smoothing (None, tuple (a,b), or list) – If not None, predictions will be smoothed as yhat_smooth = (yhat*n + a)/(n + b), where ‘n’ is the number of times each arm was chosen in the training data. Can also pass it as a list of tuples with different ‘a’ and ‘b’ parameters for each arm (e.g. if there are arm features, these parameters can be determined through a different model). This will not work well with non-probabilistic classifiers such as SVM, in which case you might want to define a class that embeds it with some recalibration built-in. Recommended to use only one of beta_prior or smoothing.

  • noise_to_smooth (bool) – If passing smoothing, whether to add a small amount of random noise \(\sim Uniform(0, 10^{-12})\) in order to break ties at random instead of choosing the smallest arm index. Ignored when passing smoothing=None.

  • batch_train (bool) – Whether the base algorithm will be fit to the data in batches as it comes (streaming), or to the whole dataset each time it is refit. Requires a classifier with a ‘partial_fit’ method.

  • refit_buffer (int or None) – Number of observations per arm to keep as a reserve for passing to ‘partial_fit’. If passing it, up until the moment there are at least this number of observations for a given arm, that arm will keep the observations when calling ‘fit’ and ‘partial_fit’, and will translate calls to ‘partial_fit’ to calls to ‘fit’ with the new plus stored observations. After the reserve number is reached, calls to ‘partial_fit’ will enlarge the data batch with the stored observations, and old stored observations will be gradually replaced with the new ones (at random, not on a FIFO basis). This technique can greatly enchance the performance when fitting the data in batches, but memory consumption can grow quite large. If passing sparse CSR matrices as input to ‘fit’ and ‘partial_fit’, these will be converted to dense once they go into this reserve, and then converted back to CSR to augment the new data. Calls to ‘fit’ will override this reserve. Ignored when passing ‘batch_train=False’.

  • deep_copy_buffer (bool) – Whether to make deep copies of the data that is stored in the reserve for refit_buffer. If passing ‘False’, when the reserve is not yet full, these will only store shallow copies of the data, which is faster but will not let Python’s garbage collector free memory after deleting the data, and if the original data is overwritten, so will this buffer. Ignored when not using refit_buffer.

  • assume_unique_reward (bool) – Whether to assume that only one arm has a reward per observation. If set to ‘True’, whenever an arm receives a reward, the classifiers for all other arms will be fit to that observation too, having negative label.

  • batch_sample_method (str, either ‘gamma’ or ‘poisson’) – How to simulate bootstrapped samples when training in batch mode (online). See Note.

  • random_state (int, None, RandomState, or Generator) – Either an integer which will be used as seed for initializing a Generator object for random number generation, a RandomState object (from NumPy) from which to draw an integer, or a Generator object (from NumPy), which will be used directly. While this controls random number generation for this meteheuristic, there can still be other sources of variations upon re-runs, such as data aggregations in parallel (e.g. from OpenMP or BLAS functions).

  • njobs_arms (int or None) – Number of parallel jobs to run (for dividing work across arms). If passing None will set it to 1. If passing -1 will set it to the number of CPU cores. Note that if the base algorithm is itself parallelized, this might result in a slowdown as both compete for available threads, so don’t set parallelization in both. The total number of parallel jobs will be njobs_arms * njobs_samples. The parallelization uses shared memory, thus you will only see a speed up if your base classifier releases the Python GIL, and will otherwise result in slower runs.

  • njobs_samples (int or None) – Number of parallel jobs to run (for dividing work across samples within one arm). If passing None will set it to 1. If passing -1 will set it to the number of CPU cores. The total number of parallel jobs will be njobs_arms * njobs_samples. The parallelization uses shared memory, thus you will only see a speed up if your base classifier releases the Python GIL, and will otherwise result in slower runs.

References

1

Cortes, David. “Adapting multi-armed bandits policies to contextual bandits scenarios.” arXiv preprint arXiv:1811.04383 (2018).

add_arm(arm_name=None, fitted_classifier=None, n_w_rew=0, n_wo_rew=0, smoothing=None, beta_prior=None, refit_buffer_X=None, refit_buffer_r=None, f_grad_norm=None, case_one_class=None)

Adds a new arm to the pool of choices

Parameters
  • arm_name (object) – Name for this arm. Only applicable when using named arms. If None, will use the name of the last arm plus 1 (will only work when the names are integers).

  • fitted_classifier (object) – If a classifier has already been fit to rewards coming from this arm, you can pass it here, otherwise, will be started from the same ‘base_classifier’ as the initial arms. If using bootstrapped methods or methods from this module which do not accept arbitrary classifiers as input, don’t pass a classifier here (unless using the classes like e.g. utils._BootstrappedClassifierBase). If the constructor was called with different base_algorithm per arm, must pass a base classifier here. Not applicable for the classes that do not take a base_algorithm.

  • n_w_rew (int) – Number of trials/rounds with rewards coming from this arm (only used when using a beta prior or smoothing).

  • n_wo_rew (int) – Number of trials/rounds without rewards coming from this arm (only used when using a beta prior or smoothing).

  • smoothing (None, tuple (a,b), or list) – Smoothing parameters to use for this arm (see documentation of the class constructor for details). If None and if the smoothing passed to the constructor didn’t have separate entries per arm, will use the same smoothing as was passed in the constructor. If no smoothing was passed to the constructor, the smoothing here will be ignored. Must pass a smoothing here if the constructor was passed a smoothing with different entries per arm.

  • beta_prior (None or tuple((a,b), n)) – Beta prior to use for this arm. See the class’ documenation for details. Must be passed if the constructor was provided different beta priors per arm. If None and the constructor had a single beta_prior, will use that same beta_prior for this new arm. Note that n_w_rew and n_wo_rew will be counted towards the threshold ‘n’ in here. Cannot be passed if the constructor did not have a beta_prior.

  • refit_buffer_X (array(m, n) or None) – Refit buffer of ‘X’ data to use for the new arm. Ignored when using ‘batch_train=False’ or ‘refit_buffer=None’.

  • refit_buffer_r (array(m,) or None) – Refit buffer of rewards data to use for the new arm. Ignored when using ‘batch_train=False’ or ‘refit_buffer=None’.

  • f_grad_norm (function) – Gradient calculation function to use for this arm. This is only for the policies that make choices according to active learning criteria, and only for situations in which the policy was passed different functions for each arm.

  • case_one_class (function) – Gradient workaround function for single-class data. This is only for the policies that make choices according to active learning criteria, and only for situations in which the policy was passed different functions for each arm.

Returns

self – This object

Return type

object

decision_function(X)

Get the scores for each arm following this policy’s action-choosing criteria.

Parameters

X (array (n_samples, n_features)) – Data for which to obtain decision function scores for each arm.

Returns

scores – Scores following this policy for each arm.

Return type

array (n_samples, n_choices)

drop_arm(arm_name)

Drop an arm/choice

Drops (removes/deletes) an arm from the set of available choices to the policy.

Note

The available arms, if named, are stored in attribute ‘choice_names’.

Parameters

arm_name (int or object) – Arm to drop. If passing an integer, will drop at that index (starting at zero). Otherwise, will drop the arm matching this name (argument must be of the same type as the individual entries passed to ‘nchoices’ in the initialization).

Returns

self – This object

Return type

object

fit(X, a, r, continue_from_last=False)

Fits the base algorithm (one per class [and per sample if bootstrapped]) to partially labeled data.

Parameters
  • X (array(n_samples, n_features) or CSR(n_samples, n_features)) – Matrix of covariates for the available data.

  • a (array(n_samples, ), int type) – Arms or actions that were chosen for each observations.

  • r (array(n_samples, ), {0,1}) – Rewards that were observed for the chosen actions. Must be binary rewards 0/1.

  • continue_from_last (bool) – If the policy was previously fit to data, whether to assume that this new call to ‘fit’ will continue from the exact same dataset as before plus new rows appended at the end of ‘X’, ‘a’, ‘r’. In this case, will only refit the models that have new data according to ‘a’. Note that the bootstrapped policies will still benefit from extra refits. This option should not be used when there are calls to ‘partial_fit’ between calls to fit. Ignored if using assume_unique_reward=True.

Returns

self – This object

Return type

obj

partial_fit(X, a, r)

Fits the base algorithm (one per class) to partially labeled data in batches.

Note

In order to use this method, the base classifier must have a ‘partial_fit’ method, such as ‘sklearn.linear_model.SGDClassifier’. This method is not available for ‘LogisticUCB’, ‘LogisticTS’, ‘PartitionedUCB’, ‘PartitionedTS’.

Parameters
  • X (array(n_samples, n_features) or CSR(n_samples, n_features)) – Matrix of covariates for the available data.

  • a (array(n_samples, ), int type) – Arms or actions that were chosen for each observations.

  • r (array(n_samples, ), {0,1}) – Rewards that were observed for the chosen actions. Must be binary rewards 0/1.

Returns

self – This object

Return type

obj

predict(X, exploit=False, output_score=False)

Selects actions according to this policy for new data.

Parameters
  • X (array (n_samples, n_features)) – New observations for which to choose an action according to this policy.

  • exploit (bool) – Whether to make a prediction according to the policy, or to just choose the arm with the highest expected reward according to current models.

  • output_score (bool) – Whether to output the score that this method predicted, in case it is desired to use it with this pakckage’s offpolicy and evaluation modules.

Returns

pred – Actions chosen by the policy. If passing output_score=True, it will be a dictionary with the chosen arm and the score that the arm got following this policy with the classifiers used.

Return type

array (n_samples,) or dict(“choice” : array(n_samples,), “score” : array(n_samples,))

reset_percentile(percentile=80)

Set the upper confidence bound percentile to a custom number

Parameters

percentile (int [0,100]) – Percentile of the confidence interval to take.

Returns

self – This object

Return type

obj

topN(X, n)

Get top-N ranked actions for each observation

Note

This method will rank choices/arms according to what the policy dictates - it is not an exploitation-mode rank, so if e.g. there are random choices for some observations, there will be random ranks in here.

Parameters
  • X (array (n_samples, n_features)) – New observations for which to rank actions according to this policy.

  • n (int) – Number of top-ranked actions to output

Returns

topN – The top-ranked actions for each observation

Return type

array(n_samples, n)

EpsilonGreedy

class contextualbandits.online.EpsilonGreedy(base_algorithm, nchoices, explore_prob=0.2, decay=0.9999, beta_prior='auto', smoothing=None, noise_to_smooth=True, batch_train=False, refit_buffer=None, deep_copy_buffer=True, assume_unique_reward=False, random_state=None, njobs=-1)

Epsilon Greedy

Takes a random action with probability p, or the action with highest estimated reward with probability 1-p.

Parameters
  • base_algorithm (obj) – Base binary classifier for which each sample for each class will be fit. Will look for, in this order:

    1. A ‘predict_proba’ method with outputs (n_samples, 2), values in [0,1], rows suming to 1

    2. A ‘decision_function’ method with unbounded outputs (n_samples,) to which it will apply a sigmoid function.

    3. A ‘predict’ method with outputs (n_samples,) with values in [0,1].

    Can also pass a list with a different (or already-fit) classifier for each arm.

  • nchoices (int or list-like) – Number of arms/labels to choose from. Can also pass a list, array, or Series with arm names, in which case the outputs from predict will follow these names and arms can be dropped by name, and new ones added with a custom name.

  • explore_prob (float (0,1)) – Probability of taking a random action at each round.

  • decay (float (0,1)) –

    After each prediction, the explore probability reduces to

    p = p*decay

  • beta_prior (str ‘auto’, None, tuple ((a,b), n), or list[tuple((a,b), n)]) – If not ‘None’, when there are less than ‘n’ samples with and without a reward from a given arm, it will predict the score for that class as a random number drawn from a beta distribution with the prior specified by ‘a’ and ‘b’. If set to “auto”, will be calculated as:

    beta_prior = ((2/log2(nchoices), 4), 2)

    Can also pass different priors per arm, in which case they should be passed as a list of tuples. The impact of beta_prior for EpsilonGreedy is not as high as for other policies in this module. Recommended to use only one of beta_prior or smoothing.

  • smoothing (None, tuple (a,b), or list) – If not None, predictions will be smoothed as yhat_smooth = (yhat*n + a)/(n + b), where ‘n’ is the number of times each arm was chosen in the training data. Can also pass it as a list of tuples with different ‘a’ and ‘b’ parameters for each arm (e.g. if there are arm features, these parameters can be determined through a different model). This will not work well with non-probabilistic classifiers such as SVM, in which case you might want to define a class that embeds it with some recalibration built-in. Recommended to use only one of beta_prior or smoothing.

  • noise_to_smooth (bool) – If passing smoothing, whether to add a small amount of random noise \(\sim Uniform(0, 10^{-12})\) in order to break ties at random instead of choosing the smallest arm index. Ignored when passing smoothing=None.

  • batch_train (bool) – Whether the base algorithm will be fit to the data in batches as it comes (streaming), or to the whole dataset each time it is refit. Requires a classifier with a ‘partial_fit’ method.

  • refit_buffer (int or None) – Number of observations per arm to keep as a reserve for passing to ‘partial_fit’. If passing it, up until the moment there are at least this number of observations for a given arm, that arm will keep the observations when calling ‘fit’ and ‘partial_fit’, and will translate calls to ‘partial_fit’ to calls to ‘fit’ with the new plus stored observations. After the reserve number is reached, calls to ‘partial_fit’ will enlarge the data batch with the stored observations, and old stored observations will be gradually replaced with the new ones (at random, not on a FIFO basis). This technique can greatly enchance the performance when fitting the data in batches, but memory consumption can grow quite large. If passing sparse CSR matrices as input to ‘fit’ and ‘partial_fit’, these will be converted to dense once they go into this reserve, and then converted back to CSR to augment the new data. Calls to ‘fit’ will override this reserve. Ignored when passing ‘batch_train=False’.

  • deep_copy_buffer (bool) – Whether to make deep copies of the data that is stored in the reserve for refit_buffer. If passing ‘False’, when the reserve is not yet full, these will only store shallow copies of the data, which is faster but will not let Python’s garbage collector free memory after deleting the data, and if the original data is overwritten, so will this buffer. Ignored when not using refit_buffer.

  • assume_unique_reward (bool) – Whether to assume that only one arm has a reward per observation. If set to ‘True’, whenever an arm receives a reward, the classifiers for all other arms will be fit to that observation too, having negative label.

  • random_state (int, None, RandomState, or Generator) – Either an integer which will be used as seed for initializing a Generator object for random number generation, a RandomState object (from NumPy) from which to draw an integer, or a Generator object (from NumPy), which will be used directly. While this controls random number generation for this meteheuristic, there can still be other sources of variations upon re-runs, such as data aggregations in parallel (e.g. from OpenMP or BLAS functions).

  • njobs (int or None) – Number of parallel jobs to run. If passing None will set it to 1. If passing -1 will set it to the number of CPU cores. Note that if the base algorithm is itself parallelized, this might result in a slowdown as both compete for available threads, so don’t set parallelization in both. The parallelization uses shared memory, thus you will only see a speed up if your base classifier releases the Python GIL, and will otherwise result in slower runs.

References

1

Cortes, David. “Adapting multi-armed bandits policies to contextual bandits scenarios.” arXiv preprint arXiv:1811.04383 (2018).

2

Yue, Yisong, et al. “The k-armed dueling bandits problem.” Journal of Computer and System Sciences 78.5 (2012): 1538-1556.

add_arm(arm_name=None, fitted_classifier=None, n_w_rew=0, n_wo_rew=0, smoothing=None, beta_prior=None, refit_buffer_X=None, refit_buffer_r=None, f_grad_norm=None, case_one_class=None)

Adds a new arm to the pool of choices

Parameters
  • arm_name (object) – Name for this arm. Only applicable when using named arms. If None, will use the name of the last arm plus 1 (will only work when the names are integers).

  • fitted_classifier (object) – If a classifier has already been fit to rewards coming from this arm, you can pass it here, otherwise, will be started from the same ‘base_classifier’ as the initial arms. If using bootstrapped methods or methods from this module which do not accept arbitrary classifiers as input, don’t pass a classifier here (unless using the classes like e.g. utils._BootstrappedClassifierBase). If the constructor was called with different base_algorithm per arm, must pass a base classifier here. Not applicable for the classes that do not take a base_algorithm.

  • n_w_rew (int) – Number of trials/rounds with rewards coming from this arm (only used when using a beta prior or smoothing).

  • n_wo_rew (int) – Number of trials/rounds without rewards coming from this arm (only used when using a beta prior or smoothing).

  • smoothing (None, tuple (a,b), or list) – Smoothing parameters to use for this arm (see documentation of the class constructor for details). If None and if the smoothing passed to the constructor didn’t have separate entries per arm, will use the same smoothing as was passed in the constructor. If no smoothing was passed to the constructor, the smoothing here will be ignored. Must pass a smoothing here if the constructor was passed a smoothing with different entries per arm.

  • beta_prior (None or tuple((a,b), n)) – Beta prior to use for this arm. See the class’ documenation for details. Must be passed if the constructor was provided different beta priors per arm. If None and the constructor had a single beta_prior, will use that same beta_prior for this new arm. Note that n_w_rew and n_wo_rew will be counted towards the threshold ‘n’ in here. Cannot be passed if the constructor did not have a beta_prior.

  • refit_buffer_X (array(m, n) or None) – Refit buffer of ‘X’ data to use for the new arm. Ignored when using ‘batch_train=False’ or ‘refit_buffer=None’.

  • refit_buffer_r (array(m,) or None) – Refit buffer of rewards data to use for the new arm. Ignored when using ‘batch_train=False’ or ‘refit_buffer=None’.

  • f_grad_norm (function) – Gradient calculation function to use for this arm. This is only for the policies that make choices according to active learning criteria, and only for situations in which the policy was passed different functions for each arm.

  • case_one_class (function) – Gradient workaround function for single-class data. This is only for the policies that make choices according to active learning criteria, and only for situations in which the policy was passed different functions for each arm.

Returns

self – This object

Return type

object

decision_function(X)

Get the scores for each arm following this policy’s action-choosing criteria.

Parameters

X (array (n_samples, n_features)) – Data for which to obtain decision function scores for each arm.

Returns

scores – Scores following this policy for each arm.

Return type

array (n_samples, n_choices)

drop_arm(arm_name)

Drop an arm/choice

Drops (removes/deletes) an arm from the set of available choices to the policy.

Note

The available arms, if named, are stored in attribute ‘choice_names’.

Parameters

arm_name (int or object) – Arm to drop. If passing an integer, will drop at that index (starting at zero). Otherwise, will drop the arm matching this name (argument must be of the same type as the individual entries passed to ‘nchoices’ in the initialization).

Returns

self – This object

Return type

object

fit(X, a, r, continue_from_last=False)

Fits the base algorithm (one per class [and per sample if bootstrapped]) to partially labeled data.

Parameters
  • X (array(n_samples, n_features) or CSR(n_samples, n_features)) – Matrix of covariates for the available data.

  • a (array(n_samples, ), int type) – Arms or actions that were chosen for each observations.

  • r (array(n_samples, ), {0,1}) – Rewards that were observed for the chosen actions. Must be binary rewards 0/1.

  • continue_from_last (bool) – If the policy was previously fit to data, whether to assume that this new call to ‘fit’ will continue from the exact same dataset as before plus new rows appended at the end of ‘X’, ‘a’, ‘r’. In this case, will only refit the models that have new data according to ‘a’. Note that the bootstrapped policies will still benefit from extra refits. This option should not be used when there are calls to ‘partial_fit’ between calls to fit. Ignored if using assume_unique_reward=True.

Returns

self – This object

Return type

obj

partial_fit(X, a, r)

Fits the base algorithm (one per class) to partially labeled data in batches.

Note

In order to use this method, the base classifier must have a ‘partial_fit’ method, such as ‘sklearn.linear_model.SGDClassifier’. This method is not available for ‘LogisticUCB’, ‘LogisticTS’, ‘PartitionedUCB’, ‘PartitionedTS’.

Parameters
  • X (array(n_samples, n_features) or CSR(n_samples, n_features)) – Matrix of covariates for the available data.

  • a (array(n_samples, ), int type) – Arms or actions that were chosen for each observations.

  • r (array(n_samples, ), {0,1}) – Rewards that were observed for the chosen actions. Must be binary rewards 0/1.

Returns

self – This object

Return type

obj

predict(X, exploit=False, output_score=False)

Selects actions according to this policy for new data.

Parameters
  • X (array (n_samples, n_features)) – New observations for which to choose an action according to this policy.

  • exploit (bool) – Whether to make a prediction according to the policy, or to just choose the arm with the highest expected reward according to current models.

  • output_score (bool) – Whether to output the score that this method predicted, in case it is desired to use it with this pakckage’s offpolicy and evaluation modules.

Returns

pred – Actions chosen by the policy. If passing output_score=True, it will be a dictionary with the chosen arm and the score that the arm got following this policy with the classifiers used.

Return type

array (n_samples,) or dict(“choice” : array(n_samples,), “score” : array(n_samples,))

reset_epsilon(explore_prob=0.2)

Set the exploration probability to a custom number

Parameters

explore_prob (float between 0 and 1) – The exploration probability to set. Note that it will still apply the decay after resetting it.

Returns

self – This object

Return type

obj

topN(X, n)

Get top-N ranked actions for each observation

Note

This method will rank choices/arms according to what the policy dictates - it is not an exploitation-mode rank, so if e.g. there are random choices for some observations, there will be random ranks in here.

Parameters
  • X (array (n_samples, n_features)) – New observations for which to rank actions according to this policy.

  • n (int) – Number of top-ranked actions to output

Returns

topN – The top-ranked actions for each observation

Return type

array(n_samples, n)

ExploreFirst

class contextualbandits.online.ExploreFirst(base_algorithm, nchoices, explore_rounds=2500, prob_active_choice=0.0, active_choice='weighted', f_grad_norm='auto', case_one_class='auto', beta_prior=None, smoothing=None, noise_to_smooth=True, batch_train=False, refit_buffer=None, deep_copy_buffer=True, assume_unique_reward=False, random_state=None, njobs=-1)

Explore First, a.k.a. Explore-Then-Exploit

Selects random actions for the first N predictions, after which it selects the best arm only, according to its estimates.

Parameters
  • base_algorithm (obj) – Base binary classifier for which each sample for each class will be fit. Will look for, in this order:

    1. A ‘predict_proba’ method with outputs (n_samples, 2), values in [0,1], rows suming to 1

    2. A ‘decision_function’ method with unbounded outputs (n_samples,) to which it will apply a sigmoid function.

    3. A ‘predict’ method with outputs (n_samples,) with values in [0,1].

    Can also pass a list with a different (or already-fit) classifier for each arm.

  • nchoices (int or list-like) – Number of arms/labels to choose from. Can also pass a list, array, or Series with arm names, in which case the outputs from predict will follow these names and arms can be dropped by name, and new ones added with a custom name.

  • explore_rounds (int) – Number of rounds to wait before exploitation mode. Will switch after making N predictions.

  • prob_active_choice (float (0, 1)) – Probability of choosing explore-mode actions according to active learning criteria. Pass zero for choosing everything at random.

  • active_choice (str, one of ‘weighted’, ‘max’ or ‘min’) – How to calculate the gradient that an observation would have on the loss function for each classifier, given that it could be either class (positive or negative) for the classifier that predicts each arm. If weighted, they are weighted by the same probability estimates from the base algorithm.

  • f_grad_norm (None, str ‘auto’ or function(base_algorithm, X, pred) -> array (n_samples, 2)) – (When passing active_choice) Function that calculates the row-wise norm of the gradient from observations in X if their class were negative (first column) or positive (second column). Can also use different functions for each arm, in which case it accepts them as a list of functions with length equal to nchoices. The option ‘auto’ will only work with scikit-learn’s ‘LogisticRegression’, ‘SGDClassifier’ (log-loss only), and ‘RidgeClassifier’; with stochQN’s ‘StochasticLogisticRegression’; and with this package’s ‘LinearRegression’. Ignored when passing prob_active_choice=0.

  • case_one_class (str ‘auto’, ‘zero’, None, list, or function(X, n_pos, n_neg, rng) -> array(n_samples, 2)) – (When passing active_choice) If some arm/choice/class has only rewards of one type, many models will fail to fit, and consequently the gradients will be undefined. Likewise, if the model has not been fit, the gradient might also be undefined, and this requires a workaround.

    • If passing ‘None’, will assume that base_algorithm can be fit to data of only-positive or only-negative class without problems, and that it can calculate gradients and predictions with a base_algorithm object that has not been fitted. Be aware that the methods ‘predict’, ‘predict_proba’, and ‘decision_function’ in base_algorithm might be overwritten with another method that wraps it in a try-catch block, so don’t rely on it producing errors when unfitted.

    • If passing a function, will take the output of it as the row-wise gradient norms when it compares them against other arms/classes, with the first column having the values if the observations were of negative class, and the second column if they were positive class. The inputs to this function (signature described above) are the number of positive and negative examples that have been observed, and a Generator object from NumPy to use for generating random numbers.

    • If passing a list, will assume each entry is a function as described above, to be used with each corresponding arm.

    • If passing ‘auto’, will generate random numbers:

      • negative: ~ Gamma(log10(n_features) / (n_pos+1)/(n_pos+n_neg+2), log10(n_features)).

      • positive: ~ Gamma(log10(n_features) * (n_pos+1)/(n_pos+n_neg+2), log10(n_features)).

    • If passing ‘zero’, it will output zero whenever models have not been fitted.

    Note that the theoretically correct approach for a logistic regression would be to assume models with all-zero coefficients, in which case the gradient is defined in the absence of any data, but this tends to produce bad end results. Ignored when passing prob_active_choice=0.

  • beta_prior (str ‘auto’, None, tuple ((a,b), n), or list[tuple((a,b), n)]) – If not ‘None’, when there are less than ‘n’ samples with and without a reward from a given arm, it will predict the score for that class as a random number drawn from a beta distribution with the prior specified by ‘a’ and ‘b’. If set to “auto”, will be calculated as:

    beta_prior = ((2/log2(nchoices), 4), 2)

    Can also pass different priors per arm, in which case they should be passed as a list of tuples. Recommended to use only one of beta_prior or smoothing.

  • smoothing (None, tuple (a,b), or list) – If not None, predictions will be smoothed as yhat_smooth = (yhat*n + a)/(n + b), where ‘n’ is the number of times each arm was chosen in the training data. Can also pass it as a list of tuples with different ‘a’ and ‘b’ parameters for each arm (e.g. if there are arm features, these parameters can be determined through a different model). This will not work well with non-probabilistic classifiers such as SVM, in which case you might want to define a class that embeds it with some recalibration built-in. Recommended to use only one of beta_prior or smoothing.

  • noise_to_smooth (bool) – If passing smoothing, whether to add a small amount of random noise \(\sim Uniform(0, 10^{-12})\) in order to break ties at random instead of choosing the smallest arm index. Ignored when passing smoothing=None.

  • batch_train (bool) – Whether the base algorithm will be fit to the data in batches as it comes (streaming), or to the whole dataset each time it is refit. Requires a classifier with a ‘partial_fit’ method.

  • refit_buffer (int or None) – Number of observations per arm to keep as a reserve for passing to ‘partial_fit’. If passing it, up until the moment there are at least this number of observations for a given arm, that arm will keep the observations when calling ‘fit’ and ‘partial_fit’, and will translate calls to ‘partial_fit’ to calls to ‘fit’ with the new plus stored observations. After the reserve number is reached, calls to ‘partial_fit’ will enlarge the data batch with the stored observations, and old stored observations will be gradually replaced with the new ones (at random, not on a FIFO basis). This technique can greatly enchance the performance when fitting the data in batches, but memory consumption can grow quite large. If passing sparse CSR matrices as input to ‘fit’ and ‘partial_fit’, these will be converted to dense once they go into this reserve, and then converted back to CSR to augment the new data. Calls to ‘fit’ will override this reserve. Ignored when passing ‘batch_train=False’.

  • deep_copy_buffer (bool) – Whether to make deep copies of the data that is stored in the reserve for refit_buffer. If passing ‘False’, when the reserve is not yet full, these will only store shallow copies of the data, which is faster but will not let Python’s garbage collector free memory after deleting the data, and if the original data is overwritten, so will this buffer. Ignored when not using refit_buffer.

  • assume_unique_reward (bool) – Whether to assume that only one arm has a reward per observation. If set to ‘True’, whenever an arm receives a reward, the classifiers for all other arms will be fit to that observation too, having negative label.

  • random_state (int, None, RandomState, or Generator) – Either an integer which will be used as seed for initializing a Generator object for random number generation, a RandomState object (from NumPy) from which to draw an integer, or a Generator object (from NumPy), which will be used directly. While this controls random number generation for this meteheuristic, there can still be other sources of variations upon re-runs, such as data aggregations in parallel (e.g. from OpenMP or BLAS functions).

  • njobs (int or None) – Number of parallel jobs to run. If passing None will set it to 1. If passing -1 will set it to the number of CPU cores. Note that if the base algorithm is itself parallelized, this might result in a slowdown as both compete for available threads, so don’t set parallelization in both. The parallelization uses shared memory, thus you will only see a speed up if your base classifier releases the Python GIL, and will otherwise result in slower runs.

References

1

Cortes, David. “Adapting multi-armed bandits policies to contextual bandits scenarios.” arXiv preprint arXiv:1811.04383 (2018).

add_arm(arm_name=None, fitted_classifier=None, n_w_rew=0, n_wo_rew=0, smoothing=None, beta_prior=None, refit_buffer_X=None, refit_buffer_r=None, f_grad_norm=None, case_one_class=None)

Adds a new arm to the pool of choices

Parameters
  • arm_name (object) – Name for this arm. Only applicable when using named arms. If None, will use the name of the last arm plus 1 (will only work when the names are integers).

  • fitted_classifier (object) – If a classifier has already been fit to rewards coming from this arm, you can pass it here, otherwise, will be started from the same ‘base_classifier’ as the initial arms. If using bootstrapped methods or methods from this module which do not accept arbitrary classifiers as input, don’t pass a classifier here (unless using the classes like e.g. utils._BootstrappedClassifierBase). If the constructor was called with different base_algorithm per arm, must pass a base classifier here. Not applicable for the classes that do not take a base_algorithm.

  • n_w_rew (int) – Number of trials/rounds with rewards coming from this arm (only used when using a beta prior or smoothing).

  • n_wo_rew (int) – Number of trials/rounds without rewards coming from this arm (only used when using a beta prior or smoothing).

  • smoothing (None, tuple (a,b), or list) – Smoothing parameters to use for this arm (see documentation of the class constructor for details). If None and if the smoothing passed to the constructor didn’t have separate entries per arm, will use the same smoothing as was passed in the constructor. If no smoothing was passed to the constructor, the smoothing here will be ignored. Must pass a smoothing here if the constructor was passed a smoothing with different entries per arm.

  • beta_prior (None or tuple((a,b), n)) – Beta prior to use for this arm. See the class’ documenation for details. Must be passed if the constructor was provided different beta priors per arm. If None and the constructor had a single beta_prior, will use that same beta_prior for this new arm. Note that n_w_rew and n_wo_rew will be counted towards the threshold ‘n’ in here. Cannot be passed if the constructor did not have a beta_prior.

  • refit_buffer_X (array(m, n) or None) – Refit buffer of ‘X’ data to use for the new arm. Ignored when using ‘batch_train=False’ or ‘refit_buffer=None’.

  • refit_buffer_r (array(m,) or None) – Refit buffer of rewards data to use for the new arm. Ignored when using ‘batch_train=False’ or ‘refit_buffer=None’.

  • f_grad_norm (function) – Gradient calculation function to use for this arm. This is only for the policies that make choices according to active learning criteria, and only for situations in which the policy was passed different functions for each arm.

  • case_one_class (function) – Gradient workaround function for single-class data. This is only for the policies that make choices according to active learning criteria, and only for situations in which the policy was passed different functions for each arm.

Returns

self – This object

Return type

object

decision_function(X)

Get the scores for each arm following this policy’s action-choosing criteria.

Parameters

X (array (n_samples, n_features)) – Data for which to obtain decision function scores for each arm.

Returns

scores – Scores following this policy for each arm.

Return type

array (n_samples, n_choices)

drop_arm(arm_name)

Drop an arm/choice

Drops (removes/deletes) an arm from the set of available choices to the policy.

Note

The available arms, if named, are stored in attribute ‘choice_names’.

Parameters

arm_name (int or object) – Arm to drop. If passing an integer, will drop at that index (starting at zero). Otherwise, will drop the arm matching this name (argument must be of the same type as the individual entries passed to ‘nchoices’ in the initialization).

Returns

self – This object

Return type

object

fit(X, a, r, continue_from_last=False)

Fits the base algorithm (one per class [and per sample if bootstrapped]) to partially labeled data.

Parameters
  • X (array(n_samples, n_features) or CSR(n_samples, n_features)) – Matrix of covariates for the available data.

  • a (array(n_samples, ), int type) – Arms or actions that were chosen for each observations.

  • r (array(n_samples, ), {0,1}) – Rewards that were observed for the chosen actions. Must be binary rewards 0/1.

  • continue_from_last (bool) – If the policy was previously fit to data, whether to assume that this new call to ‘fit’ will continue from the exact same dataset as before plus new rows appended at the end of ‘X’, ‘a’, ‘r’. In this case, will only refit the models that have new data according to ‘a’. Note that the bootstrapped policies will still benefit from extra refits. This option should not be used when there are calls to ‘partial_fit’ between calls to fit. Ignored if using assume_unique_reward=True.

Returns

self – This object

Return type

obj

partial_fit(X, a, r)

Fits the base algorithm (one per class) to partially labeled data in batches.

Note

In order to use this method, the base classifier must have a ‘partial_fit’ method, such as ‘sklearn.linear_model.SGDClassifier’. This method is not available for ‘LogisticUCB’, ‘LogisticTS’, ‘PartitionedUCB’, ‘PartitionedTS’.

Parameters
  • X (array(n_samples, n_features) or CSR(n_samples, n_features)) – Matrix of covariates for the available data.

  • a (array(n_samples, ), int type) – Arms or actions that were chosen for each observations.

  • r (array(n_samples, ), {0,1}) – Rewards that were observed for the chosen actions. Must be binary rewards 0/1.

Returns

self – This object

Return type

obj

predict(X, exploit=False)

Selects actions according to this policy for new data.

Parameters
  • X (array (n_samples, n_features)) – New observations for which to choose an action according to this policy.

  • exploit (bool) – Whether to make a prediction according to the policy, or to just choose the arm with the highest expected reward according to current models.

Returns

pred – Actions chosen by the policy.

Return type

array (n_samples,)

reset_active_choice(active_choice='weighted')

Set the active gradient criteria to a custom form

Parameters

active_choice (str in {‘min’, ‘max’, ‘weighted’}) – How to calculate the gradient that an observation would have on the loss function for each classifier, given that it could be either class (positive or negative) for the classifier that predicts each arm. If weighted, they are weighted by the same probability estimates from the base algorithm.

Returns

self – This object

Return type

obj

reset_count()

Resets the counter for exploitation mode

Return type

self

topN(X, n)

Get top-N ranked actions for each observation

Note

This method will rank choices/arms according to what the policy dictates - it is not an exploitation-mode rank, so if e.g. there are random choices for some observations, there will be random ranks in here.

Parameters
  • X (array (n_samples, n_features)) – New observations for which to rank actions according to this policy.

  • n (int) – Number of top-ranked actions to output

Returns

topN – The top-ranked actions for each observation

Return type

array(n_samples, n)

LinTS

class contextualbandits.online.LinTS(nchoices, lambda_=1.0, fit_intercept=True, v_sq=1.0, sample_from='coef', n_presampled=None, sample_unique=True, use_float=False, method='chol', beta_prior=None, smoothing=None, noise_to_smooth=True, assume_unique_reward=False, random_state=None, njobs=1)

Linear Thompson Sampling

Note

This strategy requires each fitted model to store a square matrix with dimension equal to the number of features. Thus, memory consumption can grow very high with this method.

Note

The ‘X’ data (covariates) should ideally be centered before passing them to ‘fit’, ‘partial_fit’, ‘predict’.

Note

Be aware that sampling coefficients is an operation that scales poorly with the number of columns/features/variables. For wide datasets, it might be slower than a bootstrapped approach, especially when using sample_unique=True.

Parameters
  • nchoices (int or list-like) – Number of arms/labels to choose from. Can also pass a list, array, or Series with arm names, in which case the outputs from predict will follow these names and arms can be dropped by name, and new ones added with a custom name.

  • lambda_ (float > 0) – Regularization parameter. References assumed this would always be equal to 1, but this implementation allows to change it.

  • fit_intercept (bool) – Whether to add an intercept term to the coefficients.

  • v_sq (float) – Parameter by which to multiply the covariance matrix (more means higher variance). It is recommended to decrease it from the default value of 1.

  • sample_from (str, one of “coef”, “ci”) – Whether to make predictions by sampling the model coefficients or by sampling the predicted value from an interval centered around the coefficients. If sampling from the coefficients, it’s highly recommended to use method="chol" as it will be faster and more precise.

  • n_presampled (None or int) – If sampling from coefficients, this denotes a number of coefficients to pre-sample after calling ‘fit’ and/or ‘partial_fit’, which will be used later in the predictions. Pre-sampling a large number of coefficients can help to speed up predictions at the expense of longer fitting times, and is recommended if there is a large number of predictions between calls to ‘fit’ or ‘partial_fit’. If passing ‘None’ (the default), will not pre-sample a finite number of the coefficients at fitting time, but will rather sample (different) coefficients in calls to ‘predict’. Ignored when passing sample_from="ci".

  • sample_unique (bool) – Whether to sample different coefficients each time a prediction is to be made. If passing ‘False’, when calling ‘predict’, it will sample the same coefficients for all the observations in the same call to ‘predict’, whereas if passing ‘True’, will use a different set of coefficients for each observations. Passing ‘False’ leads to an approach which is theoretically wrong, but as sampling coefficients can be very slow, using ‘False’ can provide a reasonable speed up without much of a performance penalty. Ignored when passing sample_from="ci" or n_presampled.

  • use_float (bool) – Whether to use C ‘float’ type for the required matrices. If passing ‘False’, will use C ‘double’. Be aware that memory usage for this model can grow very large, and that it is more prone to suffer from numeric precision problems compared to its UCB counterpart.

  • method (str, one of ‘chol’ or ‘sm’) – Method used to fit the model. Options are:

    'chol':

    Uses the Cholesky decomposition to solve the linear system from the least-squares closed-form each time ‘fit’ or ‘partial_fit’ is called. This is likely to be faster when fitting the model to a large number of observations at once, and is able to better exploit multi-threading.

    'sm':

    Starts with an inverse diagonal matrix and updates it as each new observation comes using the Sherman-Morrison formula, thus never explicitly solving the linear system, nor needing to calculate a matrix inverse. This is likely to be faster when fitting the model to small batches of observations. Be aware that with this method, it will add regularization to the intercept if passing ‘fit_intercept=True’.

    Note that, even when using “sm” here, if sampling from the coefficients, it will need after each update to calculate eigen values of the covariance or inverse covariance matrix, so it won’t be as fast as LinUCB.

  • beta_prior (str ‘auto’, None, tuple ((a,b), n), or list[tuple((a,b), n)]) – If not ‘None’, when there are less than ‘n’ samples with and without a reward from a given arm, it will predict the score for that class as a random number drawn from a beta distribution with the prior specified by ‘a’ and ‘b’. If set to “auto”, will be calculated as:

    beta_prior = ((2/log2(nchoices), 4), 2)

    Can also pass different priors per arm, in which case they should be passed as a list of tuples. This parameter can have a very large impact in the end results, and it’s recommended to tune it accordingly - scenarios with low expected reward rates should have priors that result in drawing small random numbers, whereas scenarios with large expected reward rates should have stronger priors and tend towards larger random numbers. Also, the more arms there are, the smaller the optimal expected value for these random numbers.

  • smoothing (None, tuple (a,b), or list) – If not None, predictions will be smoothed as yhat_smooth = (yhat*n + a)/(n + b), where ‘n’ is the number of times each arm was chosen in the training data. Can also pass it as a list of tuples with different ‘a’ and ‘b’ parameters for each arm (e.g. if there are arm features, these parameters can be determined through a different model). Recommended to use only one of beta_prior or smoothing. Note that it is technically incorrect to apply smoothing like this (because the predictions from models are not bounded between zero and one), but if neither beta_prior, nor smoothing are passed, the policy can get stuck in situations in which it will only choose actions from the first batch of observations to which it is fit.

  • noise_to_smooth (bool) – If passing smoothing, whether to add a small amount of random noise \(\sim Uniform(0, 10^{-12})\) in order to break ties at random instead of choosing the smallest arm index. Ignored when passing smoothing=None.

  • assume_unique_reward (bool) – Whether to assume that only one arm has a reward per observation. If set to ‘True’, whenever an arm receives a reward, the classifiers for all other arms will be fit to that observation too, having negative label.

  • random_state (int, None, RandomState, or Generator) – Either an integer which will be used as seed for initializing a Generator object for random number generation, a RandomState object (from NumPy) from which to draw an integer, or a Generator object (from NumPy), which will be used directly. While this controls random number generation for this meteheuristic, there can still be other sources of variations upon re-runs, such as data aggregations in parallel (e.g. from OpenMP or BLAS functions).

  • njobs (int or None) – Number of parallel jobs to run. If passing None will set it to 1. If passing -1 will set it to the number of CPU cores. Be aware that the algorithm will use BLAS function calls, and if these have multi-threading enabled, it might result in a slow-down as both functions compete for available threads.

References

1

Agrawal, Shipra, and Navin Goyal. “Thompson sampling for contextual bandits with linear payoffs.” International Conference on Machine Learning. 2013.

add_arm(arm_name=None, fitted_classifier=None, n_w_rew=0, n_wo_rew=0, smoothing=None, beta_prior=None, refit_buffer_X=None, refit_buffer_r=None, f_grad_norm=None, case_one_class=None)

Adds a new arm to the pool of choices

Parameters
  • arm_name (object) – Name for this arm. Only applicable when using named arms. If None, will use the name of the last arm plus 1 (will only work when the names are integers).

  • fitted_classifier (object) – If a classifier has already been fit to rewards coming from this arm, you can pass it here, otherwise, will be started from the same ‘base_classifier’ as the initial arms. If using bootstrapped methods or methods from this module which do not accept arbitrary classifiers as input, don’t pass a classifier here (unless using the classes like e.g. utils._BootstrappedClassifierBase). If the constructor was called with different base_algorithm per arm, must pass a base classifier here. Not applicable for the classes that do not take a base_algorithm.

  • n_w_rew (int) – Number of trials/rounds with rewards coming from this arm (only used when using a beta prior or smoothing).

  • n_wo_rew (int) – Number of trials/rounds without rewards coming from this arm (only used when using a beta prior or smoothing).

  • smoothing (None, tuple (a,b), or list) – Smoothing parameters to use for this arm (see documentation of the class constructor for details). If None and if the smoothing passed to the constructor didn’t have separate entries per arm, will use the same smoothing as was passed in the constructor. If no smoothing was passed to the constructor, the smoothing here will be ignored. Must pass a smoothing here if the constructor was passed a smoothing with different entries per arm.

  • beta_prior (None or tuple((a,b), n)) – Beta prior to use for this arm. See the class’ documenation for details. Must be passed if the constructor was provided different beta priors per arm. If None and the constructor had a single beta_prior, will use that same beta_prior for this new arm. Note that n_w_rew and n_wo_rew will be counted towards the threshold ‘n’ in here. Cannot be passed if the constructor did not have a beta_prior.

  • refit_buffer_X (array(m, n) or None) – Refit buffer of ‘X’ data to use for the new arm. Ignored when using ‘batch_train=False’ or ‘refit_buffer=None’.

  • refit_buffer_r (array(m,) or None) – Refit buffer of rewards data to use for the new arm. Ignored when using ‘batch_train=False’ or ‘refit_buffer=None’.

  • f_grad_norm (function) – Gradient calculation function to use for this arm. This is only for the policies that make choices according to active learning criteria, and only for situations in which the policy was passed different functions for each arm.

  • case_one_class (function) – Gradient workaround function for single-class data. This is only for the policies that make choices according to active learning criteria, and only for situations in which the policy was passed different functions for each arm.

Returns

self – This object

Return type

object

decision_function(X)

Get the scores for each arm following this policy’s action-choosing criteria.

Parameters

X (array (n_samples, n_features)) – Data for which to obtain decision function scores for each arm.

Returns

scores – Scores following this policy for each arm.

Return type

array (n_samples, n_choices)

drop_arm(arm_name)

Drop an arm/choice

Drops (removes/deletes) an arm from the set of available choices to the policy.

Note

The available arms, if named, are stored in attribute ‘choice_names’.

Parameters

arm_name (int or object) – Arm to drop. If passing an integer, will drop at that index (starting at zero). Otherwise, will drop the arm matching this name (argument must be of the same type as the individual entries passed to ‘nchoices’ in the initialization).

Returns

self – This object

Return type

object

fit(X, a, r, continue_from_last=False)

Fits the base algorithm (one per class [and per sample if bootstrapped]) to partially labeled data.

Parameters
  • X (array(n_samples, n_features) or CSR(n_samples, n_features)) – Matrix of covariates for the available data.

  • a (array(n_samples, ), int type) – Arms or actions that were chosen for each observations.

  • r (array(n_samples, ), {0,1}) – Rewards that were observed for the chosen actions. Must be binary rewards 0/1.

  • continue_from_last (bool) – If the policy was previously fit to data, whether to assume that this new call to ‘fit’ will continue from the exact same dataset as before plus new rows appended at the end of ‘X’, ‘a’, ‘r’. In this case, will only refit the models that have new data according to ‘a’. Note that the bootstrapped policies will still benefit from extra refits. This option should not be used when there are calls to ‘partial_fit’ between calls to fit. Ignored if using assume_unique_reward=True.

Returns

self – This object

Return type

obj

partial_fit(X, a, r)

Fits the base algorithm (one per class) to partially labeled data in batches.

Note

In order to use this method, the base classifier must have a ‘partial_fit’ method, such as ‘sklearn.linear_model.SGDClassifier’. This method is not available for ‘LogisticUCB’, ‘LogisticTS’, ‘PartitionedUCB’, ‘PartitionedTS’.

Parameters
  • X (array(n_samples, n_features) or CSR(n_samples, n_features)) – Matrix of covariates for the available data.

  • a (array(n_samples, ), int type) – Arms or actions that were chosen for each observations.

  • r (array(n_samples, ), {0,1}) – Rewards that were observed for the chosen actions. Must be binary rewards 0/1.

Returns

self – This object

Return type

obj

predict(X, exploit=False, output_score=False)

Selects actions according to this policy for new data.

Parameters
  • X (array (n_samples, n_features)) – New observations for which to choose an action according to this policy.

  • exploit (bool) – Whether to make a prediction according to the policy, or to just choose the arm with the highest expected reward according to current models.

  • output_score (bool) – Whether to output the score that this method predicted, in case it is desired to use it with this pakckage’s offpolicy and evaluation modules.

Returns

pred – Actions chosen by the policy. If passing output_score=True, it will be a dictionary with the chosen arm and the score that the arm got following this policy with the classifiers used.

Return type

array (n_samples,) or dict(“choice” : array(n_samples,), “score” : array(n_samples,))

reset_alpha(alpha=1.0)

Set the upper confidence bound parameter to a custom number

Note

This method is only for LinUCB, not for LinTS.

Parameters

alpha (float) – Parameter to control the upper confidence bound (more is higher).

Returns

self – This object

Return type

obj

reset_v_sq(v_sq=1.0)

Set the covariance multiplier to a custom number

Parameters

v_sq (float) – Parameter by which to multiply the covariance matrix (more means higher variance).

Returns

self – This object

Return type

obj

topN(X, n)

Get top-N ranked actions for each observation

Note

This method will rank choices/arms according to what the policy dictates - it is not an exploitation-mode rank, so if e.g. there are random choices for some observations, there will be random ranks in here.

Parameters
  • X (array (n_samples, n_features)) – New observations for which to rank actions according to this policy.

  • n (int) – Number of top-ranked actions to output

Returns

topN – The top-ranked actions for each observation

Return type

array(n_samples, n)

LinUCB

class contextualbandits.online.LinUCB(nchoices, alpha=1.0, lambda_=1.0, fit_intercept=True, use_float=True, method='sm', ucb_from_empty=True, beta_prior=None, smoothing=None, noise_to_smooth=True, assume_unique_reward=False, random_state=None, njobs=1)

Note

This strategy requires each fitted model to store a square matrix with dimension equal to the number of features. Thus, memory consumption can grow very high with this method.

Note

The ‘X’ data (covariates) should ideally be centered before passing them to ‘fit’, ‘partial_fit’, ‘predict’.

Note

The default hyperparameters here are meant to match the original reference, but it’s recommended to change them. Particularly: use beta_prior instead of ucb_from_empty, decrease alpha, and maybe increase lambda_.

Parameters
  • nchoices (int or list-like) – Number of arms/labels to choose from. Can also pass a list, array, or Series with arm names, in which case the outputs from predict will follow these names and arms can be dropped by name, and new ones added with a custom name.

  • alpha (float) – Parameter to control the upper confidence bound (more is higher).

  • lambda_ (float > 0) – Regularization parameter. References assumed this would always be equal to 1, but this implementation allows to change it.

  • fit_intercept (bool) – Whether to add an intercept term to the coefficients.

  • use_float (bool) – Whether to use C ‘float’ type for the required matrices. If passing ‘False’, will use C ‘double’. Be aware that memory usage for this model can grow very large.

  • method (str, one of ‘chol’ or ‘sm’) – Method used to fit the model. Options are:

    'chol':

    Uses the Cholesky decomposition to solve the linear system from the least-squares closed-form each time ‘fit’ or ‘partial_fit’ is called. This is likely to be faster when fitting the model to a large number of observations at once, and is able to better exploit multi-threading.

    'sm':

    Starts with an inverse diagonal matrix and updates it as each new observation comes using the Sherman-Morrison formula, thus never explicitly solving the linear system, nor needing to calculate a matrix inverse. This is likely to be faster when fitting the model to small batches of observations. Be aware that with this method, it will add regularization to the intercept if passing ‘fit_intercept=True’.

  • ucb_from_empty (bool) – Whether to make upper confidence bounds on arms with no observations according to the formula, as suggested in the references (ties are broken at random for them). Choosing this option leads to policies that usually start making random predictions until having sampled from all arms, and as such, it’s not recommended when the number of arms is large relative to the number of rounds. Instead, it’s recommended to use beta_prior, which acts in the same way as for the other policies in this library.

  • beta_prior (str ‘auto’, None, tuple ((a,b), n), or list[tuple((a,b), n)]) – If not ‘None’, when there are less than ‘n’ samples with and without a reward from a given arm, it will predict the score for that class as a random number drawn from a beta distribution with the prior specified by ‘a’ and ‘b’. If set to “auto”, will be calculated as:

    beta_prior = ((3/log2(nchoices), 4), 2).

    Can also pass different priors per arm, in which case they should be passed as a list of tuples. This parameter can have a very large impact in the end results, and it’s recommended to tune it accordingly - scenarios with low expected reward rates should have priors that result in drawing small random numbers, whereas scenarios with large expected reward rates should have stronger priors and tend towards larger random numbers. Also, the more arms there are, the smaller the optimal expected value for these random numbers. Ignored when passing ucb_from_empty=True.

  • smoothing (None, tuple (a,b), or list) – If not None, predictions will be smoothed as yhat_smooth = (yhat*n + a)/(n + b), where ‘n’ is the number of times each arm was chosen in the training data. Can also pass it as a list of tuples with different ‘a’ and ‘b’ parameters for each arm (e.g. if there are arm features, these parameters can be determined through a different model). Recommended to use only one of beta_prior or smoothing. Note that it is technically incorrect to apply smoothing like this (because the predictions from models are not bounded between zero and one), but if neither beta_prior, nor smoothing are passed, the policy can get stuck in situations in which it will only choose actions from the first batch of observations to which it is fit (if using ucb_from_empty=False), or only from the first arms that show rewards (if using ucb_from_empty=True).

  • noise_to_smooth (bool) – If passing smoothing, whether to add a small amount of random noise \(\sim Uniform(0, 10^{-12})\) in order to break ties at random instead of choosing the smallest arm index. Ignored when passing smoothing=None.

  • assume_unique_reward (bool) – Whether to assume that only one arm has a reward per observation. If set to ‘True’, whenever an arm receives a reward, the classifiers for all other arms will be fit to that observation too, having negative label.

  • random_state (int, None, RandomState, or Generator) – Either an integer which will be used as seed for initializing a Generator object for random number generation, a RandomState object (from NumPy) from which to draw an integer, or a Generator object (from NumPy), which will be used directly. While this controls random number generation for this meteheuristic, there can still be other sources of variations upon re-runs, such as data aggregations in parallel (e.g. from OpenMP or BLAS functions).

  • njobs (int or None) – Number of parallel jobs to run. If passing None will set it to 1. If passing -1 will set it to the number of CPU cores. Be aware that the algorithm will use BLAS function calls, and if these have multi-threading enabled, it might result in a slow-down as both functions compete for available threads.

References

1

Chu, Wei, et al. “Contextual bandits with linear payoff functions.” Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics. 2011.

2

Li, Lihong, et al. “A contextual-bandit approach to personalized news article recommendation.” Proceedings of the 19th international conference on World wide web. ACM, 2010.

add_arm(arm_name=None, fitted_classifier=None, n_w_rew=0, n_wo_rew=0, smoothing=None, beta_prior=None, refit_buffer_X=None, refit_buffer_r=None, f_grad_norm=None, case_one_class=None)

Adds a new arm to the pool of choices

Parameters
  • arm_name (object) – Name for this arm. Only applicable when using named arms. If None, will use the name of the last arm plus 1 (will only work when the names are integers).

  • fitted_classifier (object) – If a classifier has already been fit to rewards coming from this arm, you can pass it here, otherwise, will be started from the same ‘base_classifier’ as the initial arms. If using bootstrapped methods or methods from this module which do not accept arbitrary classifiers as input, don’t pass a classifier here (unless using the classes like e.g. utils._BootstrappedClassifierBase). If the constructor was called with different base_algorithm per arm, must pass a base classifier here. Not applicable for the classes that do not take a base_algorithm.

  • n_w_rew (int) – Number of trials/rounds with rewards coming from this arm (only used when using a beta prior or smoothing).

  • n_wo_rew (int) – Number of trials/rounds without rewards coming from this arm (only used when using a beta prior or smoothing).

  • smoothing (None, tuple (a,b), or list) – Smoothing parameters to use for this arm (see documentation of the class constructor for details). If None and if the smoothing passed to the constructor didn’t have separate entries per arm, will use the same smoothing as was passed in the constructor. If no smoothing was passed to the constructor, the smoothing here will be ignored. Must pass a smoothing here if the constructor was passed a smoothing with different entries per arm.

  • beta_prior (None or tuple((a,b), n)) – Beta prior to use for this arm. See the class’ documenation for details. Must be passed if the constructor was provided different beta priors per arm. If None and the constructor had a single beta_prior, will use that same beta_prior for this new arm. Note that n_w_rew and n_wo_rew will be counted towards the threshold ‘n’ in here. Cannot be passed if the constructor did not have a beta_prior.

  • refit_buffer_X (array(m, n) or None) – Refit buffer of ‘X’ data to use for the new arm. Ignored when using ‘batch_train=False’ or ‘refit_buffer=None’.

  • refit_buffer_r (array(m,) or None) – Refit buffer of rewards data to use for the new arm. Ignored when using ‘batch_train=False’ or ‘refit_buffer=None’.

  • f_grad_norm (function) – Gradient calculation function to use for this arm. This is only for the policies that make choices according to active learning criteria, and only for situations in which the policy was passed different functions for each arm.

  • case_one_class (function) – Gradient workaround function for single-class data. This is only for the policies that make choices according to active learning criteria, and only for situations in which the policy was passed different functions for each arm.

Returns

self – This object

Return type

object

decision_function(X)

Get the scores for each arm following this policy’s action-choosing criteria.

Parameters

X (array (n_samples, n_features)) – Data for which to obtain decision function scores for each arm.

Returns

scores – Scores following this policy for each arm.

Return type

array (n_samples, n_choices)

drop_arm(arm_name)

Drop an arm/choice

Drops (removes/deletes) an arm from the set of available choices to the policy.

Note

The available arms, if named, are stored in attribute ‘choice_names’.

Parameters

arm_name (int or object) – Arm to drop. If passing an integer, will drop at that index (starting at zero). Otherwise, will drop the arm matching this name (argument must be of the same type as the individual entries passed to ‘nchoices’ in the initialization).

Returns

self – This object

Return type

object

fit(X, a, r, continue_from_last=False)

Fits the base algorithm (one per class [and per sample if bootstrapped]) to partially labeled data.

Parameters
  • X (array(n_samples, n_features) or CSR(n_samples, n_features)) – Matrix of covariates for the available data.

  • a (array(n_samples, ), int type) – Arms or actions that were chosen for each observations.

  • r (array(n_samples, ), {0,1}) – Rewards that were observed for the chosen actions. Must be binary rewards 0/1.

  • continue_from_last (bool) – If the policy was previously fit to data, whether to assume that this new call to ‘fit’ will continue from the exact same dataset as before plus new rows appended at the end of ‘X’, ‘a’, ‘r’. In this case, will only refit the models that have new data according to ‘a’. Note that the bootstrapped policies will still benefit from extra refits. This option should not be used when there are calls to ‘partial_fit’ between calls to fit. Ignored if using assume_unique_reward=True.

Returns

self – This object

Return type

obj

partial_fit(X, a, r)

Fits the base algorithm (one per class) to partially labeled data in batches.

Note

In order to use this method, the base classifier must have a ‘partial_fit’ method, such as ‘sklearn.linear_model.SGDClassifier’. This method is not available for ‘LogisticUCB’, ‘LogisticTS’, ‘PartitionedUCB’, ‘PartitionedTS’.

Parameters
  • X (array(n_samples, n_features) or CSR(n_samples, n_features)) – Matrix of covariates for the available data.

  • a (array(n_samples, ), int type) – Arms or actions that were chosen for each observations.

  • r (array(n_samples, ), {0,1}) – Rewards that were observed for the chosen actions. Must be binary rewards 0/1.

Returns

self – This object

Return type

obj

predict(X, exploit=False, output_score=False)

Selects actions according to this policy for new data.

Parameters
  • X (array (n_samples, n_features)) – New observations for which to choose an action according to this policy.

  • exploit (bool) – Whether to make a prediction according to the policy, or to just choose the arm with the highest expected reward according to current models.

  • output_score (bool) – Whether to output the score that this method predicted, in case it is desired to use it with this pakckage’s offpolicy and evaluation modules.

Returns

pred – Actions chosen by the policy. If passing output_score=True, it will be a dictionary with the chosen arm and the score that the arm got following this policy with the classifiers used.

Return type

array (n_samples,) or dict(“choice” : array(n_samples,), “score” : array(n_samples,))

reset_alpha(alpha=1.0)

Set the upper confidence bound parameter to a custom number

Note

This method is only for LinUCB, not for LinTS.

Parameters

alpha (float) – Parameter to control the upper confidence bound (more is higher).

Returns

self – This object

Return type

obj

topN(X, n)

Get top-N ranked actions for each observation

Note

This method will rank choices/arms according to what the policy dictates - it is not an exploitation-mode rank, so if e.g. there are random choices for some observations, there will be random ranks in here.

Parameters
  • X (array (n_samples, n_features)) – New observations for which to rank actions according to this policy.

  • n (int) – Number of top-ranked actions to output

Returns

topN – The top-ranked actions for each observation

Return type

array(n_samples, n)

LogisticTS

class contextualbandits.online.LogisticTS(nchoices, sample_from='ci', ci_from_empty=False, multiplier=0.25, n_presampled=None, fit_intercept=True, lambda_=1.0, sample_unique=True, beta_prior='auto', smoothing=None, noise_to_smooth=True, assume_unique_reward=False, random_state=None, njobs=-1)

Logistic Regression with Thompson Sampling

Logistic regression classifier which either samples its coefficients using the variance-covariance matrix of the fitted non-sampled coefficients, or which samples predicted values from a confidence interval built from the same variance-covariance matrix as a faster alternative.

Note

This strategy is implemented for comparison purposes only and it’s not recommended to rely on it, particularly not for large datasets. Performance tends to be very bad compared to the other methods provided here.

Note

This strategy does not support fitting the data in batches (‘partial_fit’ will not be available), nor does it support using any other classifier. See ‘BootstrappedTS’ for a more generalizable version.

Note

This strategy requires each fitted model to store a square matrix with dimension equal to the number of features. Thus, memory consumption can grow very high with this method.

Note

Be aware that sampling coefficients is an operation that scales poorly with the number of columns/features/variables. For wide datasets, it might be slower than a bootstrapped approach, especially when using sample_unique=True.

Parameters
  • nchoices (int or list-like) – Number of arms/labels to choose from. Can also pass a list, array, or Series with arm names, in which case the outputs from predict will follow these names and arms can be dropped by name, and new ones added with a custom name.

  • sample_from (str, one of “coef”, “ci”) – Whether to make predictions by sampling the model coefficients or by sampling the predicted value from a confidence interval around the best-fit coefficients.

  • ci_from_empty (bool) – Whether to construct a confidence interval on arms with no observations according to a variance-covariance matrix given by the regulatization parameter alone. Ignored when passing sample_from='coef'.

  • multiplier (float) – Multiplier for the covariance matrix. Pass 1 to take it as-is. Ignored when passing sample_from='ci'.

  • n_presampled (None or int) – If sampling from coefficients, this denotes a number of coefficients to pre-sample after calling ‘fit’, which will be used later in the predictions. Pre-sampling a large number of coefficients can help to speed up predictions at the expense of longer fitting times, and is recommended if there is a large number of predictions between calls to ‘fit’. If passing ‘None’ (the default), will not pre-sample a finite number of the coefficients at fitting time, but will rather sample (different) coefficients in calls to ‘predict’. Ignored when passing sample_from="ci".

  • fit_intercept (bool) – Whether to add an intercept term to the models.

  • lambda_ (float) – Strenght of the L2 regularization. Must be greater than zero.

  • sample_unique (bool) – Whether to sample different coefficients each time a prediction is to be made. If passing ‘False’, when calling ‘predict’, it will sample the same coefficients for all the observations in the same call to ‘predict’, whereas if passing ‘True’, will use a different set of coefficients for each observation/row. Passing ‘False’ leads to an approach which is theoretically wrong, but as sampling coefficients can be very slow, using ‘False’ can provide a reasonable speed up without much of a performance penalty. Ignored when passing sample_from='ci' or n_presampled.

  • beta_prior (str ‘auto’, None, tuple ((a,b), n), or list[tuple((a,b), n)]) – If not ‘None’, when there are less than ‘n’ samples with and without a reward from a given arm, it will predict the score for that class as a random number drawn from a beta distribution with the prior specified by ‘a’ and ‘b’. If set to “auto”, will be calculated as:

    beta_prior = ((2/log2(nchoices), 4), 2)

    Can also pass different priors per arm, in which case they should be passed as a list of tuples. This parameter can have a very large impact in the end results, and it’s recommended to tune it accordingly - scenarios with low expected reward rates should have priors that result in drawing small random numbers, whereas scenarios with large expected reward rates should have stronger priors and tend towards larger random numbers. Also, the more arms there are, the smaller the optimal expected value for these random numbers. Recommended to use only one of beta_prior, smoothing, ci_from_empty. Ignored when passing ci_from_empty=True.

  • smoothing (None, tuple (a,b), or list) – If not None, predictions will be smoothed as yhat_smooth = (yhat*n + a)/(n + b), where ‘n’ is the number of times each arm was chosen in the training data. Can also pass it as a list of tuples with different ‘a’ and ‘b’ parameters for each arm (e.g. if there are arm features, these parameters can be determined through a different model). Recommended to use only one of beta_prior, smoothing, ci_from_empty.

  • noise_to_smooth (bool) – If passing smoothing, whether to add a small amount of random noise \(\sim Uniform(0, 10^{-12})\) in order to break ties at random instead of choosing the smallest arm index. Ignored when passing smoothing=None.

  • assume_unique_reward (bool) – Whether to assume that only one arm has a reward per observation. If set to ‘True’, whenever an arm receives a reward, the classifiers for all other arms will be fit to that observation too, having negative label.

  • random_state (int, None, RandomState, or Generator) – Either an integer which will be used as seed for initializing a Generator object for random number generation, a RandomState object (from NumPy) from which to draw an integer, or a Generator object (from NumPy), which will be used directly. While this controls random number generation for this meteheuristic, there can still be other sources of variations upon re-runs, such as data aggregations in parallel (e.g. from OpenMP or BLAS functions).

  • njobs (int or None) – Number of parallel jobs to run. If passing None will set it to 1. If passing -1 will set it to the number of CPU cores. Be aware that the algorithm will use BLAS function calls, and if these have multi-threading enabled, it might result in a slow-down as both functions compete for available threads.

References

1

Cortes, David. “Adapting multi-armed bandits policies to contextual bandits scenarios.” arXiv preprint arXiv:1811.04383 (2018).

add_arm(arm_name=None, fitted_classifier=None, n_w_rew=0, n_wo_rew=0, smoothing=None, beta_prior=None, refit_buffer_X=None, refit_buffer_r=None, f_grad_norm=None, case_one_class=None)

Adds a new arm to the pool of choices

Parameters
  • arm_name (object) – Name for this arm. Only applicable when using named arms. If None, will use the name of the last arm plus 1 (will only work when the names are integers).

  • fitted_classifier (object) – If a classifier has already been fit to rewards coming from this arm, you can pass it here, otherwise, will be started from the same ‘base_classifier’ as the initial arms. If using bootstrapped methods or methods from this module which do not accept arbitrary classifiers as input, don’t pass a classifier here (unless using the classes like e.g. utils._BootstrappedClassifierBase). If the constructor was called with different base_algorithm per arm, must pass a base classifier here. Not applicable for the classes that do not take a base_algorithm.

  • n_w_rew (int) – Number of trials/rounds with rewards coming from this arm (only used when using a beta prior or smoothing).

  • n_wo_rew (int) – Number of trials/rounds without rewards coming from this arm (only used when using a beta prior or smoothing).

  • smoothing (None, tuple (a,b), or list) – Smoothing parameters to use for this arm (see documentation of the class constructor for details). If None and if the smoothing passed to the constructor didn’t have separate entries per arm, will use the same smoothing as was passed in the constructor. If no smoothing was passed to the constructor, the smoothing here will be ignored. Must pass a smoothing here if the constructor was passed a smoothing with different entries per arm.

  • beta_prior (None or tuple((a,b), n)) – Beta prior to use for this arm. See the class’ documenation for details. Must be passed if the constructor was provided different beta priors per arm. If None and the constructor had a single beta_prior, will use that same beta_prior for this new arm. Note that n_w_rew and n_wo_rew will be counted towards the threshold ‘n’ in here. Cannot be passed if the constructor did not have a beta_prior.

  • refit_buffer_X (array(m, n) or None) – Refit buffer of ‘X’ data to use for the new arm. Ignored when using ‘batch_train=False’ or ‘refit_buffer=None’.

  • refit_buffer_r (array(m,) or None) – Refit buffer of rewards data to use for the new arm. Ignored when using ‘batch_train=False’ or ‘refit_buffer=None’.

  • f_grad_norm (function) – Gradient calculation function to use for this arm. This is only for the policies that make choices according to active learning criteria, and only for situations in which the policy was passed different functions for each arm.

  • case_one_class (function) – Gradient workaround function for single-class data. This is only for the policies that make choices according to active learning criteria, and only for situations in which the policy was passed different functions for each arm.

Returns

self – This object

Return type

object

decision_function(X)

Get the scores for each arm following this policy’s action-choosing criteria.

Parameters

X (array (n_samples, n_features)) – Data for which to obtain decision function scores for each arm.

Returns

scores – Scores following this policy for each arm.

Return type

array (n_samples, n_choices)

drop_arm(arm_name)

Drop an arm/choice

Drops (removes/deletes) an arm from the set of available choices to the policy.

Note

The available arms, if named, are stored in attribute ‘choice_names’.

Parameters

arm_name (int or object) – Arm to drop. If passing an integer, will drop at that index (starting at zero). Otherwise, will drop the arm matching this name (argument must be of the same type as the individual entries passed to ‘nchoices’ in the initialization).

Returns

self – This object

Return type

object

fit(X, a, r, continue_from_last=False)

Fits the base algorithm (one per class [and per sample if bootstrapped]) to partially labeled data.

Parameters
  • X (array(n_samples, n_features) or CSR(n_samples, n_features)) – Matrix of covariates for the available data.

  • a (array(n_samples, ), int type) – Arms or actions that were chosen for each observations.

  • r (array(n_samples, ), {0,1}) – Rewards that were observed for the chosen actions. Must be binary rewards 0/1.

  • continue_from_last (bool) – If the policy was previously fit to data, whether to assume that this new call to ‘fit’ will continue from the exact same dataset as before plus new rows appended at the end of ‘X’, ‘a’, ‘r’. In this case, will only refit the models that have new data according to ‘a’. Note that the bootstrapped policies will still benefit from extra refits. This option should not be used when there are calls to ‘partial_fit’ between calls to fit. Ignored if using assume_unique_reward=True.

Returns

self – This object

Return type

obj

partial_fit(X, a, r)

Fits the base algorithm (one per class) to partially labeled data in batches.

Note

In order to use this method, the base classifier must have a ‘partial_fit’ method, such as ‘sklearn.linear_model.SGDClassifier’. This method is not available for ‘LogisticUCB’, ‘LogisticTS’, ‘PartitionedUCB’, ‘PartitionedTS’.

Parameters
  • X (array(n_samples, n_features) or CSR(n_samples, n_features)) – Matrix of covariates for the available data.

  • a (array(n_samples, ), int type) – Arms or actions that were chosen for each observations.

  • r (array(n_samples, ), {0,1}) – Rewards that were observed for the chosen actions. Must be binary rewards 0/1.

Returns

self – This object

Return type

obj

predict(X, exploit=False, output_score=False)

Selects actions according to this policy for new data.

Parameters
  • X (array (n_samples, n_features)) – New observations for which to choose an action according to this policy.

  • exploit (bool) – Whether to make a prediction according to the policy, or to just choose the arm with the highest expected reward according to current models.

  • output_score (bool) – Whether to output the score that this method predicted, in case it is desired to use it with this pakckage’s offpolicy and evaluation modules.

Returns

pred – Actions chosen by the policy. If passing output_score=True, it will be a dictionary with the chosen arm and the score that the arm got following this policy with the classifiers used.

Return type

array (n_samples,) or dict(“choice” : array(n_samples,), “score” : array(n_samples,))

topN(X, n)

Get top-N ranked actions for each observation

Note

This method will rank choices/arms according to what the policy dictates - it is not an exploitation-mode rank, so if e.g. there are random choices for some observations, there will be random ranks in here.

Parameters
  • X (array (n_samples, n_features)) – New observations for which to rank actions according to this policy.

  • n (int) – Number of top-ranked actions to output

Returns

topN – The top-ranked actions for each observation

Return type

array(n_samples, n)

LogisticUCB

class contextualbandits.online.LogisticUCB(nchoices, percentile=80, fit_intercept=True, lambda_=1.0, ucb_from_empty=False, beta_prior='auto', smoothing=None, noise_to_smooth=True, assume_unique_reward=False, random_state=None, njobs=-1)

Logistic Regression with Confidence Interval

Logistic regression classifier which constructs an upper bound on the predicted probabilities through a confidence interval calculated from the variance-covariance matrix of the fitted coefficients.

Note

This strategy is implemented for comparison purposes only and it’s not recommended to rely on it, particularly not for large datasets.

Note

This strategy does not support fitting the data in batches (‘partial_fit’ will not be available), nor does it support using any other classifier. See ‘BootstrappedUCB’ for a more generalizable version.

Note

This strategy requires each fitted classifier to store a square matrix with dimension equal to the number of features. Thus, memory consumption can grow very high with this method.

Parameters
  • nchoices (int or list-like) – Number of arms/labels to choose from. Can also pass a list, array, or Series with arm names, in which case the outputs from predict will follow these names and arms can be dropped by name, and new ones added with a custom name.

  • percentile (int [0,100]) – Percentile of the confidence interval to take.

  • fit_intercept (bool) – Whether to add an intercept term to the models.

  • lambda_ (float) – Strenght of the L2 regularization. Must be greater than zero.

  • ucb_from_empty (bool) – Whether to make upper confidence bounds on arms with no observations according to the formula (ties are broken at random for them). Choosing this option leads to policies that usually start making random predictions until having sampled from all arms, and as such, it’s not recommended when the number of arms is large relative to the number of rounds. Instead, it’s recommended to use beta_prior, which acts in the same way as for the other policies in this library.

  • beta_prior (str ‘auto’, None, tuple ((a,b), n), or list[tuple((a,b), n)]) – If not ‘None’, when there are less than ‘n’ samples with and without a reward from a given arm, it will predict the score for that class as a random number drawn from a beta distribution with the prior specified by ‘a’ and ‘b’. If set to “auto”, will be calculated as:

    beta_prior = ((3/log2(nchoices), 4), 2)

    Can also pass different priors per arm, in which case they should be passed as a list of tuples. This parameter can have a very large impact in the end results, and it’s recommended to tune it accordingly - scenarios with low expected reward rates should have priors that result in drawing small random numbers, whereas scenarios with large expected reward rates should have stronger priors and tend towards larger random numbers. Also, the more arms there are, the smaller the optimal expected value for these random numbers. Note that this method calculates upper bounds rather than expectations, so the ‘a’ parameter should be higher than for other methods. Recommended to use only one of beta_prior or smoothing. Ignored when passing ucb_from_empty=True.

  • smoothing (None, tuple (a,b), or list) – If not None, predictions will be smoothed as yhat_smooth = (yhat*n + a)/(n + b), where ‘n’ is the number of times each arm was chosen in the training data. Can also pass it as a list of tuples with different ‘a’ and ‘b’ parameters for each arm (e.g. if there are arm features, these parameters can be determined through a different model). Recommended to use only one of beta_prior or smoothing.

  • noise_to_smooth (bool) – If passing smoothing, whether to add a small amount of random noise \(\sim Uniform(0, 10^{-12})\) in order to break ties at random instead of choosing the smallest arm index. Ignored when passing smoothing=None.

  • assume_unique_reward (bool) – Whether to assume that only one arm has a reward per observation. If set to ‘True’, whenever an arm receives a reward, the classifiers for all other arms will be fit to that observation too, having negative label.

  • random_state (int, None, RandomState, or Generator) – Either an integer which will be used as seed for initializing a Generator object for random number generation, a RandomState object (from NumPy) from which to draw an integer, or a Generator object (from NumPy), which will be used directly. While this controls random number generation for this meteheuristic, there can still be other sources of variations upon re-runs, such as data aggregations in parallel (e.g. from OpenMP or BLAS functions).

  • njobs (int or None) – Number of parallel jobs to run. If passing None will set it to 1. If passing -1 will set it to the number of CPU cores. Be aware that the algorithm will use BLAS function calls, and if these have multi-threading enabled, it might result in a slow-down as both functions compete for available threads.

References

1

Cortes, David. “Adapting multi-armed bandits policies to contextual bandits scenarios.” arXiv preprint arXiv:1811.04383 (2018).

add_arm(arm_name=None, fitted_classifier=None, n_w_rew=0, n_wo_rew=0, smoothing=None, beta_prior=None, refit_buffer_X=None, refit_buffer_r=None, f_grad_norm=None, case_one_class=None)

Adds a new arm to the pool of choices

Parameters
  • arm_name (object) – Name for this arm. Only applicable when using named arms. If None, will use the name of the last arm plus 1 (will only work when the names are integers).

  • fitted_classifier (object) – If a classifier has already been fit to rewards coming from this arm, you can pass it here, otherwise, will be started from the same ‘base_classifier’ as the initial arms. If using bootstrapped methods or methods from this module which do not accept arbitrary classifiers as input, don’t pass a classifier here (unless using the classes like e.g. utils._BootstrappedClassifierBase). If the constructor was called with different base_algorithm per arm, must pass a base classifier here. Not applicable for the classes that do not take a base_algorithm.

  • n_w_rew (int) – Number of trials/rounds with rewards coming from this arm (only used when using a beta prior or smoothing).

  • n_wo_rew (int) – Number of trials/rounds without rewards coming from this arm (only used when using a beta prior or smoothing).

  • smoothing (None, tuple (a,b), or list) – Smoothing parameters to use for this arm (see documentation of the class constructor for details). If None and if the smoothing passed to the constructor didn’t have separate entries per arm, will use the same smoothing as was passed in the constructor. If no smoothing was passed to the constructor, the smoothing here will be ignored. Must pass a smoothing here if the constructor was passed a smoothing with different entries per arm.

  • beta_prior (None or tuple((a,b), n)) – Beta prior to use for this arm. See the class’ documenation for details. Must be passed if the constructor was provided different beta priors per arm. If None and the constructor had a single beta_prior, will use that same beta_prior for this new arm. Note that n_w_rew and n_wo_rew will be counted towards the threshold ‘n’ in here. Cannot be passed if the constructor did not have a beta_prior.

  • refit_buffer_X (array(m, n) or None) – Refit buffer of ‘X’ data to use for the new arm. Ignored when using ‘batch_train=False’ or ‘refit_buffer=None’.

  • refit_buffer_r (array(m,) or None) – Refit buffer of rewards data to use for the new arm. Ignored when using ‘batch_train=False’ or ‘refit_buffer=None’.

  • f_grad_norm (function) – Gradient calculation function to use for this arm. This is only for the policies that make choices according to active learning criteria, and only for situations in which the policy was passed different functions for each arm.

  • case_one_class (function) – Gradient workaround function for single-class data. This is only for the policies that make choices according to active learning criteria, and only for situations in which the policy was passed different functions for each arm.

Returns

self – This object

Return type

object

decision_function(X)

Get the scores for each arm following this policy’s action-choosing criteria.

Parameters

X (array (n_samples, n_features)) – Data for which to obtain decision function scores for each arm.

Returns

scores – Scores following this policy for each arm.

Return type

array (n_samples, n_choices)

drop_arm(arm_name)

Drop an arm/choice

Drops (removes/deletes) an arm from the set of available choices to the policy.

Note

The available arms, if named, are stored in attribute ‘choice_names’.

Parameters

arm_name (int or object) – Arm to drop. If passing an integer, will drop at that index (starting at zero). Otherwise, will drop the arm matching this name (argument must be of the same type as the individual entries passed to ‘nchoices’ in the initialization).

Returns

self – This object

Return type

object

fit(X, a, r, continue_from_last=False)

Fits the base algorithm (one per class [and per sample if bootstrapped]) to partially labeled data.

Parameters
  • X (array(n_samples, n_features) or CSR(n_samples, n_features)) – Matrix of covariates for the available data.

  • a (array(n_samples, ), int type) – Arms or actions that were chosen for each observations.

  • r (array(n_samples, ), {0,1}) – Rewards that were observed for the chosen actions. Must be binary rewards 0/1.

  • continue_from_last (bool) – If the policy was previously fit to data, whether to assume that this new call to ‘fit’ will continue from the exact same dataset as before plus new rows appended at the end of ‘X’, ‘a’, ‘r’. In this case, will only refit the models that have new data according to ‘a’. Note that the bootstrapped policies will still benefit from extra refits. This option should not be used when there are calls to ‘partial_fit’ between calls to fit. Ignored if using assume_unique_reward=True.

Returns

self – This object

Return type

obj

partial_fit(X, a, r)

Fits the base algorithm (one per class) to partially labeled data in batches.

Note

In order to use this method, the base classifier must have a ‘partial_fit’ method, such as ‘sklearn.linear_model.SGDClassifier’. This method is not available for ‘LogisticUCB’, ‘LogisticTS’, ‘PartitionedUCB’, ‘PartitionedTS’.

Parameters
  • X (array(n_samples, n_features) or CSR(n_samples, n_features)) – Matrix of covariates for the available data.

  • a (array(n_samples, ), int type) – Arms or actions that were chosen for each observations.

  • r (array(n_samples, ), {0,1}) – Rewards that were observed for the chosen actions. Must be binary rewards 0/1.

Returns

self – This object

Return type

obj

predict(X, exploit=False, output_score=False)

Selects actions according to this policy for new data.

Parameters
  • X (array (n_samples, n_features)) – New observations for which to choose an action according to this policy.

  • exploit (bool) – Whether to make a prediction according to the policy, or to just choose the arm with the highest expected reward according to current models.

  • output_score (bool) – Whether to output the score that this method predicted, in case it is desired to use it with this pakckage’s offpolicy and evaluation modules.

Returns

pred – Actions chosen by the policy. If passing output_score=True, it will be a dictionary with the chosen arm and the score that the arm got following this policy with the classifiers used.

Return type

array (n_samples,) or dict(“choice” : array(n_samples,), “score” : array(n_samples,))

reset_percentile(percentile=80)

Set the upper confidence bound percentile to a custom number

Parameters

percentile (int [0,100]) – Percentile of the confidence interval to take.

Returns

self – This object

Return type

obj

topN(X, n)

Get top-N ranked actions for each observation

Note

This method will rank choices/arms according to what the policy dictates - it is not an exploitation-mode rank, so if e.g. there are random choices for some observations, there will be random ranks in here.

Parameters
  • X (array (n_samples, n_features)) – New observations for which to rank actions according to this policy.

  • n (int) – Number of top-ranked actions to output

Returns

topN – The top-ranked actions for each observation

Return type

array(n_samples, n)

ParametricTS

class contextualbandits.online.ParametricTS(base_algorithm, nchoices, beta_prior=None, beta_prior_ts=(0.0, 0.0), smoothing=None, noise_to_smooth=True, batch_train=False, refit_buffer=None, deep_copy_buffer=True, assume_unique_reward=False, random_state=None, njobs=-1)

Parametric Thompson Sampling

Performs Thompson sampling using a beta distribution, with parameters given by the predicted probability from the base algorithm multiplied by the number of observations seen from each arm.

Parameters
  • base_algorithm (obj) – Base binary classifier for which each sample for each class will be fit. Will look for, in this order:

    1. A ‘predict_proba’ method with outputs (n_samples, 2), values in [0,1], rows suming to 1

    2. A ‘decision_function’ method with unbounded outputs (n_samples,) to which it will apply a sigmoid function.

    3. A ‘predict’ method with outputs (n_samples,) with values in [0,1].

    Can also pass a list with a different (or already-fit) classifier for each arm.

  • nchoices (int or list-like) – Number of arms/labels to choose from. Can also pass a list, array, or Series with arm names, in which case the outputs from predict will follow these names and arms can be dropped by name, and new ones added with a custom name.

  • beta_prior (str ‘auto’, None, tuple ((a,b), n), or list[tuple((a,b), n)]) – If not ‘None’, when there are less than ‘n’ samples with and without a reward from a given arm, it will predict the score for that class as a random number drawn from a beta distribution with the prior specified by ‘a’ and ‘b’. If set to “auto”, will be calculated as:

    beta_prior = ((2/log2(nchoices), 4), 2)

    Can also pass different priors per arm, in which case they should be passed as a list of tuples. This parameter can have a very large impact in the end results, and it’s recommended to tune it accordingly - scenarios with low expected reward rates should have priors that result in drawing small random numbers, whereas scenarios with large expected reward rates should have stronger priors and tend towards larger random numbers. Also, the more arms there are, the smaller the optimal expected value for these random numbers. Recommended to use only one of beta_prior or smoothing.

  • beta_prior_ts (tuple(float, float)) – Beta prior used for the distribution from which to draw probabilities given the base algorithm’s estimates. This is independent of beta_prior, and they will not be used together under the same arm. Pass ‘(0,0)’ for no prior.

  • smoothing (None, tuple (a,b), or list) – If not None, predictions will be smoothed as yhat_smooth = (yhat*n + a)/(n + b), where ‘n’ is the number of times each arm was chosen in the training data. Can also pass it as a list of tuples with different ‘a’ and ‘b’ parameters for each arm (e.g. if there are arm features, these parameters can be determined through a different model). This will not work well with non-probabilistic classifiers such as SVM, in which case you might want to define a class that embeds it with some recalibration built-in. Recommended to use only one of beta_prior or smoothing.

  • noise_to_smooth (bool) – If passing smoothing, whether to add a small amount of random noise \(\sim Uniform(0, 10^{-12})\) in order to break ties at random instead of choosing the smallest arm index. Ignored when passing smoothing=None.

  • batch_train (bool) – Whether the base algorithm will be fit to the data in batches as it comes (streaming), or to the whole dataset each time it is refit. Requires a classifier with a ‘partial_fit’ method.

  • refit_buffer (int or None) – Number of observations per arm to keep as a reserve for passing to ‘partial_fit’. If passing it, up until the moment there are at least this number of observations for a given arm, that arm will keep the observations when calling ‘fit’ and ‘partial_fit’, and will translate calls to ‘partial_fit’ to calls to ‘fit’ with the new plus stored observations. After the reserve number is reached, calls to ‘partial_fit’ will enlarge the data batch with the stored observations, and old stored observations will be gradually replaced with the new ones (at random, not on a FIFO basis). This technique can greatly enchance the performance when fitting the data in batches, but memory consumption can grow quite large. If passing sparse CSR matrices as input to ‘fit’ and ‘partial_fit’, these will be converted to dense once they go into this reserve, and then converted back to CSR to augment the new data. Calls to ‘fit’ will override this reserve. Ignored when passing ‘batch_train=False’.

  • deep_copy_buffer (bool) – Whether to make deep copies of the data that is stored in the reserve for refit_buffer. If passing ‘False’, when the reserve is not yet full, these will only store shallow copies of the data, which is faster but will not let Python’s garbage collector free memory after deleting the data, and if the original data is overwritten, so will this buffer. Ignored when not using refit_buffer.

  • assume_unique_reward (bool) – Whether to assume that only one arm has a reward per observation. If set to ‘True’, whenever an arm receives a reward, the classifiers for all other arms will be fit to that observation too, having negative label.

  • random_state (int, None, RandomState, or Generator) – Either an integer which will be used as seed for initializing a Generator object for random number generation, a RandomState object (from NumPy) from which to draw an integer, or a Generator object (from NumPy), which will be used directly. While this controls random number generation for this meteheuristic, there can still be other sources of variations upon re-runs, such as data aggregations in parallel (e.g. from OpenMP or BLAS functions).

  • njobs (int or None) – Number of parallel jobs to run. If passing None will set it to 1. If passing -1 will set it to the number of CPU cores. Note that if the base algorithm is itself parallelized, this might result in a slowdown as both compete for available threads, so don’t set parallelization in both. The parallelization uses shared memory, thus you will only see a speed up if your base classifier releases the Python GIL, and will otherwise result in slower runs.

add_arm(arm_name=None, fitted_classifier=None, n_w_rew=0, n_wo_rew=0, smoothing=None, beta_prior=None, refit_buffer_X=None, refit_buffer_r=None, f_grad_norm=None, case_one_class=None)

Adds a new arm to the pool of choices

Parameters
  • arm_name (object) – Name for this arm. Only applicable when using named arms. If None, will use the name of the last arm plus 1 (will only work when the names are integers).

  • fitted_classifier (object) – If a classifier has already been fit to rewards coming from this arm, you can pass it here, otherwise, will be started from the same ‘base_classifier’ as the initial arms. If using bootstrapped methods or methods from this module which do not accept arbitrary classifiers as input, don’t pass a classifier here (unless using the classes like e.g. utils._BootstrappedClassifierBase). If the constructor was called with different base_algorithm per arm, must pass a base classifier here. Not applicable for the classes that do not take a base_algorithm.

  • n_w_rew (int) – Number of trials/rounds with rewards coming from this arm (only used when using a beta prior or smoothing).

  • n_wo_rew (int) – Number of trials/rounds without rewards coming from this arm (only used when using a beta prior or smoothing).

  • smoothing (None, tuple (a,b), or list) – Smoothing parameters to use for this arm (see documentation of the class constructor for details). If None and if the smoothing passed to the constructor didn’t have separate entries per arm, will use the same smoothing as was passed in the constructor. If no smoothing was passed to the constructor, the smoothing here will be ignored. Must pass a smoothing here if the constructor was passed a smoothing with different entries per arm.

  • beta_prior (None or tuple((a,b), n)) – Beta prior to use for this arm. See the class’ documenation for details. Must be passed if the constructor was provided different beta priors per arm. If None and the constructor had a single beta_prior, will use that same beta_prior for this new arm. Note that n_w_rew and n_wo_rew will be counted towards the threshold ‘n’ in here. Cannot be passed if the constructor did not have a beta_prior.

  • refit_buffer_X (array(m, n) or None) – Refit buffer of ‘X’ data to use for the new arm. Ignored when using ‘batch_train=False’ or ‘refit_buffer=None’.

  • refit_buffer_r (array(m,) or None) – Refit buffer of rewards data to use for the new arm. Ignored when using ‘batch_train=False’ or ‘refit_buffer=None’.

  • f_grad_norm (function) – Gradient calculation function to use for this arm. This is only for the policies that make choices according to active learning criteria, and only for situations in which the policy was passed different functions for each arm.

  • case_one_class (function) – Gradient workaround function for single-class data. This is only for the policies that make choices according to active learning criteria, and only for situations in which the policy was passed different functions for each arm.

Returns

self – This object

Return type

object

decision_function(X)

Get the scores for each arm following this policy’s action-choosing criteria.

Parameters

X (array (n_samples, n_features)) – Data for which to obtain decision function scores for each arm.

Returns

scores – Scores following this policy for each arm.

Return type

array (n_samples, n_choices)

drop_arm(arm_name)

Drop an arm/choice

Drops (removes/deletes) an arm from the set of available choices to the policy.

Note

The available arms, if named, are stored in attribute ‘choice_names’.

Parameters

arm_name (int or object) – Arm to drop. If passing an integer, will drop at that index (starting at zero). Otherwise, will drop the arm matching this name (argument must be of the same type as the individual entries passed to ‘nchoices’ in the initialization).

Returns

self – This object

Return type

object

fit(X, a, r, continue_from_last=False)

Fits the base algorithm (one per class [and per sample if bootstrapped]) to partially labeled data.

Parameters
  • X (array(n_samples, n_features) or CSR(n_samples, n_features)) – Matrix of covariates for the available data.

  • a (array(n_samples, ), int type) – Arms or actions that were chosen for each observations.

  • r (array(n_samples, ), {0,1}) – Rewards that were observed for the chosen actions. Must be binary rewards 0/1.

  • continue_from_last (bool) – If the policy was previously fit to data, whether to assume that this new call to ‘fit’ will continue from the exact same dataset as before plus new rows appended at the end of ‘X’, ‘a’, ‘r’. In this case, will only refit the models that have new data according to ‘a’. Note that the bootstrapped policies will still benefit from extra refits. This option should not be used when there are calls to ‘partial_fit’ between calls to fit. Ignored if using assume_unique_reward=True.

Returns

self – This object

Return type

obj

partial_fit(X, a, r)

Fits the base algorithm (one per class) to partially labeled data in batches.

Note

In order to use this method, the base classifier must have a ‘partial_fit’ method, such as ‘sklearn.linear_model.SGDClassifier’. This method is not available for ‘LogisticUCB’, ‘LogisticTS’, ‘PartitionedUCB’, ‘PartitionedTS’.

Parameters
  • X (array(n_samples, n_features) or CSR(n_samples, n_features)) – Matrix of covariates for the available data.

  • a (array(n_samples, ), int type) – Arms or actions that were chosen for each observations.

  • r (array(n_samples, ), {0,1}) – Rewards that were observed for the chosen actions. Must be binary rewards 0/1.

Returns

self – This object

Return type

obj

predict(X, exploit=False, output_score=False)

Selects actions according to this policy for new data.

Parameters
  • X (array (n_samples, n_features)) – New observations for which to choose an action according to this policy.

  • exploit (bool) – Whether to make a prediction according to the policy, or to just choose the arm with the highest expected reward according to current models.

  • output_score (bool) – Whether to output the score that this method predicted, in case it is desired to use it with this pakckage’s offpolicy and evaluation modules.

Returns

pred – Actions chosen by the policy. If passing output_score=True, it will be a dictionary with the chosen arm and the score that the arm got following this policy with the classifiers used.

Return type

array (n_samples,) or dict(“choice” : array(n_samples,), “score” : array(n_samples,))

reset_beta_prior_ts(beta_prior_ts=(0.0, 0.0))

Set the Thompson prior to a custom tuple

Parameters

beta_prior_ts (tuple(float, float)) – Beta prior used for the distribution from which to draw probabilities given the base algorithm’s estimates. This is independent of beta_prior, and they will not be used together under the same arm. Pass ‘(0,0)’ for no prior.

Returns

self – This object

Return type

obj

topN(X, n)

Get top-N ranked actions for each observation

Note

This method will rank choices/arms according to what the policy dictates - it is not an exploitation-mode rank, so if e.g. there are random choices for some observations, there will be random ranks in here.

Parameters
  • X (array (n_samples, n_features)) – New observations for which to rank actions according to this policy.

  • n (int) – Number of top-ranked actions to output

Returns

topN – The top-ranked actions for each observation

Return type

array(n_samples, n)

PartitionedTS

class contextualbandits.online.PartitionedTS(nchoices, beta_prior=((1, 1), 1), smoothing=None, noise_to_smooth=True, assume_unique_reward=False, random_state=None, njobs=-1, *args, **kwargs)

Tree-partitioned Thompson Sampling

Fits decision trees having non-contextual multi-armed Thompson-sampling bandits at each leaf.

This corresponds to the ‘TreeHeuristic’ in the reference paper.

Note

This method fits only one tree per arm. As such, it’s not recommended for high-dimensional data.

Note

The default values for beta prior are as suggested in the reference paper. It is recommended to change it however.

Parameters
  • nchoices (int or list-like) – Number of arms/labels to choose from. Can also pass a list, array, or Series with arm names, in which case the outputs from predict will follow these names and arms can be dropped by name, and new ones added with a custom name.

  • beta_prior (str ‘auto’, tuple ((a,b), n), or list[tuple((a,b), n)]) – When there are less than ‘n’ samples with and without a reward from a given arm, it will predict the score for that class as a random number drawn from a beta distribution with the prior specified by ‘a’ and ‘b’. If passing ‘auto’ (which is not the default), will use the same default as for the other policies in this library:

    beta_prior = ((2/log2(nchoices), 4), 2)

    Can also pass different priors per arm, in which case they should be passed as a list of tuples. Additionally, will use (a,b) as prior when sampling from the MAB at a given node.

  • smoothing (None, tuple (a,b), or list) – If not None, predictions will be smoothed as yhat_smooth = (yhat*n + a)/(n + b), where ‘n’ is the number of times each arm was chosen in the training data. Can also pass it as a list of tuples with different ‘a’ and ‘b’ parameters for each arm (e.g. if there are arm features, these parameters can be determined through a different model). Not recommended for this method.

  • noise_to_smooth (bool) – If passing smoothing, whether to add a small amount of random noise \(\sim Uniform(0, 10^{-12})\) in order to break ties at random instead of choosing the smallest arm index. Ignored when passing smoothing=None.

  • assume_unique_reward (bool) – Whether to assume that only one arm has a reward per observation. If set to ‘True’, whenever an arm receives a reward, the classifiers for all other arms will be fit to that observation too, having negative label.

  • random_state (int, None, RandomState, or Generator) – Either an integer which will be used as seed for initializing a Generator object for random number generation, a RandomState object (from NumPy) from which to draw an integer, or a Generator object (from NumPy), which will be used directly.

  • njobs (int or None) – Number of parallel jobs to run. If passing None will set it to 1. If passing -1 will set it to the number of CPU cores. Note that it will not achieve a large degree of parallelization due to needing many Python computations with shared memory and no GIL releasing.

  • *args (tuple) – Additional arguments to pass to the decision tree model (this policy uses SciKit-Learn’s DecisionTreeClassifier - see their docs for more details). Note that passing random_state for DecisionTreeClassifier will have no effect as it will be set independently.

  • **kwargs (dict) – Additional keyword arguments to pass to the decision tree model (this policy uses SciKit-Learn’s DecisionTreeClassifier - see their docs for more details). Note that passing random_state for DecisionTreeClassifier will have no effect as it will be set independently.

References

1

Elmachtoub, Adam N., et al. “A practical method for solving contextual bandit problems using decision trees.” arXiv preprint arXiv:1706.04687 (2017).

2

https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

add_arm(arm_name=None, fitted_classifier=None, n_w_rew=0, n_wo_rew=0, smoothing=None, beta_prior=None, refit_buffer_X=None, refit_buffer_r=None, f_grad_norm=None, case_one_class=None)

Adds a new arm to the pool of choices

Parameters
  • arm_name (object) – Name for this arm. Only applicable when using named arms. If None, will use the name of the last arm plus 1 (will only work when the names are integers).

  • fitted_classifier (object) – If a classifier has already been fit to rewards coming from this arm, you can pass it here, otherwise, will be started from the same ‘base_classifier’ as the initial arms. If using bootstrapped methods or methods from this module which do not accept arbitrary classifiers as input, don’t pass a classifier here (unless using the classes like e.g. utils._BootstrappedClassifierBase). If the constructor was called with different base_algorithm per arm, must pass a base classifier here. Not applicable for the classes that do not take a base_algorithm.

  • n_w_rew (int) – Number of trials/rounds with rewards coming from this arm (only used when using a beta prior or smoothing).

  • n_wo_rew (int) – Number of trials/rounds without rewards coming from this arm (only used when using a beta prior or smoothing).

  • smoothing (None, tuple (a,b), or list) – Smoothing parameters to use for this arm (see documentation of the class constructor for details). If None and if the smoothing passed to the constructor didn’t have separate entries per arm, will use the same smoothing as was passed in the constructor. If no smoothing was passed to the constructor, the smoothing here will be ignored. Must pass a smoothing here if the constructor was passed a smoothing with different entries per arm.

  • beta_prior (None or tuple((a,b), n)) – Beta prior to use for this arm. See the class’ documenation for details. Must be passed if the constructor was provided different beta priors per arm. If None and the constructor had a single beta_prior, will use that same beta_prior for this new arm. Note that n_w_rew and n_wo_rew will be counted towards the threshold ‘n’ in here. Cannot be passed if the constructor did not have a beta_prior.

  • refit_buffer_X (array(m, n) or None) – Refit buffer of ‘X’ data to use for the new arm. Ignored when using ‘batch_train=False’ or ‘refit_buffer=None’.

  • refit_buffer_r (array(m,) or None) – Refit buffer of rewards data to use for the new arm. Ignored when using ‘batch_train=False’ or ‘refit_buffer=None’.

  • f_grad_norm (function) – Gradient calculation function to use for this arm. This is only for the policies that make choices according to active learning criteria, and only for situations in which the policy was passed different functions for each arm.

  • case_one_class (function) – Gradient workaround function for single-class data. This is only for the policies that make choices according to active learning criteria, and only for situations in which the policy was passed different functions for each arm.

Returns

self – This object

Return type

object

decision_function(X)

Get the scores for each arm following this policy’s action-choosing criteria.

Parameters

X (array (n_samples, n_features)) – Data for which to obtain decision function scores for each arm.

Returns

scores – Scores following this policy for each arm.

Return type

array (n_samples, n_choices)

drop_arm(arm_name)

Drop an arm/choice

Drops (removes/deletes) an arm from the set of available choices to the policy.

Note

The available arms, if named, are stored in attribute ‘choice_names’.

Parameters

arm_name (int or object) – Arm to drop. If passing an integer, will drop at that index (starting at zero). Otherwise, will drop the arm matching this name (argument must be of the same type as the individual entries passed to ‘nchoices’ in the initialization).

Returns

self – This object

Return type

object

fit(X, a, r, continue_from_last=False)

Fits the base algorithm (one per class [and per sample if bootstrapped]) to partially labeled data.

Parameters
  • X (array(n_samples, n_features) or CSR(n_samples, n_features)) – Matrix of covariates for the available data.

  • a (array(n_samples, ), int type) – Arms or actions that were chosen for each observations.

  • r (array(n_samples, ), {0,1}) – Rewards that were observed for the chosen actions. Must be binary rewards 0/1.

  • continue_from_last (bool) – If the policy was previously fit to data, whether to assume that this new call to ‘fit’ will continue from the exact same dataset as before plus new rows appended at the end of ‘X’, ‘a’, ‘r’. In this case, will only refit the models that have new data according to ‘a’. Note that the bootstrapped policies will still benefit from extra refits. This option should not be used when there are calls to ‘partial_fit’ between calls to fit. Ignored if using assume_unique_reward=True.

Returns

self – This object

Return type

obj

partial_fit(X, a, r)

Fits the base algorithm (one per class) to partially labeled data in batches.

Note

In order to use this method, the base classifier must have a ‘partial_fit’ method, such as ‘sklearn.linear_model.SGDClassifier’. This method is not available for ‘LogisticUCB’, ‘LogisticTS’, ‘PartitionedUCB’, ‘PartitionedTS’.

Parameters
  • X (array(n_samples, n_features) or CSR(n_samples, n_features)) – Matrix of covariates for the available data.

  • a (array(n_samples, ), int type) – Arms or actions that were chosen for each observations.

  • r (array(n_samples, ), {0,1}) – Rewards that were observed for the chosen actions. Must be binary rewards 0/1.

Returns

self – This object

Return type

obj

predict(X, exploit=False, output_score=False)

Selects actions according to this policy for new data.

Parameters
  • X (array (n_samples, n_features)) – New observations for which to choose an action according to this policy.

  • exploit (bool) – Whether to make a prediction according to the policy, or to just choose the arm with the highest expected reward according to current models.

  • output_score (bool) – Whether to output the score that this method predicted, in case it is desired to use it with this pakckage’s offpolicy and evaluation modules.

Returns

pred – Actions chosen by the policy. If passing output_score=True, it will be a dictionary with the chosen arm and the score that the arm got following this policy with the classifiers used.

Return type

array (n_samples,) or dict(“choice” : array(n_samples,), “score” : array(n_samples,))

topN(X, n)

Get top-N ranked actions for each observation

Note

This method will rank choices/arms according to what the policy dictates - it is not an exploitation-mode rank, so if e.g. there are random choices for some observations, there will be random ranks in here.

Parameters
  • X (array (n_samples, n_features)) – New observations for which to rank actions according to this policy.

  • n (int) – Number of top-ranked actions to output

Returns

topN – The top-ranked actions for each observation

Return type

array(n_samples, n)

PartitionedUCB

class contextualbandits.online.PartitionedUCB(nchoices, percentile=80, ucb_prior=(1, 1), beta_prior='auto', smoothing=None, noise_to_smooth=True, assume_unique_reward=False, random_state=None, njobs=-1, *args, **kwargs)

Tree-partitioned Upper Confidence Bound

Fits decision trees having non-contextual multi-armed UCB bandits at each leaf. Uses the standard approximation for confidence interval of a proportion (mean + c * sqrt(mean * (1-mean) / n)).

This is similar to the ‘TreeHeuristic’ in the reference paper, but uses UCB as a MAB policy instead of Thompson sampling.

Note

This method fits only one tree per arm. As such, it’s not recommended for high-dimensional data.

Parameters
  • nchoices (int or list-like) – Number of arms/labels to choose from. Can also pass a list, array, or Series with arm names, in which case the outputs from predict will follow these names and arms can be dropped by name, and new ones added with a custom name.

  • percentile (int [0,100]) – Percentile of the confidence interval to take.

  • ucb_prior (tuple(float, float)) – Prior for the upper confidence bounds generated at each tree leaf. First number will be added to the number of positives, and second number to the number of negatives. If passing beta_prior=None, will use these alone to generate an upper confidence bound and will break ties at random.

  • beta_prior (str ‘auto’, None, tuple ((a,b), n), or list[tuple((a,b), n)]) – If not ‘None’, when there are less than ‘n’ samples with and without a reward from a given arm, it will predict the score for that class as a random number drawn from a beta distribution with the prior specified by ‘a’ and ‘b’. If set to “auto”, will be calculated as:

    beta_prior = ((3/log2(nchoices), 4), 2)

    Can also pass different priors per arm, in which case they should be passed as a list of tuples. This parameter can have a very large impact in the end results, and it’s recommended to tune it accordingly - scenarios with low expected reward rates should have priors that result in drawing small random numbers, whereas scenarios with large expected reward rates should have stronger priors and tend towards larger random numbers. Also, the more arms there are, the smaller the optimal expected value for these random numbers. Note that this method calculates upper bounds rather than expectations, so the ‘a’ parameter should be higher than for other methods. Recommended to use only one of beta_prior or smoothing.

  • smoothing (None, tuple (a,b), or list) – If not None, predictions will be smoothed as yhat_smooth = (yhat*n + a)/(n + b), where ‘n’ is the number of times each arm was chosen in the training data. Can also pass it as a list of tuples with different ‘a’ and ‘b’ parameters for each arm (e.g. if there are arm features, these parameters can be determined through a different model). Not recommended for this method.

  • noise_to_smooth (bool) – If passing smoothing, whether to add a small amount of random noise \(\sim Uniform(0, 10^{-12})\) in order to break ties at random instead of choosing the smallest arm index. Ignored when passing smoothing=None.

  • assume_unique_reward (bool) – Whether to assume that only one arm has a reward per observation. If set to ‘True’, whenever an arm receives a reward, the classifiers for all other arms will be fit to that observation too, having negative label.

  • random_state (int, None, RandomState, or Generator) – Either an integer which will be used as seed for initializing a Generator object for random number generation, a RandomState object (from NumPy) from which to draw an integer, or a Generator object (from NumPy), which will be used directly.

  • njobs (int or None) – Number of parallel jobs to run. If passing None will set it to 1. If passing -1 will set it to the number of CPU cores. Note that it will not achieve a large degree of parallelization due to needing many Python computations with shared memory and no GIL releasing.

  • *args (tuple) – Additional arguments to pass to the decision tree model (this policy uses SciKit-Learn’s DecisionTreeClassifier - see their docs for more details). Note that passing random_state for DecisionTreeClassifier will have no effect as it will be set independently.

  • **kwargs (dict) – Additional keyword arguments to pass to the decision tree model (this policy uses SciKit-Learn’s DecisionTreeClassifier - see their docs for more details). Note that passing random_state for DecisionTreeClassifier will have no effect as it will be set independently.

References

1

Elmachtoub, Adam N., et al. “A practical method for solving contextual bandit problems using decision trees.” arXiv preprint arXiv:1706.04687 (2017).

2

https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

add_arm(arm_name=None, fitted_classifier=None, n_w_rew=0, n_wo_rew=0, smoothing=None, beta_prior=None, refit_buffer_X=None, refit_buffer_r=None, f_grad_norm=None, case_one_class=None)

Adds a new arm to the pool of choices

Parameters
  • arm_name (object) – Name for this arm. Only applicable when using named arms. If None, will use the name of the last arm plus 1 (will only work when the names are integers).

  • fitted_classifier (object) – If a classifier has already been fit to rewards coming from this arm, you can pass it here, otherwise, will be started from the same ‘base_classifier’ as the initial arms. If using bootstrapped methods or methods from this module which do not accept arbitrary classifiers as input, don’t pass a classifier here (unless using the classes like e.g. utils._BootstrappedClassifierBase). If the constructor was called with different base_algorithm per arm, must pass a base classifier here. Not applicable for the classes that do not take a base_algorithm.

  • n_w_rew (int) – Number of trials/rounds with rewards coming from this arm (only used when using a beta prior or smoothing).

  • n_wo_rew (int) – Number of trials/rounds without rewards coming from this arm (only used when using a beta prior or smoothing).

  • smoothing (None, tuple (a,b), or list) – Smoothing parameters to use for this arm (see documentation of the class constructor for details). If None and if the smoothing passed to the constructor didn’t have separate entries per arm, will use the same smoothing as was passed in the constructor. If no smoothing was passed to the constructor, the smoothing here will be ignored. Must pass a smoothing here if the constructor was passed a smoothing with different entries per arm.

  • beta_prior (None or tuple((a,b), n)) – Beta prior to use for this arm. See the class’ documenation for details. Must be passed if the constructor was provided different beta priors per arm. If None and the constructor had a single beta_prior, will use that same beta_prior for this new arm. Note that n_w_rew and n_wo_rew will be counted towards the threshold ‘n’ in here. Cannot be passed if the constructor did not have a beta_prior.

  • refit_buffer_X (array(m, n) or None) – Refit buffer of ‘X’ data to use for the new arm. Ignored when using ‘batch_train=False’ or ‘refit_buffer=None’.

  • refit_buffer_r (array(m,) or None) – Refit buffer of rewards data to use for the new arm. Ignored when using ‘batch_train=False’ or ‘refit_buffer=None’.

  • f_grad_norm (function) – Gradient calculation function to use for this arm. This is only for the policies that make choices according to active learning criteria, and only for situations in which the policy was passed different functions for each arm.

  • case_one_class (function) – Gradient workaround function for single-class data. This is only for the policies that make choices according to active learning criteria, and only for situations in which the policy was passed different functions for each arm.

Returns

self – This object

Return type

object

decision_function(X)

Get the scores for each arm following this policy’s action-choosing criteria.

Parameters

X (array (n_samples, n_features)) – Data for which to obtain decision function scores for each arm.

Returns

scores – Scores following this policy for each arm.

Return type

array (n_samples, n_choices)

drop_arm(arm_name)

Drop an arm/choice

Drops (removes/deletes) an arm from the set of available choices to the policy.

Note

The available arms, if named, are stored in attribute ‘choice_names’.

Parameters

arm_name (int or object) – Arm to drop. If passing an integer, will drop at that index (starting at zero). Otherwise, will drop the arm matching this name (argument must be of the same type as the individual entries passed to ‘nchoices’ in the initialization).

Returns

self – This object

Return type

object

fit(X, a, r, continue_from_last=False)

Fits the base algorithm (one per class [and per sample if bootstrapped]) to partially labeled data.

Parameters
  • X (array(n_samples, n_features) or CSR(n_samples, n_features)) – Matrix of covariates for the available data.

  • a (array(n_samples, ), int type) – Arms or actions that were chosen for each observations.

  • r (array(n_samples, ), {0,1}) – Rewards that were observed for the chosen actions. Must be binary rewards 0/1.

  • continue_from_last (bool) – If the policy was previously fit to data, whether to assume that this new call to ‘fit’ will continue from the exact same dataset as before plus new rows appended at the end of ‘X’, ‘a’, ‘r’. In this case, will only refit the models that have new data according to ‘a’. Note that the bootstrapped policies will still benefit from extra refits. This option should not be used when there are calls to ‘partial_fit’ between calls to fit. Ignored if using assume_unique_reward=True.

Returns

self – This object

Return type

obj

partial_fit(X, a, r)

Fits the base algorithm (one per class) to partially labeled data in batches.

Note

In order to use this method, the base classifier must have a ‘partial_fit’ method, such as ‘sklearn.linear_model.SGDClassifier’. This method is not available for ‘LogisticUCB’, ‘LogisticTS’, ‘PartitionedUCB’, ‘PartitionedTS’.

Parameters
  • X (array(n_samples, n_features) or CSR(n_samples, n_features)) – Matrix of covariates for the available data.

  • a (array(n_samples, ), int type) – Arms or actions that were chosen for each observations.

  • r (array(n_samples, ), {0,1}) – Rewards that were observed for the chosen actions. Must be binary rewards 0/1.

Returns

self – This object

Return type

obj

predict(X, exploit=False, output_score=False)

Selects actions according to this policy for new data.

Parameters
  • X (array (n_samples, n_features)) – New observations for which to choose an action according to this policy.

  • exploit (bool) – Whether to make a prediction according to the policy, or to just choose the arm with the highest expected reward according to current models.

  • output_score (bool) – Whether to output the score that this method predicted, in case it is desired to use it with this pakckage’s offpolicy and evaluation modules.

Returns

pred – Actions chosen by the policy. If passing output_score=True, it will be a dictionary with the chosen arm and the score that the arm got following this policy with the classifiers used.

Return type

array (n_samples,) or dict(“choice” : array(n_samples,), “score” : array(n_samples,))

reset_percentile(percentile=80)

Set the upper confidence bound percentile to a custom number

Parameters

percentile (int [0,100]) – Percentile of the confidence interval to take.

Returns

self – This object

Return type

obj

reset_ucb_prior(ucb_prior=(1, 1))

Set the upper confidence bound prior to a custom tuple

Parameters

ucb_prior (tuple(float, float)) – Prior for the upper confidence bounds generated at each tree leaf. First number will be added to the number of positives, and second number to the number of negatives. If passing beta_prior=None, will use these alone to generate an upper confidence bound and will break ties at random.

Returns

self – This object

Return type

obj

topN(X, n)

Get top-N ranked actions for each observation

Note

This method will rank choices/arms according to what the policy dictates - it is not an exploitation-mode rank, so if e.g. there are random choices for some observations, there will be random ranks in here.

Parameters
  • X (array (n_samples, n_features)) – New observations for which to rank actions according to this policy.

  • n (int) – Number of top-ranked actions to output

Returns

topN – The top-ranked actions for each observation

Return type

array(n_samples, n)

SeparateClassifiers

class contextualbandits.online.SeparateClassifiers(base_algorithm, nchoices, beta_prior=None, smoothing=None, noise_to_smooth=True, batch_train=False, refit_buffer=None, deep_copy_buffer=True, assume_unique_reward=False, random_state=None, njobs=-1)

Separate Clasifiers per arm

Fits one classifier per arm using only the data on which that arm was chosen. Predicts as One-Vs-Rest, plus the usual metaheuristics from beta_prior and smoothing.

Parameters
  • base_algorithm (obj) – Base binary classifier for which each sample for each class will be fit. Will look for, in this order:

    1. A ‘predict_proba’ method with outputs (n_samples, 2), values in [0,1], rows suming to 1

    2. A ‘decision_function’ method with unbounded outputs (n_samples,) to which it will apply a sigmoid function.

    3. A ‘predict’ method with outputs (n_samples,) with values in [0,1].

    Can also pass a list with a different (or already-fit) classifier for each arm.

  • nchoices (int or list-like) – Number of arms/labels to choose from. Can also pass a list, array, or Series with arm names, in which case the outputs from predict will follow these names and arms can be dropped by name, and new ones added with a custom name.

  • beta_prior (str ‘auto’, None, tuple ((a,b), n), or list[tuple((a,b), n)]) – If not ‘None’, when there are less than ‘n’ samples with and without a reward from a given arm, it will predict the score for that class as a random number drawn from a beta distribution with the prior specified by ‘a’ and ‘b’. If set to “auto”, will be calculated as:

    beta_prior = ((2/log2(nchoices), 4), 2)

    Can also pass different priors per arm, in which case they should be passed as a list of tuples. This parameter can have a very large impact in the end results, and it’s recommended to tune it accordingly - scenarios with low expected reward rates should have priors that result in drawing small random numbers, whereas scenarios with large expected reward rates should have stronger priors and tend towards larger random numbers. Also, the more arms there are, the smaller the optimal expected value for these random numbers. Recommended to use only one of beta_prior or smoothing.

  • smoothing (None, tuple (a,b), or list) – If not None, predictions will be smoothed as yhat_smooth = (yhat*n + a)/(n + b), where ‘n’ is the number of times each arm was chosen in the training data. Can also pass it as a list of tuples with different ‘a’ and ‘b’ parameters for each arm (e.g. if there are arm features, these parameters can be determined through a different model). This will not work well with non-probabilistic classifiers such as SVM, in which case you might want to define a class that embeds it with some recalibration built-in. Recommended to use only one of beta_prior or smoothing.

  • noise_to_smooth (bool) – If passing smoothing, whether to add a small amount of random noise \(\sim Uniform(0, 10^{-12})\) in order to break ties at random instead of choosing the smallest arm index. Ignored when passing smoothing=None.

  • batch_train (bool) – Whether the base algorithm will be fit to the data in batches as it comes (streaming), or to the whole dataset each time it is refit. Requires a classifier with a ‘partial_fit’ method.

  • refit_buffer (int or None) – Number of observations per arm to keep as a reserve for passing to ‘partial_fit’. If passing it, up until the moment there are at least this number of observations for a given arm, that arm will keep the observations when calling ‘fit’ and ‘partial_fit’, and will translate calls to ‘partial_fit’ to calls to ‘fit’ with the new plus stored observations. After the reserve number is reached, calls to ‘partial_fit’ will enlarge the data batch with the stored observations, and old stored observations will be gradually replaced with the new ones (at random, not on a FIFO basis). This technique can greatly enchance the performance when fitting the data in batches, but memory consumption can grow quite large. If passing sparse CSR matrices as input to ‘fit’ and ‘partial_fit’, these will be converted to dense once they go into this reserve, and then converted back to CSR to augment the new data. Calls to ‘fit’ will override this reserve. Ignored when passing ‘batch_train=False’.

  • deep_copy_buffer (bool) – Whether to make deep copies of the data that is stored in the reserve for refit_buffer. If passing ‘False’, when the reserve is not yet full, these will only store shallow copies of the data, which is faster but will not let Python’s garbage collector free memory after deleting the data, and if the original data is overwritten, so will this buffer. Ignored when not using refit_buffer.

  • assume_unique_reward (bool) – Whether to assume that only one arm has a reward per observation. If set to ‘True’, whenever an arm receives a reward, the classifiers for all other arms will be fit to that observation too, having negative label.

  • random_state (int, None, RandomState, or Generator) – Either an integer which will be used as seed for initializing a Generator object for random number generation, a RandomState object (from NumPy) from which to draw an integer, or a Generator object (from NumPy), which will be used directly. While this controls random number generation for this meteheuristic, there can still be other sources of variations upon re-runs, such as data aggregations in parallel (e.g. from OpenMP or BLAS functions).

  • njobs (int or None) – Number of parallel jobs to run. If passing None will set it to 1. If passing -1 will set it to the number of CPU cores. Note that if the base algorithm is itself parallelized, this might result in a slowdown as both compete for available threads, so don’t set parallelization in both. The parallelization uses shared memory, thus you will only see a speed up if your base classifier releases the Python GIL, and will otherwise result in slower runs.

References

1

Cortes, David. “Adapting multi-armed bandits policies to contextual bandits scenarios.” arXiv preprint arXiv:1811.04383 (2018).

add_arm(arm_name=None, fitted_classifier=None, n_w_rew=0, n_wo_rew=0, smoothing=None, beta_prior=None, refit_buffer_X=None, refit_buffer_r=None, f_grad_norm=None, case_one_class=None)

Adds a new arm to the pool of choices

Parameters
  • arm_name (object) – Name for this arm. Only applicable when using named arms. If None, will use the name of the last arm plus 1 (will only work when the names are integers).

  • fitted_classifier (object) – If a classifier has already been fit to rewards coming from this arm, you can pass it here, otherwise, will be started from the same ‘base_classifier’ as the initial arms. If using bootstrapped methods or methods from this module which do not accept arbitrary classifiers as input, don’t pass a classifier here (unless using the classes like e.g. utils._BootstrappedClassifierBase). If the constructor was called with different base_algorithm per arm, must pass a base classifier here. Not applicable for the classes that do not take a base_algorithm.

  • n_w_rew (int) – Number of trials/rounds with rewards coming from this arm (only used when using a beta prior or smoothing).

  • n_wo_rew (int) – Number of trials/rounds without rewards coming from this arm (only used when using a beta prior or smoothing).

  • smoothing (None, tuple (a,b), or list) – Smoothing parameters to use for this arm (see documentation of the class constructor for details). If None and if the smoothing passed to the constructor didn’t have separate entries per arm, will use the same smoothing as was passed in the constructor. If no smoothing was passed to the constructor, the smoothing here will be ignored. Must pass a smoothing here if the constructor was passed a smoothing with different entries per arm.

  • beta_prior (None or tuple((a,b), n)) – Beta prior to use for this arm. See the class’ documenation for details. Must be passed if the constructor was provided different beta priors per arm. If None and the constructor had a single beta_prior, will use that same beta_prior for this new arm. Note that n_w_rew and n_wo_rew will be counted towards the threshold ‘n’ in here. Cannot be passed if the constructor did not have a beta_prior.

  • refit_buffer_X (array(m, n) or None) – Refit buffer of ‘X’ data to use for the new arm. Ignored when using ‘batch_train=False’ or ‘refit_buffer=None’.

  • refit_buffer_r (array(m,) or None) – Refit buffer of rewards data to use for the new arm. Ignored when using ‘batch_train=False’ or ‘refit_buffer=None’.

  • f_grad_norm (function) – Gradient calculation function to use for this arm. This is only for the policies that make choices according to active learning criteria, and only for situations in which the policy was passed different functions for each arm.

  • case_one_class (function) – Gradient workaround function for single-class data. This is only for the policies that make choices according to active learning criteria, and only for situations in which the policy was passed different functions for each arm.

Returns

self – This object

Return type

object

decision_function(X)

Get the scores for each arm following this policy’s action-choosing criteria.

Parameters

X (array (n_samples, n_features)) – Data for which to obtain decision function scores for each arm.

Returns

scores – Scores following this policy for each arm.

Return type

array (n_samples, n_choices)

decision_function_std(X)

Get the predicted “probabilities” from each arm from the classifier that predicts it, standardized to sum up to 1 (note that these are no longer probabilities).

Parameters

X (array (n_samples, n_features)) – Data for which to obtain decision function scores for each arm.

Returns

scores – Scores following this policy for each arm.

Return type

array (n_samples, n_choices)

drop_arm(arm_name)

Drop an arm/choice

Drops (removes/deletes) an arm from the set of available choices to the policy.

Note

The available arms, if named, are stored in attribute ‘choice_names’.

Parameters

arm_name (int or object) – Arm to drop. If passing an integer, will drop at that index (starting at zero). Otherwise, will drop the arm matching this name (argument must be of the same type as the individual entries passed to ‘nchoices’ in the initialization).

Returns

self – This object

Return type

object

fit(X, a, r, continue_from_last=False)

Fits the base algorithm (one per class [and per sample if bootstrapped]) to partially labeled data.

Parameters
  • X (array(n_samples, n_features) or CSR(n_samples, n_features)) – Matrix of covariates for the available data.

  • a (array(n_samples, ), int type) – Arms or actions that were chosen for each observations.

  • r (array(n_samples, ), {0,1}) – Rewards that were observed for the chosen actions. Must be binary rewards 0/1.

  • continue_from_last (bool) – If the policy was previously fit to data, whether to assume that this new call to ‘fit’ will continue from the exact same dataset as before plus new rows appended at the end of ‘X’, ‘a’, ‘r’. In this case, will only refit the models that have new data according to ‘a’. Note that the bootstrapped policies will still benefit from extra refits. This option should not be used when there are calls to ‘partial_fit’ between calls to fit. Ignored if using assume_unique_reward=True.

Returns

self – This object

Return type

obj

partial_fit(X, a, r)

Fits the base algorithm (one per class) to partially labeled data in batches.

Note

In order to use this method, the base classifier must have a ‘partial_fit’ method, such as ‘sklearn.linear_model.SGDClassifier’. This method is not available for ‘LogisticUCB’, ‘LogisticTS’, ‘PartitionedUCB’, ‘PartitionedTS’.

Parameters
  • X (array(n_samples, n_features) or CSR(n_samples, n_features)) – Matrix of covariates for the available data.

  • a (array(n_samples, ), int type) – Arms or actions that were chosen for each observations.

  • r (array(n_samples, ), {0,1}) – Rewards that were observed for the chosen actions. Must be binary rewards 0/1.

Returns

self – This object

Return type

obj

predict(X, output_score=False)

Selects actions according to this policy for new data.

Parameters
  • X (array (n_samples, n_features)) – New observations for which to choose an action according to this policy.

  • output_score (bool) – Whether to output the score that this method predicted, in case it is desired to use it with this pakckage’s offpolicy and evaluation modules.

Returns

pred – Actions chosen by the policy. If passing output_score=True, it will be a dictionary with the chosen arm and the score that the arm got following this policy with the classifiers used.

Return type

array (n_samples,) or dict(“choice” : array(n_samples,), “score” : array(n_samples,))

predict_proba_separate(X)

Get the predicted probabilities from each arm from the classifier that predicts it.

Note

Classifiers are all fit on different data, so the probabilities will not add up to 1.

Parameters

X (array (n_samples, n_features)) – Data for which to obtain decision function scores for each arm.

Returns

scores – Scores following this policy for each arm.

Return type

array (n_samples, n_choices)

topN(X, n)

Get top-N ranked actions for each observation

Note

This method will rank choices/arms according to what the policy dictates - it is not an exploitation-mode rank, so if e.g. there are random choices for some observations, there will be random ranks in here.

Parameters
  • X (array (n_samples, n_features)) – New observations for which to rank actions according to this policy.

  • n (int) – Number of top-ranked actions to output

Returns

topN – The top-ranked actions for each observation

Return type

array(n_samples, n)

SoftmaxExplorer

class contextualbandits.online.SoftmaxExplorer(base_algorithm, nchoices, multiplier=1.0, inflation_rate=1.0004, beta_prior='auto', smoothing=None, noise_to_smooth=True, batch_train=False, refit_buffer=None, deep_copy_buffer=True, assume_unique_reward=False, random_state=None, njobs=-1)

SoftMax Explorer

Selects an action according to probabilites determined by a softmax transformation on the scores from the decision function that predicts each class.

Note

Will apply an inverse sigmoid transformations to the probabilities that come from the base algorithm before applying the softmax function.

Parameters
  • base_algorithm (obj) – Base binary classifier for which each sample for each class will be fit. Will look for, in this order:

    1. A ‘predict_proba’ method with outputs (n_samples, 2), values in [0,1], rows suming to 1, to which it will apply an inverse sigmoid function.

    2. A ‘decision_function’ method with unbounded outputs (n_samples,).

    3. A ‘predict’ method outputting (n_samples,), values in [0,1], to which it will apply an inverse sigmoid function.

    Can also pass a list with a different (or already-fit) classifier for each arm.

  • nchoices (int or list-like) – Number of arms/labels to choose from. Can also pass a list, array, or Series with arm names, in which case the outputs from predict will follow these names and arms can be dropped by name, and new ones added with a custom name.

  • multiplier (float or None) – Number by which to multiply the outputs from the base algorithm before applying the softmax function (i.e. will take softmax(yhat * multiplier)).

  • inflation_rate (float or None) – Number by which to multiply the multipier rate after every prediction, i.e. after making ‘t’ predictions, the multiplier will be ‘multiplier_t = multiplier * inflation_rate^t’.

  • beta_prior (str ‘auto’, None, tuple ((a,b), n), or list[tuple((a,b), n)]) – If not ‘None’, when there are less than ‘n’ samples with and without a reward from a given arm, it will predict the score for that class as a random number drawn from a beta distribution with the prior specified by ‘a’ and ‘b’. If set to “auto”, will be calculated as:

    beta_prior = ((2/log2(nchoices), 4), 2)

    Can also pass different priors per arm, in which case they should be passed as a list of tuples. This parameter can have a very large impact in the end results, and it’s recommended to tune it accordingly - scenarios with low expected reward rates should have priors that result in drawing small random numbers, whereas scenarios with large expected reward rates should have stronger priors and tend towards larger random numbers. Also, the more arms there are, the smaller the optimal expected value for these random numbers. Recommended to use only one of beta_prior or smoothing.

  • smoothing (None, tuple (a,b), or list) – If not None, predictions will be smoothed as yhat_smooth = (yhat*n + a)/(n + b), where ‘n’ is the number of times each arm was chosen in the training data. Can also pass it as a list of tuples with different ‘a’ and ‘b’ parameters for each arm (e.g. if there are arm features, these parameters can be determined through a different model). This will not work well with non-probabilistic classifiers such as SVM, in which case you might want to define a class that embeds it with some recalibration built-in. Recommended to use only one of beta_prior or smoothing.

  • noise_to_smooth (bool) – If passing smoothing, whether to add a small amount of random noise \(\sim Uniform(0, 10^{-12})\) in order to break ties at random instead of choosing the smallest arm index. Ignored when passing smoothing=None.

  • batch_train (bool) – Whether the base algorithm will be fit to the data in batches as it comes (streaming), or to the whole dataset each time it is refit. Requires a classifier with a ‘partial_fit’ method.

  • refit_buffer (int or None) – Number of observations per arm to keep as a reserve for passing to ‘partial_fit’. If passing it, up until the moment there are at least this number of observations for a given arm, that arm will keep the observations when calling ‘fit’ and ‘partial_fit’, and will translate calls to ‘partial_fit’ to calls to ‘fit’ with the new plus stored observations. After the reserve number is reached, calls to ‘partial_fit’ will enlarge the data batch with the stored observations, and old stored observations will be gradually replaced with the new ones (at random, not on a FIFO basis). This technique can greatly enchance the performance when fitting the data in batches, but memory consumption can grow quite large. If passing sparse CSR matrices as input to ‘fit’ and ‘partial_fit’, these will be converted to dense once they go into this reserve, and then converted back to CSR to augment the new data. Calls to ‘fit’ will override this reserve. Ignored when passing ‘batch_train=False’.

  • deep_copy_buffer (bool) – Whether to make deep copies of the data that is stored in the reserve for refit_buffer. If passing ‘False’, when the reserve is not yet full, these will only store shallow copies of the data, which is faster but will not let Python’s garbage collector free memory after deleting the data, and if the original data is overwritten, so will this buffer. Ignored when not using refit_buffer.

  • assume_unique_reward (bool) – Whether to assume that only one arm has a reward per observation. If set to ‘True’, whenever an arm receives a reward, the classifiers for all other arms will be fit to that observation too, having negative label.

  • random_state (int, None, RandomState, or Generator) – Either an integer which will be used as seed for initializing a Generator object for random number generation, a RandomState object (from NumPy) from which to draw an integer, or a Generator object (from NumPy), which will be used directly. While this controls random number generation for this meteheuristic, there can still be other sources of variations upon re-runs, such as data aggregations in parallel (e.g. from OpenMP or BLAS functions).

  • njobs (int or None) – Number of parallel jobs to run. If passing None will set it to 1. If passing -1 will set it to the number of CPU cores. Note that if the base algorithm is itself parallelized, this might result in a slowdown as both compete for available threads, so don’t set parallelization in both. The parallelization uses shared memory, thus you will only see a speed up if your base classifier releases the Python GIL, and will otherwise result in slower runs.

References

1

Cortes, David. “Adapting multi-armed bandits policies to contextual bandits scenarios.” arXiv preprint arXiv:1811.04383 (2018).

add_arm(arm_name=None, fitted_classifier=None, n_w_rew=0, n_wo_rew=0, smoothing=None, beta_prior=None, refit_buffer_X=None, refit_buffer_r=None, f_grad_norm=None, case_one_class=None)

Adds a new arm to the pool of choices

Parameters
  • arm_name (object) – Name for this arm. Only applicable when using named arms. If None, will use the name of the last arm plus 1 (will only work when the names are integers).

  • fitted_classifier (object) – If a classifier has already been fit to rewards coming from this arm, you can pass it here, otherwise, will be started from the same ‘base_classifier’ as the initial arms. If using bootstrapped methods or methods from this module which do not accept arbitrary classifiers as input, don’t pass a classifier here (unless using the classes like e.g. utils._BootstrappedClassifierBase). If the constructor was called with different base_algorithm per arm, must pass a base classifier here. Not applicable for the classes that do not take a base_algorithm.

  • n_w_rew (int) – Number of trials/rounds with rewards coming from this arm (only used when using a beta prior or smoothing).

  • n_wo_rew (int) – Number of trials/rounds without rewards coming from this arm (only used when using a beta prior or smoothing).

  • smoothing (None, tuple (a,b), or list) – Smoothing parameters to use for this arm (see documentation of the class constructor for details). If None and if the smoothing passed to the constructor didn’t have separate entries per arm, will use the same smoothing as was passed in the constructor. If no smoothing was passed to the constructor, the smoothing here will be ignored. Must pass a smoothing here if the constructor was passed a smoothing with different entries per arm.

  • beta_prior (None or tuple((a,b), n)) – Beta prior to use for this arm. See the class’ documenation for details. Must be passed if the constructor was provided different beta priors per arm. If None and the constructor had a single beta_prior, will use that same beta_prior for this new arm. Note that n_w_rew and n_wo_rew will be counted towards the threshold ‘n’ in here. Cannot be passed if the constructor did not have a beta_prior.

  • refit_buffer_X (array(m, n) or None) – Refit buffer of ‘X’ data to use for the new arm. Ignored when using ‘batch_train=False’ or ‘refit_buffer=None’.

  • refit_buffer_r (array(m,) or None) – Refit buffer of rewards data to use for the new arm. Ignored when using ‘batch_train=False’ or ‘refit_buffer=None’.

  • f_grad_norm (function) – Gradient calculation function to use for this arm. This is only for the policies that make choices according to active learning criteria, and only for situations in which the policy was passed different functions for each arm.

  • case_one_class (function) – Gradient workaround function for single-class data. This is only for the policies that make choices according to active learning criteria, and only for situations in which the policy was passed different functions for each arm.

Returns

self – This object

Return type

object

decision_function(X, output_score=False, apply_sigmoid_score=True)

Get the scores for each arm following this policy’s action-choosing criteria.

Parameters

X (array (n_samples, n_features)) – Data for which to obtain decision function scores for each arm.

Returns

scores – Scores following this policy for each arm.

Return type

array (n_samples, n_choices)

drop_arm(arm_name)

Drop an arm/choice

Drops (removes/deletes) an arm from the set of available choices to the policy.

Note

The available arms, if named, are stored in attribute ‘choice_names’.

Parameters

arm_name (int or object) – Arm to drop. If passing an integer, will drop at that index (starting at zero). Otherwise, will drop the arm matching this name (argument must be of the same type as the individual entries passed to ‘nchoices’ in the initialization).

Returns

self – This object

Return type

object

fit(X, a, r, continue_from_last=False)

Fits the base algorithm (one per class [and per sample if bootstrapped]) to partially labeled data.

Parameters
  • X (array(n_samples, n_features) or CSR(n_samples, n_features)) – Matrix of covariates for the available data.

  • a (array(n_samples, ), int type) – Arms or actions that were chosen for each observations.

  • r (array(n_samples, ), {0,1}) – Rewards that were observed for the chosen actions. Must be binary rewards 0/1.

  • continue_from_last (bool) – If the policy was previously fit to data, whether to assume that this new call to ‘fit’ will continue from the exact same dataset as before plus new rows appended at the end of ‘X’, ‘a’, ‘r’. In this case, will only refit the models that have new data according to ‘a’. Note that the bootstrapped policies will still benefit from extra refits. This option should not be used when there are calls to ‘partial_fit’ between calls to fit. Ignored if using assume_unique_reward=True.

Returns

self – This object

Return type

obj

partial_fit(X, a, r)

Fits the base algorithm (one per class) to partially labeled data in batches.

Note

In order to use this method, the base classifier must have a ‘partial_fit’ method, such as ‘sklearn.linear_model.SGDClassifier’. This method is not available for ‘LogisticUCB’, ‘LogisticTS’, ‘PartitionedUCB’, ‘PartitionedTS’.

Parameters
  • X (array(n_samples, n_features) or CSR(n_samples, n_features)) – Matrix of covariates for the available data.

  • a (array(n_samples, ), int type) – Arms or actions that were chosen for each observations.

  • r (array(n_samples, ), {0,1}) – Rewards that were observed for the chosen actions. Must be binary rewards 0/1.

Returns

self – This object

Return type

obj

predict(X, exploit=False, output_score=False)

Selects actions according to this policy for new data.

Parameters
  • X (array (n_samples, n_features)) – New observations for which to choose an action according to this policy.

  • exploit (bool) – Whether to make a prediction according to the policy, or to just choose the arm with the highest expected reward according to current models.

  • output_score (bool) – Whether to output the score that this method predicted, in case it is desired to use it with this pakckage’s offpolicy and evaluation modules.

Returns

pred – Actions chosen by the policy. If passing output_score=True, it will be a dictionary with the chosen arm and the score that the arm got following this policy with the classifiers used.

Return type

array (n_samples,) or dict(“choice” : array(n_samples,), “score” : array(n_samples,))

reset_multiplier(multiplier=1.0)

Set the multiplier to a custom number

Parameters

multiplier (float) – New multiplier for the numbers going to the softmax function. Note that it will still apply the inflation rate after this parameter is being reset.

Returns

self – This object

Return type

obj

topN(X, n)

Get top-N ranked actions for each observation

Note

This method will rank choices/arms according to what the policy dictates - it is not an exploitation-mode rank, so if e.g. there are random choices for some observations, there will be random ranks in here.

Parameters
  • X (array (n_samples, n_features)) – New observations for which to rank actions according to this policy.

  • n (int) – Number of top-ranked actions to output

Returns

topN – The top-ranked actions for each observation

Return type

array(n_samples, n)

Off-policy learning

Hint: if in doubt, use OffsetTree or SeparateClassifiers (last one is from the online module)

DoublyRobustEstimator

class contextualbandits.offpolicy.DoublyRobustEstimator(base_algorithm, reward_estimator, nchoices, method='rovr', handle_invalid=True, random_state=1, c=None, pmin=1e-05, beta_prior=None, smoothing=(1.0, 2.0), njobs=-1, **kwargs_costsens)

Doubly-Robust Estimator

Estimates the expected reward for each arm, applies a correction for the actions that were chosen, and converts the problem to const-sensitive classification, on which the base algorithm is then fit.

Note

If following these docs to the letter about what to pass under each argument, this implementation will be theoretically incorrect as this whole library doesn’t follow the paradigm of producing probabilities of choosing actions, nor of estimating probabilities of a previous policy. Instead, it uses estimated expected rewards (that is, the rows of the estimations don’t sum to 1), but nevertheless, this is likely to still produce an improvement over a naive approach. One may still supply post-hoc estimated probabilities if feasible though.

Note

This technique converts the problem into a cost-sensitive classification problem by calculating a matrix of expected rewards and turning it into costs. The base algorithm is then fit to this data, using either the Weighted All-Pairs approach, which requires a binary classifier with sample weights as base algorithm, or the Regression One-Vs-Rest approach, which requires a regressor as base algorithm.

In the Weighted All-Pairs approach, this technique will fail if there are actions that were never taken by the exploration policy, as it cannot construct a model for them.

The expected rewards are estimated with the imputer algorithm passed here, which should output a number in the range \([0,1]\).

This technique is meant for the case of contiunous rewards in the \([0,1]\) interval, but here it is used for the case of discrete rewards \(\{0,1\}\), under which it performs poorly. It is not recommended to use, but provided for comparison purposes.

Alo important: this method requires to form reward estimates of all arms for each observation. In order to do so, you can either provide estimates as an array (see Parameters), or pass a model.

One method to obtain reward estimates is to fit a model to the data and use its predictions as reward estimates. You can do so by passing an object of class contextualbandits.online.SeparateClassifiers which should be already fitted, or by passing a classifier with a ‘predict_proba’ method, which will be put into a ‘SeparateClassifiers’

object and fit to the same data passed to this function to obtain reward estimates.

The estimates can make invalid predictions if there are some arms for which every time they were chosen they resulted in a reward, or never resulted in a reward. In such cases, this function includes the option to impute the “predictions” for them (which would otherwise always be exactly zero or one regardless of the context) by replacing them with random numbers \(\sim ext{Beta}(3,1)\) or \(\sim ext{Beta}(1,3)\) for the cases of always good and always bad.

This is just a wild idea though, and doesn’t guarantee reasonable results in such siutation.

Note that, if you are using the ‘SeparateClassifiers’ class from the online module in this same package, it comes with a method ‘predict_proba_separate’ that can be used to get reward estimates. It still can suffer from the same problem of always-one and always-zero predictions though.

Parameters
  • base_algorithm (obj) – Base algorithm to be used for cost-sensitive classification.

  • reward_estimator (obj or array (n_samples, n_choices)) –

    One of the following:
    • An array with the first column corresponding to the reward estimates for the action chosen by the new policy, and the second column corresponding to the reward estimates for the action chosen in the data (see Note for details).

    • An already-fit object of class ‘contextualbandits.online.SeparateClassifiers’, which will be used to make predictions on the actions chosen and the actions that the new policy would choose.

    • A classifier with a ‘predict_proba’ method, which will be fit to the same test data passed here in order to obtain reward estimates (see Note 2 for details).

  • nchoices (int) – Number of arms/labels to choose from. Only used when passing a classifier object to ‘reward_estimator’.

  • method (str, either ‘rovr’ or ‘wap’) – Whether to use Regression One-Vs-Rest or Weighted All-Pairs (see Note 1)

  • handle_invalid (bool) – Whether to replace 0/1 estimated rewards with randomly-generated numbers (see Note 2)

  • random_state (int, None, RandomState, or Generator) – Either an integer which will be used as seed for initializing a Generator object for random number generation, a RandomState object (from NumPy) from which to draw an integer, or a Generator object (from NumPy), which will be used directly. This is used when passing handle_invalid=True or beta_prior != None.

  • c (None or float) – Constant by which to multiply all scores from the exploration policy.

  • pmin (None or float) – Scores (from the exploration policy) will be converted to the minimum between pmin and the original estimate.

  • beta_prior (tuple((a, b), n), str “auto”, or None) – Beta prior to pass to ‘SeparateClassifiers’. Only used when passing to ‘reward_estimator’ a classifier with ‘predict_proba’. See the documentation of ‘SeparateClassifiers’ for details about it.

  • smoothing (tuple(a, b), list, or None) – Smoothing parameter to pass to SeparateClassifiers. Only used when passing to ‘reward_estimator’ a classifier with ‘predict_proba’. See the documentation of SeparateClassifiers for details.

  • njobs (int or None) – Number of parallel jobs to run. If passing None will set it to 1. If passing -1 will set it to the number of CPU cores.

  • kwargs_costsens – Additional keyword arguments to pass to the cost-sensitive classifier.

References

1

Dudík, Miroslav, John Langford, and Lihong Li. “Doubly robust policy evaluation and learning.” arXiv preprint arXiv:1103.4601 (2011).

2

Dudík, Miroslav, et al. “Doubly robust policy evaluation and optimization.” Statistical Science 29.4 (2014): 485-511.

decision_function(X)

Get score distribution for the arm’s rewards

Note

For details on how this is calculated, see the documentation of the RegressionOneVsRest and WeightedAllPairs classes in the costsensitive package.

Parameters

X (array (n_samples, n_features)) – New observations for which to evaluate actions.

Returns

pred – Score assigned to each arm for each observation (see Note).

Return type

array (n_samples, n_choices)

fit(X, a, r, p)

Fits the Doubly-Robust estimator to partially-labeled data collected from a different policy.

Parameters
  • X (array (n_samples, n_features)) – Matrix of covariates for the available data.

  • a (array (n_samples), int type) – Arms or actions that were chosen for each observations.

  • r (array (n_samples), {0,1}) – Rewards that were observed for the chosen actions. Must be binary rewards 0/1.

  • p (array (n_samples)) – Reward estimates for the actions that were chosen by the policy.

    Note that, in theory, this should be an estimate of the probabilities that the actions in a would have been taken under the policy that chose these actions, but passing reward estimates in its place might still produce reasonable results.

predict(X)

Predict best arm for new data.

Parameters

X (array (n_samples, n_features)) – New observations for which to choose an action.

Returns

pred – Actions chosen by this technique.

Return type

array (n_samples,)

OffsetTree

class contextualbandits.offpolicy.OffsetTree(base_algorithm, nchoices, c=None, pmin=1e-05, random_state=1, njobs=-1)

Offset Tree

Parameters
  • base_algorithm (obj) – Binary classifier to be used for each classification sub-problem in the tree.

  • nchoices (int) – Number of arms/labels to choose from.

  • c (None or float) – Constant by which to multiply all scores from the exploration policy.

  • pmin (None or float) – Scores (from the exploration policy) will be converted to the minimum between pmin and the original estimate.

  • random_state (int, None, RandomState, or Generator) – Either an integer which will be used as seed for initializing a Generator object for random number generation, a RandomState object (from NumPy) from which to draw an integer, or a Generator object (from NumPy), which will be used directly. This is used when predictions need to be done for an arm with no data.

  • njobs (int or None) – Number of parallel jobs to run. If passing None will set it to 1. If passing -1 will set it to the number of CPU cores. Note that if the base algorithm is itself parallelized, this might result in a slowdown as both compete for available threads, so don’t set parallelization in both.

References

1

Beygelzimer, Alina, and John Langford. “The offset tree for learning with partial labels.” Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2009.

fit(X, a, r, p)

Fits the Offset Tree estimator to partially-labeled data collected from a different policy.

Parameters
  • X (array (n_samples, n_features)) – Matrix of covariates for the available data.

  • a (array (n_samples), int type) – Arms or actions that were chosen for each observations.

  • r (array (n_samples), {0,1}) – Rewards that were observed for the chosen actions. Must be binary rewards 0/1.

  • p (array (n_samples)) – Reward estimates for the actions that were chosen by the policy.

predict(X)

Predict best arm for new data.

Note

While in theory, making predictions from this algorithm should be faster than from others, the implementation here uses a Python loop for each observation, which is slow compared to NumPy array lookups, so the predictions will be slower to calculate than those from other algorithms.

Parameters

X (array (n_samples, n_features)) – New observations for which to choose an action.

Returns

pred – Actions chosen by this technique.

Return type

array (n_samples,)

Policy Evaluation

evaluateRejectionSampling

class contextualbandits.evaluation.evaluateRejectionSampling(policy, X, a, r, online=True, partial_fit=False, start_point_online='random', random_state=1, batch_size=10)

Evaluate a policy using rejection sampling on test data.

Note

In order for this method to be unbiased, the actions on the test sample must have been collected at random and not according to some other policy.

Parameters
  • policy (obj) – Policy to be evaluated (already fitted to data). Must have a ‘predict’ method. If it is an online policy, it must also have a ‘fit’ method.

  • X (array (n_samples, n_features)) – Matrix of covariates for the available data.

  • a (array (n_samples), int type) – Arms or actions that were chosen for each observation.

  • r (array (n_samples), {0,1}) – Rewards that were observed for the chosen actions. Must be binary rewards 0/1.

  • online (bool) – Whether this is an online policy to be evaluated by refitting it to the data as it makes choices on it.

  • partial_fit (bool) – Whether to use ‘partial_fit’ when fitting the policy to more data. Ignored if passing online=False.

  • start_point_online (either str ‘random’ or int in [0, n_samples-1]) – Point at which to start evaluating cases in the sample. Only used when passing online=True.

  • random_state (int, None, RandomState, or Generator) – Either an integer which will be used as seed for initializing a Generator object for random number generation, a RandomState object (from NumPy) from which to draw an integer, or a Generator object (from NumPy), which will be used directly. This is only used when passing start_point_online='random'.

  • batch_size (int) – Size of batches of data to take for making predictions and adding observations to the history. Note that usually most of the samples are rejected, thus the actual size of the batches to which the models are refit are usually smaller than this number. Only used when passing online=True.

Returns

result – Estimated mean reward and number of observations taken.

Return type

tuple (float, int)

References

1

Li, Lihong, et al. “A contextual-bandit approach to personalized news article recommendation.” Proceedings of the 19th international conference on World wide web. ACM, 2010.

evaluateDoublyRobust

class contextualbandits.evaluation.evaluateDoublyRobust(pred, X, a, r, p, reward_estimator, nchoices=None, handle_invalid=True, c=None, pmin=1e-05, random_state=1)

Doubly-Robust Policy Evaluation

Evaluates rewards of arm choices of a policy from data collected by another policy, using a reward estimator along with the historical probabilities (hence the name).

Note

This method requires to form reward estimates of the arms that were chosen and of the arms that the policy to be evaluated would choose. In order to do so, you can either provide estimates as an array (see Parameters), or pass a model.

One method to obtain reward estimates is to fit a model to both the training and test data and use its predictions as reward estimates. You can do so by passing an object of class contextualbandits.online.SeparateClassifiers which should be already fitted.

Another method is to fit a model to the test data, in which case you can pass a classifier with a ‘predict_proba’ method here, which will be fit to the same test data passed to this function to obtain reward estimates.

The last two options can suffer from invalid predictions if there are some arms for which every time they were chosen they resulted in a reward, or never resulted in a reward. In such cases, this function includes the option to impute the “predictions” for them (which would otherwise always be exactly zero or one regardless of the context) by replacing them with random numbers ~Beta(3,1) or ~Beta(1,3) for the cases of always good and always bad.

This is just a wild idea though, and doesn’t guarantee reasonable results in such siutation.

Note that, if you are using the ‘SeparateClassifiers’ class from the online module in this same package, it comes with a method ‘predict_proba_separate’ that can be used to get reward estimates. It still can suffer from the same problem of always-one and always-zero predictions though.

Parameters
  • pred (array (n_samples,)) – Arms that would be chosen by the policy to evaluate.

  • X (array (n_samples, n_features)) – Matrix of covariates for the available data.

  • a (array (n_samples), int type) – Arms or actions that were chosen for each observation.

  • r (array (n_samples), {0,1}) – Rewards that were observed for the chosen actions. Must be binary rewards 0/1.

  • p (array (n_samples)) – Scores or reward estimates from the policy that generated the data for the actions that were chosen by it.

  • reward_estimator (obj or array (n_samples, 2)) –

    One of the following:
    • An array with the first column corresponding to the reward estimates for the action chosen by the new policy, and the second column corresponding to the reward estimates for the action chosen in the data (see Note for details).

    • An already-fit object of class ‘contextualbandits.online.SeparateClassifiers’, which will be used to make predictions on the actions chosen and the actions that the new policy would choose.

    • A classifier with a ‘predict_proba’ method, which will be fit to the same test data passed here in order to obtain reward estimates (see Note for details).

  • nchoices (int) – Number of arms/labels to choose from. Only used when passing a classifier object to ‘reward_estimator’.

  • handle_invalid (bool) – Whether to replace 0/1 estimated rewards with randomly-generated numbers (see Note)

  • c (None or float) – Constant by which to multiply all scores from the exploration policy.

  • pmin (None or float) – Scores (from the exploration policy) will be converted to the minimum between pmin and the original estimate.

  • random_state (int, None, RandomState, or Generator) – Either an integer which will be used as seed for initializing a Generator object for random number generation, a RandomState object (from NumPy) from which to draw an integer, or a Generator object (from NumPy), which will be used directly.

Returns

est – The estimated mean reward that the new policy would obtain on the ‘X’ data.

Return type

float

References

1

Dudík, Miroslav, John Langford, and Lihong Li. “Doubly robust policy evaluation and learning.” arXiv preprint arXiv:1103.4601 (2011).

evaluateFullyLabeled

class contextualbandits.evaluation.evaluateFullyLabeled(policy, X, y_onehot, online=False, shuffle=True, update_freq=50, random_state=1)

Evaluates a policy on fully-labeled data

Parameters
  • policy (obj) – Policy to be evaluated (already fitted to data). Must have a ‘predict’ method. If it is an online policy, it must also have a ‘fit’ method.

  • X (array (n_samples, n_features)) – Covariates for each observation.

  • y_onehot (array (n_samples, n_arms)) – Labels (zero or one) for each class for each observation.

  • online (bool) – Whether the algorithm should be fit to batches of data with a ‘partial_fit’ method, or to all historical data each time.

  • shuffle (bool) – Whether to shuffle the data (X and y_onehot) before passing through it. Be awarethat data is shuffled in-place.

  • update_freq (int) – Batch size - how many observations to predict before refitting the model.

  • random_state (int, None, RandomState, or Generator) – Either an integer which will be used as seed for initializing a Generator object for random number generation, a RandomState object (from NumPy) from which to draw an integer, or a Generator object (from NumPy), which will be used directly. This is used when shuffling and when selecting actions at random for the first batch.

Returns

mean_rew – Mean reward obtained at each batch.

Return type

array (n_samples,)

evaluateNCIS

class contextualbandits.evaluation.evaluateNCIS(est, r, p, cmin=1e-08, cmax=1000.0)

Normalized Capped Importance Sampling

Evaluates rewards of arm choices of a policy from data collected by another policy, making corrections according to the difference between the estimations of the new and old policy over the actions that were chosen.

Note

This implementation is theoretically incorrect as this whole library doesn’t follow the paradigm of producing probabilities of choosing actions (it is theoretically possible for many of the methods in the online section, but computationally inefficient and not supported by the library). Instead, it uses estimated expected rewards (that is, the rows of the estimations don’t sum to 1), which is not what this method expects, but nevertheless, the ratio of these estimations between the old and new policy should be highly related to the ratio of the probabilities of choosing those actions, and as such, this function is likely to still produce an improvement over a naive average of the expected rewards across actions that were chosen by a different policy.

Note

Unlike the other functions in this module, function doesn’t take the indices of the chosen actions, but rather takes the predictions directly (see the ‘Parameters’ section for details).

Parameters
  • est (array (n_samples,)) – Scores or reward estimates from the policy being evaluated on the actions that were chosen by the old policy for each row of ‘X’.

  • r (array (n_samples), {0,1}) – Rewards that were observed for the chosen actions.

  • p (array (n_samples)) – Scores or reward estimates from the policy that generated the data for the actions that were chosen by it. Must be in the same scale as ‘est’.

  • cmin (float) – Minimum value for the ratio between estimations to assign to observations. If any ratio is below this number, it will be assigned this value (i.e. will be clipped).

  • cmax (float) – Maximum value of the ratio between estimations that will be taken. Observations with ratios higher than this will be discarded rather than clipped.

Returns

est – The estimated mean reward that the new policy would obtain on the ‘X’ data.

Return type

float

References

1

Gilotte, Alexandre, et al. “Offline a/b testing for recommender systems.” Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining. 2018.

Linear Regression

The package offers non-stochastic linear regression procedures with exact “partial_fit” solutions, which are recommended to use alongside the online policies for better incremental updates.

Linear Regression

class contextualbandits.linreg.LinearRegression(lambda_=1.0, fit_intercept=True, method='sm', calc_inv=True, precompute_ts=False, precompute_ts_multiplier=1.0, n_presampled=None, rng_presample=None, use_float=True)

Linear Regression

Typical Linear Regression model, which keeps track of the aggregated data needed to obtain the closed-form solution in a way that calling ‘partial_fit’ multiple times would be equivalent to a single call to ‘fit’ with all the data. This is an exact method rather than a stochastic optimization procedure.

Also provides functionality for making predictions according to upper confidence bound (UCB) and to Thompson sampling criteria.

Note

Doing linear regression this way requires both memory and computation time which scale quadratically with the number of columns/features/variables. As such, the class will by default use C ‘float’ types (typically np.float32) instead of C ‘double’ (np.float64), in order to save memory.

Parameters
  • lambda_ (float) – Strenght of the L2 regularization.

  • fit_intercept (bool) – Whether to add an intercept term to the formula. If passing ‘True’, it will be the last entry in the coefficients.

  • method (str, one of ‘chol’ or ‘sm’) – Method used to fit the model. Options are:

    'chol':

    Uses the Cholesky decomposition to solve the linear system from the least-squares closed-form each time ‘fit’ or ‘partial_fit’ is called. This is likely to be faster when fitting the model to a large number of observations at once, and is able to better exploit multi-threading.

    'sm':

    Starts with an inverse diagonal matrix and updates it as each new observation comes using the Sherman-Morrison formula, thus never explicitly solving the linear system, nor needing to calculate a matrix inverse. This is likely to be faster when fitting the model to small batches of observations. Be aware that with this method, it will add regularization to the intercept if passing ‘fit_intercept=True’.

    Note that it is possible to change the method after the object has already been fit (e.g. if you want non-regularization intercept with fast online updates, you might use Cholesky first and then switch to Sherman-Morrison).

  • calc_inv (bool) – When using method='chol', whether to also produce a matrix inverse, which is required for using the LinUCB prediction mode. Ignored when passing method='sm' (the default). Note that is is possible to change the method after the object has already been fit.

  • precompute_ts (bool) – Whether to pre-compute the necessary matrices to accelerate the Thompson sampling prediction mode (method predict_thompson). If you plan to use predict_thompson, it’s recommended to pass “True”. Note that this will make the Sherman-Morrison updates (method="sm") much slower as it will calculate eigenvalues after every update. Can be changed after the object is already initialized or fitted.

  • precompute_ts_multiplier (float) – Multiplier for the covariance matrix to use when using precompute_ts. Calling predict_thompson with this same multiplier will be faster than with a different one. Calling it with a different multiplier with precompute_ts will still be faster than without it, unless using also n_presampled. Ignored when passing precompute_ts=False.

  • n_presampled (None or int) – When passing precompute_ts, this denotes a number of coefficients to pre-sample after calling ‘fit’ and/or ‘partial_fit’, which will be used later when calling predict_thompson with the same multiplier as in precompute_ts_multiplier. Pre-sampling a large number of coefficients can help to speed up Thompson-sampled predictions at the expense of longer fitting times, and is recommended if there is a large number of predictions between calls to ‘fit’ or ‘partial_fit’. If passing ‘None’ (the default), will not pre-sample a finite number of the coefficients at fitting time, but will rather sample (different) coefficients in calls to predict_thompson. The pre-sampled coefficients will not be used if calling predict_thompson with a different multiplier than what was passed to precompute_ts_multiplier.

  • rng_presample (None, int, RandomState, or Generator) – Random number generator to use for pre-sampling coefficients. If passing an integer, will use it as a random seed for initialization. If passing a RandomState, will use it to draw an integer to use as seed. If passing a Generator, will use it directly. If passing ‘None’, will initialize a Generator without random seed. Ignored if passing precompute_ts=False or n_presampled=None (the defaults).

  • use_float (bool) – Whether to use C ‘float’ type for the required matrices. If passing ‘False’, will use C ‘double’. Be aware that memory usage for this model can grow very large. Can be changed after initialization.

Variables

coef (array(n) or array(n+1)) – The obtained coefficients. If passing ‘fit_intercept=True’, the intercept will be at the last entry.

property calc_inv
fit(X, y, sample_weight=None)

Fit model to data

Note

Calling ‘fit’ will reset whatever previous data was there. For fitting the model incrementally to new data, use ‘partial_fit’ instead.

Parameters
  • X (array(m,n) or CSR matrix(m, n)) – The covariates.

  • y (array-like(m)) – The target variable.

  • sample_weight (None or array-like(m)) – Observation weights for each row.

Return type

self

get_params(deep=True)

Get parameters for this estimator.

Parameters

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

params – Parameter names mapped to their values.

Return type

dict

property method
property n_presampled
partial_fit(X, y, sample_weight=None, *args, **kwargs)

Fit model incrementally to new data

Parameters
  • X (array(m,n) or CSR matrix(m, n)) – The covariates.

  • y (array-like(m)) – The target variable.

  • sample_weight (None or array-like(m)) – Observation weights for each row.

Return type

self

property precompute_ts
property precompute_ts_multiplier
predict(X)

Make predictions on new data

Parameters

X (array(m,n) or CSR matrix(m, n)) – The covariates.

Returns

y_hat – The predicted values given ‘X’.

Return type

array(m)

predict_thompson(X, v_sq=1.0, sample_unique=False, random_state=None)

Make a guess prediction on new data

Make a prediction on new data with coefficients sampled from their estimated distribution.

Note

If using this method, it’s recommended to center the ‘X’ data passed to ‘fit’ and ‘partial_fit’. If not centered, it’s recommendable to lower the v_sq value.

Parameters
  • X (array(m,n) or CSR matrix(m, n)) – The covariates.

  • v_sq (float > 0) – The multiplier for the covariance matrix. Larger values lead to more variable results.

  • sample_unique (bool) – Whether to sample different coefficients each time a prediction is to be made. If passing ‘False’, when calling ‘predict’, it will sample the same coefficients for all the observations in the same call to ‘predict’, whereas if passing ‘True’, will use a different set of coefficients for each observations. Passing ‘False’ leads to an approach which is theoretically wrong, but as sampling coefficients can be very slow, using ‘False’ can provide a reasonable speed up without much of a performance penalty.

  • random_state (None, np.random.Generator, or np.random.RandomState) – A NumPy ‘Generator’ or ‘RandomState’ object instance to use for generating random numbers. If passing ‘None’, will use NumPy’s random module directly (which can be made reproducible through np.random.seed).

Returns

y_hat – The predicted guess on ‘y’ given ‘X’ and v_sq.

Return type

array(m)

References

1

Agrawal, Shipra, and Navin Goyal. “Thompson sampling for contextual bandits with linear payoffs.” International Conference on Machine Learning. 2013.

predict_ucb(X, alpha=1.0, add_unfit_noise=False, random_state=None)

Make an upper-bound prediction on new data

Make a prediction on new data with an upper bound given by the LinUCB formula (be aware that it’s not probabilistic like a regular CI).

Note

If using this method, it’s recommended to center the ‘X’ data passed to ‘fit’ and ‘partial_fit’. If not centered, it’s recommendable to lower the alpha value.

Parameters
  • X (array(m,n) or CSR matrix(m, n)) – The covariates.

  • alpha (float > 0 or array(m, ) > 0) – The multiplier for the width of the bound. Can also pass an array with different values for each row.

  • add_unfit_noise (bool) – When making predictions with an unfit model (in this case they are given by empty zero matrices except for the inverse diagonal matrix based on the regularization parameter), whether to add a very small amount of random noise ~ Uniform(0, 10^-12) to it. This is useful in order to break ties at random when using multiple models.

  • random_state (None, np.random.Generator, or np.random.RandomState) – A NumPy ‘Generator’ or ‘RandomState’ object instance to use for generating random numbers. If passing ‘None’, will use NumPy’s random module directly (which can be made reproducible through np.random.seed). Only used when passing add_unfit_noise=True and calling this method on a model that has not been fit to data.

Returns

y_hat – The predicted upper bound on ‘y’ given ‘X’ and alpha.

Return type

array(m)

References

1

Chu, Wei, et al. “Contextual bandits with linear payoff functions.” Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics. 2011.

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters

**params (dict) – Estimator parameters.

Returns

self – Estimator instance.

Return type

estimator instance

property use_float

ElasticNet

class contextualbandits.linreg.ElasticNet(alpha=0.1, l1_ratio=0.5, fit_intercept=True, l1=None, l2=None, use_float=True)

ElasticNet Regression

ElasticNet regression (with penalization on the l1 and l2 norms of the coefficients), which keeps track of the aggregated data needed to obtain the optimal coefficients in such a way that calling ‘partial_fit’ multiple times would be equivalent to a single call to ‘fit’ with all the data. This is an exact method rather than a stochastic optimization procedure.

Note

This ElasticNet regression is fit through a reduction to non-negative least squares with twice the number of variables, which is in turn solved through a coordinate descent procedure. This is typically slower than the lasso paths used in GLMNet and SciKit-Learn, and scales much worse with the number of features/columns, but allows for faster incremental updates through ‘partial_fit’, which will give the same result as calls to fit.

Note

This model will not standardize the input data in any way.

Note

By default, will set the l1 and l2 regularization in the same way as GLMNet and SciKit-Learn - that is, the regularizations increase along with the number of rows in the data, which means they will be different after each call to ‘fit’ or ‘partial_fit’. It is nevertheless possible to specify the l1 and l2 regularization directly, and both will remain constant that way, but be careful about the choice for such hyperparameters.

Note

Doing regression this way requires both memory and computation time which scale quadratically with the number of columns/features/variables. As such, the class will by default use C ‘float’ types (typically np.float32) instead of C ‘double’ (np.float64), in order to save memory.

Parameters
  • alpha (float) – Strenght of the regularization.

  • l1_ratio (float [0,1]) – Proportion of the regularization that will be applied to the l1 norm of the coefficients (remainder will be applied to the l2 norm). Must be a number between zero and one. If passing l1_ratio=0, it’s recommended instead to use the LinearRegression class which uses more efficient procedures.

    Using higher l1 regularization is more likely to result in some of the obtained coefficients being exactly zero, which is oftentimes desirable.

  • fit_intercept (bool) – Whether to add an intercept term to the formula. If passing ‘True’, it will be the last entry in the coefficients.

  • l1 (None or float) – Strength of the l1 regularization. If passing it, will bypass the values set through alpha and l1_ratio, and will remain constant inbetween calls to fit and partial_fit. If passing this, should also pass l2 or otherwise will assume that it is zero.

  • l2 (None or float) – Strength of the l2 regularization. If passing it, will bypass the values set through alpha and l1_ratio, and will remain constant inbetween calls to fit and partial_fit. If passing this, should also pass l1 or otherwise will assume that it is zero.

  • use_float (bool) – Whether to use C ‘float’ type for the required matrices. If passing ‘False’, will use C ‘double’. Be aware that memory usage for this model can grow very large. Can be changed after initialization.

Variables

coef (array(n) or array(n+1)) – The obtained coefficients. If passing ‘fit_intercept=True’, the intercept will be at the last entry.

References

1

Franc, Vojtech, Vaclav Hlavac, and Mirko Navara. “Sequential coordinate-wise algorithm for the non-negative least squares problem.” International Conference on Computer Analysis of Images and Patterns. Springer, Berlin, Heidelberg, 2005.

fit(X, y, sample_weight=None)

Fit model to data

Note

Calling ‘fit’ will reset whatever previous data was there. For fitting the model incrementally to new data, use ‘partial_fit’ instead.

Parameters
  • X (array(m,n) or CSR matrix(m, n)) – The covariates.

  • y (array-like(m)) – The target variable.

  • sample_weight (None or array-like(m)) – Observation weights for each row.

Return type

self

get_params(deep=True)

Get parameters for this estimator.

Parameters

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

params – Parameter names mapped to their values.

Return type

dict

partial_fit(X, y, sample_weight=None, *args, **kwargs)

Fit model incrementally to new data

Parameters
  • X (array(m,n) or CSR matrix(m, n)) – The covariates.

  • y (array-like(m)) – The target variable.

  • sample_weight (None or array-like(m)) – Observation weights for each row.

Return type

self

predict(X)

Make predictions on new data

Parameters

X (array(m,n) or CSR matrix(m, n)) – The covariates.

Returns

y_hat – The predicted values given ‘X’.

Return type

array(m)

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters

**params (dict) – Estimator parameters.

Returns

self – Estimator instance.

Return type

estimator instance

property use_float

Indices and tables