Solution to Assignment 4 (Approach 2)#

import pymc as pm
import numpy as np
from sklearn import datasets
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

iris = datasets.load_iris()

X = iris.data[:,:2]
y = iris.target

species_idx = 0
y_binary = (y==species_idx).astype(np.float64)

print(X.shape,y.shape)

(150, 2) (150,)

plt.scatter(X[:,0],X[:,1],c=y)
plt.title('The Data')

Text(0.5, 1.0, 'The Data')

X_mean, X_std = X.mean(axis=0), X.std(axis=0)
X_normalized = (X - X_mean) / X_std

Question 1#

with pm.Model() as model_reg:
  l = pm.Gamma('l',2,0.5)
  eta = pm.HalfNormal('eta',sigma=1.0)
  cov = eta ** 2 * pm.gp.cov.ExpQuad(2,ls=l)
  gp = pm.gp.Marginal(cov_func=cov)
  eps = pm.HalfNormal('eps',1.0)
  gp = pm.gp.Latent(cov_func=cov)
  f = gp.prior(name="f", X=X_normalized, dims="idx")
  p = pm.Deterministic(name="p", var=pm.math.invlogit(f), dims="idx")
  likelihood = pm.Bernoulli(name="likelihood", p=p, dims="idx", observed=y_binary)

  trace_reg = pm.sample(1000,return_inferencedata=True,target_accept=0.95,cores=1)

100.00% [2000/2000 06:05<00:00 Sampling chain 0, 0 divergences]

100.00% [2000/2000 06:09<00:00 Sampling chain 1, 0 divergences]

import arviz as az
az.plot_trace(trace_reg,compact=True)
plt.subplots_adjust(hspace=0.5)

Question 2#

x_min, x_max = X_normalized[:, 0].min() - 1, X_normalized[:, 0].max() + 1
y_min, y_max = X_normalized[:, 1].min() - 1, X_normalized[:, 1].max() + 1
h = 0.25
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
X_new = np.c_[xx.ravel(),yy.ravel()]
print(X_new.shape)

(806, 2)

with model_reg:
  f_pred = gp.conditional("f_pred",X_new)
  p_pred = pm.Deterministic(name="p_pred", var=pm.math.invlogit(f_pred))
  likelihood_pred = pm.Bernoulli(name="likelihood_pred", p=p_pred)
  pred_samples = pm.sample_posterior_predictive(trace=trace_reg, var_names=["f_pred", "p_pred", "likelihood_pred"])
  #f_pred = gp.conditional("f_pred",X_new)
  #pred_samples = pm.sample_posterior_predictive(trace_reg,var_names=["f_pred"])

100.00% [2000/2000 29:49<00:00]

# Plot contour plot
plt.figure(figsize=(8, 6))
#s = 1.0 / (1.0 + np.exp(-pred_samples.posterior_predictive["f_pred"].values.mean(0).mean(0)))
s = pred_samples.posterior_predictive['p_pred'].values.mean(0).mean(0)
plt.contourf(xx, yy,s.reshape(xx.shape),cmap='coolwarm')
plt.scatter(X_normalized[:,0],X_normalized[:,1],c=y,marker='o',edgecolor='k',cmap=plt.cm.Paired)
plt.xlabel("X1")
plt.ylabel("X2")
plt.title("GPC Predicted Probabilites for Class 1")
plt.show()

p_setosa = trace_reg.posterior['p'].values.mean(0).mean(0)[np.where(y_binary == 1.0)[0]]
plt.hist(p_setosa,bins=100,histtype='step',color='red')
plt.title('Probability Distribution over Setosa')
plt.xlabel(r'$p_{setosa}$')
plt.ylabel('Counts')
plt.show()
print("Number of Setosa with p > 0.5: ",len(np.where(p_setosa > 0.5)[0]),"/",len(p_setosa))

Number of Setosa with p > 0.5:  49 / 50

Question 4#

# Plot contour plot
plt.figure(figsize=(8, 6))
#s = 1.0 / (1.0 + np.exp(-pred_samples.posterior_predictive["f_pred"].values.mean(0).mean(0)))
s = pred_samples.posterior_predictive['p_pred'].values.mean(0).std(0)
plt.contourf(xx, yy,s.reshape(xx.shape),cmap='coolwarm')
plt.scatter(X_normalized[:,0],X_normalized[:,1],c=y,marker='o',edgecolor='k',cmap=plt.cm.Paired)
plt.xlabel("X1")
plt.ylabel("X2")
plt.title("GPC Classification Uncertainties")
plt.show()

Question 5#

Using [P - 0.5] > 3 \(\sigma(p)\) we find that all points are retained except one. This point lies in the lower left hand corner of all plots (both above and below). From the contours plots it can be seen this point is missed by our GPC.

original_p = trace_reg.posterior['p'].values.mean(0).mean(0)
uncertainty_p = trace_reg.posterior['p'].values.mean(0).std(0)


p_sig_test = original_p - 0.5
resulting_setosa = np.where(p_sig_test > 3*uncertainty_p)[0]
resulting_setosa_p = original_p[resulting_setosa]
true_idx = np.where(y_binary == 1.)[0]


plt.scatter(X_normalized[true_idx,0],X_normalized[true_idx,1],c='red',alpha=1.0)
plt.scatter(X_normalized[resulting_setosa,0],X_normalized[resulting_setosa,1],c=resulting_setosa_p)
plt.colorbar(label='Probability')
plt.xlim(X_normalized[:,0].min() - 0.5,X_normalized[:,0].max()  + 0.5 )
plt.ylim(X_normalized[:,1].min()-0.5 ,X_normalized[:,1].max() + 0.5)
plt.xlabel("x1")
plt.ylabel('x2')
plt.title(r'$3\sigma$ Uncertainty Cut -1 Point Missed')
plt.show()

Overall classification performance:#

Note this is indifferent under the uncertainty cuts.

predictions_ = (original_p > 0.5).astype('int')
print("Accuracy: ", np.sum(predictions_ == y_binary) / len(y_binary))

Accuracy:  0.9933333333333333

Bonus Question 1#

Yes entirely. From the countour plots we can see the probability, and also confidence of the model increases as we move towards a ‘central’ mass, i.e., the centroid of class 0 or 1 in the feature space. The uncertainty of the model increases in low density regions which is desirable.

Bonus Question 2#

I actually opted for a more ‘optimal’ configuration by default in the model construction by using directly a GP with latent variable. When utilizing the intial model I noticed that the probabilities had a lower bound of ~ 0.5, i.e. the model was not really learning the distribution of the other class (0). This is directly related to the intitial discretization of the space, i.e. as we move away from 0,1 to -10,10 the performance of the marginal improves.

It is true the results depend on:

kernel function
GP output discretization
Scaling factor on sigmoid

The kernel function we choose will define how we model the covariance terms and is therefore directly tied to the performance of the model. The discretization is very important as mentioned prior. Scaling factor on the sigmoid will just make the distributions of the class more multimodal, i.e., two distributions centered around p = 0.5.

What did I learn?

These are useful techniques and offer a lot of residual information apart from just a classification label in this case. They offer uncertainty, but also information on uncertainty due to low density regions. I am considering two cases where uncertainty should be high:

on the boundary of two classes
low density regions

I am not aware of other methods that provide both.

Bonus Question 3#

Sure. Lets use one from my old assignment and see what we get.

with pm.Model() as lgr:
    alpha = pm.Normal('alpha',mu=0,sigma=1)
    betas = pm.Normal('betas',mu=0,sigma=1,shape=X_normalized.shape[1])
    mu = alpha + pm.math.dot(X_normalized,betas)
    theta = pm.Deterministic('theta', pm.math.sigmoid(mu))

    lkd = pm.Bernoulli('lkd',p=theta,observed=y_binary)

    lr_posterior = pm.sample(1000,tune=2000,return_inferencedata=True,target_accept=0.95)

100.00% [3000/3000 00:04<00:00 Sampling chain 0, 0 divergences]

100.00% [3000/3000 00:04<00:00 Sampling chain 1, 0 divergences]

plr = lr_posterior.posterior['theta'].values.mean(0).mean(0)
pred_lr = (plr > 0.5).astype('int')
print("Accuracy from BLR: ", np.sum(pred_lr == y_binary) / len(y_binary))

Accuracy from BLR:  0.9933333333333333

So we see the performance of the models is identical. This is to be expected given the distribution of the features. There exists a very easily identifiable decision boundary for the BLR to identify. However, in cases where the decision boundary is highly non-linear, a GPC would outperform. It is similar in operation to other classification methods that use radial basis functions.

Bayesian Reasoning in Data Science (DATA 644)

Solution to Assignment 4 (Approach 2)

Contents