Logistic Regression#

Ref: Chap 4 of Mar18

https://cfteach.github.io/brds/referencesmd.html

!pip install pymc3

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pymc3
  Downloading pymc3-3.11.5-py3-none-any.whl (872 kB)
     |████████████████████████████████| 872 kB 12.8 MB/s 
?25hRequirement already satisfied: numpy<1.22.2,>=1.15.0 in /usr/local/lib/python3.7/dist-packages (from pymc3) (1.21.6)
Requirement already satisfied: dill in /usr/local/lib/python3.7/dist-packages (from pymc3) (0.3.6)
Collecting semver>=2.13.0
  Downloading semver-2.13.0-py2.py3-none-any.whl (12 kB)
Requirement already satisfied: arviz>=0.11.0 in /usr/local/lib/python3.7/dist-packages (from pymc3) (0.12.1)
Requirement already satisfied: cachetools>=4.2.1 in /usr/local/lib/python3.7/dist-packages (from pymc3) (5.2.0)
Requirement already satisfied: fastprogress>=0.2.0 in /usr/local/lib/python3.7/dist-packages (from pymc3) (1.0.3)
Collecting theano-pymc==1.1.2
  Downloading Theano-PyMC-1.1.2.tar.gz (1.8 MB)
     |████████████████████████████████| 1.8 MB 30.9 MB/s 
?25hRequirement already satisfied: typing-extensions>=3.7.4 in /usr/local/lib/python3.7/dist-packages (from pymc3) (4.1.1)
Collecting deprecat
  Downloading deprecat-2.1.1-py2.py3-none-any.whl (9.8 kB)
Requirement already satisfied: patsy>=0.5.1 in /usr/local/lib/python3.7/dist-packages (from pymc3) (0.5.3)
Requirement already satisfied: pandas>=0.24.0 in /usr/local/lib/python3.7/dist-packages (from pymc3) (1.3.5)
Requirement already satisfied: scipy<1.8.0,>=1.7.3 in /usr/local/lib/python3.7/dist-packages (from pymc3) (1.7.3)
Requirement already satisfied: filelock in /usr/local/lib/python3.7/dist-packages (from theano-pymc==1.1.2->pymc3) (3.8.0)
Requirement already satisfied: setuptools>=38.4 in /usr/local/lib/python3.7/dist-packages (from arviz>=0.11.0->pymc3) (57.4.0)
Requirement already satisfied: xarray>=0.16.1 in /usr/local/lib/python3.7/dist-packages (from arviz>=0.11.0->pymc3) (0.20.2)
Requirement already satisfied: xarray-einstats>=0.2 in /usr/local/lib/python3.7/dist-packages (from arviz>=0.11.0->pymc3) (0.2.2)
Requirement already satisfied: matplotlib>=3.0 in /usr/local/lib/python3.7/dist-packages (from arviz>=0.11.0->pymc3) (3.2.2)
Requirement already satisfied: netcdf4 in /usr/local/lib/python3.7/dist-packages (from arviz>=0.11.0->pymc3) (1.6.1)
Requirement already satisfied: packaging in /usr/local/lib/python3.7/dist-packages (from arviz>=0.11.0->pymc3) (21.3)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib>=3.0->arviz>=0.11.0->pymc3) (1.4.4)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib>=3.0->arviz>=0.11.0->pymc3) (3.0.9)
Requirement already satisfied: python-dateutil>=2.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib>=3.0->arviz>=0.11.0->pymc3) (2.8.2)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.7/dist-packages (from matplotlib>=3.0->arviz>=0.11.0->pymc3) (0.11.0)
Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.7/dist-packages (from pandas>=0.24.0->pymc3) (2022.6)
Requirement already satisfied: six in /usr/local/lib/python3.7/dist-packages (from patsy>=0.5.1->pymc3) (1.15.0)
Requirement already satisfied: importlib-metadata in /usr/local/lib/python3.7/dist-packages (from xarray>=0.16.1->arviz>=0.11.0->pymc3) (4.13.0)
Requirement already satisfied: wrapt<2,>=1.10 in /usr/local/lib/python3.7/dist-packages (from deprecat->pymc3) (1.14.1)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata->xarray>=0.16.1->arviz>=0.11.0->pymc3) (3.10.0)
Requirement already satisfied: cftime in /usr/local/lib/python3.7/dist-packages (from netcdf4->arviz>=0.11.0->pymc3) (1.6.2)
Building wheels for collected packages: theano-pymc
  Building wheel for theano-pymc (setup.py) ... ?25l?25hdone
  Created wheel for theano-pymc: filename=Theano_PyMC-1.1.2-py3-none-any.whl size=1529964 sha256=defebffb659cb5ed16c8f722ef3988e92655cced11359b7b4bebf871f2b9ba9f
  Stored in directory: /root/.cache/pip/wheels/f3/af/8c/5dd7553522d74c52a7813806fc7ee1a9caa20a3f7c8fd850d5
Successfully built theano-pymc
Installing collected packages: theano-pymc, semver, deprecat, pymc3
Successfully installed deprecat-2.1.1 pymc3-3.11.5 semver-2.13.0 theano-pymc-1.1.2

import pymc3 as pm
import numpy as np
import pandas as pd
import theano.tensor as tt
import seaborn as sns
import scipy.stats as stats
from scipy.special import expit as logistic
import matplotlib.pyplot as plt
import arviz as az
import requests
import io 

az.style.use('arviz-darkgrid')

z = np.linspace(-8, 8)
plt.plot(z, 1 / (1 + np.exp(-z)))
plt.xlabel('z')
plt.ylabel('logistic(z)')

Text(0, 0.5, 'logistic(z)')

_images/mod2_part2b_Bayesian_Logistic_Regression_5_1.png

The Iris Dataset#

target_url = 'https://raw.githubusercontent.com/cfteach/brds/main/datasets/iris.csv' 

download = requests.get(target_url).content
iris = pd.read_csv(io.StringIO(download.decode('utf-8')))

iris.head()

	sepal_length	sepal_width	petal_length	petal_width	species
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa
3	4.6	3.1	1.5	0.2	setosa
4	5.0	3.6	1.4	0.2	setosa

#using stripplot function from seaborn 

sns.stripplot(x="species", y="sepal_length", data=iris, jitter=True)

<matplotlib.axes._subplots.AxesSubplot at 0x7f5290a34350>

_images/mod2_part2b_Bayesian_Logistic_Regression_8_1.png

sns.pairplot(iris, hue='species', diag_kind='kde')

/usr/local/lib/python3.7/dist-packages/seaborn/axisgrid.py:88: UserWarning: This figure was using constrained_layout==True, but that is incompatible with subplots_adjust and or tight_layout: setting constrained_layout==False. 
  self._figure.tight_layout(*args, **kwargs)

<seaborn.axisgrid.PairGrid at 0x7f52909aeb10>

_images/mod2_part2b_Bayesian_Logistic_Regression_9_2.png

The logistic model applied to the iris dataset#

df = iris.query("species == ('setosa', 'versicolor')")

df

	sepal_length	sepal_width	petal_length	petal_width	species
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa
3	4.6	3.1	1.5	0.2	setosa
4	5.0	3.6	1.4	0.2	setosa
...	...	...	...	...	...
95	5.7	3.0	4.2	1.2	versicolor
96	5.7	2.9	4.2	1.3	versicolor
97	6.2	2.9	4.3	1.3	versicolor
98	5.1	2.5	3.0	1.1	versicolor
99	5.7	2.8	4.1	1.3	versicolor

100 rows × 5 columns

y_0 = pd.Categorical(df['species']).codes

print(y_0, len(y_0))

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1] 100

# let's select one variate
x_n = 'sepal_length'
x_0 = df[x_n].values
#print(x_0)

# let's center our dataset, as we have done in other exercises
x_c = x_0 - x_0.mean()
print(x_c)

[-0.371 -0.571 -0.771 -0.871 -0.471 -0.071 -0.871 -0.471 -1.071 -0.571
 -0.071 -0.671 -0.671 -1.171  0.329  0.229 -0.071 -0.371  0.229 -0.371
 -0.071 -0.371 -0.871 -0.371 -0.671 -0.471 -0.471 -0.271 -0.271 -0.771
 -0.671 -0.071 -0.271  0.029 -0.571 -0.471  0.029 -0.571 -1.071 -0.371
 -0.471 -0.971 -1.071 -0.471 -0.371 -0.671 -0.371 -0.871 -0.171 -0.471
  1.529  0.929  1.429  0.029  1.029  0.229  0.829 -0.571  1.129 -0.271
 -0.471  0.429  0.529  0.629  0.129  1.229  0.129  0.329  0.729  0.129
  0.429  0.629  0.829  0.629  0.929  1.129  1.329  1.229  0.529  0.229
  0.029  0.029  0.329  0.529 -0.071  0.529  1.229  0.829  0.129  0.029
  0.029  0.629  0.329 -0.471  0.129  0.229  0.229  0.729 -0.371  0.229]

with pm.Model() as model_0:
    α = pm.Normal('α', mu=0, sd=10)
    β = pm.Normal('β', mu=0, sd=10)
    
    μ = α + pm.math.dot(x_c, β)    
    θ = pm.Deterministic('θ', pm.math.sigmoid(μ))
    bd = pm.Deterministic('bd', -α/β)
    
    yl = pm.Bernoulli('yl', p=θ, observed=y_0)

    trace_0 = pm.sample(2000, tune = 2000, return_inferencedata=True)

100.00% [4000/4000 00:05<00:00 Sampling chain 0, 0 divergences]

100.00% [4000/4000 00:02<00:00 Sampling chain 1, 0 divergences]

varnames = ['α', 'β', 'bd']
res = az.summary(trace_0)
#print(res)
az.summary(trace_0)

	mean	sd	hdi_3%	hdi_97%	mcse_mean	mcse_sd	ess_bulk	ess_tail	r_hat
α	0.309	0.351	-0.345	0.955	0.006	0.005	2972.0	2475.0	1.0
β	5.409	1.047	3.469	7.348	0.021	0.015	2655.0	2333.0	1.0
θ[0]	0.164	0.060	0.060	0.277	0.001	0.001	3071.0	2506.0	1.0
θ[1]	0.068	0.037	0.011	0.135	0.001	0.000	2948.0	2406.0	1.0
θ[2]	0.027	0.021	0.001	0.064	0.000	0.000	2867.0	2277.0	1.0
...	...	...	...	...	...	...	...	...	...
θ[96]	0.815	0.067	0.689	0.930	0.001	0.001	2727.0	2186.0	1.0
θ[97]	0.980	0.018	0.947	0.999	0.000	0.000	2609.0	2140.0	1.0
θ[98]	0.164	0.060	0.060	0.277	0.001	0.001	3071.0	2506.0	1.0
θ[99]	0.815	0.067	0.689	0.930	0.001	0.001	2727.0	2186.0	1.0
bd	-0.056	0.065	-0.170	0.070	0.001	0.001	3016.0	2591.0	1.0

103 rows × 9 columns

theta_post= trace_0.posterior['θ'] 
print(np.shape(theta_post))

(2, 2000, 100)

theta = trace_0.posterior['θ'].mean(axis=0).mean(axis=0)
idx = np.argsort(x_c)

np.random.seed(123)

plt.plot(x_c[idx], theta[idx], color='C2', lw=3)

plt.vlines(trace_0.posterior['bd'].mean(), 0, 1, color='k')

bd_hpd = az.hdi(trace_0.posterior['bd'])

plt.fill_betweenx([0, 1], bd_hpd.bd[0].values, bd_hpd.bd[1].values, color='k', alpha=0.5)

plt.scatter(x_c, np.random.normal(y_0, 0.02),
            marker='.', color=[f'C{x}' for x in y_0])

az.plot_hdi(x_c, trace_0.posterior['θ'], color='C2')  #green band 

plt.xlabel(x_n)
plt.ylabel('θ', rotation=0)
# use original scale for xticks
locs, _ = plt.xticks()
plt.xticks(locs, np.round(locs + x_0.mean(), 1))

([<matplotlib.axis.XTick at 0x7f5280db0250>,
  <matplotlib.axis.XTick at 0x7f5280db0090>,
  <matplotlib.axis.XTick at 0x7f5280e10fd0>,
  <matplotlib.axis.XTick at 0x7f5280d0c5d0>,
  <matplotlib.axis.XTick at 0x7f5280d0cdd0>,
  <matplotlib.axis.XTick at 0x7f5280d0cd50>,
  <matplotlib.axis.XTick at 0x7f5280d0c590>,
  <matplotlib.axis.XTick at 0x7f5280d11810>],
 [Text(0, 0, '4.0'),
  Text(0, 0, '4.5'),
  Text(0, 0, '5.0'),
  Text(0, 0, '5.5'),
  Text(0, 0, '6.0'),
  Text(0, 0, '6.5'),
  Text(0, 0, '7.0'),
  Text(0, 0, '7.5')])

_images/mod2_part2b_Bayesian_Logistic_Regression_19_1.png

Multiple logistic regression#

df = iris.query("species == ('setosa', 'versicolor')") 
y_1 = pd.Categorical(df['species']).codes 
x_n = ['sepal_length', 'sepal_width'] 
x_1 = df[x_n].values

print(np.shape(x_1), type(x_1))
t_l = x_1.tolist()
print(np.shape(t_l), type(t_l))

(100, 2) <class 'numpy.ndarray'>
(100, 2) <class 'list'>

with pm.Model() as model_1: 
    α = pm.Normal('α', mu=0, sd=10) 
    β = pm.Normal('β', mu=0, sd=2, shape=len(x_n)) 

    x__1 = pm.Data('x', x_1)
    y__1 = pm.Data('y', y_1)
     
    μ = α + pm.math.dot(x__1, β)
    
    θ = pm.Deterministic('θ', 1 / (1 + pm.math.exp(-μ))) 
    bd = pm.Deterministic('bd', -α/β[1] - β[0]/β[1] * x__1[:,0])
     
    yl = pm.Bernoulli('yl', p=θ, observed=y__1) 
 
    trace_1 = pm.sample(2000, tune=4000, return_inferencedata=True, target_accept=0.9)

100.00% [6000/6000 00:29<00:00 Sampling chain 0, 0 divergences]

100.00% [6000/6000 00:29<00:00 Sampling chain 1, 0 divergences]

varnames = ['α', 'β'] 
az.plot_forest(trace_1, var_names=varnames);

_images/mod2_part2b_Bayesian_Logistic_Regression_24_0.png

idx = np.argsort(x_1[:,0]) 

bd_mean = trace_1.posterior['bd'].mean(axis=0).mean(axis=0)

plt.scatter(x_1[:,0], x_1[:,1], c=[f'C{x}' for x in y_0]) 

bd = bd_mean[idx]

plt.plot(x_1[:,0][idx], bd, color='k'); 

az.plot_hdi(x_1[:,0], trace_1.posterior['bd'], color='k')
 
plt.xlabel(x_n[0]) 
plt.ylabel(x_n[1])

Text(0, 0.5, 'sepal_width')

_images/mod2_part2b_Bayesian_Logistic_Regression_25_1.png

Prediction on unseen data#

print(np.shape(x_1), type(x_1))

udata = np.array(((5.2,2.0),(7.0,2.0),(4.5,5.0),(5.5,3.2)))

print(np.shape(udata), type(udata))


#dic = {'x_1': udata[:,0].tolist(), 'sepal_width': udata[:,1].tolist()}
tmp_dic = {'x': udata}
#a=udata[:,0].tolist()
#print(type(a))
#print(a)

(100, 2) <class 'numpy.ndarray'>
(4, 2) <class 'numpy.ndarray'>

with model_1:
  pm.set_data(tmp_dic)
  y_test = pm.sample_posterior_predictive(trace_1, 200, model_1)

/usr/local/lib/python3.7/dist-packages/pymc3/sampling.py:1709: UserWarning: samples parameter is smaller than nchains times ndraws, some draws and/or chains may not be represented in the returned posterior predictive sample
  "samples parameter is smaller than nchains times ndraws, some draws "

100.00% [200/200 00:04<00:00]

print(np.shape(y_test['yl']), type(y_test['yl']))
res = y_test['yl'].mean(axis=0)
#res = np.round(y_test['yl'].mean(axis=0), decimals=0).astype(int)


for i,k in enumerate(udata):
  print('sepal_length, sepal_width, category: {:}, {:}'.format(udata[i], res[i]))

(200, 4) <class 'numpy.ndarray'>
sepal_length, sepal_width, category: [5.2 2. ], 0.99
sepal_length, sepal_width, category: [7. 2.], 1.0
sepal_length, sepal_width, category: [4.5 5. ], 0.0
sepal_length, sepal_width, category: [5.5 3.2], 0.425

Question in class (from D. Zhu): why is the decision boundary linear?#

with pm.Model() as model_2: 
    α = pm.Normal('α', mu=0, sd=10) 
    β = pm.Normal('β', mu=0, sd=2, shape=len(x_n)) 
    𝛾 = pm.Normal('𝛾', mu=0, sd=2) 
     
    μ = α + pm.math.dot(x_1, β) + pm.math.dot(x_1[:,0]**2, 𝛾)
    
    θ = pm.Deterministic('θ', 1 / (1 + pm.math.exp(-μ))) 
    bd = pm.Deterministic('bd', -α/β[1] - β[0]/β[1] * x_1[:,0] - 𝛾/β[1] * x_1[:,0]**2)
     
    yl = pm.Bernoulli('yl', p=θ, observed=y_1) 
 
    trace_2 = pm.sample(2000, tune=4000, return_inferencedata=True, target_accept=0.95)

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [𝛾, β, α]

100.00% [24000/24000 00:19<00:00 Sampling 4 chains, 0 divergences]

/Users/cfanelli/Desktop/teaching/BRDS/jupynb_env_new/lib/python3.9/site-packages/scipy/stats/_continuous_distns.py:624: RuntimeWarning: overflow encountered in _beta_ppf
  return _boost._beta_ppf(q, a, b)
/Users/cfanelli/Desktop/teaching/BRDS/jupynb_env_new/lib/python3.9/site-packages/scipy/stats/_continuous_distns.py:624: RuntimeWarning: overflow encountered in _beta_ppf
  return _boost._beta_ppf(q, a, b)
/Users/cfanelli/Desktop/teaching/BRDS/jupynb_env_new/lib/python3.9/site-packages/scipy/stats/_continuous_distns.py:624: RuntimeWarning: overflow encountered in _beta_ppf
  return _boost._beta_ppf(q, a, b)
/Users/cfanelli/Desktop/teaching/BRDS/jupynb_env_new/lib/python3.9/site-packages/scipy/stats/_continuous_distns.py:624: RuntimeWarning: overflow encountered in _beta_ppf
  return _boost._beta_ppf(q, a, b)
Sampling 4 chains for 4_000 tune and 2_000 draw iterations (16_000 + 8_000 draws total) took 26 seconds.

idx = np.argsort(x_1[:,0]) 

bd_mean = trace_2.posterior['bd'].mean(axis=0).mean(axis=0)

plt.scatter(x_1[:,0], x_1[:,1], c=[f'C{x}' for x in y_0]) 

bd = bd_mean[idx]

plt.plot(x_1[:,0][idx], bd, color='k'); 

az.plot_hdi(x_1[:,0], trace_2.posterior['bd'], color='k')
 
plt.xlabel(x_n[0]) 
plt.ylabel(x_n[1])

Text(0, 0.5, 'sepal_width')

_images/mod2_part2b_Bayesian_Logistic_Regression_32_1.png

y_pred_2 = pm.sample_posterior_predictive(trace_2, 200, model_2) 

/Users/cfanelli/Desktop/teaching/BRDS/jupynb_env_new/lib/python3.9/site-packages/pymc3/sampling.py:1708: UserWarning: samples parameter is smaller than nchains times ndraws, some draws and/or chains may not be represented in the returned posterior predictive sample
  warnings.warn(

100.00% [200/200 00:00<00:00]

How to plot a variate distributed according to a PyMC model#

with pm.Model() as model_testv:
    testv = pm.Beta('testv',1.,1.)
    


syn1 = testv.random(size=10000)
plt.hist(syn1, bins=25);

syn2 = testv.random(size=1000)
plt.hist(syn2, bins=25);

_images/mod2_part2b_Bayesian_Logistic_Regression_35_0.png

Bayesian Reasoning in Data Science

Logistic Regression

Contents