DATA 644 (Bayesian Reasoning in Data Science): Assignment 1#

%pip install pymc pytensor

Requirement already satisfied: pymc in /usr/local/lib/python3.10/dist-packages (5.7.2)
Requirement already satisfied: pytensor in /usr/local/lib/python3.10/dist-packages (2.14.2)
Requirement already satisfied: arviz>=0.13.0 in /usr/local/lib/python3.10/dist-packages (from pymc) (0.15.1)
Requirement already satisfied: cachetools>=4.2.1 in /usr/local/lib/python3.10/dist-packages (from pymc) (5.3.2)
Requirement already satisfied: cloudpickle in /usr/local/lib/python3.10/dist-packages (from pymc) (2.2.1)
Requirement already satisfied: fastprogress>=0.2.0 in /usr/local/lib/python3.10/dist-packages (from pymc) (1.0.3)
Requirement already satisfied: numpy>=1.15.0 in /usr/local/lib/python3.10/dist-packages (from pymc) (1.25.2)
Requirement already satisfied: pandas>=0.24.0 in /usr/local/lib/python3.10/dist-packages (from pymc) (1.5.3)
Requirement already satisfied: scipy>=1.4.1 in /usr/local/lib/python3.10/dist-packages (from pymc) (1.11.4)
Requirement already satisfied: typing-extensions>=3.7.4 in /usr/local/lib/python3.10/dist-packages (from pymc) (4.9.0)
Requirement already satisfied: setuptools>=48.0.0 in /usr/local/lib/python3.10/dist-packages (from pytensor) (67.7.2)
Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from pytensor) (3.13.1)
Requirement already satisfied: etuples in /usr/local/lib/python3.10/dist-packages (from pytensor) (0.3.9)
Requirement already satisfied: logical-unification in /usr/local/lib/python3.10/dist-packages (from pytensor) (0.4.6)
Requirement already satisfied: miniKanren in /usr/local/lib/python3.10/dist-packages (from pytensor) (1.0.3)
Requirement already satisfied: cons in /usr/local/lib/python3.10/dist-packages (from pytensor) (0.4.6)
Requirement already satisfied: matplotlib>=3.2 in /usr/local/lib/python3.10/dist-packages (from arviz>=0.13.0->pymc) (3.7.1)
Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from arviz>=0.13.0->pymc) (23.2)
Requirement already satisfied: xarray>=0.21.0 in /usr/local/lib/python3.10/dist-packages (from arviz>=0.13.0->pymc) (2023.7.0)
Requirement already satisfied: h5netcdf>=1.0.2 in /usr/local/lib/python3.10/dist-packages (from arviz>=0.13.0->pymc) (1.3.0)
Requirement already satisfied: xarray-einstats>=0.3 in /usr/local/lib/python3.10/dist-packages (from arviz>=0.13.0->pymc) (0.7.0)
Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=0.24.0->pymc) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=0.24.0->pymc) (2023.4)
Requirement already satisfied: toolz in /usr/local/lib/python3.10/dist-packages (from logical-unification->pytensor) (0.12.1)
Requirement already satisfied: multipledispatch in /usr/local/lib/python3.10/dist-packages (from logical-unification->pytensor) (1.0.0)
Requirement already satisfied: h5py in /usr/local/lib/python3.10/dist-packages (from h5netcdf>=1.0.2->arviz>=0.13.0->pymc) (3.9.0)
Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib>=3.2->arviz>=0.13.0->pymc) (1.2.0)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.10/dist-packages (from matplotlib>=3.2->arviz>=0.13.0->pymc) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib>=3.2->arviz>=0.13.0->pymc) (4.48.1)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib>=3.2->arviz>=0.13.0->pymc) (1.4.5)
Requirement already satisfied: pillow>=6.2.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib>=3.2->arviz>=0.13.0->pymc) (9.4.0)
Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib>=3.2->arviz>=0.13.0->pymc) (3.1.1)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.1->pandas>=0.24.0->pymc) (1.16.0)

%matplotlib inline
import warnings

import arviz as az
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import pymc as pm
#import theano.tensor as tt

from pymc import Model, Normal, Slice, sample, HalfNormal
from pymc.distributions import Interpolated
from scipy import stats
#from theano import as_op

import pickle

from google.colab import files
import pickle

uploaded = files.upload()

Upload widget is only available when the cell has been executed in the current browser session. Please rerun this cell to enable.

Saving my_arrays.pkl to my_arrays.pkl

with open("my_arrays.pkl", 'rb') as file:
    data = pickle.load(file)

import numpy as np
np.shape(data['X1'])

(1000,)

X1 = data['X1']
X2 = data['X2']
Y = data['Y']

Visual inspection#

import pandas as pd
import statsmodels.api as sm
import seaborn as sns


# Create a DataFrame from the arrays
data = pd.DataFrame({
    'X1': X1,
    'X2': X2,
    'Y': Y
})

# Prepare the features and target variable
X = data[['X1', 'X2']]
X = sm.add_constant(X)  # Adds a constant term to the predictor
Y = data['Y']


# Visualization
plt.figure(figsize=(12, 6))

# Plot Y vs X1
plt.subplot(1, 2, 1)
sns.scatterplot(x='X1', y='Y', data=data)


# Plot Y vs X2
plt.subplot(1, 2, 2)
sns.scatterplot(x='X2', y='Y', data=data)

plt.title('Y vs X2')

plt.show()


#---- you can also do a linear fit, etc. (complete)

# Perform linear regression
model = sm.OLS(Y, X).fit()

# Print the summary to inspect the coefficients and model performance
print(model.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      Y   R-squared:                       0.984
Model:                            OLS   Adj. R-squared:                  0.984
Method:                 Least Squares   F-statistic:                 3.081e+04
Date:                Tue, 20 Feb 2024   Prob (F-statistic):               0.00
Time:                        15:21:11   Log-Likelihood:                -1395.6
No. Observations:                1000   AIC:                             2797.
Df Residuals:                     997   BIC:                             2812.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          5.0115      0.031    161.944      0.000       4.951       5.072
X1             7.0540      0.031    229.558      0.000       6.994       7.114
X2            12.8148      0.155     82.457      0.000      12.510      13.120
==============================================================================
Omnibus:                        0.260   Durbin-Watson:                   2.042
Prob(Omnibus):                  0.878   Jarque-Bera (JB):                0.335
Skew:                          -0.027   Prob(JB):                        0.846
Kurtosis:                       2.929   Cond. No.                         5.10
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Building the Model#

basic_model = Model()

with basic_model:
    # Priors for unknown model parameters
    alpha = Normal("alpha", mu=0, sigma=10)
    beta0 = Normal("beta0", mu=0, sigma=10)
    beta1 = Normal("beta1", mu=0, sigma=10)
    sigma = HalfNormal("sigma",sigma=5)


    # Expected value of outcome
    mu = alpha + beta0 * X1 + beta1 * X2

    # Likelihood (sampling distribution) of observations
    Y_obs = Normal("Y_obs", mu=mu, sigma=sigma, observed=Y)

    # draw 1000 posterior samples
    trace_in = sample(5000)

100.00% [6000/6000 00:07<00:00 Sampling chain 0, 0 divergences]

100.00% [6000/6000 00:06<00:00 Sampling chain 1, 0 divergences]

az.plot_posterior(trace_in)

/usr/local/lib/python3.10/dist-packages/arviz/utils.py:184: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  numba_fn = numba.jit(**self.kwargs)(self.function)

array([<Axes: title={'center': 'alpha'}>,
       <Axes: title={'center': 'beta0'}>,
       <Axes: title={'center': 'beta1'}>,
       <Axes: title={'center': 'sigma'}>], dtype=object)

az.plot_trace(trace_in, compact=True)

array([[<Axes: title={'center': 'alpha'}>,
        <Axes: title={'center': 'alpha'}>],
       [<Axes: title={'center': 'beta0'}>,
        <Axes: title={'center': 'beta0'}>],
       [<Axes: title={'center': 'beta1'}>,
        <Axes: title={'center': 'beta1'}>],
       [<Axes: title={'center': 'sigma'}>,
        <Axes: title={'center': 'sigma'}>]], dtype=object)

np.shape(X1)

(1000,)

plt.hist(X1)

(array([ 18.,  55., 122., 230., 248., 194.,  95.,  29.,   6.,   3.]),
 array([-2.83591463, -2.18122708, -1.52653953, -0.87185197, -0.21716442,
         0.43752313,  1.09221068,  1.74689823,  2.40158578,  3.05627333,
         3.71096089]),
 <BarContainer object of 10 artists>)

az.summary(trace_in,hdi_prob=0.94)

	mean	sd	hdi_3%	hdi_97%	mcse_mean	mcse_sd	ess_bulk	ess_tail	r_hat
alpha	5.011	0.031	4.953	5.069	0.000	0.000	17161.0	7772.0	1.0
beta0	7.054	0.031	6.995	7.114	0.000	0.000	15873.0	8126.0	1.0
beta1	12.811	0.154	12.525	13.103	0.001	0.001	14779.0	8785.0	1.0
sigma	0.980	0.022	0.939	1.021	0.000	0.000	11961.0	8264.0	1.0

What happens if using 10 observations only?#

# Generate 10 random unique indices from the range of your data length
random_indices = np.random.choice(len(X1), 10, replace=False)


# Select the corresponding elements from X1, X2, and Y
selected_X1 = X1[random_indices]
selected_X2 = X2[random_indices]
selected_Y = Y.values[random_indices]

selected_model = Model()

with selected_model:
    # Priors for unknown model parameters
    alpha = Normal("alpha", mu=0, sigma=10)
    beta0 = Normal("beta0", mu=0, sigma=10)
    beta1 = Normal("beta1", mu=0, sigma=10)
    sigma = HalfNormal("sigma",sigma=5)


    # Expected value of outcome
    mu = alpha + beta0 * selected_X1 + beta1 * selected_X2

    # Likelihood (sampling distribution) of observations
    Y_obs = Normal("Y_obs", mu=mu, sigma=sigma, observed=selected_Y)

    # draw 1000 posterior samples
    trace_selected = sample(5000)

100.00% [6000/6000 00:08<00:00 Sampling chain 0, 1 divergences]

100.00% [6000/6000 00:09<00:00 Sampling chain 1, 0 divergences]

az.plot_posterior(trace_selected)

array([<Axes: title={'center': 'alpha'}>,
       <Axes: title={'center': 'beta0'}>,
       <Axes: title={'center': 'beta1'}>,
       <Axes: title={'center': 'sigma'}>], dtype=object)

az.summary(trace_selected,hdi_prob=0.94)

	mean	sd	hdi_3%	hdi_97%	mcse_mean	mcse_sd	ess_bulk	ess_tail	r_hat
alpha	4.283	0.385	3.581	5.033	0.005	0.004	6071.0	4790.0	1.0
beta0	6.620	0.453	5.761	7.487	0.006	0.004	6391.0	4463.0	1.0
beta1	12.654	1.270	10.052	14.876	0.017	0.012	5998.0	4707.0	1.0
sigma	1.006	0.370	0.493	1.653	0.006	0.004	3700.0	4565.0	1.0

The HDI intervals become broader, highlighting more uncertainty (less observations means less information we could leverage on in our model)

Bayesian Reasoning in Data Science (DATA 644)

DATA 644 (Bayesian Reasoning in Data Science): Assignment 1

Contents

DATA 644 (Bayesian Reasoning in Data Science): Assignment 1#

Visual inspection#

Building the Model#

What happens if using 10 observations only?#