DATA 644 (Bayesian Reasoning in Data Science): Assignment 1#

%pip install pymc pytensor
Requirement already satisfied: pymc in /usr/local/lib/python3.10/dist-packages (5.7.2)
Requirement already satisfied: pytensor in /usr/local/lib/python3.10/dist-packages (2.14.2)
Requirement already satisfied: arviz>=0.13.0 in /usr/local/lib/python3.10/dist-packages (from pymc) (0.15.1)
Requirement already satisfied: cachetools>=4.2.1 in /usr/local/lib/python3.10/dist-packages (from pymc) (5.3.2)
Requirement already satisfied: cloudpickle in /usr/local/lib/python3.10/dist-packages (from pymc) (2.2.1)
Requirement already satisfied: fastprogress>=0.2.0 in /usr/local/lib/python3.10/dist-packages (from pymc) (1.0.3)
Requirement already satisfied: numpy>=1.15.0 in /usr/local/lib/python3.10/dist-packages (from pymc) (1.25.2)
Requirement already satisfied: pandas>=0.24.0 in /usr/local/lib/python3.10/dist-packages (from pymc) (1.5.3)
Requirement already satisfied: scipy>=1.4.1 in /usr/local/lib/python3.10/dist-packages (from pymc) (1.11.4)
Requirement already satisfied: typing-extensions>=3.7.4 in /usr/local/lib/python3.10/dist-packages (from pymc) (4.9.0)
Requirement already satisfied: setuptools>=48.0.0 in /usr/local/lib/python3.10/dist-packages (from pytensor) (67.7.2)
Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from pytensor) (3.13.1)
Requirement already satisfied: etuples in /usr/local/lib/python3.10/dist-packages (from pytensor) (0.3.9)
Requirement already satisfied: logical-unification in /usr/local/lib/python3.10/dist-packages (from pytensor) (0.4.6)
Requirement already satisfied: miniKanren in /usr/local/lib/python3.10/dist-packages (from pytensor) (1.0.3)
Requirement already satisfied: cons in /usr/local/lib/python3.10/dist-packages (from pytensor) (0.4.6)
Requirement already satisfied: matplotlib>=3.2 in /usr/local/lib/python3.10/dist-packages (from arviz>=0.13.0->pymc) (3.7.1)
Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from arviz>=0.13.0->pymc) (23.2)
Requirement already satisfied: xarray>=0.21.0 in /usr/local/lib/python3.10/dist-packages (from arviz>=0.13.0->pymc) (2023.7.0)
Requirement already satisfied: h5netcdf>=1.0.2 in /usr/local/lib/python3.10/dist-packages (from arviz>=0.13.0->pymc) (1.3.0)
Requirement already satisfied: xarray-einstats>=0.3 in /usr/local/lib/python3.10/dist-packages (from arviz>=0.13.0->pymc) (0.7.0)
Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=0.24.0->pymc) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=0.24.0->pymc) (2023.4)
Requirement already satisfied: toolz in /usr/local/lib/python3.10/dist-packages (from logical-unification->pytensor) (0.12.1)
Requirement already satisfied: multipledispatch in /usr/local/lib/python3.10/dist-packages (from logical-unification->pytensor) (1.0.0)
Requirement already satisfied: h5py in /usr/local/lib/python3.10/dist-packages (from h5netcdf>=1.0.2->arviz>=0.13.0->pymc) (3.9.0)
Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib>=3.2->arviz>=0.13.0->pymc) (1.2.0)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.10/dist-packages (from matplotlib>=3.2->arviz>=0.13.0->pymc) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib>=3.2->arviz>=0.13.0->pymc) (4.48.1)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib>=3.2->arviz>=0.13.0->pymc) (1.4.5)
Requirement already satisfied: pillow>=6.2.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib>=3.2->arviz>=0.13.0->pymc) (9.4.0)
Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib>=3.2->arviz>=0.13.0->pymc) (3.1.1)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.1->pandas>=0.24.0->pymc) (1.16.0)
%matplotlib inline
import warnings

import arviz as az
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import pymc as pm
#import theano.tensor as tt

from pymc import Model, Normal, Slice, sample, HalfNormal
from pymc.distributions import Interpolated
from scipy import stats
#from theano import as_op

import pickle
from google.colab import files
import pickle

uploaded = files.upload()
Upload widget is only available when the cell has been executed in the current browser session. Please rerun this cell to enable.
Saving my_arrays.pkl to my_arrays.pkl
with open("my_arrays.pkl", 'rb') as file:
    data = pickle.load(file)
import numpy as np
np.shape(data['X1'])
(1000,)
X1 = data['X1']
X2 = data['X2']
Y = data['Y']

Visual inspection#

import pandas as pd
import statsmodels.api as sm
import seaborn as sns


# Create a DataFrame from the arrays
data = pd.DataFrame({
    'X1': X1,
    'X2': X2,
    'Y': Y
})

# Prepare the features and target variable
X = data[['X1', 'X2']]
X = sm.add_constant(X)  # Adds a constant term to the predictor
Y = data['Y']


# Visualization
plt.figure(figsize=(12, 6))

# Plot Y vs X1
plt.subplot(1, 2, 1)
sns.scatterplot(x='X1', y='Y', data=data)


# Plot Y vs X2
plt.subplot(1, 2, 2)
sns.scatterplot(x='X2', y='Y', data=data)

plt.title('Y vs X2')

plt.show()


#---- you can also do a linear fit, etc. (complete)

# Perform linear regression
model = sm.OLS(Y, X).fit()

# Print the summary to inspect the coefficients and model performance
print(model.summary())
_images/Sol_assignment1_10_0.png
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      Y   R-squared:                       0.984
Model:                            OLS   Adj. R-squared:                  0.984
Method:                 Least Squares   F-statistic:                 3.081e+04
Date:                Tue, 20 Feb 2024   Prob (F-statistic):               0.00
Time:                        15:21:11   Log-Likelihood:                -1395.6
No. Observations:                1000   AIC:                             2797.
Df Residuals:                     997   BIC:                             2812.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          5.0115      0.031    161.944      0.000       4.951       5.072
X1             7.0540      0.031    229.558      0.000       6.994       7.114
X2            12.8148      0.155     82.457      0.000      12.510      13.120
==============================================================================
Omnibus:                        0.260   Durbin-Watson:                   2.042
Prob(Omnibus):                  0.878   Jarque-Bera (JB):                0.335
Skew:                          -0.027   Prob(JB):                        0.846
Kurtosis:                       2.929   Cond. No.                         5.10
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Building the Model#

basic_model = Model()

with basic_model:
    # Priors for unknown model parameters
    alpha = Normal("alpha", mu=0, sigma=10)
    beta0 = Normal("beta0", mu=0, sigma=10)
    beta1 = Normal("beta1", mu=0, sigma=10)
    sigma = HalfNormal("sigma",sigma=5)


    # Expected value of outcome
    mu = alpha + beta0 * X1 + beta1 * X2

    # Likelihood (sampling distribution) of observations
    Y_obs = Normal("Y_obs", mu=mu, sigma=sigma, observed=Y)

    # draw 1000 posterior samples
    trace_in = sample(5000)
100.00% [6000/6000 00:07<00:00 Sampling chain 0, 0 divergences]
100.00% [6000/6000 00:06<00:00 Sampling chain 1, 0 divergences]
az.plot_posterior(trace_in)
/usr/local/lib/python3.10/dist-packages/arviz/utils.py:184: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  numba_fn = numba.jit(**self.kwargs)(self.function)
array([<Axes: title={'center': 'alpha'}>,
       <Axes: title={'center': 'beta0'}>,
       <Axes: title={'center': 'beta1'}>,
       <Axes: title={'center': 'sigma'}>], dtype=object)
_images/Sol_assignment1_13_2.png
az.plot_trace(trace_in, compact=True)
array([[<Axes: title={'center': 'alpha'}>,
        <Axes: title={'center': 'alpha'}>],
       [<Axes: title={'center': 'beta0'}>,
        <Axes: title={'center': 'beta0'}>],
       [<Axes: title={'center': 'beta1'}>,
        <Axes: title={'center': 'beta1'}>],
       [<Axes: title={'center': 'sigma'}>,
        <Axes: title={'center': 'sigma'}>]], dtype=object)
_images/Sol_assignment1_14_1.png
np.shape(X1)
(1000,)
plt.hist(X1)
(array([ 18.,  55., 122., 230., 248., 194.,  95.,  29.,   6.,   3.]),
 array([-2.83591463, -2.18122708, -1.52653953, -0.87185197, -0.21716442,
         0.43752313,  1.09221068,  1.74689823,  2.40158578,  3.05627333,
         3.71096089]),
 <BarContainer object of 10 artists>)
_images/Sol_assignment1_16_1.png
az.summary(trace_in,hdi_prob=0.94)
mean sd hdi_3% hdi_97% mcse_mean mcse_sd ess_bulk ess_tail r_hat
alpha 5.011 0.031 4.953 5.069 0.000 0.000 17161.0 7772.0 1.0
beta0 7.054 0.031 6.995 7.114 0.000 0.000 15873.0 8126.0 1.0
beta1 12.811 0.154 12.525 13.103 0.001 0.001 14779.0 8785.0 1.0
sigma 0.980 0.022 0.939 1.021 0.000 0.000 11961.0 8264.0 1.0

What happens if using 10 observations only?#

# Generate 10 random unique indices from the range of your data length
random_indices = np.random.choice(len(X1), 10, replace=False)


# Select the corresponding elements from X1, X2, and Y
selected_X1 = X1[random_indices]
selected_X2 = X2[random_indices]
selected_Y = Y.values[random_indices]
selected_model = Model()

with selected_model:
    # Priors for unknown model parameters
    alpha = Normal("alpha", mu=0, sigma=10)
    beta0 = Normal("beta0", mu=0, sigma=10)
    beta1 = Normal("beta1", mu=0, sigma=10)
    sigma = HalfNormal("sigma",sigma=5)


    # Expected value of outcome
    mu = alpha + beta0 * selected_X1 + beta1 * selected_X2

    # Likelihood (sampling distribution) of observations
    Y_obs = Normal("Y_obs", mu=mu, sigma=sigma, observed=selected_Y)

    # draw 1000 posterior samples
    trace_selected = sample(5000)
100.00% [6000/6000 00:08<00:00 Sampling chain 0, 1 divergences]
100.00% [6000/6000 00:09<00:00 Sampling chain 1, 0 divergences]
az.plot_posterior(trace_selected)
array([<Axes: title={'center': 'alpha'}>,
       <Axes: title={'center': 'beta0'}>,
       <Axes: title={'center': 'beta1'}>,
       <Axes: title={'center': 'sigma'}>], dtype=object)
_images/Sol_assignment1_21_1.png
az.summary(trace_selected,hdi_prob=0.94)
mean sd hdi_3% hdi_97% mcse_mean mcse_sd ess_bulk ess_tail r_hat
alpha 4.283 0.385 3.581 5.033 0.005 0.004 6071.0 4790.0 1.0
beta0 6.620 0.453 5.761 7.487 0.006 0.004 6391.0 4463.0 1.0
beta1 12.654 1.270 10.052 14.876 0.017 0.012 5998.0 4707.0 1.0
sigma 1.006 0.370 0.493 1.653 0.006 0.004 3700.0 4565.0 1.0

The HDI intervals become broader, highlighting more uncertainty (less observations means less information we could leverage on in our model)