Gaussian Inferences, t-Student, Update of Beliefs#

%pip install pymc pytensor

Requirement already satisfied: pymc in /usr/local/lib/python3.10/dist-packages (5.7.2)
Requirement already satisfied: pytensor in /usr/local/lib/python3.10/dist-packages (2.14.2)
Requirement already satisfied: arviz>=0.13.0 in /usr/local/lib/python3.10/dist-packages (from pymc) (0.15.1)
Requirement already satisfied: cachetools>=4.2.1 in /usr/local/lib/python3.10/dist-packages (from pymc) (5.3.2)
Requirement already satisfied: cloudpickle in /usr/local/lib/python3.10/dist-packages (from pymc) (2.2.1)
Requirement already satisfied: fastprogress>=0.2.0 in /usr/local/lib/python3.10/dist-packages (from pymc) (1.0.3)
Requirement already satisfied: numpy>=1.15.0 in /usr/local/lib/python3.10/dist-packages (from pymc) (1.23.5)
Requirement already satisfied: pandas>=0.24.0 in /usr/local/lib/python3.10/dist-packages (from pymc) (1.5.3)
Requirement already satisfied: scipy>=1.4.1 in /usr/local/lib/python3.10/dist-packages (from pymc) (1.11.4)
Requirement already satisfied: typing-extensions>=3.7.4 in /usr/local/lib/python3.10/dist-packages (from pymc) (4.9.0)
Requirement already satisfied: setuptools>=48.0.0 in /usr/local/lib/python3.10/dist-packages (from pytensor) (67.7.2)
Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from pytensor) (3.13.1)
Requirement already satisfied: etuples in /usr/local/lib/python3.10/dist-packages (from pytensor) (0.3.9)
Requirement already satisfied: logical-unification in /usr/local/lib/python3.10/dist-packages (from pytensor) (0.4.6)
Requirement already satisfied: miniKanren in /usr/local/lib/python3.10/dist-packages (from pytensor) (1.0.3)
Requirement already satisfied: cons in /usr/local/lib/python3.10/dist-packages (from pytensor) (0.4.6)
Requirement already satisfied: matplotlib>=3.2 in /usr/local/lib/python3.10/dist-packages (from arviz>=0.13.0->pymc) (3.7.1)
Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from arviz>=0.13.0->pymc) (23.2)
Requirement already satisfied: xarray>=0.21.0 in /usr/local/lib/python3.10/dist-packages (from arviz>=0.13.0->pymc) (2023.7.0)
Requirement already satisfied: h5netcdf>=1.0.2 in /usr/local/lib/python3.10/dist-packages (from arviz>=0.13.0->pymc) (1.3.0)
Requirement already satisfied: xarray-einstats>=0.3 in /usr/local/lib/python3.10/dist-packages (from arviz>=0.13.0->pymc) (0.7.0)
Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=0.24.0->pymc) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=0.24.0->pymc) (2023.4)
Requirement already satisfied: toolz in /usr/local/lib/python3.10/dist-packages (from logical-unification->pytensor) (0.12.1)
Requirement already satisfied: multipledispatch in /usr/local/lib/python3.10/dist-packages (from logical-unification->pytensor) (1.0.0)
Requirement already satisfied: h5py in /usr/local/lib/python3.10/dist-packages (from h5netcdf>=1.0.2->arviz>=0.13.0->pymc) (3.9.0)
Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib>=3.2->arviz>=0.13.0->pymc) (1.2.0)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.10/dist-packages (from matplotlib>=3.2->arviz>=0.13.0->pymc) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib>=3.2->arviz>=0.13.0->pymc) (4.48.1)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib>=3.2->arviz>=0.13.0->pymc) (1.4.5)
Requirement already satisfied: pillow>=6.2.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib>=3.2->arviz>=0.13.0->pymc) (9.4.0)
Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib>=3.2->arviz>=0.13.0->pymc) (3.1.1)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.1->pandas>=0.24.0->pymc) (1.16.0)

import matplotlib.pyplot as plt
import scipy.stats as stats
import numpy as np
import pandas as pd
import seaborn as sns
import arviz as az
import requests
import pymc as pm

az.style.use('arviz-darkgrid')

%config InlineBackend.figure_format = 'retina'
print(f"Running on PyMC v{pm.__version__}")
print(f"Running on ArviZ v{az.__version__}")

Running on PyMC v5.7.2
Running on ArviZ v0.15.1

The reference for this new exercise is [1] (one of our reference textbooks in our course), from which this exercise is taken.

We use a dataset that comes from Nuclear Magnetic Resonance (NMR), a powerful tehcnique that is used to study molecules. NMR allows to measure chemical shifts from nuclei of certain types of atmos. In quantum chemistry chemical shifts correspond to a signal shift from a reference value in the chemical environemnt that we are studying.

We could have measured the height of a group of people or the average time to travel back home… it’s the same thing.

We do not need that our data is truly Gaussian (or any other distribution): we only demand that a Guassian distribution is a resoanble approximation of our data. For this example, we have 48 chemical shifts values that we can load into a NumPy array and plot by using the following code:

import io
import csv

target_url = 'https://raw.githubusercontent.com/cfteach/brds/main/datasets/chemical_shifts.csv'

download = requests.get(target_url).content
df = pd.read_csv(io.StringIO(download.decode('utf-8')))

data=df.iloc[:,0].values

print(data)

[55.12 53.73 50.24 52.05 56.4  48.45 52.34 55.65 51.49 51.86 63.43 53.
09 51.93 52.31 52.33 57.48 57.44 55.14 53.93 54.62 56.09 68.58 51.36
47 50.73 51.94 54.95 50.39 52.91 51.5  52.68 47.72 49.73 51.82 54.99
84 53.19 54.52 51.46 53.73 51.61 49.81 52.42 54.3  53.84 53.16]

The KDE plot of this dataset shows a Gaussian-like distribution, except for two data points that are far away from the mean:

az.plot_kde(data, rug=True)

#rug plots https://en.wikipedia.org/wiki/Rug_plot

plt.yticks([0], alpha=0.5)

([<matplotlib.axis.YTick at 0x7ff47cc83fa0>], [Text(0, 0, '0')])

Let’s forget about those two “outliers” for a moment, and assume that a Gaussian distribution is a proper description of the data.

Since we do not know the mean and standard deviation, we must set priors for both of them.

Let’s build a resonable model; this could be:

\(\mu\) \(\sim\) \(U(l,u)\)

\(\sigma\) \(\sim\) |\(\mathcal{N}(0,\sigma_{\sigma})\)|

y \(\sim\) \(\mathcal{N}\)(\(\mu\),\(\sigma\))

Question: What is the meaning of this model?

If we do not know the possible values of \(\mu\) and \(\sigma\), we can set priors on them to reflect our ignorance on these parameters.

Using PyMC, we can write the model as follows:

with pm.Model() as model_g:

    μ = pm.Uniform('μ', lower = 40, upper = 70)
    σ = pm.HalfNormal('σ', sigma = 10)
    y = pm.Normal('y', mu=μ, sigma =σ, observed=data)
    idata_g = pm.sample(4000)

az.plot_trace(idata_g)

100.00% [5000/5000 00:10<00:00 Sampling chain 0, 0 divergences]

100.00% [5000/5000 00:05<00:00 Sampling chain 1, 0 divergences]

array([[<Axes: title={'center': 'μ'}>, <Axes: title={'center': 'μ'}>],
       [<Axes: title={'center': 'σ'}>, <Axes: title={'center': 'σ'}>]],
      dtype=object)

az.plot_pair(idata_g, kind='kde', marginals=True)

array([[<Axes: >, None],
       [<Axes: xlabel='μ', ylabel='σ'>, <Axes: >]], dtype=object)

az.summary(idata_g)

	mean	sd	hdi_3%	hdi_97%	mcse_mean	mcse_sd	ess_bulk	ess_tail	r_hat
μ	53.539	0.526	52.531	54.482	0.006	0.004	7741.0	5147.0	1.0
σ	3.567	0.378	2.911	4.290	0.004	0.003	7673.0	5922.0	1.0

Remember:

mcse: Monte Carlo Standard Error. Monte Carlo Standard Error (MCSE) is another measure of accuracy of the chains. It is defined as standard deviation of the chains divided by their effective sample size.
ess: effective-sample size. ESS takes into account the autocorrelation within the chain, indicating the equivalent number of independent samples.
ess_bulk: useful measure for sampling efficiency in the bulk of the distribution. The rule of thumb for ess_bulk is for this value to be greater than 100 per chain on average. Since we ran N chains, we need ess_bulk to be greater than N*100 for each parameter.
ess_tail: compute a tail effective sample size estimate for a single variable. The rule of thumb for this value is also to be greater than 100 per chain on average.
r_hat: diagnostic tests for lack of convergence by comparing the variance between multiple chains to the variance within each chain. converges to unity when each of the traces is a sample from the target posterior. Values greater than one indicate that one or more chains have not yet converged.

Student’s t Distribution#

https://en.wikipedia.org/wiki/Student’s_t-distribution

Fun fact: origin of the name
Distribution

plt.figure(figsize=(7, 4))
x_values = np.linspace(-10, 10, 500)
for nu in [1, 2, 30]:
    distri = stats.t(nu)
    x_pdf = distri.pdf(x_values)
    plt.plot(x_values, x_pdf, label=fr'$\nu = {nu}$', lw=3)

x_pdf = stats.norm.pdf(x_values)
plt.plot(x_values, x_pdf, 'k--', label=r'$\nu = \infty$') #normal
plt.xlabel('x')
plt.yticks([])
plt.legend()
plt.xlim(-5, 5)

(-5.0, 5.0)

For the t-distribution, we now have three parameters to consider, \(\mu\), \(\sigma\) and \(\nu\). Please see notes in class.

This will reflect in our model according to:

\(\mu \sim U(l,u)\) — uniform (Prior on \(\mu\))

\(\sigma \sim |\mathcal{N}(0,\sigma_{\sigma})|\) — positive-defined and normal-like (Prior on \(\sigma\))

\(\nu \sim Exp(\lambda)\) — exponential (Prior on \(\nu\))

\(y \sim \mathcal{T(\mu,\sigma, \nu)}\) — t-Student (Likelihood)

with pm.Model() as model_t:

    μ = pm.Uniform('μ', 40, 75)
    σ = pm.HalfNormal('σ', sigma=10)
    ν = pm.Exponential('ν', 1/30)
    y = pm.StudentT('y', mu=μ, sigma=σ, nu=ν, observed=data)
    idata_t = pm.sample(1000,tune=5000,return_inferencedata=True)

#  step size adaptation, which gives you fewer parameters to tune, and adds in the concept of “warmup” or “tuning” for your sampler.
# https://www.pymc.io/projects/docs/en/stable/api/distributions/generated/pymc.Exponential.html

100.00% [6000/6000 00:21<00:00 Sampling chain 0, 0 divergences]

100.00% [6000/6000 00:16<00:00 Sampling chain 1, 0 divergences]

# Define rate parameters for the exponential distribution
rate_params = [1/30, 1/10, 1]

# Generate x values
x = np.linspace(0, 10, 1000)

# Plot the PDF for each rate parameter
plt.figure(figsize=(5, 3))
for rate in rate_params:
    scale = 1 / rate  # Convert rate (lambda) to scale parameter for scipy.stats
    pdf_values = stats.expon.pdf(x, scale=scale)
    plt.plot(x, pdf_values, label=f'λ = {rate}')

plt.title('PDF of Exponential Distribution for Various λ')
plt.xlabel('x')
plt.ylabel('Probability Density')
plt.legend()
plt.grid(True)
plt.show()

az.plot_trace(idata_t)

array([[<Axes: title={'center': 'μ'}>, <Axes: title={'center': 'μ'}>],
       [<Axes: title={'center': 'σ'}>, <Axes: title={'center': 'σ'}>],
       [<Axes: title={'center': 'ν'}>, <Axes: title={'center': 'ν'}>]],
      dtype=object)

az.summary(idata_t)

	mean	sd	hdi_3%	hdi_97%	mcse_mean	mcse_sd	ess_bulk	ess_tail	r_hat
μ	53.074	0.399	52.276	53.793	0.009	0.007	1765.0	1021.0	1.0
σ	2.181	0.401	1.476	2.927	0.013	0.009	919.0	1034.0	1.0
ν	4.496	3.708	1.243	9.300	0.112	0.079	1068.0	996.0	1.0

y_pred_t = pm.sample_posterior_predictive(idata_t, model_t)


ax = az.plot_ppc(y_pred_t, figsize=(12,6), mean=True)
ax.legend(fontsize=15)
plt.xlim(40, 70)
plt.xlabel('y')

100.00% [2000/2000 00:00<00:00]

Text(0.5, 0, 'y')

Updating beliefs#

During the last lecture we discussed about how to update beliefs with new observations.

The posterior calculated through our model from a previous set of observations becomes the new prior for our model, where this time in the likelihood we utilize the new observations…

First, we need to capture the posterior. The following function will be very useful.

from pymc.distributions import Interpolated

def from_posterior(param, samples):
    smin, smax = np.min(samples), np.max(samples)
    width = smax - smin
    x = np.linspace(smin, smax, 1000)
    y = stats.gaussian_kde(samples)(x)

    # what was never sampled should have a small probability but not 0,
    # so we'll extend the domain and use linear approximation of density on it
    #x = np.concatenate([[x[0] - 1.5 * width], x, [x[-1] + 1.5 * width]])
    #y = np.concatenate([[0], y, [0]])
    return Interpolated(param, x, y)

Now let’s assume we have two additional observations; e.g., let; suppose they are:

new_observations = [53, 81]

One data pointfalls in the core of the distribution, the other data point occurs at the tail.

with pm.Model() as model_t:

    μ = from_posterior("μ", np.asarray(idata_t.posterior["μ"]).reshape(-1))
    σ = from_posterior("σ", np.asarray(idata_t.posterior["σ"]).reshape(-1))
    ν = from_posterior("ν", np.asarray(idata_t.posterior["ν"]).reshape(-1))

    y = pm.StudentT('y', mu=μ, sigma=σ, nu=ν, observed=new_observations)
    idata_t_update = pm.sample(1000,tune=3000,return_inferencedata=True)

100.00% [4000/4000 00:10<00:00 Sampling chain 0, 0 divergences]

100.00% [4000/4000 00:09<00:00 Sampling chain 1, 0 divergences]

az.plot_trace(idata_t_update, compact= True)

array([[<Axes: title={'center': 'μ'}>, <Axes: title={'center': 'μ'}>],
       [<Axes: title={'center': 'σ'}>, <Axes: title={'center': 'σ'}>],
       [<Axes: title={'center': 'ν'}>, <Axes: title={'center': 'ν'}>]],
      dtype=object)

az.plot_posterior(idata_t_update)

array([<Axes: title={'center': 'μ'}>, <Axes: title={'center': 'σ'}>,
       <Axes: title={'center': 'ν'}>], dtype=object)

Bayesian Reasoning in Data Science (DATA 644)

Gaussian Inferences, t-Student, Update of Beliefs

Contents

Gaussian Inferences, t-Student, Update of Beliefs#

Predictive Checks#

Student’s t Distribution#

Updating beliefs#