Chapter 2 - Information and Conditioning
Notes on information theory and conditional probability.
Information and $\sigma$ algebras
Idea: we must specify what position we will take in the underlying security at each future time contingent on how the uncertainty between the present time and that future time is resolved.
In the experiment of the coin toss, we do not know the true $\omega$ precisely we can make a list f sets that are sure to contain it, and other sets that are sure not to contain it? $\rightarrow$ these are the sets that are resolved by the information
The $\emptyset$ and whole space $\Omega$ are always resolved, even without information; the true $\omega$ does not belong to $\emptyset$ and does belong to $\Omega$.
Take the following 4 $\sigma$-algebras:
- $\mathcal{F}_1 = {\emptyset, \Omega, A_H, A_T}$ a. contains information learned after observing the first coin toss b. know outcome of the first coin toss and nothing more.
- $\mathcal{F}2 = {\emptyset, \Omega, A_H, A_T, A{HH}, A_{TT}, \dots, A_{HT} \cup A_{TT}} \rightarrow$ contains 16 resolved sets a. containing the information learned by observing the first two coin tosses
- $\mathcal{F}_3$ the set of all subsets of $\Omega$ which contains 256 sets
- $F_0$ is the trivial $\sigma$-field with two sets:
These four $\sigma$-algebras are indexed by time, so, as time moves forward, we obtain the finer resolution. i.e., if $n < m$ then $F_m$ contains every set in $F_n$ and even more.
Definition: Filtration $\Omega$ is a non-empty set, and we fix $T$ and assume that for each $t \in [0, T]$ there is a $\sigma$-algebra $\mathcal{F}(t)$. Assume that if $s \le t$ then every set in $\mathcal{F}(s)$ is also in $\mathcal{F}(t)$. Then we call the collection of $\sigma$-algebras $\mathcal{F}(t)$ for $0 \le t \le T$, a filtration.
e.g., take $\Omega = C_0[0, T]$; the set of continuous functions defined on $[0, T]$ taking the value zero at time zero. Suppose $\overline{\omega}$ is chosen at random, and we get to observe it up to $t$ for $0 \le t \le T$.
We know $\overline{\omega}(s)$ for $0 \le s \le t$ but do not know its value for $t < s \le T$.
Thus, we see that the sets that are resolved by time $t$ are just those sets that can be described in terms of the path of $\omega$ up to time $t$. Thus the only sets resolved at time 0 are $\emptyset$ and $\Omega$ so we have that $\mathcal{F}_0 = {\emptyset, \Omega}$.
From this, we can see that the $\sigma$-algebras in a filtration can be built by taking unions and complements of certain fundamental sets in the way $\mathcal{F}_2$ was constructed. It suffices to work with indivisible sets in the $\sigma$-algebra (atoms) and not consider all the other sets. However, for the continuous case, it’s a little different.
Let’s say we choose a continuous function $f(u)$ defined only for $0 \le u \le t$ and satisfies $f(0) = 0$. The set $\omega \in C_0[0, T]$ that agree with $f$ on $[0, t]$ and that are free to take any values on $(t, T]$ form an atom in $\mathcal{F}(t)$, which we can define as:
\[\{\omega \in C_0[0, T]; \omega(u) = f(u) \text{ for all } u \in [0, t]\}\]Another way we can observe the evolution over time, is by letting $X$ to be a random variable, and have a formula for this, which we know ahead of time. We are only waiting to learn the value of $\omega$ to substitute into the formula, so we can evaluate $X(\omega)$.
Assume, now, that we are only told $X(\omega)$. This resolves certain sets. Every set of the form ${X \in B}$ where $B$ is a subset of $\mathbb{R}$ is resolved. We restrict attention to subsets $B$ that are Borel measurable.
Definition: $X$ is a random variable, $\sigma$-algebra generated by $X$, is denoted $\sigma(X)$, is the collection of subsets of $\Omega$ of the form ${X \in B}$ where $B$ ranges over the Borel subsets of $\mathbb{R}$.
Definition: Let $X$ be a random variable again, defined on a non-empty sample space $\Omega$. $\mathcal{G}$ is the $\sigma$-algebra of subsets of $\Omega$. If every set in $\sigma(X)$ is also in $\mathcal{G}$ then we say that $X$ is $\mathcal{G}$ measurable.
- $X$ is $\mathcal{G}$-measurable if and only if the information in $\mathcal{G}$ is sufficient to determine the value of $X$.
- $X$ is $\mathcal{G}$-measurable, then $f(X)$ is also $\mathcal{G}$-measurable for any Borel-measurable function $f$; if the information in $\mathcal{G}$ is sufficient to determine the value of $X$, it will also determine the value of $f(X)$.
- $X, Y$ are $\mathcal{G}$-measurable, then $f(X, Y)$ is $\mathcal{G}$-measurable
Definition: Let $\Omega$ be equipped with a filtration $\mathcal{F}(t)$, $0 \le t \le T$. Let $X(t)$ be a collection of random variables indexed by $t \in [0, T]$. This collection of random variables is an adapted stochastic process if for each $t$, the random variable $X(t)$ is $\mathcal{F}(t)$-measurable.
- A portfolio position $\Delta(t)$ take at time $t$, must be $\mathcal{F}(t)$-measurable, depend only on information available to the investor at time $t$.
Independence
When we have a $\sigma$-algebra $\mathcal{G}$ and a random variable $X$ that is neither measurable with respect to $\mathcal{G}$ nor independent of $\mathcal{G}$, the information in $\mathcal{G}$ is not sufficient to evaluate $X$, but we can estimate $X$ based on the information $\mathcal{G}$.
We need a probability measure in order to talk about independence. Independence an be affected by changes of probability measure, measurability is not.
e.g., let’s say we have a probability space defined by the triple again, we say that two sets $A$ and $B$ in $\mathcal{F}$ are independent if:
\[\mathbb{P}(A \cap B) = \mathbb{P}(A) \cdot \mathbb{P}(B)\]i.e., we have $\Omega = {HH, HT, TH, TT}$ with $0 \le p \le 1$ for $q = 1 - p$,
\[\mathbb{P}(HH) = p^2, \mathbb{P}(HT) = pq, \mathbb{P}(TH) = pq, \mathbb{P}(TT) = q^2\]Thus, the sets $A, B$ are independent, meaning that knowing the outcome $\omega$ is in $A$ does not change our estimation of the probability that it is in $B$. If $\omega$ occurs and we know the value of $X(\omega)$ then our estimation of the distribution of $Y$ is the same as when we did not know the value of $X(\omega)$.
Definition: Define a probability space by the triple, and let $\mathcal{G}$ and $\mathcal{H}$ be sub-$\sigma$-algebras of $\mathcal{F}$. We say that two $\sigma$-algebras are independent if,
\[\mathbb{P}(A \cap B) = \mathbb{P}(A) \cdot \mathbb{P}(B), \quad \forall A \in \mathcal{G}, B \in \mathcal{H}\]Let $X, Y$ be random variables on the probability space. We say these two random variables are independent if the $\sigma$-algebras they generate, $\sigma(X)$ and $\sigma(Y)$ are independent. The random variable $X$ is said to be independent of the $\sigma$-algebra $\mathcal{G}$ if $\sigma(X)$ and $\mathcal{G}$ are independent.
\[\mathbb{P} \{X \in C \text{ and } Y \in D\} = \mathbb{P}\{X \in C\} \cdot \mathbb{P}\{Y \in D\}\]for all Borel subsets $C$ and $D$ of $\mathbb{R}$.
e.g., let’s say we have $\Omega_3$ as our probability space of three independent coin tosses on which the stock price random variables were defined.
\[\mathbb{P}(HHH) = p^3, \mathbb{P} = p^2q, \mathbb{P}(HTH) = p^2q, \mathbb{P}(HTT) = pq^2, \dots, \mathbb{P}(TTT) = q^3\]We see that $S_2$ and $S_3$ are not independent because if we know $S_2$ takes the value 16, then $S_3$ is either 8 or 32, and is not 2 or .50.
However, if $p = 1$, then after learning that $S_2 = 16$, we do not revise our estimate of the distribution of $S_3$. If $p = 0$, then $S_2$ cannot be 16, and we do not have to worry about revising our estimate of the distribution of $S_3$ if this occurs because it will not occur.
Definition: Let the probability space be defined by our triple, and let $\mathcal{G_1}, \mathcal{G_2}, \dots$ be a sequence of sub-$\sigma$-algebras of $\mathcal{F}$. For a fixed $n$, we say that the $n$ $\sigma$-algebras $\mathcal{G_1}, \dots \mathcal{G_n}$ are independent if
\[\mathbb{P}(A_1 \cap A_2 \cap \dots \cap A_n) = \mathbb{P}(A_1) \cdot \mathbb{P}(A_2) \cdot \dots \cdot \mathbb{P}(A_n)\]for all $A_1 \in \mathcal{G_1}, \dots, A_n \in \mathcal{G_n}$.
- Say we have a sequence of random variables on some probability space
- The $n$ random variables $X_1, \dots, X_n$ are independent if the $\sigma$-algebras of $\sigma(X_1), \dots $\sigma(X_n)$ are independent. The full sequence of $\sigma$-algebras is independent if for every integer $n$ the $n$ $\sigma$-algebras are independent.
e.g., take the example where the infinite independent coin toss space exhibits the kind of independence described above. Let $\mathcal{G_k}$ be the $\sigma$-algebra of information associated with the $k$-th toss. $\mathcal{G_k}$ essentially comprises the sets $\emptyset, \Omega_\infty$ and the atoms:
\[\{\omega \in \Omega_{\infty}; \omega_k = H\} \quad\text{and}\quad \{\omega \in \Omega_\infty; \omega_k = T\}\]Theorem: Let $X, Y$ be independent random variables, and $f$ and $g$ be Borel-measurable functions on $\mathbb{R}$. Then $f(X), g(Y)$ are independent random variables.
The proof follows by defining $A$ being generated by $f(X)$ and have every set $A$ in $\sigma(f(X))$ of the form ${\omega \in \Omega; f(X(\omega)) \in C}$ where $C$ is the Borel subset of $\mathbb{R}$. Then, after, define $D = {x\in \mathbb{R}; f(x) \in C}$ where we can define $A$ now as:
\[A = \{\omega \in \Omega; f(X(\omega)) \in C\} = \{\omega \in \Omega, X(\omega) \in D\}\]Since we know that $B$ is the $\sigma$-algebra generated by $g(Y)$, then we have that this $\sigma$-algebra is a sub-$\sigma$-algebra of $\sigma(Y)$ so $B \in \sigma(Y)$ is known. Then we know that $X, Y$ are independent so we have that $\mathbb{P}(A \cap B) = \mathbb{P}(A) \cdot \mathbb{P}(B)$.
Definition: Let $X, Y$ be r.v.’s and the pair $(X, Y)$ takes values in the plane $\mathbb{R}^2$ and the joint distribution measure of $(X, Y)$ is given by:
\[\mu_{X, Y}(C) = \mathbb{P}\{(X, Y) \in C\}, \quad\text{for all Borel Sets } C \subset \mathbb{R}^2\]i.e., assigning measure between 0 and 1 to subsets of $\mathbb{R}^2$ so that $\mu_{X, Y}(\mathbb{R}^2) = 1$
- countable additivity should also be satisfied The joint cumulative distribution function of $f(X, Y)$ is:
A non-negative Borel-measurable function $f_{X, Y} (x, y)$ is a joint density for the pair of random variables $(X, Y)$ if:
\[\mu_{X, Y}(C) = \int_{-\infty}^\infty \int_{-\infty}^\infty \mathbb{I}_C(x, y) f_{X, Y}(x, y) dy dx \quad\text{for all Borel Sets } C \subset \mathbb{R}^2\]which holds if and only if:
\[F_{X, Y}(a, b) = \int_{-\infty}^a \int_{-\infty}^b f_{X, Y}(x, y) dy dx \quad \text{for all } a \in \mathbb{R}, b \in \mathbb{R}\]Distribution Measures often referred to as marginal distribution measures of $X, Y$ are:
\[\begin{align*} \mu_X(A) = \mathbb{P}\{X \in A\} = \mu_{X, Y}(A \times \mathbb{R}), \text{ for all Borel Subsets } A \subset \mathbb{R}\\ \mu_Y(A) = \mathbb{P}\{Y \in B\} = \mu_{X, Y}(\mathbb{R} \times B), \text{ for all Borel Subsets } A \subset \mathbb{R} \end{align*}\]Thus, the marginal CDF are:
\[\begin{align*} F_X(a) = \mu_X(-\infty, a] = \mathbb{P}\{X \le a\}, \text{ for all } a \in \mathbb{R}\\ F_Y(b) = \mu_Y(-\infty, b] = \mathbb{P}\{Y \le b\}, \text{ for all } b \in \mathbb{R} \end{align*}\]If the joint density $f_{X, Y}$ exists, then we can say that the marginal densities exist and are given:
\[f_X(x) = \int_{-\infty}^\infty f_{X, Y}(x, y) dy \quad\text{and}\quad f_Y(y) = \int_{-\infty}^\infty f_{X, Y}(x, y) dx\]Theorem: We have $X, Y$ that are random variables, the conditions are equivalent:
- $X, Y$ are independent
- The joint distribution measure factors:
for all Borel subsets $A \subset \mathbb{R}$ and $B \subset \mathbb{R}$.
- The joint cumulative distribution function factors:
for all $a \in \mathbb{R}, b \in \mathbb{R}$.
- The join moment-generating function factors:
for all $u, v \in \mathbb{R}$ for which the expectations are finite
- The joint-density factors:
for almost every $x, y \in \mathbb{R}$. The conditions above imply but are not equivalent to:
- The expectation factors:
provided that $\mathbb{E} | XY | < \infty$. |
Proof: (i) $\implies$ (ii) Assume that $X$ and $Y$ are independent, then
\[\begin{align*} \mu_{X, Y}(A \times B) &= \mathbb{P}\{X \in A \text{ and } Y \in B\}\\ &= \mathbb{P}(\{X \in A\} \cap \{Y \in B\})\\ &= \mathbb{P}\{X \in A\} \cdot \mathbb{P}\{Y \in B\}\\ &= \mu_X(A) \cdot \mu_Y(B) \end{align*}\](ii) $\implies$ (i) is proven similarly, but on the converse.
(ii) $\implies$ (iii) Assume that $\mu_{X, Y} (A \times B) = \mu_X(A) \cdot \mu_Y(B)
\[$ \begin{align*} F_{X, Y}(a, b) &= \mu_{X, Y} ((-\infty, a] \times (-\infty, b])\\ &= \mu_X(-\infty, a] \cdot \mu_Y(-\infty, b]\\ &= F_X(a) \cdot F_Y(b)\\ \end{align*}\](iii) $\implies$ (ii) holds whenever $A$ is of the form $A = (-\infty, a]$ and $B$ is of the form $B = (-\infty, b]$ which suffices to establish for all Borel Sets $A, B$.
(iii) $\implies$ (v) If there is a joint density, then $(iii)$ implies
\[\begin{align*} \int_{-\infty}^a \int_{-\infty}^b f_{X, Y}(x, y) dy dx = \int_{-\infty}^a f_X(x) dx \cdot \int_{-\infty}^b f_Y(y) dy \implies f_{X, Y}(a, b) = f_X(a) \cdot f_Y(b) \end{align*}\]other proofs for other implications follow similarly.
Assume there is a joint density, if we also assume the above, we can integrate both sides to get:
\[F_{X, Y}(a, b) = \int_{-\infty}^a \int_{-\infty}^b f_{X, Y}(x, y) dy dx \implies \int_{-\infty}^a f_X(x) dx \cdot \int_{-\infty}^b f_Y(y) dy = F_X(a) \cdot F_Y(b)\](i) $\implies$ (iv) we can use the standard machine starting with the case when $h$ is the indicator function of a Borel subset of $\mathbb{R}^2$, to show that for every real-valued Borel-measurable function $h(x, y)$ on $\mathbb{R}^2$ we have the following:
\[\mathbb{E}[h(X, Y)] = \int_{\mathbb{R}^2} |h(x, y)| d \mu_{X, Y} (x, y)\]If this quantity above is finite, then we have that:
\[\mathbb{E}h(X, Y) = \int_{\mathbb{R}^2} h(x, y) d\mu_{X, Y} (x, y)\]which holds for any pair of random variables $X, Y$ whether or not they are independence, if they are independent, then their joint distribution $\mu_{X, Y}$ is a product of the marginal distributions,
\[\mathbb{E}h(X, Y) = \int_{-\infty}^\infty \int_{-\infty}^\infty h(x, y) d \mu_{Y}(y) d \mu_X (x)\]After fixing $h(x, y) = e^{ux + vy}$ then we get:
\[\begin{align*} \mathbb{E}e^{uX + vY} &= \int_{-\infty}^\infty \int_{-\infty}^\infty e^{ux + vy} d\mu_Y(y) d\mu(X)(x)\\ &= \int_{-\infty}^\infty e^{ux} d \mu_X(x) \cdot \int_{-\infty}^\infty e^{vy} d\mu_X(y)\\ &= \mathbb{E}e^{uX} \cdot \mathbb{E}e^{uY}\\ &\implies \mathbb{E}[XY] = \int_{-\infty}^{\infty} x d\mu_X(x) \cdot \int_{-\infty}^\infty y d\mu_Y(y) = \mathbb{E}X \cdot \mathbb{E}Y \end{align*}\]e.g., Independent Normal Random Variables $X, Y$ are independent and standard normal if they have the joint density defined as:
\[f_{X, Y} (x, y) = \frac{1}{2 \pi} e^{-\frac{1}{2}(x^2 + y^2)} \quad\text{for all }x \in \mathbb{R}, y \in \mathbb{R}\]Which is the product of marginal densities:
\[f_X(x) = \frac{1}{\sqrt{2 \pi}}e^{-\frac{1}{2}x^2} \quad\text{and}\quad f_Y(y) = \frac{1}{\sqrt{2\pi}} e^{-\frac{1}{2}}y^2\]We use the notation,
\[N(a) = \frac{1}{\sqrt{2 \pi}} \int_{-\infty}^a e^{-\frac{1}{2}x^2} dx\]for the standard normal cumulative distribution function, the joint cumulative distribution function for $(X, Y)$ factors:
\[\begin{align*} F_{X, Y}(a,b) &= \int_{-\infty}^\infty \int_{-\infty}^\infty f_X(x) f_Y(y) dy dx\\ &= \int_{-\infty}^a f_X(x) dx \cdot \int_{-\infty}^b f_Y(y) dy\\ &= N(a) \cdot N(b) \end{align*}\]thus the $\mu_{X, Y}$ joint distribution is the probability measure on $\mathbb{R}^2$ that assigns a measure to each Borel set $C \subset \mathbb{R}^2$ equal to the integral of $f_{X, Y}(x, y)$ over $C$. If $C = A \times B$ where $A \in \mathcal{B}(\mathbb{R})$ and equivalently for $B$, then $\mu_{X, Y}$ factors:
\[\begin{align*} \mu_{X, Y}(A \times B) &= \int_A \int_B f_X(x) f_Y(y) dy dx\\ &= \int_A f_X(x) dx \cdot \int_B f_Y(y) dy\\ &= \mu_X(A) \cdot \mu_Y (B) \end{align*}\]Definitions: Review of Variance and Covariance $Var(X)$ of a random variable $X$ whose expected value is defined is given as,
\[\begin{align*} Var(X) = \mathbb{E}[(X - \mathbb{E}X)^2] \end{align*}\]the part in the square brackets is non-negative, so $Var(X)$ is always defined, although it might be infinite. with $\sigma = \sqrt{Var(X)}$ being the standard deviation.
Linearly of expectation: $Var(X) = \mathbb{E}[X^2] - (\mathbb{E}X)^2$ which is trivially proven in 2nd year STAT 230. Define $Y$ as another random variable, and assume $\mathbb{E}X$, $Var(X)$, $\mathbb{E}Y$ and $Var(Y)$ are all finite,
\[Cov(X, Y) = \mathbb{E}[(X - \mathbb{E}X)(Y - \mathbb{E}Y)]\]is the covariance of $X, Y$ which also can be shown,
\[Cov(X, Y) = \mathbb{E}[XY] - \mathbb{E}X \cdot \mathbb{E}Y\]$\mathbb{E}[XY] = \mathbb{E}X \cdot \mathbb{E}Y$ if and only if $Cov(X, Y) = 0$.
The correlation coefficient of $X, Y$ can be defined as:
\[\rho(X, Y) = \frac{Cov(X, Y)}{Var(X) Var(Y)}\]so, if $\rho(X, Y) = 0$ for equivalently $Cov(X, Y) = 0$, then we can say that $X, Y$ are uncorrelated.
e.g., Uncorrelated, Dependent, normal random variables, let $X$ be a standard normal random variable and let $Z$ be independent of $X$ and $Y$ satisfy, let$Y = ZX$, we show $Y$ is also standard normal, and that $X, Y$ are uncorrelated but not independent.
i.e., the pair $(X, Y)$ does not have a joint density,
\[\begin{align*} F_Y(b) &= \mathbb{P}\{Y \le b\}\\ &= \mathbb{P}\{Y \le b \text{ and } Z = 1\} + \mathbb{P}\{Y \le b \text{ and } Z = -1\}\\ &= \mathbb{P}\{X \le b \text{ and } Z = 1\} + \mathbb{P}\{-X \le b \text{ and } Z = -1\} \end{align*}\]Since $X, Z$ are independent, we have that
\[\begin{align*} \mathbb{P}\{X \le b \text{ and } Z = 1\} &+ \mathbb{P}\{-X \le b \text{ and } Z = -1\}\\ &= \mathbb{P}\{Z = 1\} \cdot \mathbb{P}\{Z \le b\} + \mathbb{P}\{Z = -1\} \cdot \mathbb{P}\{-X \le b\}\\ &= \frac{1}{2} \cdot \mathbb{P}\{X \le b\} + \frac{1}{2} \cdot \mathbb{P}\{-X \le b\} \end{align*}\]Since $X$ is standard normal, so is $-X$, thus, $\mathbb{P}{X \le b} = \mathbb{P}{- X \le b} = N(b)$, thus $F_Y(b) = N(b)$. $Y$ is a standard normal random variable. Since $\mathbb{E}X = \mathbb{E}Y = 0$, the covariance of $X$ and $Y$ is
\[Cov(X,Y) = \mathbb{E}[XY] = \mathbb{E}[ZX^2]\]$Z$ and $X$ are independent, so are $Z$ and $X^2$,
\[\mathbb{E}[Z X^2] = \mathbb{E}Z \cdot \mathbb{E}[X^2] = 0 \cdot 1 = 0\]Therefore, $X, Y$ are uncorrelated.
$X, Y$ cannot be independent, because, if they were, then $ | X | $ and $ | Y | $ would also be independent, |
which are not equal expressions, as they would be for independent random variables.
Definition: two random variables $X$ and $Y$ are said to be jointly normal if they have the joint density as defined by,
\[f_{X, Y}(x, y) = \frac{1}{2 \pi \sigma_1 \sigma_2 \sqrt{1 - p^2}} \exp \left\{- \frac{1}{2 (1 - p^2)} \left[\frac{(x - \mu_1)^2}{\sigma_1^2} - \frac{2\rho (x - \mu_1)(y - \mu_2)}{\sigma_1 \sigma_2} + \frac{(y - \mu_2)^2}{\sigma_2^2} \right] \right\}\]where $\sigma_1 > 0, \sigma_2 > 0$ and $ | \rho | < 1$, where $\mu_1, \mu_2$ are real numbers. More generally, a random column vector $\mathcal{X} = (X_1, \dots, X_n)^T$ is jointly normal if it has joint density, |
there $x = (x_1, \dots x_n)$ is a row vector of dummy variables with $\mu = (\mu_1, \dots, \mu_n)$ is the row vector of expectations and $C$ is the positive definite matrix of covariances.
- Linear combinations of jointly normal random variables are jointly normal
- Independent normal random variables are jointly normal, thus, a general method for creating jointly normal random variables is to begin with a set of independent normal random variables and take linear combinations.
- Any set of jointly normal random variables can be reduced to linear combinations of independent normal random variables
-
reduction for a pair of correlated normal random variables
e.g., $(X, Y)$ is a jointly normal pair of random variables with density defined above in the definition, we define $W = Y - \frac{\rho \sigma_2}{\sigma_1} X$ then $X, W$ are independent. We verify this by showing that $X, W$ have covariance zero since they are jointly normal,
\[\begin{align*} Cov(X, W) &= \mathbb{E}[(X - \mathbb{E}X)(W - \mathbb{E}W)] \\ &= \mathbb{E}[(X - \mathbb{E}X)(Y - \mathbb{E}Y)] - \mathbb{E} \left[\frac{\rho \sigma_2}{\sigma_1} (X - \mathbb{E}X)^2 \right]\\ &= Cov(X, Y) - \frac{\rho \sigma_2}{\sigma_1} \sigma_1^2\\ &= 0 \end{align*}\]Thus, the expectation of $W = \mu_3 = \mu_2 - \frac{\sigma \sigma_2 \mu_1}{\sigma_1}$ and the variance is,
\[\begin{align*} \sigma_3^2 &= \mathbb{E}[(W - \mathbb{E}W)^2]\\ &= \mathbb{E}[(Y - \mathbb{E}Y)^2] - \frac{2 \rho \sigma_2}{\sigma_1} \mathbb{E}[(X - \mathbb{E}X)(Y - \mathbb{E}Y)] + \frac{\rho^2 \sigma_2^2}{\sigma_1^2} \mathbb{E}[(X - \mathbb{E}X)^2]\\ &= (1 - \rho^2) \sigma_2^2 \end{align*}\]Thus, the joint density of $X$ and $W$ can be written as,
\[f_{X, W}(x, w) = \frac{1}{2 \pi \sigma_1 \sigma_3} \exp \left\{-\frac{(x - \mu_1)^2}{2 \sigma_1^2} - \frac{(w - \mu_3)^2}{2\sigma_3^2} \right\} \implies Y = \frac{\rho \sigma_2}{\sigma_1} X + W\]which is the decomposed $Y$ into the linear combination of a pair of independent normal random variables $X$ and $W$.
General Conditional Expectations
If $X$ is $\mathcal{G}$-measurable, then the information in $\mathcal{G}$ is sufficient to determine the value of $X$, if $X$ is independent of $\mathcal{G}$ then the information in $\mathcal{G}$ provides no help in determining the value of $X$. In the immediate case, the conditional expectation of $X$ given $\mathcal{G}$ is such an estimate when estimating but not evaluating $X$.
Definition: $\mathcal{G}$ is a sub-$\sigma$-algebra of $\mathcal{F}$ and let $X$ be a random variable which is either non-negative or integrable. Conditional expectation of $X$ given $\mathcal{G}$ denoted $\mathbb{E}[X | \mathcal{G}]$ is any random variable that satisfies:
-
Measurability: $\mathbb{E}[X \mathcal{G}]$ is $\mathcal{G}$-measurable, and - Partial Averaging:
If $\mathcal{G}$ is the $\sigma$-algebra generated by some other random variable $W$ then we can write $\mathbb{E}[X | W]$ rather than $\mathbb{E}[X | \sigma(W)]$. The estimate of $X$ is based on the information in $\mathcal{G}$ is itself a random variable, the value of the estimate $\mathbb{E}[X | \mathcal{G}]$ can be determined from the information in $\mathcal{G}$. |
(i) estimates $\mathbb{E}[X | \mathcal{G}]$ of $X$ is based on the information in $\mathcal{G}$. (ii) ensures that $\mathbb{E}[X | \mathcal{G}]$ is indeed an estimate of $X$. It gives the same averages of $X$ over all the sets in $\mathcal{G}$. If $\mathcal{G}$ has many sets, partial-averaging property over the small sets in $\mathcal{G}$ says that $\mathbb{E}[X | \mathcal{G}]$ is a good estimator of $X$. |
Because both $Y$ and $Z$ are $\mathcal{G}$-measurable, their difference $Y - Z$ is as well, thus we can define the set $A = {Y - Z > 0}$ is in $\mathcal{G}$, thus we have:
\[\int_A Y(\omega) d\mathbb{P}(\omega) = \int_A X(\omega) d\mathbb{P}(\omega) = \int_A Z(\omega) d\mathbb{P}(\omega) \implies \int_A (Y(\omega) - Z(\omega)) d\mathbb{P}(\omega) = 0\]Theorem:
- Linearity of Conditional expectations: $X, Y$ are integrable random variables and $c_1, c_2$ are constants, then we can write:
- Taking out what is Known: If $X, Y$ are integrable and $Y$ and $XY$ are integrable and $X$ is $\mathcal{G}$-measurable, then we can write:
- Iterated Conditioning: If $\mathcal{H}$ is a sub-$\sigma$-algebra of $\mathcal{G}$, and $X$ is integrable, then
- Independence, if $X$ is integrable, and independent of $\mathcal{G}$, then
- Conditional Jensen’s Inequality: If $\phi(x)$ is a convex function of a dummy variable $x$ and $X$ is integrable, then we can write:
The proof for this theorem is in Stochastic Calculus for Finance II by Shreve, go read it it you have time to spare, the discussion and proof follow quite similarly from discrete calculus.
e.g.s, Let $X, Y$ be a pair of jointly normal random variables with density. Define
\[W = Y = \frac{\rho \sigma_2}{\sigma_1} X\]Then $X, W$ are independent, and we can write
\[Y = \frac{\rho \sigma_2}{\sigma_1} X + W\]Let us take the conditioning $\sigma$-algebra to be $\mathcal{G} = \sigma(W)$. We estimate $Y$ based on $X$ using the above, and properties, and get the linear regression equation,
\[\mathbb{E}[Y | X] = \frac{\rho \sigma_2}{\sigma_1} X + \mathbb{E}W = \frac{\rho \sigma_2}{\sigma_1} (X - \mu_1) + \mu_2\]The RHS of the above is the random but is $\sigma(X)$-measurable which is the same as knowing the value of $X$, then we can evaluate $\mathbb{E}[Y | X]$. We can see the error by the estimator is, |
The error is random with expected value zero, and is independent of the estimate $\mathbb{E}[Y | X]$ because $\mathbb{E}[Y | X]$ is $\sigma(X)$-measurable and $W$ is independent of $\sigma(X)$. The independence between the error and the conditioning random variable $X$ is a consequence of the jointly normality in the example. |
Lemma: Independence, Suppose that $\mathcal{G}$ is a sub-$\sigma$-algebra of $\mathcal{F}$. Suppose the random variables $X_1, \dots, X_K$ are $\mathcal{G}$-measurable and the random variables $Y_1, \dots, Y_L$ are independent of $\mathcal{G}$. Let $f(x_1, \dots, x_K, y_1, \dots, y_L)$ be a function of the dummy variables $x_1, \dots, x_K$ and $y_1, \dots, y_L$ and define the following:
\[g(x_1, \dots, x_K) = \mathbb{E}f(x_1, \dots, x_K, Y_1, \dots, Y_L)\]Then we can write,
\[\mathbb{E}[f(X_1, \dots, X_K, Y_1, \dots, Y_L) | \mathcal{G}] = g(X_1, \dots, X_K)\]Since the information in $\mathcal{G}$ is sufficient to determine the values of $X_1, \dots, X_K$, we should hold these random variables constant when estimating $f(X_1, \dots, X_K, Y_1, \dots, Y_K)$.
Definition: Let $T$ be fixed positive number, and let $\mathcal{F}(t)$ for $0 \le t \le T$ be a filtration of sub-$\sigma$-algebras of $\mathcal{F}$. Consider an adapted stochastic process $M(t)$, for $0 \le t \le T$,
- A process is martingale (no tendency to rise or fall) if the following holds:
- A process is submartingale (no tendency to fall, but tendency to rise) if the following holds:
- A process is supermartingale (no tendency to rise, but tendency to fall) if this holds:
Definition: Let $T$ be fixed positive integer and let $\mathcal{F}(t)$ and $0 \le t \le T$ be a filtration of sub-$\sigma$-algebras of $\mathcal{F}$. Consider an adapted stochastic process $X(t)$ and assume that $0 \le s \le t \le T$ for every non-negative, Borel-measurable function $f$, there is another Borel-measurable function $g$ such that we have,
\[\mathbb{E}[f(X(t)) | \mathcal{F}(s)] = g(X(s))\]Then we say that the $X$ is a Markov process.