Probability For Machine Learning (Part 1)

Random variables, distributions, and moments

Random variables

  • A random variable is a function which assigns a number to events in the sample space. (A better name might be “random-valued function on the sample space.”)
\[X = \{1:heads,0:tail\}\]
  • We describe the probability of an outcome in terms of the probability of a random variable taking a given value:
\[P(X = 1) = 1/2, P(X^2 = 1) = 1\]

Continuous random variables

  • Consider choosing a random number between 0 and 1, where all values are equally likely.
  • Since there are an (uncountably) infinite number of values, the probability of any given value is zero:\(Prob(X = x_0) = 0\)
  • Does not mean event is impossible. Points on a line have “measure zero
  • Instead, ask for probability to lie within a given range, e.g.,
\[Prob(a < X<b) = \int_a^b p(x)dx\]

Probability distributions

  • More generally, a probability distribution satisfies:
    • $P(x) \geq 0$
    • $\sum_k p(x_k) = 1, x_k$ discrete
    • $\int_{-\infty}^{\infty}p(x)dx=1, x$ continuous

Cumulative distribution function

  • The cumulative distribution function, or CDF, gives the probability that X is less than or equal to a given value.
\[F(x) = Prob(X<x) = \int_{-\infty}^x p(x^{'})dx^{'}\]
  • It contains (nearly) the same information as the probability density, since
\[p(x) = \frac{dF(x)}{dx}\]
  • The CDF is often easier to approximate from empirical data and it is useful since
\[Prob(a < X < b) = F(b) - F(a)\]

Change of variable

  • Suppose we wish to change variables, shift the distribution, or consider functions of the random variable. If $x \rightarrow y = y(x)$ then the density in terms of the new variable is given by
\[p(x)dx = g(y)dy\]

which preserves the normalization condition. (This assumes y is an increasing function of x. If not, an absolute value is needed for positivity; and further care is needed if y has critical points.)

\(p(x)dx = g(y)dy\) \(g(y) = \frac{p(x)}{|dy / dx|}\)

Expectations and moments

  • The probability distribution defines weighted averages over the sample space, where the weight of each event is equal to its probability. These are called expected values.

  • For the discrete case,

\[E[f(X)] = \sum_{k=1}^n f(x_k)p(x_k)\]
  • while for the continuous case,
\[E[f] = \int_{-\infty}^{\infty}f(x)p(x)dx\]

Mean of a distribution

  • The mean of the distribution is simply the expectation of the random variable itself:
\[\mu = E[X] = \overline{X} = \langle X \rangle = \left\{ \begin{array}{rcl} \sum_kx_kp(x_k) \\ \int xp(x)dx \end{array}\right.\]
  • In the case of an infinite sample space, whether continuous or discrete, the mean is not guaranteed to exist since the integral or the sum might not converge.

Moments of a distribution

  • The moments of a distribution are the expectation of powers of the random variable itself.
\[\mu_l \equiv E[X^l] \equiv \langle X^l \rangle = \left\{ \begin{array}{rcl} \sum_kx_k^lp(x_k) \\ \int x^lp(x)dx \end{array}\right.\]
  • If all the moments are known – and if they exist – they can be used to get the expectation of other functions using the linearity of the expectation operator
\[E[cf(X)] = cE[f(x)]\] \[E[f(x) + g(X)] = E[f(x)] + E[g(x)]\]

Variance and standard deviation

  • Of particular interest is the second moment, in combination with the mean, defining the variance:
\[\sigma^2 = Var(X) = E[(X-\mu)^2] = E[X^2] - E[X]^2\]
  • The standard deviation, which is the square root of the variance, has the same units as the random variable (e.g., rate of return, dollars, etc.)

Higher moments characterize properties of a distribution

  • Variance – dispersion measure based on second moment
\[\sigma^2 \equiv E[(X-\mu)^2] = \int (x-\mu)^2p(x)dx\]
  • Skewness – asymmetry parameter based on 3rd moments; dimensionless – normalized cumulant
\[s \equiv \frac{E[(X-\mu)^3]}{\sigma^3} = E[(\frac{X-\mu}{\sigma})^3]\]
  • Kurtosis – measure of tail “weights” in terms of 4th moments; zero for Gaussian, bounded below by -1.
\[\kappa \equiv \frac{E[(X-\mu)^4]}{\sigma^4} - 3\]

Covariance and correlation

  • For any two random variables, not necessarily independent or identically distributed, their covariance is defined as
\[Cov(X,Y) \equiv E[(X - \mu_x)(Y - \mu_y)] = E[XY] - \mu_x \mu_y\]
  • The correlation is proportional to the covariance,
\[\rho(X,Y) = Corr(X,Y) = \frac{Cov(X,Y)}{\sqrt{Var(X)Var(Y)}} = E[(\frac{X-\mu_x}{\sigma_x})(\frac{Y - \mu_y}{\sigma_y})]\]
  • Dividing the covariance by the standard deviations makes the correlation a pure number, and
\[-1 \leq \rho(X,Y) \leq +1\]


  • Random variables” are functions that assign a number to events in the sample space. They can be discrete, continuous, or a mix of both.
  • The probability distribution is positive and sums to one.
  • Expected values, or expectations, are weighted averages that use the probabilities as the weights.
  • The moments of a distribution, such as the mean, are expectations of various powers of a random variable. They are numbers, not functions.
  • Variance, skewness, and kurtosis are simple functions of the moments that characterize the shape of the probability distribution.
  • When there are multiple random variables, their covariance and correlation are also computed as expectations.

(To be Continued!)

This post is licensed under CC BY 4.0 by the author.

Trending Tags