Introduction to Complex Systems


Random Variables & Probability

Motivation

This script covers some basics about probability, variability, fluctuations, random walks, random forces, noise and stochastic processes. Some of this is required to understand some of the topics that we will cover in class, but don’t expect a deep dive into the topic. Lots and lots of important aspects will be omitted but a few references given below. Variation and Fluctuations: Often things, like the circles here in size, vary

Why is understanding randomness, random processes or probabilistic concepts important in complex systems research? There are a few reasons. First, natural systems are often subject to variation that cannot be explained or doesn’t need to be explained because it isn’t in the focus of the question at hand. For instance if you want to study a population of birds, then you’ll find that individuals have variation, different sizes or whatever. So, sometimes we need to model or analyze systems that exhibit variation. We need to do basic statistics, like computing mean values standard deviations etc., those things.

And then there’s random motion which is key to understanding many Random motion: Understanding random processes is important for understanding random motion like diffusion or related processes. Vary the speed or the noise to see how that impacts the geometry of the paths. complex dynamical processes in which e.g. interacting agents move randomly or can be captured by a model that employs some sort of random motion component. This is particularly true for biology, e.g. sometimes, if we want to understand intra-cellular processes we may need to model molecules that move around randomly inside the cell, or we wish to model interactions of individuals in a spatially extended predator-prey system and wish to capture that individuals move around their habitat in a seemingly random way.

Third, we often have situations in which we can capture the dynamics say by a deterministic model like $$ \dot{x}=f(x) $$ but need to account for the fact that in addition to the deterministic part of the dynamics, we also have a random force that, as time progresses, adds little random changes to the state, which implies that even for the same initial condition there’s an entire family or trajectories that are generated by a model and that are possible observations, so we would like to understand this: The influence of noise: Particle are subject to a force that attracts them to the center at long range and repells the at short range. A noisy force makes them wiggle. Eventually the crowd will be distributed along a circle. $$ \dot{x}=f(x)+\text{“Noise”} $$

and we need to figure out how to handle the noisy force in a systematic way. This naturally leads to stochastic differential equations or Langevin-equations.

Finally, there are situations in which events occur at random times, for instance the firing of action potentials by a neurons, or the collision of particles in a container, or the encounter of two animals in a habitat. In many situations like this we need a way to model random time sequences $t_{n}<t_{n+1}$ that are ordered but random. A prime example of such a random process is called Poisson-Process which we will discuss in detail below. Random event sequences: In each small time interval $\Delta t$ with a small probability $\Delta t \alpha$ a spike is generated. $\alpha$ is the probability rate.

Measurements, Random Numbers, Probabilities

Notation

Let’s set all of what follows on an empirical footing. Let’s say we’ve got a quantity $X$ that can have values denoted by $x$. What’s up with using two letters here for the same thing? It turns out it’s a good habit in the beginning to distinguish between the quantity ($X$ for instance the location) and the values it can have which we denote with $x$. So in a way we can think of $X$ as a measurement, like an instrument that spits out numbers and we denote the numbers by $x$. This distinction is important, because we are dealing with randomness and a quantity can have different values each time we measure.

Expectation values

So let’s say we have $n=1,….,N$ such measurements and a sequence of values $x_{n}$, we usually first compute the mean $$ \left\langle X\right\rangle = \frac{1}{N}\sum_{n = 1}^{N}x_{n} $$ In most cases, if we are lucky, the number we get when we compute this approaches a limiting value when $N$ gets large, we then call it the expectation value. Why? Because its what we expect from the measurement. The panel below shows the behavior of $\left\langle X\right\rangle $ as a function of samples size $N$ for three different random measurements $x_{n}$ each for a unique random number $X$:

  1. exponentially distributed,
  2. uniformly, and
  3. algebratically distributed.

We can see that in the first two example the mean converges to a fixed value, but in the third example it doesn’t. We will discuss this phenomenon below in more detail.

The law of large numbers: The curve that is generated is the mean $\left\langle X\right\rangle = \frac{1}{N}\sum_{n = 1}^{N}x_{n}$ as a function of sample size $N$. You can seledt different distributions from with individual random variables are drawn. The dashed line represents the theoretical expectation value if it exists.

The expectation value $\left\langle X\right\rangle $ tells me something about the typical value of the variable $X$. We can also compute the expected difference between of $X$ and its typical value. We square it to get a non-negative number, and get the variance $$ \left\langle \left(X-\mu\right)^{2}\right\rangle =\frac{1}{N}\sum_{n}^{N}\left(x_{n}-\mu\right)^{2} $$ where we used the symbol $\mu$ for the expectation value, so $\mu=\left\langle X\right\rangle $. The variance, for which we use the symbol $\sigma^{2}$ is never negative:

$$ \sigma^{2}=\left\langle \left(X-\mu\right)^{2}\right\rangle \geq0 $$ It’s called variance because it measures how far away from the mean a measurement typically is (squared). The square-root of it, the quantity

$$ \sigma=\sqrt{\left\langle \left(X-\mu\right)^{2}\right\rangle } $$ is called the standard deviation. Now when we look at this more closely we find that

$$ 0\leq\left\langle \left(X-\mu\right)^{2}\right\rangle =\left\langle X^{2}-2\mu X+\mu^{2}\right\rangle $$ $$ =\left\langle X^{2}\right\rangle -\left\langle 2\mu X\right\rangle +\left\langle \mu^{2}\right\rangle $$ $$ =\left\langle X ^{2}\right\rangle -2\mu ^{2}+\mu ^{2} $$ $$ =\left\langle X^{2}\right\rangle -\left\langle X\right\rangle ^{2} $$ which implies that $$ \left\langle X^{2}\right\rangle \geq\left\langle X\right\rangle ^{2} $$ which is always true and a special case of Jensen’s inequality.

We can also compute the expectation value of any function of $X$ say of $f(X)$ $$ \left\langle f(X)\right\rangle =\frac{1}{N}\sum_{n = 1} ^{N}f\left(x_{n}\right) $$ all we do is take the values $x_{n}$ plug them into the function $f$, add the results and devide by $N$. The variance is an example for which $f(x)=(x-\mu)^{2}.$

Probability and Distribution Functions

The next thing we need to understand is the concept of probability. Given a set of measured values $x_{n}$, $n=1,…,N$ and assuming that $N$ is sufficiently large we can estimate the probability that $X$ is smaller than some chosen reference value $x$ by the fraction of values $x_{n}$ that are smaller than $x$ $$ P(X < x)=\frac{M_{x}}{N} $$ where we denote by $M_{x}$ the number of values of $x_{n}$ that are smaller than $x$. We can also express this as $$ P(X < x)=\frac{1}{N}\sum_{n}^{N}\theta(x-x_{n}) $$ where the function $\theta(x)$ is the Heaviside step function that equals $1$ if it’s argument is positive and $0$ otherwise. The Heaviside step function $\theta(x)$

So the stuff inside the sum counts all the terms for which $x > x_{n}$. The distribution function $P(X< x)$ is a function of $x$ so we oftern write $$ P(x)=P(X < x). $$ Say $x$ is the entire real axis, then we will have $$ P(-\infty)=0 $$ and $$ P(\infty)=1 $$ For finite $N$ the function is a a sequence of steps but if we let $N\rightarrow\infty$ we can expect this to become a smooth function. $P(x)$ is a non-decreasing, typically sigmoid function. We can also express the function $P(x)$ as an expectation value $$ P(x)=\left\langle \theta(x-X)\right\rangle $$ In some situations it’s more interesting to look at the probability that a random variable is larger than a value $x$ which is simply $$ Q(x)=P(X>x)=1-P(x). $$ Finally, the most abundantly used, because most intuitive, quantity is the probability density function $p(x)$ which is the “concentration” of probability at point $x$. So, it’s related to the fraction of measurements $x_{n}$ that fall into the interval $[ x,x+\Delta x ] $ which is the fraction of measurements that are less than $x+\Delta x$ minus all the measurements that are smaller than $x$ so $$ \Delta xp(x)=P(X < x+\Delta x)-P(X < x) $$ The factor $\Delta x$ implies that the function $p(x)$ is a density. So then if we devide by $\Delta x$ we get in the limit of $\Delta x\rightarrow0$ $$ p(x)=P^{\prime}(x) $$ However, the limit $\Delta x\rightarrow0$ implies that we have an infinite set $N$ of measurements. In reality we have to deal with a ``binsize’’ $\Delta x$ that, if we make it to small, we cannot estimate the probability per bin, because there are only single points eventually in a bin.

The probability density function $p(x)$: The point cloud represents the sample, i.e. the values $x_n$. Here we have a sample size of $N=1000$. The red bars are estimates of the probability density function $p(x)$. You can vary the binsize $\Delta x$. If the binsize is to small, the functional form of the pdf is lost, if it’s too large, the resolution is to coarse.

Although the pdf is much better to ``read’’ as a function because it can tell you where probability might be concentrated, the probability distribution $P(x)$ is much more stable to estimate and doesn’t change so much as the sample size is change.

Estimating the Probability

In fact there’s an easy way to estimate the function $P(x)$ given a set of measurements $x_{n}$. All you need to to is first sort the measurements such that $$ x_{n-1} < x_{n}< x_{n+1} $$ (assuming that there are no exactly identical measurements, an assumption that isn’t necessary but avoids some minor technical complications) and then one plots the dots $$ \left(x_{n},n/N\right) $$ in the $x-y$-plane, so uses values $y_{n}=n/N$. This will estimate the function $P(x)$.

The cumulative probability distribution function $P(x)$: The cumulative probability distribution function $P(x)$. As you vary the sample size the curve becomes smoother and estimates the probability that $P(X < x)$. You can vary $N$ in the range $[10,1000]$.

Experiment and theory

Often, we have an experiment and therefore a sequence of measured values $x_{n}$ and look at an empirical $P(x)$ or a histogram $\Delta xh(x)$ to develop an idea what theoretical fit to these functions could be used. For instance, let’s say we have a set of measurements and cook up a distribution $P(x)$ or a pdf $p(x)$ that describes the data well. In this case we have a continuous curve, say $p(x)$ and can infer quantities of interest based on that function.

Model pdfs

For instance if we have a pdf $p(x)$ it must be normalized

$$ \int_{-\infty}^{\infty}dxp(x)=P(\infty)-P(-\infty)=1 $$ With $p(x)$ we can compute the expectation value of $X$ like this $$ \left\langle X\right\rangle =\int_{-\infty}^{\infty}dx\,x\,p(x) $$ or in fact any expectation value like so: $$ \left\langle f(X)\right\rangle =\int_{-\infty}^{\infty}dx\,f(x)\,p(x) $$ In applications we often encounter different types of pdfs, some more often than others. Frequently encountered pdfs are

  • the exponential $$ p(x)=\alpha e^{-\alpha x} $$ with an expectation value of $$ \left\langle X\right\rangle =1/\alpha $$ and a domain $[0,\infty]$

  • the Gaussian $$ p(x)=\frac{1}{\sqrt{2\pi}} \exp\left({-(x-\mu)^ 2 / 2\sigma^2 }\right) $$ with a mean of $\mu$ and a variance of $\sigma^{2}$ and defined on the entire real axis.

There are many other important distributions.

Discrete random variables

What we discussed above all applied to random variables $X$ with a continuous domain. We also often have discrete random variables, for instance random variable that can only generate values that are say among the natural numbers, $0,1,2,3,…$. Typically this is encountered when we either have event sequences of particle or agent numbers. Most of what we discussed above directly translates to discrete variables, all we have to do is do replace integrals by sums over the domain of possible values.

Two random variables

What we discussed above was about a single random variable or quantity $X$. Of course we can have situations where we have multiple quantities or variables. Let’s discuss the situation where we have two of them, call them $X$ and $Y$. Generalizations to more than 2 variables are straightforward. In this case we have a two-dimensional pdf $p(x,y)$ and $$ p(x,y)\Delta x\Delta y $$ is the probability that the combined measurement of $X$ and $Y$ will yield values in the little square $\left[x,x+\Delta x\right]$ and $\left[y,y+\Delta y\right]$. Now it turns out that $X$ and $Y$ may well have an impact on one another. In one extreme the value of $X$ has no influence on that of $Y$ and vice-versa. In order to describe this we define conditional probabilities like $$ p(x|y)\Delta x $$ which is the probability density of measuring a value $x$ given that $Y=y$. Likewise we can define $$ p(y|x)\Delta y $$ the pdf of measuring $y$ given that $X=x$. We must have $$ \int dxp(x|y)=1\quad\int dyp(y|x)=1 $$ The probability of measureing $X=x$ and $Y=y$ is $$ p(x,y)\Delta x\Delta y=p(x|y)\Delta x\times p(y)\Delta y $$ so that $$ p(x,y)=p(x|y)p(y) $$ and likewise $$ p(x,y)=p(y|x)p(x) $$ so that $$ p(y|x)p(x)=p(x|y)p(y) $$ Now if $X$ and $Y$ have no impact on each other then the ``condition’' $X=x$ does not have an impact on measuring a value $y$ and vice versa, so the condition can be dropped: $$ p(y|x)=p(y)\qquad p(x|y)=p(x) $$ so that $$ p(x,y)=p(x)p(y) $$ which means that the combined pdf factorizes. This also means that $$ \left\langle f(X)g(Y)\right\rangle =\left\langle f(X)\right\rangle \left\langle g(Y)\right\rangle $$ so expectation values also factorize.

Covariance and Correlations

When dealing with covariant variables, like $X$ and $Y$ we can quantify the degree to which they impact one-another by the covariance $$ \left\langle (X-\left\langle X\right\rangle )(Y-\left\langle Y\right\rangle )\right\rangle $$ which is zero if the variables are independent. But be careful, if the covariance is zero, that does not imply that $X$ and $Y$ are independent. Independence is a stronger condition. We can normalize the covariance by the standard deviation of each of the variables to obtain the correlation coefficient: $$ c=\frac{\left\langle (X-\left\langle X\right\rangle )(Y-\left\langle Y\right\rangle )\right\rangle }{\sqrt{\left\langle (X-\left\langle X\right\rangle )^{2}\right\rangle \left\langle (Y-\left\langle Y\right\rangle )^{2}\right\rangle }} $$ But…

…be careful

Normally the covariance or correlation coeffienct is used to get an idea about potential cause-effect links between two quantities or possible mechanisms that link them. Yet, sometimes the way that $X$ and $Y$ depend on one-another are not captured by the covariance as is illustrated by the following video.

Covariance: This is the datasaurus. An illustration that shows that covariance, standard deviation and mean of two random variables does not capture all the structure potentially hidden in the statistics of the two variables. Datasaurus