A primer on filtering

Here, I discuss the core of the filtering idea in a relatively simple language. I will not introduce particle filters here but at the end you should have a really solid idea about what they are aiming at. In the following, I assume some familiarity with probability densities and fundamental rules (e.g. marginalisation or conditional independence or difference between a random variable and its realisation).


A simple bound for optimisation using a grid

If I give you a function on $[0,1]$ and a computer and want you to find the minimum, what would you do? Since you have the computer, you can be lazy: Just compute a grid on $[0,1]$, evaluate the grid points and take the minimum. This will give you something close to the true minimum. But how much?


An $L_2$ bound for Perfect Monte Carlo

Monte Carlo methods are widely used for estimating expectations of complicated probability distributions. Here I provide the well-known $L_2$ bound.


Matrix Factorisation with Linear Filters (and discussion)

I submitted a preprint on matrix factorisations and linear filters. I managed to derive some factorisation algorithms as linear filtering algorithms.

In the paper, I left a discussion to here, estimating parameters via maximising marginal likelihood. So here it is.

In the paper, I consider the following probabilistic model, \begin{align} p(c) &= \NPDF(c; c_0, V_0 \otimes I) \label{priorC} \\ p(y_k|c,x_k) &= \NPDF(y_k; (x_k \otimes I_m) c, \lambda \otimes I). \label{LikelihoodCform1} \end{align} where each $x_k$ is a static model parameter vector, and $C$ is the latent matrix. Note that $(x_k^\top \otimes I_m) c = C x_k$ (see the paper). We would like to solve the following problem to estimate $x_k$ for each $k$, \begin{align*} x_k^* = \argmax_{x_k} \log p(y_k | y_{1:k-1},x_k). \end{align*} So we would like to obtain the incremental marginal likelihood $p(y_k | y_{1:k-1},x_k)$. We define it as, \begin{align*} p(y_k | y_{1:k-1},x_k) = \int p(y_k | c, x_k) p(c | y_{1:k-1}) \mbox{d}c \end{align*} where $c = \vect(C)$. The important thing here is $p(c | y_{1:k-1})$ is not the exact distribution but it is an approximation to the true posterior as one has to use $x_{k-1}^*$ to obtain the distribution (If $X$ were known, it would be exact). So we know from the paper, \begin{align*} p(c | y_{1:k-1}) = \NPDF(c; c_{k-1}, V_{k-1} \otimes I_m), \end{align*} and, \begin{align*} p(y_k | c, x_k) = \NPDF(y_k; (x_k \otimes I_m) c, \lambda \otimes I_m). \end{align*} Using formulas in Bishop's PRML (2006, page 93), we can arrive the marginal $p(y_k|y_{1:k-1},x_k)$ as the following, \begin{align*} p(y_k | y_{1:k-1},x_k) = \NPDF(y_k; (x_k^\top \otimes I_m) c_k, (\lambda \otimes I_m) + (x_k^\top \otimes I_m) (V_{k-1} \otimes I_m) (x_k \otimes I_m)). \end{align*} Using properties given in the Section I.A of the paper, we compactly obtain, \begin{align}\label{MarginalLikelihoodOfMFRLF} p(y_k | y_{1:k-1},x_k) = \NPDF(y_k; C_{k-1} x_k, (\lambda + x_k^\top V_{k-1} x_k) \otimes I_m) \end{align} Although integration is analytic, maximising the logarithm of \eqref{MarginalLikelihoodOfMFRLF} is not possible analytically. However the likelihood is differentiable, so one can use nonlinear optimisation methods. But in terms of performance this does not bring any advantage: One gains a bit in the overall error part but loses too much in terms of computation time. I have tried quasi-Newton methods (BFGS), and conjugate gradients but no cure.


Online Matrix Factorization via Broyden Updates

I arXived a new preprint titled Online Matrix Factorization via Broyden Updates.

Around this April, I was reading quasi-Newton methods (from this very nice paper of Philipp Hennig) and when I saw the derivation of the Broyden update, I immediately realized that this idea may be used for computing factorizations. Furthermore, it will lead to an online scheme, more preferable!

The idea is to solve the following optimization problem at each iteration $k$:\begin{align*} \min_{x_k,C_k} \big\| y_k - C_k x_k \big\|_2^2 + \lambda \big\|C_k - C_{k-1}\big\|_F^2.\end{align*}
The motivation behind this cost is in the manuscript.

Although the basic idea was explicit, I set a few goals. First of all, I would like to develop a method that one can sample any column of the dataset and use it immediately. So I modified the notation a bit, as you can see from Eq. (2) in the manuscript. Secondly, I wanted that one must be able to use mini-batches as well, a group of columns at each time. Thirdly, it was obvious that a modern matrix factorization method must handle the missing data, so I had to extend the algorithm to handle the missing data. Consequently, I have sorted out all of this except a rule for missing data with mini-batches which turned out to be harder, so I left out that for this work.


Tinkering around logistic map

I was tinkering around logistic map today, I accidentally discovered a property of it. That is, if you take a statistical view over time and take a look at the histogram of the sequence generated by it, it will look same independent of the input. This immediately hit me that there may be an underlying probability density of this map. In fact, I found out that Ulam and von Neumann studied this! So here we go.


Monte Carlo as Intuition

I am doing teaching assistantship for Monte Carlo methods course. I explained the following exercise as an application of the Monte Carlo principle. It is nice because it coincides with a very intuitive guess.


Fisher's Identity

Fisher's identity is useful to use in maximum-likelihood parameter estimation problems. In this post, I give its proof. The main reference is Douc, Moulines, Stoffer; Nonlinear time series theory, methods and applications.


Batch MLE for the GARCH(1,1) model


In this post, we derive the batch MLE procedure for the GARCH model in a more principled way than the last GARCH post. The derivation presented here is simple and concise.


Convergence of gradient descent algorithms


In this post, I review the convergence proofs of gradient algorithms. Our main reference is: Leon Bottou, Online learning and stochastic approximations. I rewrite the proofs described in Bottou's paper but with more details about the points which are subtle to me. I tried to write the proofs as clear as possible so as to make them accessible to everyone.