Angelo Maria Sabatini
  • Blog
  • Archive
Categories
All (25)
data mining (1)
machine learning (4)
measurement (4)
multivariate statistics (4)
probability (4)
signal processing (4)
statistics (5)
stochastic calculus (1)
stochastic modeling (3)
text mining (2)
time series (2)

Is the sample mean a Kalman filter in disguise?

Why recursive averaging is deeper than it looks
stochastic modeling
signal processing
measurement
Starting from one of the simplest problems in elementary statistics — estimating a constant from noisy measurements — this post gradually reconstructs the core ideas behind Kalman filtering: recursive estimation, uncertainty propagation, process noise and its covariance \(Q\), and the emergence of a steady-state gain in linear models. Along the way, the sample mean reappears as a special case of a Kalman filter with zero process noise, while the role of \(Q\) reveals the trade-off between promptness of response and noise sensitivity.
Jun 3, 2026
13 min

What does a discrete-time noise generator really simulate?

Why thermal-noise simulation is not about sampling white noise, but about understanding where physics places bandwidth.
stochastic modeling
signal processing
measurement
A resistor at finite temperature generates thermal noise, and even a simple RC low-pass filter produces a stochastic output voltage. The classical Nyquist–Johnson model describes the excitation as continuous-time Gaussian white noise, leading to a Gauss–Markov output process. Everything looks elegant on paper. But what does it actually mean to simulate it? Can white noise really be sampled? Or does every discrete-time noise generator hide an implicit bandwidth assumption—and if so, which one? Starting from this apparently simple question, this post explores one of the most common conceptual traps in engineering simulation: confusing ideal white noise with what numerical models can actually generate. The result is a practical path—from theory to, e.g., MATLAB code—for reconstructing the correct Gauss–Markov process and its exponential autocorrelation without cheating physics.
May 15, 2026
19 min

Discretizing motion under stochastic acceleration

stochastic modeling
Using a minimal kinematic model, this post examines how assumptions on stochastic acceleration propagate from continuous time to an exact discrete-time model of position and velocity. The focus is on modeling choices, discretization, and their often overlooked consequences for uncertainty representation.
Feb 2, 2026
14 min

PCA, MANOVA, and the geometry of multivariate comparison

Understanding variability beyond univariate thinking
multivariate statistics
Multivariate analysis is often introduced as a technical upgrade to univariate testing: more variables, more sophisticated statistics, more powerful conclusions. In practice, however, its real value lies elsewhere. Multivariate methods force us to think geometrically, shifting attention from individual variables to configurations, from isolated effects to structured variability. In this post, Principal Component Analysis (PCA) and Multivariate Analysis of Variance (MANOVA) are discussed in their concerted action on the popular iris dataset.
Dec 23, 2025
23 min

Partial Least-Squares Discriminant Analysis for text classification: A linear model that works

multivariate statistics
machine learning
text mining
In this post, I present a short, hands-on exploration of sparse Partial Least-Squares Discriminant Analysis (sPLS-DA) applied to stylometric classification using term frequency representations. How far can a linear model go — and what can we learn from its internal structure? Starting from raw text, I construct a document-term matrix, tune a sparse linear classifier, and interpret the results through the most discriminative lexical features selected by each latent component. Along the way, I highlight the method’s interpretability, efficiency, and surprisingly strong performance in a simple but expressive text classification setting.
Jul 7, 2025
28 min

Why memory matters: A tale of two Markov chains

probability
statistics
This post explores the computation of the stationary distribution and the autocorrelation function (ACF) in discrete-time Markov chains, focusing on two fundamental cases: a simple two-state chain and a more structured four-state chain that encodes second-order binary dependencies. Avoiding simulation-based methods, I’ll show how both analytical and numerical approaches can yield exact results for the ACF. Particular attention is given to the concept of Variance Inflation Factor (VIF) and its role in estimating the standard error of a sample proportion when autocorrelation is present.
Jun 9, 2025
14 min

Backwards through the model

measurement
statistics
Calibration is not just about building a model — it’s also about understanding how that model will be used in practice. This post revisits a common measurement scenario: a transducer is calibrated using polynomial regression, and then the challenge of inverse prediction is faced — estimating the input that produced an observed output. Along the way, I examine the conceptual pitfalls of model inversion, and reflect on the practical consequences of noise, nonlinearity, and extrapolation in real deployment settings via a Monte Carlo simulation.
May 29, 2025
24 min

Statistics and Machine Learning: A shared landscape

statistics
machine learning
What does it mean to shift from statistical inference to prediction? This post uses a minimal but complete example to explore the shared terrain between statistics and machine learning — from how models are trained, to how decisions are made and evaluated in real-world contexts.
May 5, 2025
25 min

Supercharge Deep Learning in R with a hybrid R–Colab workflow

machine learning
Training deep learning models in R is powerful—but it can be painfully slow on a single-CPU machine. This post shows how to blend the flexibility of R for data prep, visualization, and evaluation with the raw GPU power of Google Colab for fast model training. I show how to create a seamless workflow allowing continued use of familiar R tools, while letting Python handle the heavy lifting when it counts.
Apr 23, 2025
17 min

Unleashing the power of Apple Silicon for R: Parallel processing on M1/M2

machine learning
Parallel computation is a big deal in machine learning (ML), especially when working with large datasets, complex models (such as deep neural networks), or computationally intensive tasks like hyperparameter tuning. At its core, parallel computation means executing multiple calculations or processes simultaneously. In ML, this can significantly speed up training and inference by leveraging multiple processing units, such as CPU cores, GPUs, TPUs, or computing clusters. In this post, I’ll review some of the most popular methods available in R for running computations in parallel.
Apr 22, 2025
9 min

Using Long Short Term Memory (LSTM) in R for time series forecasting

time series
In mid 2017, R launched Keras in R, a comprehensive library which runs on top of powerful numerical platforms, such as TensorFlow and Theano. This package helped R not to lag behind Python anymore when managing Deep Learning (DL) frameworks and libraries. Models supported by the R Keras package include Recurrent Neural Network (RNN), Long Short-Term Memory (LSTM), Convoluted Neural Network (CNN), Multilayer Perceptron (MLP), among others. In this post, I discuss an example of using LSTM for time series forecasting using code written in R.
Mar 11, 2025
10 min

Frequency spectrum of a sine-wave tone burst

signal processing
In this post, I discuss the spectral properties of a well-known test signal, namely the sine-wave tone burst. The underlying theory is reviewed to explain the pattern of spectral lines that are produced when a sinusoidal signal at a given frequency is turned ‘on and off’ at a slower pace. Examples are presented to show the agreement between theoretical predictions and results of the FFT-based spectrum analysis.
Jan 16, 2025
5 min

Spell checking using hunspell

text mining
The hunspell package is a high-performance stemmer, tokenizer, and spell checker for R. LibreOffice, OpenOffice, Mozilla Firefox, Google Chrome, Mac OS-X, InDesign, Opera, RStudio and many others use this spell checker library, with support being provided for several languages, including Italian. Hunspell uses a special dictionary format that defines which characters, words and conjugations are valid in the specified language. In this post I will illustrate how to use the spell checker.
Jan 2, 2025
4 min

Correspondence analysis: Part II

multivariate statistics
Correspondence Analysis (CA) is a type of multidimensional scaling, one of several methods that are available for developing spatial models that reveal associations between two or more categorical variables. Conceptually, CA is similar to principal component analysis, but applies to categorical rather than continuous data. In this post, I will briefly illustrate how simple CA (the method used when data of only two categorical variables are analyzed) can be computed using the R programming software (a brief overview of the underlying theory has been sketched in a previous post.)
Dec 10, 2024
13 min

Correspondence analysis: Part I

multivariate statistics
Correspondence Analysis (CA) is a type of multidimensional scaling, one of several methods that are available for developing spatial models that reveal associations between two or more categorical variables. Conceptually, CA is similar to principal component analysis, but applies to categorical rather than continuous data. In this post, I will briefly present the theory behind simple CA (the method used when data of only two categorical variables are analyzed), leaving details of how to carry the analysis using the R programming software to a future post.
Dec 9, 2024
16 min

Audio features for free

data mining
Spotify is the leader in the audio streaming market, with its several million subscribers, including myself, and many more listeners who use the app for free. Although not necessarily known to ordinary users, each track in the Spotify library comes accompanied by an extensive list of numerical scores - the outcome of the audio analysis each track is submitted to using advanced algorithms developed by Spotify. No matter what the use is for them, these data, which are usually hidden to the user, can be retrieved using the Web API’s offered by Spotify. In this post I explain how this can be done.
Nov 12, 2024
10 min

Numerical simulation for stochastic differential equations

stochastic calculus
A down-to-the-bone post, where few issues concerning the numerical simulation of stochastic differential equations are discussed with just a limited amount of technical detail.
Jun 6, 2024
10 min

Significant figures

measurement
The use of calculators and computers leads to lab reports with far too many digits in every number produced. Assessing the correct number of significant figures is essential in reporting either experimental or computed results together with their stated uncertainties.
May 14, 2024
11 min

Augmented Dickey-Fuller test

time series
In statistics, an augmented Dickey–Fuller test (ADF) tests the null hypothesis of a unit root in a time series sample. The alternative hypothesis is different depending on which version of the test is used, but is usually stationarity or trend-stationarity. This post explains how to use the ADF test in R, with an attempt to make the different test statistics clear and easily interpretable.
May 13, 2024
10 min

Gambler’s ruin

probability
The gambler’s ruin problem is often applied to gamblers with finite wealth playing against a bookie or casino assumed to have a much larger amount of wealth available, in principle infinite. It can then be proven that the probability of the gambler’s eventual ruin tends to 1 even in the scenario where the game is fair.
May 5, 2024
7 min

Multivariate probability regions

statistics
Computing probability regions in the space where sample data are assumed to live relates to the determination of regions that we are confident that the underlying population will occupy with the prescribed value of probability. After reviewing the theory for multivariate normal distributions, I present an example of application from the field of posturographic research.
May 2, 2024
9 min

Confidence intervals for proportions

statistics
Usually confidence intervals for the estimation of proportions are based on methods that exploit the normal approximation to the binomial distribution. By simulation, two of these methods (Wilson and Wald) are tested for their ability to provide the stated coverage for small-to-large samples.
Apr 27, 2024
10 min

Pills of combinatorics

probability
Although experimentalists are well familiar with the topic, nonetheless I believe it might be helpful to spend a few minutes for a review of basic formulae of combinatorial analysis.
Apr 24, 2024
7 min

Random incidence

probability
In this post, I briefly discuss the random incidence phenomenon, using the classical example of a person arriving at a bus stop at a random time, and waiting for the arrival of the next bus.
Apr 22, 2024
7 min

Frequency resolution of spectral analysis

signal processing
Spectral leakage and length of data record hamper our ability to resolve spectral lines by DFT/FFT analysis. In this post, I briefly discuss this problem, with examples using sinusoidal mixtures.
Apr 13, 2024
8 min
No matching items
 
Cookie Preferences