IBM has been building the same idea for five years and it is finally becoming a product.
At the KDD conference in 2021, IBM researchers introduced what they describe as the first transformer-based framework for unsupervised representation learning of multivariate time series — a model Jayant Kalagnanam at IBM calls the opening move in what became a crowded field. Informer, Autoformer, and Fedformer arrived in subsequent years. IBM's own subsequent models, PatchTST and PatchTSMixer, unveiled at ICLR and KDD in 2023, outperformed other transformers on weather, traffic, and electrical load forecasting tasks by up to 60 percent in experiments. The code is on Hugging Face.
The harder problem is that time series data is nothing like text. Words have fixed vocabularies and consistent structure. Time series is continuous — think of video streaming from a self-driving car or heart rate data from a smartwatch — and staggeringly varied across domains, from stock prices to brain waves to light curves from distant stars. There is no common dictionary of patterns across domains, said Vijay Ekambaram at IBM. That is what makes the problem so challenging — and potentially rewarding.
The other constraint is data. LLMs learned language by training on internet-scale text. Time series has no equivalent common corpus. The Monash Time Series Forecasting Archive, one of the more comprehensive benchmarking efforts, brings together around 30 datasets across domains — a fraction of what a Wikipedia-scale resource would require. There is no time series equivalent of the internet's text commons.
IBM's response has been two tracks. The first is to keep building better time series transformers and open-sourcing them, using community feedback to improve the models. The second is to retrofit existing LLMs for time series tasks — taking a text classifier and repurposing it with two additional trained layers. The reprogramming approach is the subject of upcoming papers at AAAI and ICLR from Pin-Yu Chen and colleagues.
The fastest model is Tiny Time Mixers, under 1 million parameters, up to 65 times faster to tune and run than billion-parameter alternatives, trainable in eight hours on limited data. At that size and speed, the economics of deploying forecasting models change — you do not need a data center to run it.
IBM is using the current models with clients on cement and granulated sugar production optimization and quality simulations for physics research. The eventual goal is a transformer that can extract the structure of dynamic events in nature or the built environment and transfer that knowledge to never-before-seen domains. Every organization has a sales forecast, a demand plan, a capacity model. The accuracy of those models determines outcomes that cost real money. Better forecasting at enterprise scale is not a glamorous story. It is a billion-dollar problem.