I’ve been experimenting with adding DSP-inspired multirate + LFO modules around GPT blocks and learning the DSP hyperparameters during training. This is only tested on small character-level GPTs (enwik8/text8), so it’s not a SOTA claim, just an exploratory result. In this setting I see ~12-19% lower best val loss and roughly 55-70% fewer FLOPs to reach fixed loss targets vs the same GPT baselines (5 seeds, FLOPs-to-target analysis).
The repo has a detailed README (with math and ablations) plus scripts to reproduce the experiments.
I’ve been experimenting with adding DSP-inspired multirate + LFO modules around GPT blocks and learning the DSP hyperparameters during training. This is only tested on small character-level GPTs (enwik8/text8), so it’s not a SOTA claim, just an exploratory result. In this setting I see ~12-19% lower best val loss and roughly 55-70% fewer FLOPs to reach fixed loss targets vs the same GPT baselines (5 seeds, FLOPs-to-target analysis).
The repo has a detailed README (with math and ablations) plus scripts to reproduce the experiments.