This paper presents Bayesian-inspired Space-Time Superpixels (BIST): a fast, state-of-the-art method to compute space-time superpixels. BIST is a novel extension of a single-image Bayesian method named BASS, and it is inspired by hill-climbing to a local mode of a Dirichlet-Process Gaussian Mixture Model (DP-GMM). The method is only Bayesian-inspired, rather than actually Bayesian, because it includes heuristic modifications to the theoretically correct sampler. Similar to existing methods, BIST can adapt the number of superpixels to an individual frame using split-merge steps. A key novelty is a new temporal coherence term in the split step, which reduces the chance of splitting propagated superpixels. This term enforces temporal coherence in propagated regions, and unconstrained adaptation in disoccluded regions. A hyperparameter determines the strength of this new term, which does not require special tuning to return consistent results across multiple videos. The wall-clock runtime of BIST is over twice as fast as BASS and over 30 times faster than the next fastest space-time superpixel method with open-source code.
NeurIPS
Soft Superpixel Neighborhood Attention
Kent Gauen, and Stanley Chan
Advances in Neural Information Processing Systems, 2024
Images contain objects with deformable boundaries, such as the contours of a human face, yet attention operators act on square windows. This mixes features from perceptually unrelated regions, which can degrade the quality of a denoiser. This paper proposes using superpixel probabilities to re-weight the local attention map. If images are modeled with latent superpixel probabilities, we show our re-weighted attention module matches the theoretically optimal denoiser. The left image shows that NA mixes information from the unrelated blue region, Hard-SNA improperly rejects pixels from the adjacent orange regions, and SNA correctly selects the all the orange pixels and rejects the blue pixels.
Computing attention maps for videos is challenging due to the motion of objects between frames. Small spatial inaccuracies significantly impact the attention module’s quality. Recent works propose using a deep network to correct these small inaccuracies. In this project, we efficiently implement a space-time grid search which outperforms existing deep neural network alternatives. The image on the left shows a no-shift search, a search using a deep network from related works, and our proposed shifted non-local search.