This paper presents an efficient method to compute space-time superpixels and an application of the superpixels called superpixel convolution. The space-time superpixel method extends a single-image Bayesian method named BASS. Our approach, named Bayesian-inspired Space-Time Superpixels (BIST), is inspired by hill-climbing to a local mode of a Dirichlet-Process Gaussian Mixture Model conditioned on the previous frame’s superpixel information. The method is only Bayesian-inspired, rather than actually Bayesian, because the split/merge steps are treated as a classification problem rather than derived from a Gibbs sampling update step. However, this heuristic reduces the number of split/merge steps from several hundred per frame to only a few. BIST is over twice as fast as BASS and over 10 times faster than other space-time superpixel methods with favorable (and sometimes superior) quality. Additionally, to garner interest in superpixels, this paper demonstrates their use within deep neural networks. We present a superpixel-weighted convolution layer for single-image denoising that outperforms standard convolution by 1.89 dB PSNR.
NeurIPS
Soft Superpixel Neighborhood Attention
Kent Gauen, and Stanley Chan
Advances in Neural Information Processing Systems, 2024
Images contain objects with deformable boundaries, such as the contours of a human face, yet attention operators act on square windows. This mixes features from perceptually unrelated regions, which can degrade the quality of a denoiser. This paper proposes using superpixel probabilities to re-weight the local attention map. If images are modeled with latent superpixel probabilities, we show our re-weighted attention module matches the theoretically optimal denoiser. The left image shows that NA mixes information from the unrelated blue region, Hard-SNA improperly rejects pixels from the adjacent orange regions, and SNA correctly selects the all the orange pixels and rejects the blue pixels.
Computing attention maps for videos is challenging due to the motion of objects between frames. Small spatial inaccuracies significantly impact the attention module’s quality. Recent works propose using a deep network to correct these small inaccuracies. In this project, we efficiently implement a space-time grid search which outperforms existing deep neural network alternatives. The image on the left shows a no-shift search, a search using a deep network from related works, and our proposed shifted non-local search.