Dec 6, 2022
Speaker · 0 followers
Speaker · 0 followers
Speaker · 0 followers
Speaker · 0 followers
Recent studies show that transformer has strong capability of building long-range dependencies, yet is incompetent in capturing high frequencies that predominantly convey local information. To tackle this issue, we present a novel and general-purpose Inception Transformer, or iFormer for short, that effectively learns comprehensive features with both high- and low-frequency information in visual data. Specifically, we design an Inception mixer to explicitly graft the advantages of convolution and max-pooling for capturing the high-frequency information to transformers. Different from recent hybrid frameworks, the Inception mixer brings greater efficiency through a channel splitting mechanism to adopt parallel convolution/max-pooling path and self-attention path as high- and low-frequency mixers, while having the flexibility to model discriminative information scattered within a wide frequency range. Considering that bottom layers play more roles in capturing high-frequency details while top layers more in modeling low-frequency global information, we further introduce a frequency ramp structure, i.e., gradually decreasing the dimensions fed to the high-frequency mixer and increasing those to the low-frequency mixer, which can effectively trade-off high- and low-frequency components across different layers. We benchmark the iFormer on a series of vision tasks, and showcase that it achieves impressive performance on image classification, COCO detection and ADE20K segmentation. For example, our iFormer-S hits the top-1 accuracy of 83.4Recent studies show that transformer has strong capability of building long-range dependencies, yet is incompetent in capturing high frequencies that predominantly convey local information. To tackle this issue, we present a novel and general-purpose Inception Transformer, or iFormer for short, that effectively learns comprehensive features with both high- and low-frequency information in visual data. Specifically, we design an Inception mixer to explicitly graft the advantages of convolution an…
Account · 952 followers
Professional recording and live streaming, delivered globally.
Presentations on similar topic, category or speaker
Total of 0 viewers voted for saving the presentation to eternal vault which is 0.0%
Total of 0 viewers voted for saving the presentation to eternal vault which is 0.0%
Puyu Wang, …
Total of 0 viewers voted for saving the presentation to eternal vault which is 0.0%
Total of 0 viewers voted for saving the presentation to eternal vault which is 0.0%
Marcel Binz, …
Total of 0 viewers voted for saving the presentation to eternal vault which is 0.0%
Hualin Zhang, …
Total of 0 viewers voted for saving the presentation to eternal vault which is 0.0%