@librarian-bot\n\t recommend\n","updatedAt":"2024-08-15T22:03:25.120Z","author":{"_id":"646b8e6f31968a60a0201a12","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/646b8e6f31968a60a0201a12/SU2Gs1NPuk1zoXHwFHl0U.jpeg","fullname":")))?!?(((","name":"stereoplegic","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3637}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7918877601623535},"editors":["stereoplegic"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/646b8e6f31968a60a0201a12/SU2Gs1NPuk1zoXHwFHl0U.jpeg"],"reactions":[],"isReport":false},"replies":[{"id":"66be7b33399b14b08f2e7fea","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":263},"createdAt":"2024-08-15T22:03:31.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Pseudo Labelling for Enhanced Masked Autoencoders](https://huggingface.co/papers/2406.17450) (2024)\n* [How Lightweight Can A Vision Transformer Be](https://huggingface.co/papers/2407.17783) (2024)\n* [SIGMA:Sinkhorn-Guided Masked Video Modeling](https://huggingface.co/papers/2407.15447) (2024)\n* [Unified Auto-Encoding with Masked Diffusion](https://huggingface.co/papers/2406.17688) (2024)\n* [Bringing Masked Autoencoders Explicit Contrastive Properties for Point Cloud Self-Supervised Learning](https://huggingface.co/papers/2407.05862) (2024)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
\n
The following papers were recommended by the Semantic Scholar API
Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2024-08-15T22:03:31.149Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":263}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7434622049331665},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false,"parentCommentId":"66be7b2d9284c8209f766821"}}]}],"primaryEmailConfirmed":false,"paper":{"id":"2207.07611","authors":[{"_id":"6560069b4a5a63bc00afaf28","user":{"_id":"675bf04240726aac79ae8e7b","avatarUrl":"/avatars/8ae95255dfce91a6f5e76f14eae9e9e5.svg","isPro":false,"fullname":"Shuangfei Zhai","user":"shuangfei","type":"user"},"name":"Shuangfei Zhai","status":"claimed_verified","statusLastChangedAt":"2024-12-13T15:24:55.172Z","hidden":false},{"_id":"6560069b4a5a63bc00afaf29","name":"Navdeep Jaitly","hidden":false},{"_id":"6560069b4a5a63bc00afaf2a","name":"Jason Ramapuram","hidden":false},{"_id":"6560069b4a5a63bc00afaf2b","user":{"_id":"64c3726f2a5eaefd000cdedd","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64c3726f2a5eaefd000cdedd/iwFifH1sWQy7agW3eTmNQ.png","isPro":false,"fullname":"Dan Busbridge","user":"dbusbridge","type":"user"},"name":"Dan Busbridge","status":"claimed_verified","statusLastChangedAt":"2025-02-13T20:36:35.850Z","hidden":false},{"_id":"6560069b4a5a63bc00afaf2c","name":"Tatiana Likhomanenko","hidden":false},{"_id":"6560069b4a5a63bc00afaf2d","name":"Joseph Yitan Cheng","hidden":false},{"_id":"6560069b4a5a63bc00afaf2e","name":"Walter Talbott","hidden":false},{"_id":"6560069b4a5a63bc00afaf2f","name":"Chen Huang","hidden":false},{"_id":"6560069b4a5a63bc00afaf30","name":"Hanlin Goh","hidden":false},{"_id":"6560069b4a5a63bc00afaf31","user":{"_id":"6470d2247fd7ecdbd0ec3cc9","avatarUrl":"/avatars/52c5eca12499a1aa9bd49c43d4f20685.svg","isPro":false,"fullname":"Joshua M. Susskind","user":"jsusskind","type":"user"},"name":"Joshua Susskind","status":"claimed_verified","statusLastChangedAt":"2024-01-17T16:20:15.219Z","hidden":false}],"publishedAt":"2022-07-15T17:10:48.000Z","title":"Position Prediction as an Effective Pretraining Strategy","summary":"Transformers have gained increasing popularity in a wide range of\napplications, including Natural Language Processing (NLP), Computer Vision and\nSpeech Recognition, because of their powerful representational capacity.\nHowever, harnessing this representational capacity effectively requires a large\namount of data, strong regularization, or both, to mitigate overfitting.\nRecently, the power of the Transformer has been unlocked by self-supervised\npretraining strategies based on masked autoencoders which rely on\nreconstructing masked inputs, directly, or contrastively from unmasked content.\nThis pretraining strategy which has been used in BERT models in NLP, Wav2Vec\nmodels in Speech and, recently, in MAE models in Vision, forces the model to\nlearn about relationships between the content in different parts of the input\nusing autoencoding related objectives. In this paper, we propose a novel, but\nsurprisingly simple alternative to content reconstruction~-- that of predicting\nlocations from content, without providing positional information for it. Doing\nso requires the Transformer to understand the positional relationships between\ndifferent parts of the input, from their content alone. This amounts to an\nefficient implementation where the pretext task is a classification problem\namong all possible positions for each input token. We experiment on both Vision\nand Speech benchmarks, where our approach brings improvements over strong\nsupervised training baselines and is comparable to modern\nunsupervised/self-supervised pretraining methods. Our method also enables\nTransformers trained without position embeddings to outperform ones trained\nwith full position information.","upvotes":1,"discussionId":"6560069e4a5a63bc00afaf7c","ai_summary":"The paper proposes a novel self-supervised pretraining method for Transformers by predicting token locations from content, improving performance on vision and speech benchmarks without using positional embeddings.","ai_keywords":["Transformers","self-supervised pretraining","masked autoencoders","content reconstruction","positional relationships","pretext task","vision benchmarks","speech benchmarks","position embeddings"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6538119803519fddb4a17e10","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6538119803519fddb4a17e10/ffJMkdx-rM7VvLTCM6ri_.jpeg","isPro":false,"fullname":"samusenps","user":"samusenps","type":"user"}],"acceptLanguages":["*"]}">
The paper proposes a novel self-supervised pretraining method for Transformers by predicting token locations from content, improving performance on vision and speech benchmarks without using positional embeddings.
AI-generated summary
Transformers have gained increasing popularity in a wide range of
applications, including Natural Language Processing (NLP), Computer Vision and
Speech Recognition, because of their powerful representational capacity.
However, harnessing this representational capacity effectively requires a large
amount of data, strong regularization, or both, to mitigate overfitting.
Recently, the power of the Transformer has been unlocked by self-supervised
pretraining strategies based on masked autoencoders which rely on
reconstructing masked inputs, directly, or contrastively from unmasked content.
This pretraining strategy which has been used in BERT models in NLP, Wav2Vec
models in Speech and, recently, in MAE models in Vision, forces the model to
learn about relationships between the content in different parts of the input
using autoencoding related objectives. In this paper, we propose a novel, but
surprisingly simple alternative to content reconstruction~-- that of predicting
locations from content, without providing positional information for it. Doing
so requires the Transformer to understand the positional relationships between
different parts of the input, from their content alone. This amounts to an
efficient implementation where the pretext task is a classification problem
among all possible positions for each input token. We experiment on both Vision
and Speech benchmarks, where our approach brings improvements over strong
supervised training baselines and is comparable to modern
unsupervised/self-supervised pretraining methods. Our method also enables
Transformers trained without position embeddings to outperform ones trained
with full position information.