@librarian-bot\n\t recommend

\n","updatedAt":"2024-08-15T22:03:25.120Z","author":{"_id":"646b8e6f31968a60a0201a12","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/646b8e6f31968a60a0201a12/SU2Gs1NPuk1zoXHwFHl0U.jpeg","fullname":")))?!?(((","name":"stereoplegic","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3637}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7918877601623535},"editors":["stereoplegic"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/646b8e6f31968a60a0201a12/SU2Gs1NPuk1zoXHwFHl0U.jpeg"],"reactions":[],"isReport":false},"replies":[{"id":"66be7b33399b14b08f2e7fea","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":263},"createdAt":"2024-08-15T22:03:31.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Pseudo Labelling for Enhanced Masked Autoencoders](https://huggingface.co/papers/2406.17450) (2024)\n* [How Lightweight Can A Vision Transformer Be](https://huggingface.co/papers/2407.17783) (2024)\n* [SIGMA:Sinkhorn-Guided Masked Video Modeling](https://huggingface.co/papers/2407.15447) (2024)\n* [Unified Auto-Encoding with Masked Diffusion](https://huggingface.co/papers/2406.17688) (2024)\n* [Bringing Masked Autoencoders Explicit Contrastive Properties for Point Cloud Self-Supervised Learning](https://huggingface.co/papers/2407.05862) (2024)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Pseudo Labelling for Enhanced Masked Autoencoders (2024)
How Lightweight Can A Vision Transformer Be (2024)
SIGMA:Sinkhorn-Guided Masked Video Modeling (2024)
Unified Auto-Encoding with Masked Diffusion (2024)
Bringing Masked Autoencoders Explicit Contrastive Properties for Point Cloud Self-Supervised Learning (2024)

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2024-08-15T22:03:31.149Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":263}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7434622049331665},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false,"parentCommentId":"66be7b2d9284c8209f766821"}}]}],"primaryEmailConfirmed":false,"paper":{"id":"2207.07611","authors":[{"_id":"6560069b4a5a63bc00afaf28","user":{"_id":"675bf04240726aac79ae8e7b","avatarUrl":"/avatars/8ae95255dfce91a6f5e76f14eae9e9e5.svg","isPro":false,"fullname":"Shuangfei Zhai","user":"shuangfei","type":"user"},"name":"Shuangfei Zhai","status":"claimed_verified","statusLastChangedAt":"2024-12-13T15:24:55.172Z","hidden":false},{"_id":"6560069b4a5a63bc00afaf29","name":"Navdeep Jaitly","hidden":false},{"_id":"6560069b4a5a63bc00afaf2a","name":"Jason Ramapuram","hidden":false},{"_id":"6560069b4a5a63bc00afaf2b","user":{"_id":"64c3726f2a5eaefd000cdedd","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64c3726f2a5eaefd000cdedd/iwFifH1sWQy7agW3eTmNQ.png","isPro":false,"fullname":"Dan Busbridge","user":"dbusbridge","type":"user"},"name":"Dan Busbridge","status":"claimed_verified","statusLastChangedAt":"2025-02-13T20:36:35.850Z","hidden":false},{"_id":"6560069b4a5a63bc00afaf2c","name":"Tatiana Likhomanenko","hidden":false},{"_id":"6560069b4a5a63bc00afaf2d","name":"Joseph Yitan Cheng","hidden":false},{"_id":"6560069b4a5a63bc00afaf2e","name":"Walter Talbott","hidden":false},{"_id":"6560069b4a5a63bc00afaf2f","name":"Chen Huang","hidden":false},{"_id":"6560069b4a5a63bc00afaf30","name":"Hanlin Goh","hidden":false},{"_id":"6560069b4a5a63bc00afaf31","user":{"_id":"6470d2247fd7ecdbd0ec3cc9","avatarUrl":"/avatars/52c5eca12499a1aa9bd49c43d4f20685.svg","isPro":false,"fullname":"Joshua M. Susskind","user":"jsusskind","type":"user"},"name":"Joshua Susskind","status":"claimed_verified","statusLastChangedAt":"2024-01-17T16:20:15.219Z","hidden":false}],"publishedAt":"2022-07-15T17:10:48.000Z","title":"Position Prediction as an Effective Pretraining Strategy","summary":"Transformers have gained increasing popularity in a wide range of\napplications, including Natural Language Processing (NLP), Computer Vision and\nSpeech Recognition, because of their powerful representational capacity.\nHowever, harnessing this representational capacity effectively requires a large\namount of data, strong regularization, or both, to mitigate overfitting.\nRecently, the power of the Transformer has been unlocked by self-supervised\npretraining strategies based on masked autoencoders which rely on\nreconstructing masked inputs, directly, or contrastively from unmasked content.\nThis pretraining strategy which has been used in BERT models in NLP, Wav2Vec\nmodels in Speech and, recently, in MAE models in Vision, forces the model to\nlearn about relationships between the content in different parts of the input\nusing autoencoding related objectives. In this paper, we propose a novel, but\nsurprisingly simple alternative to content reconstruction~-- that of predicting\nlocations from content, without providing positional information for it. Doing\nso requires the Transformer to understand the positional relationships between\ndifferent parts of the input, from their content alone. This amounts to an\nefficient implementation where the pretext task is a classification problem\namong all possible positions for each input token. We experiment on both Vision\nand Speech benchmarks, where our approach brings improvements over strong\nsupervised training baselines and is comparable to modern\nunsupervised/self-supervised pretraining methods. Our method also enables\nTransformers trained without position embeddings to outperform ones trained\nwith full position information.","upvotes":1,"discussionId":"6560069e4a5a63bc00afaf7c","ai_summary":"The paper proposes a novel self-supervised pretraining method for Transformers by predicting token locations from content, improving performance on vision and speech benchmarks without using positional embeddings.","ai_keywords":["Transformers","self-supervised pretraining","masked autoencoders","content reconstruction","positional relationships","pretext task","vision benchmarks","speech benchmarks","position embeddings"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6538119803519fddb4a17e10","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6538119803519fddb4a17e10/ffJMkdx-rM7VvLTCM6ri_.jpeg","isPro":false,"fullname":"samusenps","user":"samusenps","type":"user"}],"acceptLanguages":["*"]}">

Papers

arxiv:2207.07611

Position Prediction as an Effective Pretraining Strategy

Published on Jul 15, 2022

Upvote

Authors:

Shuangfei Zhai ,

Dan Busbridge ,

Joshua Susskind

Abstract

The paper proposes a novel self-supervised pretraining method for Transformers by predicting token locations from content, improving performance on vision and speech benchmarks without using positional embeddings.

AI-generated summary

Transformers have gained increasing popularity in a wide range of applications, including Natural Language Processing (NLP), Computer Vision and Speech Recognition, because of their powerful representational capacity. However, harnessing this representational capacity effectively requires a large amount of data, strong regularization, or both, to mitigate overfitting. Recently, the power of the Transformer has been unlocked by self-supervised pretraining strategies based on masked autoencoders which rely on reconstructing masked inputs, directly, or contrastively from unmasked content. This pretraining strategy which has been used in BERT models in NLP, Wav2Vec models in Speech and, recently, in MAE models in Vision, forces the model to learn about relationships between the content in different parts of the input using autoencoding related objectives. In this paper, we propose a novel, but surprisingly simple alternative to content reconstruction~-- that of predicting locations from content, without providing positional information for it. Doing so requires the Transformer to understand the positional relationships between different parts of the input, from their content alone. This amounts to an efficient implementation where the pretext task is a classification problem among all possible positions for each input token. We experiment on both Vision and Speech benchmarks, where our approach brings improvements over strong supervised training baselines and is comparable to modern unsupervised/self-supervised pretraining methods. Our method also enables Transformers trained without position embeddings to outperform ones trained with full position information.