lynx   »   [go: up one dir, main page]

https://huggingface.co/spaces/MCG-NJU/PixNerd

\n","updatedAt":"2025-08-04T02:16:30.149Z","author":{"_id":"66615c855fd9d736e670e0a9","avatarUrl":"/avatars/0ff3127b513552432a7c651e21d7f283.svg","fullname":"wangshuai","name":"wangsssssss","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":10}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6322138905525208},"editors":["wangsssssss"],"editorAvatarUrls":["/avatars/0ff3127b513552432a7c651e21d7f283.svg"],"reactions":[{"reaction":"🔥","users":["search-facility"],"count":1}],"isReport":false}},{"id":"68901d3a5c3f77ec2fd42886","author":{"_id":"66615c855fd9d736e670e0a9","avatarUrl":"/avatars/0ff3127b513552432a7c651e21d7f283.svg","fullname":"wangshuai","name":"wangsssssss","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":10},"createdAt":"2025-08-04T02:38:50.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"# Revision of the inference time statistics\n\n![image.png](https://cdn-uploads.huggingface.co/production/uploads/66615c855fd9d736e670e0a9/vEGp4Lthv9JDjDa8Gvyze.png)\n\n| Model | **Inference** | | | **Training** | |\n|-------------------|---------------|--------|----------|--------------|----------|\n| | 1 image | 1 step | Mem (GB) | Speed (s/it) | Mem (GB) |\n| SiT-L/2(VAE-f8) | 0.51s | **0.0097s**| 2.9 | 0.30 | 18.4 |\n| Baseline-L/16 | 0.48s | **0.0097s**| 2.1 | 0.18 | 18 |\n| PixNerd-L/16 | 0.51s | 0.010s | 2.1 | 0.19 | 22 |\n\nDeeply sorry for this mistake, the single-step inference time of SiT-L/2 and Baseline-L is missing a zero (0.097s vs 0.0097s). The single-step inference time of PixNerd and Baseline is close.","html":"

Revision of the inference time statistics

\n

\"image.png\"

\n
\n\t\n\t\t\n\n\n\n\n\n\n\n\n\t\t\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\t
ModelInferenceTraining
1 image1 stepMem (GB)Speed (s/it)Mem (GB)
SiT-L/2(VAE-f8)0.51s0.0097s2.90.3018.4
Baseline-L/160.48s0.0097s2.10.1818
PixNerd-L/160.51s0.010s2.10.1922
\n
\n

Deeply sorry for this mistake, the single-step inference time of SiT-L/2 and Baseline-L is missing a zero (0.097s vs 0.0097s). The single-step inference time of PixNerd and Baseline is close.

\n","updatedAt":"2025-08-04T02:38:50.005Z","author":{"_id":"66615c855fd9d736e670e0a9","avatarUrl":"/avatars/0ff3127b513552432a7c651e21d7f283.svg","fullname":"wangshuai","name":"wangsssssss","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":10}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.601259708404541},"editors":["wangsssssss"],"editorAvatarUrls":["/avatars/0ff3127b513552432a7c651e21d7f283.svg"],"reactions":[{"reaction":"❤️","users":["edmond","eaglgenes101"],"count":2}],"isReport":false},"replies":[{"id":"68918dce46620722ba43a38f","author":{"_id":"66615c855fd9d736e670e0a9","avatarUrl":"/avatars/0ff3127b513552432a7c651e21d7f283.svg","fullname":"wangshuai","name":"wangsssssss","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":10},"createdAt":"2025-08-05T04:51:26.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Since the arxiv paper has been updated, so i closed this issue! feel free to open!","html":"

Since the arxiv paper has been updated, so i closed this issue! feel free to open!

\n","updatedAt":"2025-08-05T04:51:26.364Z","author":{"_id":"66615c855fd9d736e670e0a9","avatarUrl":"/avatars/0ff3127b513552432a7c651e21d7f283.svg","fullname":"wangshuai","name":"wangsssssss","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":10}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9719476103782654},"editors":["wangsssssss"],"editorAvatarUrls":["/avatars/0ff3127b513552432a7c651e21d7f283.svg"],"reactions":[],"isReport":false,"parentCommentId":"68901d3a5c3f77ec2fd42886"}}]},{"id":"68916041f9ee79541946e211","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":264},"createdAt":"2025-08-05T01:37:05.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Representation Entanglement for Generation:Training Diffusion Transformers Is Much Easier Than You Think](https://huggingface.co/papers/2507.01467) (2025)\n* [DGAE: Diffusion-Guided Autoencoder for Efficient Latent Representation Learning](https://huggingface.co/papers/2506.09644) (2025)\n* [Instella-T2I: Pushing the Limits of 1D Discrete Latent Space Image Generation](https://huggingface.co/papers/2506.21022) (2025)\n* [Taming Diffusion Transformer for Real-Time Mobile Video Generation](https://huggingface.co/papers/2507.13343) (2025)\n* [Diffusion Transformer-to-Mamba Distillation for High-Resolution Image Generation](https://huggingface.co/papers/2506.18999) (2025)\n* [Pyramidal Patchification Flow for Visual Generation](https://huggingface.co/papers/2506.23543) (2025)\n* [DC-AR: Efficient Masked Autoregressive Image Generation with Deep Compression Hybrid Tokenizer](https://huggingface.co/papers/2507.04947) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2025-08-05T01:37:05.973Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":264}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6492796540260315},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}},{"id":"689bd66f09cb26ac337b47b4","author":{"_id":"65d9fc2a0e6ad24551d87a1e","avatarUrl":"/avatars/3aedb9522cc3cd08349d654f523fd792.svg","fullname":"Grant Singleton","name":"grantsing","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1},"createdAt":"2025-08-13T00:03:59.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"arXiv explained breakdown of this paper 👉 https://arxivexplained.com/papers/pixnerd-pixel-neural-field-diffusion","html":"

arXiv explained breakdown of this paper 👉 https://arxivexplained.com/papers/pixnerd-pixel-neural-field-diffusion

\n","updatedAt":"2025-08-13T00:03:59.773Z","author":{"_id":"65d9fc2a0e6ad24551d87a1e","avatarUrl":"/avatars/3aedb9522cc3cd08349d654f523fd792.svg","fullname":"Grant Singleton","name":"grantsing","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6850689649581909},"editors":["grantsing"],"editorAvatarUrls":["/avatars/3aedb9522cc3cd08349d654f523fd792.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2507.23268","authors":[{"_id":"688c1ec68c434640078cc386","user":{"_id":"66615c855fd9d736e670e0a9","avatarUrl":"/avatars/0ff3127b513552432a7c651e21d7f283.svg","isPro":false,"fullname":"wangshuai","user":"wangsssssss","type":"user"},"name":"Shuai Wang","status":"claimed_verified","statusLastChangedAt":"2025-08-06T19:30:56.415Z","hidden":false},{"_id":"688c1ec68c434640078cc387","name":"Ziteng Gao","hidden":false},{"_id":"688c1ec68c434640078cc388","user":{"_id":"65e55ca4a0681de63022843e","avatarUrl":"/avatars/5b1ac4a81f0c38fda6f47b392f7474c8.svg","isPro":false,"fullname":"zhu chenhui","user":"flateon","type":"user"},"name":"Chenhui Zhu","status":"claimed_verified","statusLastChangedAt":"2025-08-06T19:30:51.704Z","hidden":false},{"_id":"688c1ec68c434640078cc389","name":"Weilin Huang","hidden":false},{"_id":"688c1ec68c434640078cc38a","name":"Limin Wang","hidden":false}],"publishedAt":"2025-07-31T06:07:20.000Z","submittedOnDailyAt":"2025-08-04T00:45:14.702Z","title":"PixNerd: Pixel Neural Field Diffusion","submittedOnDailyBy":{"_id":"66615c855fd9d736e670e0a9","avatarUrl":"/avatars/0ff3127b513552432a7c651e21d7f283.svg","isPro":false,"fullname":"wangshuai","user":"wangsssssss","type":"user"},"summary":"The current success of diffusion transformers heavily depends on the\ncompressed latent space shaped by the pre-trained variational autoencoder(VAE).\nHowever, this two-stage training paradigm inevitably introduces accumulated\nerrors and decoding artifacts. To address the aforementioned problems,\nresearchers return to pixel space at the cost of complicated cascade pipelines\nand increased token complexity. In contrast to their efforts, we propose to\nmodel the patch-wise decoding with neural field and present a single-scale,\nsingle-stage, efficient, end-to-end solution, coined as pixel neural field\ndiffusion~(PixelNerd). Thanks to the efficient neural field representation in\nPixNerd, we directly achieved 2.15 FID on ImageNet 256times256 and 2.84 FID\non ImageNet 512times512 without any complex cascade pipeline or VAE. We also\nextend our PixNerd framework to text-to-image applications. Our PixNerd-XXL/16\nachieved a competitive 0.73 overall score on the GenEval benchmark and 80.9\noverall score on the DPG benchmark.","upvotes":51,"discussionId":"688c1ec68c434640078cc38b","projectPage":"https://huggingface.co/spaces/MCG-NJU/PixNerd","githubRepo":"https://github.com/MCG-NJU/PixNerd","ai_summary":"Pixel Neural Field Diffusion (PixNerd) achieves high-quality image generation in a single-scale, single-stage process without VAEs or complex pipelines, and extends to text-to-image applications with competitive performance.","ai_keywords":["diffusion transformers","compressed latent space","pre-trained variational autoencoder","pixel space","patch-wise decoding","neural field","single-scale","single-stage","end-to-end solution","pixel neural field diffusion","PixNerd","FID","ImageNet","text-to-image","GenEval benchmark","DPG benchmark"],"githubStars":118},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6427e08288215cee63b1c44d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6427e08288215cee63b1c44d/rzaG978FF-ywzicWNl_xl.jpeg","isPro":false,"fullname":"yao teng","user":"tytyt","type":"user"},{"_id":"66615c855fd9d736e670e0a9","avatarUrl":"/avatars/0ff3127b513552432a7c651e21d7f283.svg","isPro":false,"fullname":"wangshuai","user":"wangsssssss","type":"user"},{"_id":"648ac3d53470b17ccc90deaf","avatarUrl":"/avatars/35b678baa7a91959f40f704c5de2a3e1.svg","isPro":false,"fullname":"Ziteng Gao","user":"sebgao","type":"user"},{"_id":"6441438acea37249a0fd8775","avatarUrl":"/avatars/2cc696926e2ef9ce7d58210c45875a82.svg","isPro":false,"fullname":"SHEN","user":"Hiyoung","type":"user"},{"_id":"65d851096769b3a9c9376134","avatarUrl":"/avatars/e4ca32cf96fa8bcba44f1f1dc8e0ec9b.svg","isPro":false,"fullname":"ZehongMa","user":"zehongma","type":"user"},{"_id":"664ee978e24897b66cac36b0","avatarUrl":"/avatars/1757f533b08e6d85ae8bd4faa5496ef2.svg","isPro":false,"fullname":"Chunxu Liu","user":"lcxrocks","type":"user"},{"_id":"65768087e58db15694d1edbe","avatarUrl":"/avatars/94aa8526d1fa9686788512be697d9a08.svg","isPro":false,"fullname":"Yifan Xu","user":"geekifan","type":"user"},{"_id":"64ad086c5d48838462e2eee1","avatarUrl":"/avatars/735b29d01bae05599e7d5ce61b223153.svg","isPro":false,"fullname":"Du","user":"xiaobiaodu","type":"user"},{"_id":"633e6f07309a99325095dd42","avatarUrl":"/avatars/57b91a488ac1745b3c0509c04eb6ad93.svg","isPro":false,"fullname":"Hoigi Seo","user":"Agorium","type":"user"},{"_id":"6513030fb3a463e17df56edd","avatarUrl":"/avatars/867bd4316b2de758654ad3a84ea868c1.svg","isPro":false,"fullname":"Hyun, Jeongseok","user":"js-hyun","type":"user"},{"_id":"6342796a0875f2c99cfd313b","avatarUrl":"/avatars/98575092404c4197b20c929a6499a015.svg","isPro":false,"fullname":"Yuseung \"Phillip\" Lee","user":"phillipinseoul","type":"user"},{"_id":"664920999e82dfd4a3fe1914","avatarUrl":"/avatars/fafbaa3a94fb25832f411c6b2c7b3aa8.svg","isPro":false,"fullname":"larouri","user":"hachim1","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":3}">
Papers
arxiv:2507.23268

PixNerd: Pixel Neural Field Diffusion

Published on Jul 31
· Submitted by wangshuai on Aug 4
#3 Paper of the day
Authors:
,
,

Abstract

Pixel Neural Field Diffusion (PixNerd) achieves high-quality image generation in a single-scale, single-stage process without VAEs or complex pipelines, and extends to text-to-image applications with competitive performance.

AI-generated summary

The current success of diffusion transformers heavily depends on the compressed latent space shaped by the pre-trained variational autoencoder(VAE). However, this two-stage training paradigm inevitably introduces accumulated errors and decoding artifacts. To address the aforementioned problems, researchers return to pixel space at the cost of complicated cascade pipelines and increased token complexity. In contrast to their efforts, we propose to model the patch-wise decoding with neural field and present a single-scale, single-stage, efficient, end-to-end solution, coined as pixel neural field diffusion~(PixelNerd). Thanks to the efficient neural field representation in PixNerd, we directly achieved 2.15 FID on ImageNet 256times256 and 2.84 FID on ImageNet 512times512 without any complex cascade pipeline or VAE. We also extend our PixNerd framework to text-to-image applications. Our PixNerd-XXL/16 achieved a competitive 0.73 overall score on the GenEval benchmark and 80.9 overall score on the DPG benchmark.

Community

Paper author Paper submitter

A new speedy pixel diffusion transformer with neural field!!

TL;DR: The current success of diffusion transformers heavily depends on the compressed latent space shaped by the pre-trained variational autoencoder(VAE). However, this two-stage training paradigm inevitably introduces accumulated errors and decoding artifacts. To address the aforementioned problems, researchers return to pixel space at the cost of complicated cascade pipelines and increased token complexity. In contrast to their efforts, we propose to model the patch-wise decoding with neural field and present a single-scale, single-stage, efficient, end-to-end solution, coined as pixel neural field diffusion~(PixelNerd). Thanks to the efficient neural field representation in PixNerd, we directly achieved 2.15 FID on ImageNet 256×256 and 2.84 FID on ImageNet 512×512 without any complex cascade pipeline or VAE. We also extend our PixNerd framework to text-to-image applications. Our PixNerd-XXL/16 achieved a competitive 0.73 overall score on the GenEval benchmark and 80.9 overall score on the DPG benchmark.

Paper author Paper submitter

Online Space for text-to-image: https://huggingface.co/spaces/MCG-NJU/PixNerd

Paper author Paper submitter

Revision of the inference time statistics

image.png

Model Inference Training
1 image 1 step Mem (GB) Speed (s/it) Mem (GB)
SiT-L/2(VAE-f8) 0.51s 0.0097s 2.9 0.30 18.4
Baseline-L/16 0.48s 0.0097s 2.1 0.18 18
PixNerd-L/16 0.51s 0.010s 2.1 0.19 22

Deeply sorry for this mistake, the single-step inference time of SiT-L/2 and Baseline-L is missing a zero (0.097s vs 0.0097s). The single-step inference time of PixNerd and Baseline is close.

·
Paper author

Since the arxiv paper has been updated, so i closed this issue! feel free to open!

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

arXiv explained breakdown of this paper 👉 https://arxivexplained.com/papers/pixnerd-pixel-neural-field-diffusion

Sign up or log in to comment

Models citing this paper 2

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2507.23268 in a dataset README.md to link it from this page.

Spaces citing this paper 2

Collections including this paper 9

Лучший частный хостинг