https://huggingface.co/stabilityai/stable-diffusion-2-1/blob/main/text_encoder/config.json#L19, However, when looking into the official OpenClip H-14, the number of the hidden layer is 24 https://github.com/mlfoundations/open_clip/blob/main/src/open_clip/model_configs/ViT-H-14.json#L15, this can also be confirmed from the number of layers in the LAION CLIP ViT H-14 repo, https://huggingface.co/laion/CLIP-ViT-H-14-laion2B-s32B-b79K/blob/main/config.json#L54

Does anyone know why the hugging face repo is setting the number of hidden layers to 23? Is this a bug, or a small trick to improve the sampling performance?

Thanks

\n","updatedAt":"2023-08-22T01:09:45.796Z","author":{"_id":"633b7a4b0d68f86e2d98de05","avatarUrl":"/avatars/5d48c171ddbcc7ca39bdc0d11c6224e4.svg","fullname":"Jun Gao","name":"JungaoCanada","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.782565712928772},"editors":["JungaoCanada"],"editorAvatarUrls":["/avatars/5d48c171ddbcc7ca39bdc0d11c6224e4.svg"],"reactions":[],"isReport":false}},{"id":"656a5f1baf6d3c4129e96e1e","author":{"_id":"62fa41d0363251ee40a2915d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62fa41d0363251ee40a2915d/AWbQCvPkxujxR5BCfCniz.jpeg","fullname":"Viktor Toth","name":"vtoth","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":4,"isOwner":false,"isOrgMember":false},"createdAt":"2023-12-01T22:32:59.000Z","type":"comment","data":{"edited":true,"hidden":false,"latest":{"raw":"Can this possibly be about the last projection layer being removed from/not used in SD as it takes the 77x1024 text embedding as input, not the final CLIP projection of dim 1024? ","html":"

Can this possibly be about the last projection layer being removed from/not used in SD as it takes the 77x1024 text embedding as input, not the final CLIP projection of dim 1024?

\n","updatedAt":"2023-12-01T22:33:16.496Z","author":{"_id":"62fa41d0363251ee40a2915d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62fa41d0363251ee40a2915d/AWbQCvPkxujxR5BCfCniz.jpeg","fullname":"Viktor Toth","name":"vtoth","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":4}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.9129307270050049},"editors":["vtoth"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/62fa41d0363251ee40a2915d/AWbQCvPkxujxR5BCfCniz.jpeg"],"reactions":[],"isReport":false}}],"pinned":false,"locked":false,"collection":"discussions","isPullRequest":false,"isReport":false},"repo":{"name":"stabilityai/stable-diffusion-2-1","type":"model"},"activeTab":"discussion","discussionRole":0,"watched":false,"muted":false,"repoDiscussionsLocked":false}">

Question in the Text encoder setting

#81

by JungaoCanada - opened Aug 22, 2023

Discussion

Does anyone know why the hugging face repo is setting the number of hidden layers to 23? Is this a bug, or a small trick to improve the sampling performance?

Thanks

Can this possibly be about the last projection layer being removed from/not used in SD as it takes the 77x1024 text embedding as input, not the final CLIP projection of dim 1024?

\n","updatedAt":"2023-12-01T22:33:16.496Z","author":{"_id":"62fa41d0363251ee40a2915d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62fa41d0363251ee40a2915d/AWbQCvPkxujxR5BCfCniz.jpeg","fullname":"Viktor Toth","name":"vtoth","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":4}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.9129307270050049},"editors":["vtoth"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/62fa41d0363251ee40a2915d/AWbQCvPkxujxR5BCfCniz.jpeg"],"reactions":[],"isReport":false}}],"pinned":false,"locked":false,"collection":"discussions","isPullRequest":false,"isReport":false},"primaryEmailConfirmed":false,"repo":{"name":"stabilityai/stable-diffusion-2-1","type":"model"},"discussionRole":0,"acceptLanguages":["*"],"hideComments":true,"repoDiscussionsLocked":false,"isDiscussionAuthor":false}">

JungaoCanada

Aug 22, 2023

Hi,

I find there probably is a problem in setting up the text encoder, not sure why this occurs...

In particular, in the text encoder, the number of hidden layers is set to 23 https://huggingface.co/stabilityai/stable-diffusion-2-1/blob/main/text_encoder/config.json#L19, However, when looking into the official OpenClip H-14, the number of the hidden layer is 24 https://github.com/mlfoundations/open_clip/blob/main/src/open_clip/model_configs/ViT-H-14.json#L15, this can also be confirmed from the number of layers in the LAION CLIP ViT H-14 repo, https://huggingface.co/laion/CLIP-ViT-H-14-laion2B-s32B-b79K/blob/main/config.json#L54

Does anyone know why the hugging face repo is setting the number of hidden layers to 23? Is this a bug, or a small trick to improve the sampling performance?

Thanks

vtoth

Dec 1, 2023

•

edited Dec 1, 2023

Can this possibly be about the last projection layer being removed from/not used in SD as it takes the 77x1024 text embedding as input, not the final CLIP projection of dim 1024?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment