tokenizer = AutoTokenizer.from_pretrained(\"google/gemma-7b\", use_fast=False)\n
\ndoes give token 172. I've updated the title.
\n","updatedAt":"2024-03-21T10:22:32.578Z","author":{"_id":"626f896cda2765b2f11b221b","avatarUrl":"/avatars/38d738b4c7b89ab44ad6c0f7d56fb2e4.svg","fullname":"Sander Land","name":"sanderland","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":13}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.5049139857292175},"editors":["sanderland"],"editorAvatarUrls":["/avatars/38d738b4c7b89ab44ad6c0f7d56fb2e4.svg"],"reactions":[],"isReport":false}},{"id":"65fc0a848027e00bcc7c45e6","author":{"_id":"626f896cda2765b2f11b221b","avatarUrl":"/avatars/38d738b4c7b89ab44ad6c0f7d56fb2e4.svg","fullname":"Sander Land","name":"sanderland","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":13,"isOwner":false,"isOrgMember":false},"createdAt":"2024-03-21T10:23:00.000Z","type":"title-change","data":{"from":"Tokenizer inconsistencies","to":"Tokenizer inconsistencies in GemmaFastTokenizer"}},{"id":"65fc0a94030389a29b93d363","author":{"_id":"626f896cda2765b2f11b221b","avatarUrl":"/avatars/38d738b4c7b89ab44ad6c0f7d56fb2e4.svg","fullname":"Sander Land","name":"sanderland","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":13,"isOwner":false,"isOrgMember":false},"createdAt":"2024-03-21T10:23:16.000Z","type":"title-change","data":{"from":"Tokenizer inconsistencies in GemmaFastTokenizer","to":"Tokenizer inconsistencies in GemmaTokenizerFast"}},{"id":"65fce525a0d7adc40b638e24","author":{"_id":"64b776d9a5018e3c7c99b459","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64b776d9a5018e3c7c99b459/cU8wVyCB3nqGuLyi3wU0E.png","fullname":"minyi","name":"minyichen","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":12,"isOwner":false,"isOrgMember":false},"createdAt":"2024-03-22T01:55:49.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"@suryabhupa Can you fix this? Thanks!","html":"\n\n@suryabhupa\n\t Can you fix this? Thanks!
\n","updatedAt":"2024-03-22T01:55:49.554Z","author":{"_id":"64b776d9a5018e3c7c99b459","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64b776d9a5018e3c7c99b459/cU8wVyCB3nqGuLyi3wU0E.png","fullname":"minyi","name":"minyichen","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":12}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8560248017311096},"editors":["minyichen"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/64b776d9a5018e3c7c99b459/cU8wVyCB3nqGuLyi3wU0E.png"],"reactions":[],"isReport":false}},{"id":"660198d67a3a9bbfcdb3eb08","author":{"_id":"65d7644d19bf1e60614dee6e","avatarUrl":"/avatars/422a8d28fa9d9ff2c3ec1406352cb346.svg","fullname":"Surya Bhupatiraju","name":"suryabhupa","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":14,"isOwner":false,"isOrgMember":true},"createdAt":"2024-03-25T15:31:34.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"We're looking into this now, thanks for raising! Should have an update soon.","html":"We're looking into this now, thanks for raising! Should have an update soon.
\n","updatedAt":"2024-03-25T15:31:34.674Z","author":{"_id":"65d7644d19bf1e60614dee6e","avatarUrl":"/avatars/422a8d28fa9d9ff2c3ec1406352cb346.svg","fullname":"Surya Bhupatiraju","name":"suryabhupa","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":14}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9804787039756775},"editors":["suryabhupa"],"editorAvatarUrls":["/avatars/422a8d28fa9d9ff2c3ec1406352cb346.svg"],"reactions":[{"reaction":"👍","users":["minyichen"],"count":1}],"isReport":false}},{"id":"6601db38b0432edfb55f059b","author":{"_id":"65cbbc59a9103476218ad67e","avatarUrl":"/avatars/42b5e5969a73ded58984f34595acc25d.svg","fullname":"Nam Nguyen","name":"postmasters","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":4,"isOwner":false,"isOrgMember":true},"createdAt":"2024-03-25T20:14:48.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"If you pull latest `transformers` code, the two tokenizers should produce the same sequence. FYI, the fix is https://github.com/huggingface/transformers/pull/29473.","html":"If you pull latest transformers
code, the two tokenizers should produce the same sequence. FYI, the fix is https://github.com/huggingface/transformers/pull/29473.
Yes, @minyiccp \n\n@sanderland\n\t that should be the fix here -- let us know if it doesn't work!
\n","updatedAt":"2024-03-26T20:35:36.463Z","author":{"_id":"65d7644d19bf1e60614dee6e","avatarUrl":"/avatars/422a8d28fa9d9ff2c3ec1406352cb346.svg","fullname":"Surya Bhupatiraju","name":"suryabhupa","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":14}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.846721887588501},"editors":["suryabhupa"],"editorAvatarUrls":["/avatars/422a8d28fa9d9ff2c3ec1406352cb346.svg"],"reactions":[],"isReport":false}}],"pinned":false,"locked":false,"collection":"discussions","isPullRequest":false,"isReport":false},"repo":{"name":"google/gemma-7b","type":"model"},"activeTab":"discussion","discussionRole":0,"watched":false,"muted":false,"repoDiscussionsLocked":false}">Tokenizer inconsistencies in GemmaTokenizerFast
tokenizer = AutoTokenizer.from_pretrained(\"google/gemma-7b\", use_fast=False)\n
\ndoes give token 172. I've updated the title.
\n","updatedAt":"2024-03-21T10:22:32.578Z","author":{"_id":"626f896cda2765b2f11b221b","avatarUrl":"/avatars/38d738b4c7b89ab44ad6c0f7d56fb2e4.svg","fullname":"Sander Land","name":"sanderland","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":13}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.5049139857292175},"editors":["sanderland"],"editorAvatarUrls":["/avatars/38d738b4c7b89ab44ad6c0f7d56fb2e4.svg"],"reactions":[],"isReport":false}},{"id":"65fc0a848027e00bcc7c45e6","author":{"_id":"626f896cda2765b2f11b221b","avatarUrl":"/avatars/38d738b4c7b89ab44ad6c0f7d56fb2e4.svg","fullname":"Sander Land","name":"sanderland","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":13,"isOwner":false,"isOrgMember":false},"createdAt":"2024-03-21T10:23:00.000Z","type":"title-change","data":{"from":"Tokenizer inconsistencies","to":"Tokenizer inconsistencies in GemmaFastTokenizer"}},{"id":"65fc0a94030389a29b93d363","author":{"_id":"626f896cda2765b2f11b221b","avatarUrl":"/avatars/38d738b4c7b89ab44ad6c0f7d56fb2e4.svg","fullname":"Sander Land","name":"sanderland","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":13,"isOwner":false,"isOrgMember":false},"createdAt":"2024-03-21T10:23:16.000Z","type":"title-change","data":{"from":"Tokenizer inconsistencies in GemmaFastTokenizer","to":"Tokenizer inconsistencies in GemmaTokenizerFast"}},{"id":"65fce525a0d7adc40b638e24","author":{"_id":"64b776d9a5018e3c7c99b459","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64b776d9a5018e3c7c99b459/cU8wVyCB3nqGuLyi3wU0E.png","fullname":"minyi","name":"minyichen","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":12,"isOwner":false,"isOrgMember":false},"createdAt":"2024-03-22T01:55:49.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"@suryabhupa Can you fix this? Thanks!","html":"\n\n@suryabhupa\n\t Can you fix this? Thanks!
\n","updatedAt":"2024-03-22T01:55:49.554Z","author":{"_id":"64b776d9a5018e3c7c99b459","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64b776d9a5018e3c7c99b459/cU8wVyCB3nqGuLyi3wU0E.png","fullname":"minyi","name":"minyichen","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":12}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8560248017311096},"editors":["minyichen"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/64b776d9a5018e3c7c99b459/cU8wVyCB3nqGuLyi3wU0E.png"],"reactions":[],"isReport":false}},{"id":"660198d67a3a9bbfcdb3eb08","author":{"_id":"65d7644d19bf1e60614dee6e","avatarUrl":"/avatars/422a8d28fa9d9ff2c3ec1406352cb346.svg","fullname":"Surya Bhupatiraju","name":"suryabhupa","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":14,"isOwner":false,"isOrgMember":true},"createdAt":"2024-03-25T15:31:34.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"We're looking into this now, thanks for raising! Should have an update soon.","html":"We're looking into this now, thanks for raising! Should have an update soon.
\n","updatedAt":"2024-03-25T15:31:34.674Z","author":{"_id":"65d7644d19bf1e60614dee6e","avatarUrl":"/avatars/422a8d28fa9d9ff2c3ec1406352cb346.svg","fullname":"Surya Bhupatiraju","name":"suryabhupa","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":14}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9804787039756775},"editors":["suryabhupa"],"editorAvatarUrls":["/avatars/422a8d28fa9d9ff2c3ec1406352cb346.svg"],"reactions":[{"reaction":"👍","users":["minyichen"],"count":1}],"isReport":false}},{"id":"6601db38b0432edfb55f059b","author":{"_id":"65cbbc59a9103476218ad67e","avatarUrl":"/avatars/42b5e5969a73ded58984f34595acc25d.svg","fullname":"Nam Nguyen","name":"postmasters","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":4,"isOwner":false,"isOrgMember":true},"createdAt":"2024-03-25T20:14:48.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"If you pull latest `transformers` code, the two tokenizers should produce the same sequence. FYI, the fix is https://github.com/huggingface/transformers/pull/29473.","html":"If you pull latest transformers
code, the two tokenizers should produce the same sequence. FYI, the fix is https://github.com/huggingface/transformers/pull/29473.
Yes, @minyiccp \n\n@sanderland\n\t that should be the fix here -- let us know if it doesn't work!
\n","updatedAt":"2024-03-26T20:35:36.463Z","author":{"_id":"65d7644d19bf1e60614dee6e","avatarUrl":"/avatars/422a8d28fa9d9ff2c3ec1406352cb346.svg","fullname":"Surya Bhupatiraju","name":"suryabhupa","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":14}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.846721887588501},"editors":["suryabhupa"],"editorAvatarUrls":["/avatars/422a8d28fa9d9ff2c3ec1406352cb346.svg"],"reactions":[],"isReport":false}}],"pinned":false,"locked":false,"collection":"discussions","isPullRequest":false,"isReport":false},"primaryEmailConfirmed":false,"repo":{"name":"google/gemma-7b","type":"model"},"discussionRole":0,"acceptLanguages":["*"],"hideComments":true,"repoDiscussionsLocked":false,"isDiscussionAuthor":false}">The Huggingface tokenizer gives different results from the SentencePiece tokenizer, probably due to a regex preprocessor.
Some noteable tokens affected include HTML tags which seem to have been added manually to the vocabulary.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("google/gemma-7b")
token_ids = tokenizer.encode('What is <tbody>? "<tbody>" is an html tag')
[(i,tokenizer.decode([i])) for i in token_ids]
gives
[(2, '<bos>'),
(1841, 'What'),
(603, ' is'),
(968, ' <'),
(80309, 'tbody'),
(93540, '>?'),
(15114, ' "<'),
(80309, 'tbody'),
(28760, '>"'),
(603, ' is'),
(671, ' an'),
(11060, ' html'),
(5886, ' tag')]
Whereas using
vocab = spm.SentencePieceProcessor()
vocab.Load("gemma_tokenizer.model")
input_ids = vocab.EncodeAsIds('What is <tbody>? "<tbody>" is an html tag')
[(i, vocab.DecodeIds([i])) for i in input_ids]
gives
[(1841, 'What'),
(603, ' is'),
(235248, ' '),
(172, '<tbody>'),
(235336, '?'),
(664, ' "'),
(172, '<tbody>'),
(235281, '"'),
(603, ' is'),
(671, ' an'),
(11060, ' html'),
(5886, ' tag')]
I think the problem arises when we use the AutoTokenizer
class, which is instantiated from GemmaTokenizerFast
. I think the GemmaTokenizer
tokenizes text similarly to the original spm tokenizer.
Thanks @PedramR , indeed
tokenizer = AutoTokenizer.from_pretrained("google/gemma-7b", use_fast=False)
does give token 172. I've updated the title.
We're looking into this now, thanks for raising! Should have an update soon.
If you pull latest transformers
code, the two tokenizers should produce the same sequence. FYI, the fix is https://github.com/huggingface/transformers/pull/29473.