@PedramR\n\t, indeed

tokenizer = AutoTokenizer.from_pretrained(\"google/gemma-7b\", use_fast=False)\n

does give token 172. I've updated the title.

\n","updatedAt":"2024-03-21T10:22:32.578Z","author":{"_id":"626f896cda2765b2f11b221b","avatarUrl":"/avatars/38d738b4c7b89ab44ad6c0f7d56fb2e4.svg","fullname":"Sander Land","name":"sanderland","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":13}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.5049139857292175},"editors":["sanderland"],"editorAvatarUrls":["/avatars/38d738b4c7b89ab44ad6c0f7d56fb2e4.svg"],"reactions":[],"isReport":false}},{"id":"65fc0a848027e00bcc7c45e6","author":{"_id":"626f896cda2765b2f11b221b","avatarUrl":"/avatars/38d738b4c7b89ab44ad6c0f7d56fb2e4.svg","fullname":"Sander Land","name":"sanderland","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":13,"isOwner":false,"isOrgMember":false},"createdAt":"2024-03-21T10:23:00.000Z","type":"title-change","data":{"from":"Tokenizer inconsistencies","to":"Tokenizer inconsistencies in GemmaFastTokenizer"}},{"id":"65fc0a94030389a29b93d363","author":{"_id":"626f896cda2765b2f11b221b","avatarUrl":"/avatars/38d738b4c7b89ab44ad6c0f7d56fb2e4.svg","fullname":"Sander Land","name":"sanderland","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":13,"isOwner":false,"isOrgMember":false},"createdAt":"2024-03-21T10:23:16.000Z","type":"title-change","data":{"from":"Tokenizer inconsistencies in GemmaFastTokenizer","to":"Tokenizer inconsistencies in GemmaTokenizerFast"}},{"id":"65fce525a0d7adc40b638e24","author":{"_id":"64b776d9a5018e3c7c99b459","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64b776d9a5018e3c7c99b459/cU8wVyCB3nqGuLyi3wU0E.png","fullname":"minyi","name":"minyichen","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":12,"isOwner":false,"isOrgMember":false},"createdAt":"2024-03-22T01:55:49.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"@suryabhupa Can you fix this? Thanks!","html":"

\n\n@suryabhupa\n\t Can you fix this? Thanks!

\n","updatedAt":"2024-03-22T01:55:49.554Z","author":{"_id":"64b776d9a5018e3c7c99b459","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64b776d9a5018e3c7c99b459/cU8wVyCB3nqGuLyi3wU0E.png","fullname":"minyi","name":"minyichen","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":12}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8560248017311096},"editors":["minyichen"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/64b776d9a5018e3c7c99b459/cU8wVyCB3nqGuLyi3wU0E.png"],"reactions":[],"isReport":false}},{"id":"660198d67a3a9bbfcdb3eb08","author":{"_id":"65d7644d19bf1e60614dee6e","avatarUrl":"/avatars/422a8d28fa9d9ff2c3ec1406352cb346.svg","fullname":"Surya Bhupatiraju","name":"suryabhupa","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":14,"isOwner":false,"isOrgMember":true},"createdAt":"2024-03-25T15:31:34.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"We're looking into this now, thanks for raising! Should have an update soon.","html":"

We're looking into this now, thanks for raising! Should have an update soon.

\n","updatedAt":"2024-03-25T15:31:34.674Z","author":{"_id":"65d7644d19bf1e60614dee6e","avatarUrl":"/avatars/422a8d28fa9d9ff2c3ec1406352cb346.svg","fullname":"Surya Bhupatiraju","name":"suryabhupa","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":14}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9804787039756775},"editors":["suryabhupa"],"editorAvatarUrls":["/avatars/422a8d28fa9d9ff2c3ec1406352cb346.svg"],"reactions":[{"reaction":"👍","users":["minyichen"],"count":1}],"isReport":false}},{"id":"6601db38b0432edfb55f059b","author":{"_id":"65cbbc59a9103476218ad67e","avatarUrl":"/avatars/42b5e5969a73ded58984f34595acc25d.svg","fullname":"Nam Nguyen","name":"postmasters","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":4,"isOwner":false,"isOrgMember":true},"createdAt":"2024-03-25T20:14:48.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"If you pull latest `transformers` code, the two tokenizers should produce the same sequence. FYI, the fix is https://github.com/huggingface/transformers/pull/29473.","html":"

If you pull latest transformers code, the two tokenizers should produce the same sequence. FYI, the fix is https://github.com/huggingface/transformers/pull/29473.

\n","updatedAt":"2024-03-25T20:14:48.335Z","author":{"_id":"65cbbc59a9103476218ad67e","avatarUrl":"/avatars/42b5e5969a73ded58984f34595acc25d.svg","fullname":"Nam Nguyen","name":"postmasters","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":4}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7986790537834167},"editors":["postmasters"],"editorAvatarUrls":["/avatars/42b5e5969a73ded58984f34595acc25d.svg"],"reactions":[{"reaction":"🤗","users":["minyichen","PedramR","sanderland"],"count":3}],"isReport":false}},{"id":"66033198b3598472b82722e1","author":{"_id":"65d7644d19bf1e60614dee6e","avatarUrl":"/avatars/422a8d28fa9d9ff2c3ec1406352cb346.svg","fullname":"Surya Bhupatiraju","name":"suryabhupa","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":14,"isOwner":false,"isOrgMember":true},"createdAt":"2024-03-26T20:35:36.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Yes, @minyiccp @sanderland that should be the fix here -- let us know if it doesn't work!","html":"

Yes, @minyiccp \n\n@sanderland\n\t that should be the fix here -- let us know if it doesn't work!

\n","updatedAt":"2024-03-26T20:35:36.463Z","author":{"_id":"65d7644d19bf1e60614dee6e","avatarUrl":"/avatars/422a8d28fa9d9ff2c3ec1406352cb346.svg","fullname":"Surya Bhupatiraju","name":"suryabhupa","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":14}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.846721887588501},"editors":["suryabhupa"],"editorAvatarUrls":["/avatars/422a8d28fa9d9ff2c3ec1406352cb346.svg"],"reactions":[],"isReport":false}}],"pinned":false,"locked":false,"collection":"discussions","isPullRequest":false,"isReport":false},"repo":{"name":"google/gemma-7b","type":"model"},"activeTab":"discussion","discussionRole":0,"watched":false,"muted":false,"repoDiscussionsLocked":false}">

Tokenizer inconsistencies in GemmaTokenizerFast

#76

by sanderland - opened Mar 19, 2024

Discussion

@PedramR\n\t, indeed

tokenizer = AutoTokenizer.from_pretrained(\"google/gemma-7b\", use_fast=False)\n

does give token 172. I've updated the title.

\n\n@suryabhupa\n\t Can you fix this? Thanks!

We're looking into this now, thanks for raising! Should have an update soon.

If you pull latest transformers code, the two tokenizers should produce the same sequence. FYI, the fix is https://github.com/huggingface/transformers/pull/29473.

Yes, @minyiccp \n\n@sanderland\n\t that should be the fix here -- let us know if it doesn't work!

\n","updatedAt":"2024-03-26T20:35:36.463Z","author":{"_id":"65d7644d19bf1e60614dee6e","avatarUrl":"/avatars/422a8d28fa9d9ff2c3ec1406352cb346.svg","fullname":"Surya Bhupatiraju","name":"suryabhupa","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":14}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.846721887588501},"editors":["suryabhupa"],"editorAvatarUrls":["/avatars/422a8d28fa9d9ff2c3ec1406352cb346.svg"],"reactions":[],"isReport":false}}],"pinned":false,"locked":false,"collection":"discussions","isPullRequest":false,"isReport":false},"primaryEmailConfirmed":false,"repo":{"name":"google/gemma-7b","type":"model"},"discussionRole":0,"acceptLanguages":["*"],"hideComments":true,"repoDiscussionsLocked":false,"isDiscussionAuthor":false}">

sanderland

Mar 19, 2024

The Huggingface tokenizer gives different results from the SentencePiece tokenizer, probably due to a regex preprocessor.
Some noteable tokens affected include HTML tags which seem to have been added manually to the vocabulary.

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("google/gemma-7b")
token_ids =  tokenizer.encode('What is <tbody>? "<tbody>" is an html tag')
[(i,tokenizer.decode([i])) for i in token_ids]

gives

[(2, '<bos>'),
 (1841, 'What'),
 (603, ' is'),
 (968, ' <'),
 (80309, 'tbody'),
 (93540, '>?'),
 (15114, ' "<'),
 (80309, 'tbody'),
 (28760, '>"'),
 (603, ' is'),
 (671, ' an'),
 (11060, ' html'),
 (5886, ' tag')]

Whereas using

vocab = spm.SentencePieceProcessor()
vocab.Load("gemma_tokenizer.model")
input_ids = vocab.EncodeAsIds('What is <tbody>? "<tbody>" is an html tag')
[(i, vocab.DecodeIds([i])) for i in input_ids]

gives

[(1841, 'What'),
 (603, ' is'),
 (235248, ' '),
 (172, '<tbody>'),
 (235336, '?'),
 (664, ' "'),
 (172, '<tbody>'),
 (235281, '"'),
 (603, ' is'),
 (671, ' an'),
 (11060, ' html'),
 (5886, ' tag')]

PedramR

Mar 20, 2024

I think the problem arises when we use the AutoTokenizer class, which is instantiated from GemmaTokenizerFast. I think the GemmaTokenizer tokenizes text similarly to the original spm tokenizer.

sanderland

Mar 21, 2024

Thanks @PedramR , indeed

tokenizer = AutoTokenizer.from_pretrained("google/gemma-7b", use_fast=False)

does give token 172. I've updated the title.

sanderland changed discussion title from Tokenizer inconsistencies to Tokenizer inconsistencies in GemmaFastTokenizer Mar 21, 2024

sanderland changed discussion title from Tokenizer inconsistencies in GemmaFastTokenizer to Tokenizer inconsistencies in GemmaTokenizerFast Mar 21, 2024

minyichen

Mar 22, 2024

@suryabhupa Can you fix this? Thanks!

suryabhupa

Google org Mar 25, 2024

We're looking into this now, thanks for raising! Should have an update soon.

postmasters

Google org Mar 25, 2024

If you pull latest transformers code, the two tokenizers should produce the same sequence. FYI, the fix is https://github.com/huggingface/transformers/pull/29473.

suryabhupa

Google org Mar 26, 2024

Yes, @minyiccp @sanderland that should be the fix here -- let us know if it doesn't work!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment