There I have trained 0.7B parameter LLM of a new architecture with a throughput of approximately 0.7B tokens per hour on TPU v3-8. The details are in the article referenced above.
\nCould you read it and say if it makes sense? I would like you to try training small LLM based on this technique to decide whether it is useful or not. This would take just several TPU days.
\n","updatedAt":"2024-05-17T09:11:25.975Z","author":{"_id":"63e66caef2e9a8f22c57392b","avatarUrl":"/avatars/94d3355f31c0fe480d987dc10b6c7b0a.svg","fullname":"vp","name":"lijip26313","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false}},"numEdits":2,"identifiedLanguage":{"language":"en","probability":0.9103244543075562},"editors":["lijip26313"],"editorAvatarUrls":["/avatars/94d3355f31c0fe480d987dc10b6c7b0a.svg"],"reactions":[],"isReport":false}}],"pinned":false,"locked":false,"collection":"discussions","isPullRequest":false,"isReport":false},"repo":{"name":"google/gemma-7b","type":"model"},"activeTab":"discussion","discussionRole":0,"watched":false,"muted":false,"repoDiscussionsLocked":false}">A new idea to improve training and inference performance
There I have trained 0.7B parameter LLM of a new architecture with a throughput of approximately 0.7B tokens per hour on TPU v3-8. The details are in the article referenced above.
\nCould you read it and say if it makes sense? I would like you to try training small LLM based on this technique to decide whether it is useful or not. This would take just several TPU days.
\n","updatedAt":"2024-05-17T09:11:25.975Z","author":{"_id":"63e66caef2e9a8f22c57392b","avatarUrl":"/avatars/94d3355f31c0fe480d987dc10b6c7b0a.svg","fullname":"vp","name":"lijip26313","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false}},"numEdits":2,"identifiedLanguage":{"language":"en","probability":0.9103244543075562},"editors":["lijip26313"],"editorAvatarUrls":["/avatars/94d3355f31c0fe480d987dc10b6c7b0a.svg"],"reactions":[],"isReport":false}}],"pinned":false,"locked":false,"collection":"discussions","isPullRequest":false,"isReport":false},"primaryEmailConfirmed":false,"repo":{"name":"google/gemma-7b","type":"model"},"discussionRole":0,"acceptLanguages":["*"],"hideComments":true,"repoDiscussionsLocked":false,"isDiscussionAuthor":false}">Hello Google Team, I have an idea to significantly improve LLM performance: https://www.kaggle.com/code/vasilypodorov/fast-language-modelling-with-un-formers
There I have trained 0.7B parameter LLM of a new architecture with a throughput of approximately 0.7B tokens per hour on TPU v3-8. The details are in the article referenced above.
Could you read it and say if it makes sense? I would like you to try training small LLM based on this technique to decide whether it is useful or not. This would take just several TPU days.