lynx   »   [go: up one dir, main page]

Transformers documentation

GPTQ

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

GPTQ

PEFT๋ฅผ ํ™œ์šฉํ•œ GPTQ ์–‘์žํ™”๋ฅผ ์‚ฌ์šฉํ•ด๋ณด์‹œ๋ ค๋ฉด ์ด ๋…ธํŠธ๋ถ์„ ์ฐธ๊ณ ํ•˜์‹œ๊ณ , ์ž์„ธํ•œ ๋‚ด์šฉ์€ ์ด ๋ธ”๋กœ๊ทธ ๊ฒŒ์‹œ๋ฌผ์—์„œ ํ™•์ธํ•˜์„ธ์š”!

AutoGPTQ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋Š” GPTQ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ํ›ˆ๋ จ ํ›„ ์–‘์žํ™” ๊ธฐ๋ฒ•์œผ๋กœ, ๊ฐ€์ค‘์น˜ ํ–‰๋ ฌ์˜ ๊ฐ ํ–‰์„ ๋…๋ฆฝ์ ์œผ๋กœ ์–‘์žํ™”ํ•˜์—ฌ ์˜ค์ฐจ๋ฅผ ์ตœ์†Œํ™”ํ•˜๋Š” ๊ฐ€์ค‘์น˜ ๋ฒ„์ „์„ ์ฐพ์Šต๋‹ˆ๋‹ค. ์ด ๊ฐ€์ค‘์น˜๋Š” int4๋กœ ์–‘์žํ™”๋˜์ง€๋งŒ, ์ถ”๋ก  ์ค‘์—๋Š” ์‹ค์‹œ๊ฐ„์œผ๋กœ fp16์œผ๋กœ ๋ณต์›๋ฉ๋‹ˆ๋‹ค. ์ด๋Š” int4 ๊ฐ€์ค‘์น˜๊ฐ€ GPU์˜ ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์‹  ๊ฒฐํ•ฉ๋œ ์ปค๋„์—์„œ ์—ญ์–‘์žํ™”๋˜๊ธฐ ๋•Œ๋ฌธ์— ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์„ 4๋ฐฐ ์ ˆ์•ฝํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ๋” ๋‚ฎ์€ ๋น„ํŠธ ๋„ˆ๋น„๋ฅผ ์‚ฌ์šฉํ•จ์œผ๋กœ์จ ํ†ต์‹  ์‹œ๊ฐ„์ด ์ค„์–ด๋“ค์–ด ์ถ”๋ก  ์†๋„๊ฐ€ ๋นจ๋ผ์งˆ ๊ฒƒ์œผ๋กœ ๊ธฐ๋Œ€ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์‹œ์ž‘ํ•˜๊ธฐ ์ „์— ๋‹ค์Œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋“ค์ด ์„ค์น˜๋˜์–ด ์žˆ๋Š”์ง€ ํ™•์ธํ•˜์„ธ์š”:

pip install auto-gptq
pip install --upgrade accelerate optimum transformers

๋ชจ๋ธ์„ ์–‘์žํ™”ํ•˜๋ ค๋ฉด(ํ˜„์žฌ ํ…์ŠคํŠธ ๋ชจ๋ธ๋งŒ ์ง€์›๋จ) GPTQConfig ํด๋ž˜์Šค๋ฅผ ์ƒ์„ฑํ•˜๊ณ  ์–‘์žํ™”ํ•  ๋น„ํŠธ ์ˆ˜, ์–‘์žํ™”๋ฅผ ์œ„ํ•œ ๊ฐ€์ค‘์น˜ ๊ต์ • ๋ฐ์ดํ„ฐ์…‹, ๊ทธ๋ฆฌ๊ณ  ๋ฐ์ดํ„ฐ์…‹์„ ์ค€๋น„ํ•˜๊ธฐ ์œ„ํ•œ ํ† ํฌ๋‚˜์ด์ €๋ฅผ ์„ค์ •ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig

model_id = "facebook/opt-125m"
tokenizer = AutoTokenizer.from_pretrained(model_id)
gptq_config = GPTQConfig(bits=4, dataset="c4", tokenizer=tokenizer)

์ž์‹ ์˜ ๋ฐ์ดํ„ฐ์…‹์„ ๋ฌธ์ž์—ด ๋ฆฌ์ŠคํŠธ ํ˜•ํƒœ๋กœ ์ „๋‹ฌํ•  ์ˆ˜๋„ ์žˆ์ง€๋งŒ, GPTQ ๋…ผ๋ฌธ์—์„œ ์‚ฌ์šฉํ•œ ๋™์ผํ•œ ๋ฐ์ดํ„ฐ์…‹์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์„ ๊ฐ•๋ ฅํžˆ ๊ถŒ์žฅํ•ฉ๋‹ˆ๋‹ค.

dataset = ["auto-gptq is an easy-to-use model quantization library with user-friendly apis, based on GPTQ algorithm."]
gptq_config = GPTQConfig(bits=4, dataset=dataset, tokenizer=tokenizer)

์–‘์žํ™”ํ•  ๋ชจ๋ธ์„ ๋กœ๋“œํ•˜๊ณ  gptq_config์„ from_pretrained() ๋ฉ”์†Œ๋“œ์— ์ „๋‹ฌํ•˜์„ธ์š”. ๋ชจ๋ธ์„ ๋ฉ”๋ชจ๋ฆฌ์— ๋งž์ถ”๊ธฐ ์œ„ํ•ด device_map="auto"๋ฅผ ์„ค์ •ํ•˜์—ฌ ๋ชจ๋ธ์„ ์ž๋™์œผ๋กœ CPU๋กœ ์˜คํ”„๋กœ๋“œํ•˜๊ณ , ์–‘์žํ™”๋ฅผ ์œ„ํ•ด ๋ชจ๋ธ ๋ชจ๋“ˆ์ด CPU์™€ GPU ๊ฐ„์— ์ด๋™ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•ฉ๋‹ˆ๋‹ค.

quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", quantization_config=gptq_config)

๋ฐ์ดํ„ฐ์…‹์ด ๋„ˆ๋ฌด ์ปค์„œ ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ๋ถ€์กฑํ•œ ๊ฒฝ์šฐ๋ฅผ ๋Œ€๋น„ํ•œ ๋””์Šคํฌ ์˜คํ”„๋กœ๋“œ๋Š” ํ˜„์žฌ ์ง€์›ํ•˜์ง€ ์•Š๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Ÿด ๋•Œ๋Š” max_memory ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋””๋ฐ”์ด์Šค(GPU ๋ฐ CPU)์—์„œ ์‚ฌ์šฉํ•  ๋ฉ”๋ชจ๋ฆฌ ์–‘์„ ํ• ๋‹นํ•ด ๋ณด์„ธ์š”:

quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", max_memory={0: "30GiB", 1: "46GiB", "cpu": "30GiB"}, quantization_config=gptq_config)

ํ•˜๋“œ์›จ์–ด์™€ ๋ชจ๋ธ ๋งค๊ฐœ๋ณ€์ˆ˜๋Ÿ‰์— ๋”ฐ๋ผ ๋ชจ๋ธ์„ ์ฒ˜์Œ๋ถ€ํ„ฐ ์–‘์žํ™”ํ•˜๋Š” ๋ฐ ๋“œ๋Š” ์‹œ๊ฐ„์ด ์„œ๋กœ ๋‹ค๋ฅผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ๋ฌด๋ฃŒ ๋“ฑ๊ธ‰์˜ Google Colab GPU๋กœ ๋น„๊ต์  ๊ฐ€๋ฒผ์šด facebook/opt-350m ๋ชจ๋ธ์„ ์–‘์žํ™”ํ•˜๋Š” ๋ฐ ์•ฝ 5๋ถ„์ด ๊ฑธ๋ฆฌ์ง€๋งŒ, NVIDIA A100์œผ๋กœ 175B์— ๋‹ฌํ•˜๋Š” ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ๊ฐ€์ง„ ๋ชจ๋ธ์„ ์–‘์žํ™”ํ•˜๋Š” ๋ฐ๋Š” ์•ฝ 4์‹œ๊ฐ„์— ๋‹ฌํ•˜๋Š” ์‹œ๊ฐ„์ด ๊ฑธ๋ฆด ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ชจ๋ธ์„ ์–‘์žํ™”ํ•˜๊ธฐ ์ „์—, Hub์—์„œ ํ•ด๋‹น ๋ชจ๋ธ์˜ GPTQ ์–‘์žํ™” ๋ฒ„์ „์ด ์ด๋ฏธ ์กด์žฌํ•˜๋Š”์ง€ ํ™•์ธํ•˜๋Š” ๊ฒƒ์ด ์ข‹์Šต๋‹ˆ๋‹ค.

๋ชจ๋ธ์ด ์–‘์žํ™”๋˜๋ฉด, ๋ชจ๋ธ๊ณผ ํ† ํฌ๋‚˜์ด์ €๋ฅผ Hub์— ํ‘ธ์‹œํ•˜์—ฌ ์‰ฝ๊ฒŒ ๊ณต์œ ํ•˜๊ณ  ์ ‘๊ทผํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. GPTQConfig๋ฅผ ์ €์žฅํ•˜๊ธฐ ์œ„ํ•ด push_to_hub() ๋ฉ”์†Œ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜์„ธ์š”:

quantized_model.push_to_hub("opt-125m-gptq")
tokenizer.push_to_hub("opt-125m-gptq")

์–‘์žํ™”๋œ ๋ชจ๋ธ์„ ๋กœ์ปฌ์— ์ €์žฅํ•˜๋ ค๋ฉด save_pretrained() ๋ฉ”์†Œ๋“œ๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ชจ๋ธ์ด device_map ๋งค๊ฐœ๋ณ€์ˆ˜๋กœ ์–‘์žํ™”๋˜์—ˆ์„ ๊ฒฝ์šฐ, ์ €์žฅํ•˜๊ธฐ ์ „์— ์ „์ฒด ๋ชจ๋ธ์„ GPU๋‚˜ CPU๋กœ ์ด๋™ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ๋ชจ๋ธ์„ CPU์— ์ €์žฅํ•˜๋ ค๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์ด ํ•ฉ๋‹ˆ๋‹ค:

quantized_model.save_pretrained("opt-125m-gptq")
tokenizer.save_pretrained("opt-125m-gptq")

# device_map์ด ์„ค์ •๋œ ์ƒํƒœ์—์„œ ์–‘์žํ™”๋œ ๊ฒฝ์šฐ
quantized_model.to("cpu")
quantized_model.save_pretrained("opt-125m-gptq")

์–‘์žํ™”๋œ ๋ชจ๋ธ์„ ๋‹ค์‹œ ๋กœ๋“œํ•˜๋ ค๋ฉด from_pretrained() ๋ฉ”์†Œ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜๊ณ , device_map="auto"๋ฅผ ์„ค์ •ํ•˜์—ฌ ๋ชจ๋“  ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ GPU์— ๋ชจ๋ธ์„ ์ž๋™์œผ๋กœ ๋ถ„์‚ฐ์‹œ์ผœ ๋” ๋งŽ์€ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š์œผ๋ฉด์„œ ๋ชจ๋ธ์„ ๋” ๋น ๋ฅด๊ฒŒ ๋กœ๋“œํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("{your_username}/opt-125m-gptq", device_map="auto")

ExLlama

ExLlama์€ Llama ๋ชจ๋ธ์˜ Python/C++/CUDA ๊ตฌํ˜„์ฒด๋กœ, 4๋น„ํŠธ GPTQ ๊ฐ€์ค‘์น˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋” ๋น ๋ฅธ ์ถ”๋ก ์„ ์œ„ํ•ด ์„ค๊ณ„๋˜์—ˆ์Šต๋‹ˆ๋‹ค(์ด ๋ฒค์น˜๋งˆํฌ๋ฅผ ์ฐธ๊ณ ํ•˜์„ธ์š”). [โ€˜GPTQConfigโ€™] ๊ฐ์ฒด๋ฅผ ์ƒ์„ฑํ•  ๋•Œ ExLlama ์ปค๋„์ด ๊ธฐ๋ณธ์ ์œผ๋กœ ํ™œ์„ฑํ™”๋ฉ๋‹ˆ๋‹ค. ์ถ”๋ก  ์†๋„๋ฅผ ๋”์šฑ ๋†’์ด๊ธฐ ์œ„ํ•ด, exllama_config ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ๊ตฌ์„ฑํ•˜์—ฌ ExLlamaV2 ์ปค๋„์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

import torch
from transformers import AutoModelForCausalLM, GPTQConfig

gptq_config = GPTQConfig(bits=4, exllama_config={"version":2})
model = AutoModelForCausalLM.from_pretrained("{your_username}/opt-125m-gptq", device_map="auto", quantization_config=gptq_config)

4๋น„ํŠธ ๋ชจ๋ธ๋งŒ ์ง€์›๋˜๋ฉฐ, ์–‘์žํ™”๋œ ๋ชจ๋ธ์„ PEFT๋กœ ๋ฏธ์„ธ ์กฐ์ •ํ•˜๋Š” ๊ฒฝ์šฐ ExLlama ์ปค๋„์„ ๋น„ํ™œ์„ฑํ™”ํ•  ๊ฒƒ์„ ๊ถŒ์žฅํ•ฉ๋‹ˆ๋‹ค.

ExLlama ์ปค๋„์€ ์ „์ฒด ๋ชจ๋ธ์ด GPU์— ์žˆ์„ ๋•Œ๋งŒ ์ง€์›๋ฉ๋‹ˆ๋‹ค. AutoGPTQ(๋ฒ„์ „ 0.4.2 ์ด์ƒ)๋กœ CPU์—์„œ ์ถ”๋ก ์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ฒฝ์šฐ ExLlama ์ปค๋„์„ ๋น„ํ™œ์„ฑํ™”ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด config.json ํŒŒ์ผ์˜ ์–‘์žํ™” ์„ค์ •์—์„œ ExLlama ์ปค๋„๊ณผ ๊ด€๋ จ๋œ ์†์„ฑ์„ ๋ฎ์–ด์จ์•ผ ํ•ฉ๋‹ˆ๋‹ค.

import torch
from transformers import AutoModelForCausalLM, GPTQConfig
gptq_config = GPTQConfig(bits=4, use_exllama=False)
model = AutoModelForCausalLM.from_pretrained("{your_username}/opt-125m-gptq", device_map="cpu", quantization_config=gptq_config)
< > Update on GitHub

ะ›ัƒั‡ัˆะธะน ั‡ะฐัั‚ะฝั‹ะน ั…ะพัั‚ะธะฝะณ