lynx   »   [go: up one dir, main page]

Transformers documentation

bitsandbytes

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

bitsandbytes

bitsandbytes๋Š” ๋ชจ๋ธ์„ 8๋น„ํŠธ ๋ฐ 4๋น„ํŠธ๋กœ ์–‘์žํ™”ํ•˜๋Š” ๊ฐ€์žฅ ์‰ฌ์šด ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค. 8๋น„ํŠธ ์–‘์žํ™”๋Š” fp16์˜ ์ด์ƒ์น˜์™€ int8์˜ ๋น„์ด์ƒ์น˜๋ฅผ ๊ณฑํ•œ ํ›„, ๋น„์ด์ƒ์น˜ ๊ฐ’์„ fp16์œผ๋กœ ๋‹ค์‹œ ๋ณ€ํ™˜ํ•˜๊ณ , ์ด๋“ค์„ ํ•ฉ์‚ฐํ•˜์—ฌ fp16์œผ๋กœ ๊ฐ€์ค‘์น˜๋ฅผ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค. ์ด๋ ‡๊ฒŒ ํ•˜๋ฉด ์ด์ƒ์น˜ ๊ฐ’์ด ๋ชจ๋ธ ์„ฑ๋Šฅ์— ๋ฏธ์น˜๋Š” ์ €ํ•˜ ํšจ๊ณผ๋ฅผ ์ค„์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. 4๋น„ํŠธ ์–‘์žํ™”๋Š” ๋ชจ๋ธ์„ ๋”์šฑ ์••์ถ•ํ•˜๋ฉฐ, QLoRA์™€ ํ•จ๊ป˜ ์‚ฌ์šฉํ•˜์—ฌ ์–‘์žํ™”๋œ ๋Œ€๊ทœ๋ชจ ์–ธ์–ด ๋ชจ๋ธ์„ ๋ฏธ์„ธ ์กฐ์ •ํ•˜๋Š” ๋ฐ ํ”ํžˆ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.

bitsandbytes๋ฅผ ์‚ฌ์šฉํ•˜๋ ค๋ฉด ๋‹ค์Œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๊ฐ€ ์„ค์น˜๋˜์–ด ์žˆ์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค:

8-bit
4-bit
pip install transformers accelerate bitsandbytes>0.37.0

์ด์ œ BitsAndBytesConfig๋ฅผ from_pretrained() ๋ฉ”์†Œ๋“œ์— ์ „๋‹ฌํ•˜์—ฌ ๋ชจ๋ธ์„ ์–‘์žํ™”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” Accelerate ๊ฐ€์ ธ์˜ค๊ธฐ๋ฅผ ์ง€์›ํ•˜๊ณ  torch.nn.Linear ๋ ˆ์ด์–ด๊ฐ€ ํฌํ•จ๋œ ๋ชจ๋“  ๋ชจ๋ธ์—์„œ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค.

8-bit
4-bit

๋ชจ๋ธ์„ 8๋น„ํŠธ๋กœ ์–‘์žํ™”ํ•˜๋ฉด ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์ด ์ ˆ๋ฐ˜์œผ๋กœ ์ค„์–ด๋“ค๋ฉฐ, ๋Œ€๊ทœ๋ชจ ๋ชจ๋ธ์˜ ๊ฒฝ์šฐ ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ GPU๋ฅผ ํšจ์œจ์ ์œผ๋กœ ํ™œ์šฉํ•˜๋ ค๋ฉด device_map="auto"๋ฅผ ์„ค์ •ํ•˜์„ธ์š”.

from transformers import AutoModelForCausalLM, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_8bit=True)

model_8bit = AutoModelForCausalLM.from_pretrained(
    "bigscience/bloom-1b7", 
    quantization_config=quantization_config
)

๊ธฐ๋ณธ์ ์œผ๋กœ torch.nn.LayerNorm๊ณผ ๊ฐ™์€ ๋‹ค๋ฅธ ๋ชจ๋“ˆ์€ torch.float16์œผ๋กœ ๋ณ€ํ™˜๋ฉ๋‹ˆ๋‹ค. ์›ํ•œ๋‹ค๋ฉด dtype ๋งค๊ฐœ๋ณ€์ˆ˜๋กœ ์ด๋“ค ๋ชจ๋“ˆ์˜ ๋ฐ์ดํ„ฐ ์œ ํ˜•์„ ๋ณ€๊ฒฝํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_8bit=True)

model_8bit = AutoModelForCausalLM.from_pretrained(
    "facebook/opt-350m", 
    quantization_config=quantization_config, 
    dtype=torch.float32
)
model_8bit.model.decoder.layers[-1].final_layer_norm.weight.dtype

๋ชจ๋ธ์ด 8๋น„ํŠธ๋กœ ์–‘์žํ™”๋˜๋ฉด ์ตœ์‹  ๋ฒ„์ „์˜ Transformers์™€ bitsandbytes๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š๋Š” ํ•œ ์–‘์žํ™”๋œ ๊ฐ€์ค‘์น˜๋ฅผ Hub์— ํ‘ธ์‹œํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค. ์ตœ์‹  ๋ฒ„์ „์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ, push_to_hub() ๋ฉ”์†Œ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ 8๋น„ํŠธ ๋ชจ๋ธ์„ Hub์— ํ‘ธ์‹œํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์–‘์žํ™” config.json ํŒŒ์ผ์ด ๋จผ์ € ํ‘ธ์‹œ๋˜๊ณ , ๊ทธ ๋‹ค์Œ ์–‘์žํ™”๋œ ๋ชจ๋ธ ๊ฐ€์ค‘์น˜๊ฐ€ ํ‘ธ์‹œ๋ฉ๋‹ˆ๋‹ค.

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_8bit=True)

model = AutoModelForCausalLM.from_pretrained(
    "bigscience/bloom-560m", 
    quantization_config=quantization_config
)
tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom-560m")

model.push_to_hub("bloom-560m-8bit")

8๋น„ํŠธ ๋ฐ 4๋น„ํŠธ ๊ฐ€์ค‘์น˜๋กœ ํ›ˆ๋ จํ•˜๋Š” ๊ฒƒ์€ ์ถ”๊ฐ€ ๋งค๊ฐœ๋ณ€์ˆ˜์— ๋Œ€ํ•ด์„œ๋งŒ ์ง€์›๋ฉ๋‹ˆ๋‹ค.

๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์„ ํ™•์ธํ•˜๋ ค๋ฉด get_memory_footprint๋ฅผ ์‚ฌ์šฉํ•˜์„ธ์š”:

print(model.get_memory_footprint())

์–‘์žํ™”๋œ ๋ชจ๋ธ์€ from_pretrained() ๋ฉ”์†Œ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ load_in_8bit ๋˜๋Š” load_in_4bit ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ์ง€์ •ํ•˜์ง€ ์•Š๊ณ ๋„ ๊ฐ€์ ธ์˜ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("{your_username}/bloom-560m-8bit", device_map="auto")

8๋น„ํŠธ (LLM.int8() ์•Œ๊ณ ๋ฆฌ์ฆ˜)

8๋น„ํŠธ ์–‘์žํ™”์— ๋Œ€ํ•œ ์ž์„ธํ•œ ๋‚ด์šฉ์„ ์•Œ๊ณ  ์‹ถ๋‹ค๋ฉด ์ด ๋ธ”๋กœ๊ทธ ํฌ์ŠคํŠธ๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”!

์ด ์„น์…˜์—์„œ๋Š” ์˜คํ”„๋กœ๋”ฉ, ์ด์ƒ์น˜ ์ž„๊ณ—๊ฐ’, ๋ชจ๋“ˆ ๋ณ€ํ™˜ ๊ฑด๋„ˆ๋›ฐ๊ธฐ ๋ฐ ๋ฏธ์„ธ ์กฐ์ •๊ณผ ๊ฐ™์€ 8๋น„ํŠธ ๋ชจ๋ธ์˜ ํŠน์ • ๊ธฐ๋Šฅ์„ ์‚ดํŽด๋ด…๋‹ˆ๋‹ค.

์˜คํ”„๋กœ๋”ฉ

8๋น„ํŠธ ๋ชจ๋ธ์€ CPU์™€ GPU ๊ฐ„์— ๊ฐ€์ค‘์น˜๋ฅผ ์˜คํ”„๋กœ๋“œํ•˜์—ฌ ๋งค์šฐ ํฐ ๋ชจ๋ธ์„ ๋ฉ”๋ชจ๋ฆฌ์— ์žฅ์ฐฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. CPU๋กœ ์ „์†ก๋œ ๊ฐ€์ค‘์น˜๋Š” ์‹ค์ œ๋กœ float32๋กœ ์ €์žฅ๋˜๋ฉฐ 8๋น„ํŠธ๋กœ ๋ณ€ํ™˜๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, bigscience/bloom-1b7 ๋ชจ๋ธ์˜ ์˜คํ”„๋กœ๋“œ๋ฅผ ํ™œ์„ฑํ™”ํ•˜๋ ค๋ฉด BitsAndBytesConfig๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๊ฒƒ๋ถ€ํ„ฐ ์‹œ์ž‘ํ•˜์„ธ์š”:

from transformers import AutoModelForCausalLM, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(llm_int8_enable_fp32_cpu_offload=True)

CPU์— ์ „๋‹ฌํ•  lm_head๋ฅผ ์ œ์™ธํ•œ ๋ชจ๋“  ๊ฒƒ์„ GPU์— ์ ์žฌํ•  ์ˆ˜ ์žˆ๋„๋ก ์‚ฌ์šฉ์ž ์ •์˜ ๋””๋ฐ”์ด์Šค ๋งต์„ ์„ค๊ณ„ํ•ฉ๋‹ˆ๋‹ค:

device_map = {
    "transformer.word_embeddings": 0,
    "transformer.word_embeddings_layernorm": 0,
    "lm_head": "cpu",
    "transformer.h": 0,
    "transformer.ln_f": 0,
}

์ด์ œ ์‚ฌ์šฉ์ž ์ •์˜ device_map๊ณผ quantization_config์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ชจ๋ธ์„ ๊ฐ€์ ธ์˜ต๋‹ˆ๋‹ค:

model_8bit = AutoModelForCausalLM.from_pretrained(
    "bigscience/bloom-1b7",
    device_map=device_map,
    quantization_config=quantization_config,
)

์ด์ƒ์น˜ ์ž„๊ณ—๊ฐ’

โ€œ์ด์ƒ์น˜โ€๋Š” ํŠน์ • ์ž„๊ณ—๊ฐ’์„ ์ดˆ๊ณผํ•˜๋Š” ์€๋‹‰ ์ƒํƒœ ๊ฐ’์„ ์˜๋ฏธํ•˜๋ฉฐ, ์ด๋Ÿฌํ•œ ๊ฐ’์€ fp16์œผ๋กœ ๊ณ„์‚ฐ๋ฉ๋‹ˆ๋‹ค. ๊ฐ’์€ ์ผ๋ฐ˜์ ์œผ๋กœ ์ •๊ทœ ๋ถ„ํฌ ([-3.5, 3.5])๋ฅผ ๋”ฐ๋ฅด์ง€๋งŒ, ๋Œ€๊ทœ๋ชจ ๋ชจ๋ธ์˜ ๊ฒฝ์šฐ ์ด ๋ถ„ํฌ๋Š” ๋งค์šฐ ๋‹ค๋ฅผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค ([-60, 6] ๋˜๋Š” [6, 60]). 8๋น„ํŠธ ์–‘์žํ™”๋Š” ~5 ์ •๋„์˜ ๊ฐ’์—์„œ ์ž˜ ์ž‘๋™ํ•˜์ง€๋งŒ, ๊ทธ ์ด์ƒ์—์„œ๋Š” ์ƒ๋‹นํ•œ ์„ฑ๋Šฅ ์ €ํ•˜๊ฐ€ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค. ์ข‹์€ ๊ธฐ๋ณธ ์ž„๊ณ—๊ฐ’ ๊ฐ’์€ 6์ด์ง€๋งŒ, ๋” ๋ถˆ์•ˆ์ •ํ•œ ๋ชจ๋ธ (์†Œํ˜• ๋ชจ๋ธ ๋˜๋Š” ๋ฏธ์„ธ ์กฐ์ •)์—๋Š” ๋” ๋‚ฎ์€ ์ž„๊ณ—๊ฐ’์ด ํ•„์š”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋ชจ๋ธ์— ๊ฐ€์žฅ ์ ํ•ฉํ•œ ์ž„๊ณ—๊ฐ’์„ ์ฐพ์œผ๋ ค๋ฉด BitsAndBytesConfig์—์„œ llm_int8_threshold ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ์‹คํ—˜ํ•ด๋ณด๋Š” ๊ฒƒ์ด ์ข‹์Šต๋‹ˆ๋‹ค:

from transformers import AutoModelForCausalLM, BitsAndBytesConfig

model_id = "bigscience/bloom-1b7"

quantization_config = BitsAndBytesConfig(
    llm_int8_threshold=10,
)

model_8bit = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map=device_map,
    quantization_config=quantization_config,
)

๋ชจ๋“ˆ ๋ณ€ํ™˜ ๊ฑด๋„ˆ๋›ฐ๊ธฐ

Jukebox์™€ ๊ฐ™์€ ์ผ๋ถ€ ๋ชจ๋ธ์€ ๋ชจ๋“  ๋ชจ๋“ˆ์„ 8๋น„ํŠธ๋กœ ์–‘์žํ™”ํ•  ํ•„์š”๊ฐ€ ์—†์œผ๋ฉฐ, ์ด๋Š” ์‹ค์ œ๋กœ ๋ถˆ์•ˆ์ •์„ฑ์„ ์œ ๋ฐœํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. Jukebox์˜ ๊ฒฝ์šฐ, BitsAndBytesConfig์˜ llm_int8_skip_modules ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์—ฌ๋Ÿฌ lm_head ๋ชจ๋“ˆ์„ ๊ฑด๋„ˆ๋›ฐ์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค:

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

model_id = "bigscience/bloom-1b7"

quantization_config = BitsAndBytesConfig(
    llm_int8_skip_modules=["lm_head"],
)

model_8bit = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    quantization_config=quantization_config,
)

๋ฏธ์„ธ ์กฐ์ •

PEFT ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด flan-t5-large ๋ฐ facebook/opt-6.7b์™€ ๊ฐ™์€ ๋Œ€๊ทœ๋ชจ ๋ชจ๋ธ์„ 8๋น„ํŠธ ์–‘์žํ™”๋กœ ๋ฏธ์„ธ ์กฐ์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํ›ˆ๋ จ ์‹œ device_map ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ์ „๋‹ฌํ•  ํ•„์š”๊ฐ€ ์—†์œผ๋ฉฐ, ๋ชจ๋ธ์„ ์ž๋™์œผ๋กœ GPU์— ๊ฐ€์ ธ์˜ต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์›ํ•˜๋Š” ๊ฒฝ์šฐ device_map ๋งค๊ฐœ๋ณ€์ˆ˜๋กœ ์žฅ์น˜ ๋งต์„ ์‚ฌ์šฉ์ž ์ •์˜ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค (device_map="auto"๋Š” ์ถ”๋ก ์—๋งŒ ์‚ฌ์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค).

4๋น„ํŠธ (QLoRA ์•Œ๊ณ ๋ฆฌ์ฆ˜)

์ด ๋…ธํŠธ๋ถ์—์„œ 4๋น„ํŠธ ์–‘์žํ™”๋ฅผ ์‹œ๋„ํ•ด๋ณด๊ณ  ์ž์„ธํ•œ ๋‚ด์šฉ์€ ์ด ๋ธ”๋กœ๊ทธ ๊ฒŒ์‹œ๋ฌผ์—์„œ ํ™•์ธํ•˜์„ธ์š”.

์ด ์„น์…˜์—์„œ๋Š” ๊ณ„์‚ฐ ๋ฐ์ดํ„ฐ ์œ ํ˜• ๋ณ€๊ฒฝ, Normal Float 4 (NF4) ๋ฐ์ดํ„ฐ ์œ ํ˜• ์‚ฌ์šฉ, ์ค‘์ฒฉ ์–‘์žํ™” ์‚ฌ์šฉ๊ณผ ๊ฐ™์€ 4๋น„ํŠธ ๋ชจ๋ธ์˜ ํŠน์ • ๊ธฐ๋Šฅ ์ผ๋ถ€๋ฅผ ํƒ๊ตฌํ•ฉ๋‹ˆ๋‹ค.

๋ฐ์ดํ„ฐ ์œ ํ˜• ๊ณ„์‚ฐ

๊ณ„์‚ฐ ์†๋„๋ฅผ ๋†’์ด๊ธฐ ์œ„ํ•ด BitsAndBytesConfig์—์„œ bnb_4bit_compute_dtype ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ฐ์ดํ„ฐ ์œ ํ˜•์„ float32(๊ธฐ๋ณธ๊ฐ’)์—์„œ bf16์œผ๋กœ ๋ณ€๊ฒฝํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

import torch
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16)

Normal Float 4 (NF4)

NF4๋Š” QLoRA ๋…ผ๋ฌธ์—์„œ ์†Œ๊ฐœ๋œ 4๋น„ํŠธ ๋ฐ์ดํ„ฐ ์œ ํ˜•์œผ๋กœ, ์ •๊ทœ ๋ถ„ํฌ์—์„œ ์ดˆ๊ธฐํ™”๋œ ๊ฐ€์ค‘์น˜์— ์ ํ•ฉํ•ฉ๋‹ˆ๋‹ค. 4๋น„ํŠธ ๊ธฐ๋ฐ˜ ๋ชจ๋ธ์„ ํ›ˆ๋ จํ•  ๋•Œ NF4๋ฅผ ์‚ฌ์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” BitsAndBytesConfig์—์„œ bnb_4bit_quant_type ๋งค๊ฐœ๋ณ€์ˆ˜๋กœ ์„ค์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

from transformers import BitsAndBytesConfig

nf4_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
)

model_nf4 = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=nf4_config)

์ถ”๋ก ์˜ ๊ฒฝ์šฐ, bnb_4bit_quant_type์€ ์„ฑ๋Šฅ์— ํฐ ์˜ํ–ฅ์„ ๋ฏธ์น˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๋ชจ๋ธ ๊ฐ€์ค‘์น˜์™€ ์ผ๊ด€์„ฑ์„ ์œ ์ง€ํ•˜๊ธฐ ์œ„ํ•ด bnb_4bit_compute_dtype ๋ฐ dtype ๊ฐ’์„ ์‚ฌ์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

์ค‘์ฒฉ ์–‘์žํ™”

์ค‘์ฒฉ ์–‘์žํ™”๋Š” ์ถ”๊ฐ€์ ์ธ ์„ฑ๋Šฅ ์†์‹ค ์—†์ด ์ถ”๊ฐ€์ ์ธ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์ ˆ์•ฝํ•  ์ˆ˜ ์žˆ๋Š” ๊ธฐ์ˆ ์ž…๋‹ˆ๋‹ค. ์ด ๊ธฐ๋Šฅ์€ ์ด๋ฏธ ์–‘์žํ™”๋œ ๊ฐ€์ค‘์น˜์˜ 2์ฐจ ์–‘์žํ™”๋ฅผ ์ˆ˜ํ–‰ํ•˜์—ฌ ๋งค๊ฐœ๋ณ€์ˆ˜๋‹น ์ถ”๊ฐ€๋กœ 0.4๋น„ํŠธ๋ฅผ ์ ˆ์•ฝํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ์ค‘์ฒฉ ์–‘์žํ™”๋ฅผ ํ†ตํ•ด 16GB NVIDIA T4 GPU์—์„œ ์‹œํ€€์Šค ๊ธธ์ด 1024, ๋ฐฐ์น˜ ํฌ๊ธฐ 1, ๊ทธ๋ ˆ์ด๋””์–ธํŠธ ๋ˆ„์  4๋‹จ๊ณ„๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ Llama-13b ๋ชจ๋ธ์„ ๋ฏธ์„ธ ์กฐ์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

from transformers import BitsAndBytesConfig

double_quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
)

model_double_quant = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-13b", quantization_config=double_quant_config)

bitsandbytes ๋ชจ๋ธ์˜ ๋น„์–‘์žํ™”

์–‘์žํ™”๋œ ํ›„์—๋Š” ๋ชจ๋ธ์„ ์›๋ž˜์˜ ์ •๋ฐ€๋„๋กœ ๋น„์–‘์žํ™”ํ•  ์ˆ˜ ์žˆ์ง€๋งŒ, ์ด๋Š” ๋ชจ๋ธ์˜ ํ’ˆ์งˆ์ด ์•ฝ๊ฐ„ ์ €ํ•˜๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋น„์–‘์žํ™”๋œ ๋ชจ๋ธ์— ๋งž์ถœ ์ˆ˜ ์žˆ๋Š” ์ถฉ๋ถ„ํ•œ GPU RAM์ด ์žˆ๋Š”์ง€ ํ™•์ธํ•˜์„ธ์š”.

from transformers import AutoModelForCausalLM, BitsAndBytesConfig, AutoTokenizer

model_id = "facebook/opt-125m"

model = AutoModelForCausalLM.from_pretrained(model_id, BitsAndBytesConfig(load_in_4bit=True))
tokenizer = AutoTokenizer.from_pretrained(model_id)

model.dequantize()

text = tokenizer("Hello my name is", return_tensors="pt").to(0)

out = model.generate(**text)
print(tokenizer.decode(out[0]))
< > Update on GitHub

ะ›ัƒั‡ัˆะธะน ั‡ะฐัั‚ะฝั‹ะน ั…ะพัั‚ะธะฝะณ