lynx   »   [go: up one dir, main page]

https://huggingface.co/nakajimayoshi/ddpm-iris-256/tree/main/

\n

I was able to successfully train the model, and the training logs/samples were successfully uploaded, but the model was neither saved in the runtime as a .bin or .pth or pushed to my repository. I have made no modifications to the training loop, only the training config and dataset loading pipeline. You can see the modification of the training config below:

\n
from dataclasses import dataclass\n\n@dataclass\nclass TrainingConfig:\n    image_size = 256  # the generated image resolution\n    train_batch_size = 16\n    eval_batch_size = 16  # how many images to sample during evaluation\n    num_epochs = 50\n    gradient_accumulation_steps = 1\n    learning_rate = 1e-4\n    lr_warmup_steps = 500\n    save_image_epochs = 10\n    dataset_name= 'imagefolder'\n    save_model_epochs = 30\n    mixed_precision = 'fp16'  # `no` for float32, `fp16` for automatic mixed precision\n    output_dir = 'ddpm-iris-256'  # the model namy locally and on the HF Hub\n\n    push_to_hub = True  # whether to upload the saved model to the HF Hub\n    hub_private_repo = False\n    overwrite_output_dir = True  # overwrite the old model when re-running the notebook\n    seed = 0\n\nconfig = TrainingConfig()\n
\n

On my repository, you can see the logs and samples were uploaded, but none of the model checkpoints were uploaded nor can I find them in my google colab notebook. Any help is appreciated. Thanks

\n","updatedAt":"2023-05-28T04:19:17.608Z","author":{"_id":"6458a056c16ecb4815dcd582","avatarUrl":"/avatars/adb565ba5692ee14d6b6ce8eb3d9394d.svg","fullname":"Yoshi Nakajima","name":"nakajimayoshi","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false}},"numEdits":3,"editors":["nakajimayoshi"],"editorAvatarUrls":["/avatars/adb565ba5692ee14d6b6ce8eb3d9394d.svg"],"reactions":[],"isReport":false}},{"id":"64704704150f4cab863903be","author":{"_id":"6458a056c16ecb4815dcd582","avatarUrl":"/avatars/adb565ba5692ee14d6b6ce8eb3d9394d.svg","fullname":"Yoshi Nakajima","name":"nakajimayoshi","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isOwner":false,"isOrgMember":false},"createdAt":"2023-05-26T05:43:32.000Z","type":"comment","data":{"edited":true,"hidden":false,"latest":{"raw":"I have found a work around for this issue:\nThe issue is in the training loop:\n```py\n if accelerator.is_main_process:\n pipeline = DDPMPipeline(unet=accelerator.unwrap_model(model), scheduler=noise_scheduler)\n\n if (epoch + 1) % config.save_image_epochs == 0 or epoch == config.num_epochs - 1:\n evaluate(config, epoch, pipeline)\n\n if (epoch + 1) % config.save_model_epochs == 0 or epoch == config.num_epochs - 1:\n if config.push_to_hub:\n repo.push_to_hub(commit_message=f\"Epoch {epoch}\", blocking=True)\n else:\n pipeline.save_pretrained(config.output_dir) # this never gets called\n```\n\nFor one reason or another, the 'else' condition is not being reached, therefore pipline.save_pretrained(config.output_dir) never gets called. I solved this by simply moving the method call out of the else statement and saving it on every epoch:\n\n\n```py\n if accelerator.is_main_process:\n pipeline = DDPMPipeline(unet=accelerator.unwrap_model(model), scheduler=noise_scheduler)\n pipeline.save_pretrained(config.output_dir) # move to here\n if (epoch + 1) % config.save_image_epochs == 0 or epoch == config.num_epochs - 1:\n evaluate(config, epoch, pipeline)\n\n if (epoch + 1) % config.save_model_epochs == 0 or epoch == config.num_epochs - 1:\n if config.push_to_hub:\n repo.push_to_hub(commit_message=f\"Epoch {epoch}\", blocking=True)\n else:\n print('saving..') # replaced with print to see if it gets called\n```\n\nnote I could have easily just removed the entire nested if statement and have it push to hub, but to prevent any unexpected behaviors I left it as is, and only moved the method call.\nThis slows down the training speed but at the very least the model doesn't get lost.","html":"

I have found a work around for this issue:
The issue is in the training loop:

\n
 if accelerator.is_main_process:\n            pipeline = DDPMPipeline(unet=accelerator.unwrap_model(model), scheduler=noise_scheduler)\n\n            if (epoch + 1) % config.save_image_epochs == 0 or epoch == config.num_epochs - 1:\n                evaluate(config, epoch, pipeline)\n\n            if (epoch + 1) % config.save_model_epochs == 0 or epoch == config.num_epochs - 1:\n                if config.push_to_hub:\n                    repo.push_to_hub(commit_message=f\"Epoch {epoch}\", blocking=True)\n                else:\n                    pipeline.save_pretrained(config.output_dir) # this never gets called\n
\n

For one reason or another, the 'else' condition is not being reached, therefore pipline.save_pretrained(config.output_dir) never gets called. I solved this by simply moving the method call out of the else statement and saving it on every epoch:

\n
 if accelerator.is_main_process:\n            pipeline = DDPMPipeline(unet=accelerator.unwrap_model(model), scheduler=noise_scheduler)\n            pipeline.save_pretrained(config.output_dir) # move to here\n            if (epoch + 1) % config.save_image_epochs == 0 or epoch == config.num_epochs - 1:\n                evaluate(config, epoch, pipeline)\n\n            if (epoch + 1) % config.save_model_epochs == 0 or epoch == config.num_epochs - 1:\n                if config.push_to_hub:\n                    repo.push_to_hub(commit_message=f\"Epoch {epoch}\", blocking=True)\n                else:\n                    print('saving..') # replaced with print to see if it gets called\n
\n

note I could have easily just removed the entire nested if statement and have it push to hub, but to prevent any unexpected behaviors I left it as is, and only moved the method call.
This slows down the training speed but at the very least the model doesn't get lost.

\n","updatedAt":"2023-05-26T05:46:10.473Z","author":{"_id":"6458a056c16ecb4815dcd582","avatarUrl":"/avatars/adb565ba5692ee14d6b6ce8eb3d9394d.svg","fullname":"Yoshi Nakajima","name":"nakajimayoshi","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false}},"numEdits":2,"editors":["nakajimayoshi"],"editorAvatarUrls":["/avatars/adb565ba5692ee14d6b6ce8eb3d9394d.svg"],"reactions":[],"isReport":false}}],"pinned":false,"locked":false,"collection":"discussions","isPullRequest":false,"isReport":false},"repo":{"name":"google/ddpm-cifar10-32","type":"model"},"activeTab":"discussion","discussionRole":0,"watched":false,"muted":false,"repoDiscussionsLocked":false}">

Using official training example, model was neither saved nor pushed to repo

#12
by nakajimayoshi - opened
https://huggingface.co/nakajimayoshi/ddpm-iris-256/tree/main/

\n

I was able to successfully train the model, and the training logs/samples were successfully uploaded, but the model was neither saved in the runtime as a .bin or .pth or pushed to my repository. I have made no modifications to the training loop, only the training config and dataset loading pipeline. You can see the modification of the training config below:

\n
from dataclasses import dataclass\n\n@dataclass\nclass TrainingConfig:\n    image_size = 256  # the generated image resolution\n    train_batch_size = 16\n    eval_batch_size = 16  # how many images to sample during evaluation\n    num_epochs = 50\n    gradient_accumulation_steps = 1\n    learning_rate = 1e-4\n    lr_warmup_steps = 500\n    save_image_epochs = 10\n    dataset_name= 'imagefolder'\n    save_model_epochs = 30\n    mixed_precision = 'fp16'  # `no` for float32, `fp16` for automatic mixed precision\n    output_dir = 'ddpm-iris-256'  # the model namy locally and on the HF Hub\n\n    push_to_hub = True  # whether to upload the saved model to the HF Hub\n    hub_private_repo = False\n    overwrite_output_dir = True  # overwrite the old model when re-running the notebook\n    seed = 0\n\nconfig = TrainingConfig()\n
\n

On my repository, you can see the logs and samples were uploaded, but none of the model checkpoints were uploaded nor can I find them in my google colab notebook. Any help is appreciated. Thanks

\n","updatedAt":"2023-05-28T04:19:17.608Z","author":{"_id":"6458a056c16ecb4815dcd582","avatarUrl":"/avatars/adb565ba5692ee14d6b6ce8eb3d9394d.svg","fullname":"Yoshi Nakajima","name":"nakajimayoshi","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false}},"numEdits":3,"editors":["nakajimayoshi"],"editorAvatarUrls":["/avatars/adb565ba5692ee14d6b6ce8eb3d9394d.svg"],"reactions":[],"isReport":false}},{"id":"64704704150f4cab863903be","author":{"_id":"6458a056c16ecb4815dcd582","avatarUrl":"/avatars/adb565ba5692ee14d6b6ce8eb3d9394d.svg","fullname":"Yoshi Nakajima","name":"nakajimayoshi","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isOwner":false,"isOrgMember":false},"createdAt":"2023-05-26T05:43:32.000Z","type":"comment","data":{"edited":true,"hidden":false,"latest":{"raw":"I have found a work around for this issue:\nThe issue is in the training loop:\n```py\n if accelerator.is_main_process:\n pipeline = DDPMPipeline(unet=accelerator.unwrap_model(model), scheduler=noise_scheduler)\n\n if (epoch + 1) % config.save_image_epochs == 0 or epoch == config.num_epochs - 1:\n evaluate(config, epoch, pipeline)\n\n if (epoch + 1) % config.save_model_epochs == 0 or epoch == config.num_epochs - 1:\n if config.push_to_hub:\n repo.push_to_hub(commit_message=f\"Epoch {epoch}\", blocking=True)\n else:\n pipeline.save_pretrained(config.output_dir) # this never gets called\n```\n\nFor one reason or another, the 'else' condition is not being reached, therefore pipline.save_pretrained(config.output_dir) never gets called. I solved this by simply moving the method call out of the else statement and saving it on every epoch:\n\n\n```py\n if accelerator.is_main_process:\n pipeline = DDPMPipeline(unet=accelerator.unwrap_model(model), scheduler=noise_scheduler)\n pipeline.save_pretrained(config.output_dir) # move to here\n if (epoch + 1) % config.save_image_epochs == 0 or epoch == config.num_epochs - 1:\n evaluate(config, epoch, pipeline)\n\n if (epoch + 1) % config.save_model_epochs == 0 or epoch == config.num_epochs - 1:\n if config.push_to_hub:\n repo.push_to_hub(commit_message=f\"Epoch {epoch}\", blocking=True)\n else:\n print('saving..') # replaced with print to see if it gets called\n```\n\nnote I could have easily just removed the entire nested if statement and have it push to hub, but to prevent any unexpected behaviors I left it as is, and only moved the method call.\nThis slows down the training speed but at the very least the model doesn't get lost.","html":"

I have found a work around for this issue:
The issue is in the training loop:

\n
 if accelerator.is_main_process:\n            pipeline = DDPMPipeline(unet=accelerator.unwrap_model(model), scheduler=noise_scheduler)\n\n            if (epoch + 1) % config.save_image_epochs == 0 or epoch == config.num_epochs - 1:\n                evaluate(config, epoch, pipeline)\n\n            if (epoch + 1) % config.save_model_epochs == 0 or epoch == config.num_epochs - 1:\n                if config.push_to_hub:\n                    repo.push_to_hub(commit_message=f\"Epoch {epoch}\", blocking=True)\n                else:\n                    pipeline.save_pretrained(config.output_dir) # this never gets called\n
\n

For one reason or another, the 'else' condition is not being reached, therefore pipline.save_pretrained(config.output_dir) never gets called. I solved this by simply moving the method call out of the else statement and saving it on every epoch:

\n
 if accelerator.is_main_process:\n            pipeline = DDPMPipeline(unet=accelerator.unwrap_model(model), scheduler=noise_scheduler)\n            pipeline.save_pretrained(config.output_dir) # move to here\n            if (epoch + 1) % config.save_image_epochs == 0 or epoch == config.num_epochs - 1:\n                evaluate(config, epoch, pipeline)\n\n            if (epoch + 1) % config.save_model_epochs == 0 or epoch == config.num_epochs - 1:\n                if config.push_to_hub:\n                    repo.push_to_hub(commit_message=f\"Epoch {epoch}\", blocking=True)\n                else:\n                    print('saving..') # replaced with print to see if it gets called\n
\n

note I could have easily just removed the entire nested if statement and have it push to hub, but to prevent any unexpected behaviors I left it as is, and only moved the method call.
This slows down the training speed but at the very least the model doesn't get lost.

\n","updatedAt":"2023-05-26T05:46:10.473Z","author":{"_id":"6458a056c16ecb4815dcd582","avatarUrl":"/avatars/adb565ba5692ee14d6b6ce8eb3d9394d.svg","fullname":"Yoshi Nakajima","name":"nakajimayoshi","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false}},"numEdits":2,"editors":["nakajimayoshi"],"editorAvatarUrls":["/avatars/adb565ba5692ee14d6b6ce8eb3d9394d.svg"],"reactions":[],"isReport":false}}],"pinned":false,"locked":false,"collection":"discussions","isPullRequest":false,"isReport":false},"primaryEmailConfirmed":false,"repo":{"name":"google/ddpm-cifar10-32","type":"model"},"discussionRole":0,"acceptLanguages":["*"],"hideComments":true,"repoDiscussionsLocked":false,"isDiscussionAuthor":false}">

Hello, I am working on training a model based on the official training example which can be located here: https://huggingface.co/nakajimayoshi/ddpm-iris-256/tree/main/

I was able to successfully train the model, and the training logs/samples were successfully uploaded, but the model was neither saved in the runtime as a .bin or .pth or pushed to my repository. I have made no modifications to the training loop, only the training config and dataset loading pipeline. You can see the modification of the training config below:

from dataclasses import dataclass

@dataclass
class TrainingConfig:
    image_size = 256  # the generated image resolution
    train_batch_size = 16
    eval_batch_size = 16  # how many images to sample during evaluation
    num_epochs = 50
    gradient_accumulation_steps = 1
    learning_rate = 1e-4
    lr_warmup_steps = 500
    save_image_epochs = 10
    dataset_name= 'imagefolder'
    save_model_epochs = 30
    mixed_precision = 'fp16'  # `no` for float32, `fp16` for automatic mixed precision
    output_dir = 'ddpm-iris-256'  # the model namy locally and on the HF Hub

    push_to_hub = True  # whether to upload the saved model to the HF Hub
    hub_private_repo = False
    overwrite_output_dir = True  # overwrite the old model when re-running the notebook
    seed = 0

config = TrainingConfig()

On my repository, you can see the logs and samples were uploaded, but none of the model checkpoints were uploaded nor can I find them in my google colab notebook. Any help is appreciated. Thanks

I have found a work around for this issue:
The issue is in the training loop:

 if accelerator.is_main_process:
            pipeline = DDPMPipeline(unet=accelerator.unwrap_model(model), scheduler=noise_scheduler)

            if (epoch + 1) % config.save_image_epochs == 0 or epoch == config.num_epochs - 1:
                evaluate(config, epoch, pipeline)

            if (epoch + 1) % config.save_model_epochs == 0 or epoch == config.num_epochs - 1:
                if config.push_to_hub:
                    repo.push_to_hub(commit_message=f"Epoch {epoch}", blocking=True)
                else:
                    pipeline.save_pretrained(config.output_dir) # this never gets called

For one reason or another, the 'else' condition is not being reached, therefore pipline.save_pretrained(config.output_dir) never gets called. I solved this by simply moving the method call out of the else statement and saving it on every epoch:

 if accelerator.is_main_process:
            pipeline = DDPMPipeline(unet=accelerator.unwrap_model(model), scheduler=noise_scheduler)
            pipeline.save_pretrained(config.output_dir) # move to here
            if (epoch + 1) % config.save_image_epochs == 0 or epoch == config.num_epochs - 1:
                evaluate(config, epoch, pipeline)

            if (epoch + 1) % config.save_model_epochs == 0 or epoch == config.num_epochs - 1:
                if config.push_to_hub:
                    repo.push_to_hub(commit_message=f"Epoch {epoch}", blocking=True)
                else:
                    print('saving..') # replaced with print to see if it gets called

note I could have easily just removed the entire nested if statement and have it push to hub, but to prevent any unexpected behaviors I left it as is, and only moved the method call.
This slows down the training speed but at the very least the model doesn't get lost.

Sign up or log in to comment

Лучший частный хостинг