I was able to successfully train the model, and the training logs/samples were successfully uploaded, but the model was neither saved in the runtime as a .bin or .pth or pushed to my repository. I have made no modifications to the training loop, only the training config and dataset loading pipeline. You can see the modification of the training config below:
\nfrom dataclasses import dataclass\n\n \nclass TrainingConfig:\n image_size = 256 # the generated image resolution\n train_batch_size = 16\n eval_batch_size = 16 # how many images to sample during evaluation\n num_epochs = 50\n gradient_accumulation_steps = 1\n learning_rate = 1e-4\n lr_warmup_steps = 500\n save_image_epochs = 10\n dataset_name= 'imagefolder'\n save_model_epochs = 30\n mixed_precision = 'fp16' # `no` for float32, `fp16` for automatic mixed precision\n output_dir = 'ddpm-iris-256' # the model namy locally and on the HF Hub\n\n push_to_hub = True # whether to upload the saved model to the HF Hub\n hub_private_repo = False\n overwrite_output_dir = True # overwrite the old model when re-running the notebook\n seed = 0\n\nconfig = TrainingConfig()\n
\nOn my repository, you can see the logs and samples were uploaded, but none of the model checkpoints were uploaded nor can I find them in my google colab notebook. Any help is appreciated. Thanks
\n","updatedAt":"2023-05-28T04:19:17.608Z","author":{"_id":"6458a056c16ecb4815dcd582","avatarUrl":"/avatars/adb565ba5692ee14d6b6ce8eb3d9394d.svg","fullname":"Yoshi Nakajima","name":"nakajimayoshi","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false}},"numEdits":3,"editors":["nakajimayoshi"],"editorAvatarUrls":["/avatars/adb565ba5692ee14d6b6ce8eb3d9394d.svg"],"reactions":[],"isReport":false}},{"id":"64704704150f4cab863903be","author":{"_id":"6458a056c16ecb4815dcd582","avatarUrl":"/avatars/adb565ba5692ee14d6b6ce8eb3d9394d.svg","fullname":"Yoshi Nakajima","name":"nakajimayoshi","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isOwner":false,"isOrgMember":false},"createdAt":"2023-05-26T05:43:32.000Z","type":"comment","data":{"edited":true,"hidden":false,"latest":{"raw":"I have found a work around for this issue:\nThe issue is in the training loop:\n```py\n if accelerator.is_main_process:\n pipeline = DDPMPipeline(unet=accelerator.unwrap_model(model), scheduler=noise_scheduler)\n\n if (epoch + 1) % config.save_image_epochs == 0 or epoch == config.num_epochs - 1:\n evaluate(config, epoch, pipeline)\n\n if (epoch + 1) % config.save_model_epochs == 0 or epoch == config.num_epochs - 1:\n if config.push_to_hub:\n repo.push_to_hub(commit_message=f\"Epoch {epoch}\", blocking=True)\n else:\n pipeline.save_pretrained(config.output_dir) # this never gets called\n```\n\nFor one reason or another, the 'else' condition is not being reached, therefore pipline.save_pretrained(config.output_dir) never gets called. I solved this by simply moving the method call out of the else statement and saving it on every epoch:\n\n\n```py\n if accelerator.is_main_process:\n pipeline = DDPMPipeline(unet=accelerator.unwrap_model(model), scheduler=noise_scheduler)\n pipeline.save_pretrained(config.output_dir) # move to here\n if (epoch + 1) % config.save_image_epochs == 0 or epoch == config.num_epochs - 1:\n evaluate(config, epoch, pipeline)\n\n if (epoch + 1) % config.save_model_epochs == 0 or epoch == config.num_epochs - 1:\n if config.push_to_hub:\n repo.push_to_hub(commit_message=f\"Epoch {epoch}\", blocking=True)\n else:\n print('saving..') # replaced with print to see if it gets called\n```\n\nnote I could have easily just removed the entire nested if statement and have it push to hub, but to prevent any unexpected behaviors I left it as is, and only moved the method call.\nThis slows down the training speed but at the very least the model doesn't get lost.","html":"I have found a work around for this issue:
The issue is in the training loop:
if accelerator.is_main_process:\n pipeline = DDPMPipeline(unet=accelerator.unwrap_model(model), scheduler=noise_scheduler)\n\n if (epoch + 1) % config.save_image_epochs == 0 or epoch == config.num_epochs - 1:\n evaluate(config, epoch, pipeline)\n\n if (epoch + 1) % config.save_model_epochs == 0 or epoch == config.num_epochs - 1:\n if config.push_to_hub:\n repo.push_to_hub(commit_message=f\"Epoch {epoch}\", blocking=True)\n else:\n pipeline.save_pretrained(config.output_dir) # this never gets called\n
\nFor one reason or another, the 'else' condition is not being reached, therefore pipline.save_pretrained(config.output_dir) never gets called. I solved this by simply moving the method call out of the else statement and saving it on every epoch:
\n if accelerator.is_main_process:\n pipeline = DDPMPipeline(unet=accelerator.unwrap_model(model), scheduler=noise_scheduler)\n pipeline.save_pretrained(config.output_dir) # move to here\n if (epoch + 1) % config.save_image_epochs == 0 or epoch == config.num_epochs - 1:\n evaluate(config, epoch, pipeline)\n\n if (epoch + 1) % config.save_model_epochs == 0 or epoch == config.num_epochs - 1:\n if config.push_to_hub:\n repo.push_to_hub(commit_message=f\"Epoch {epoch}\", blocking=True)\n else:\n print('saving..') # replaced with print to see if it gets called\n
\nnote I could have easily just removed the entire nested if statement and have it push to hub, but to prevent any unexpected behaviors I left it as is, and only moved the method call.
This slows down the training speed but at the very least the model doesn't get lost.
Using official training example, model was neither saved nor pushed to repo
I was able to successfully train the model, and the training logs/samples were successfully uploaded, but the model was neither saved in the runtime as a .bin or .pth or pushed to my repository. I have made no modifications to the training loop, only the training config and dataset loading pipeline. You can see the modification of the training config below:
\nfrom dataclasses import dataclass\n\n \nclass TrainingConfig:\n image_size = 256 # the generated image resolution\n train_batch_size = 16\n eval_batch_size = 16 # how many images to sample during evaluation\n num_epochs = 50\n gradient_accumulation_steps = 1\n learning_rate = 1e-4\n lr_warmup_steps = 500\n save_image_epochs = 10\n dataset_name= 'imagefolder'\n save_model_epochs = 30\n mixed_precision = 'fp16' # `no` for float32, `fp16` for automatic mixed precision\n output_dir = 'ddpm-iris-256' # the model namy locally and on the HF Hub\n\n push_to_hub = True # whether to upload the saved model to the HF Hub\n hub_private_repo = False\n overwrite_output_dir = True # overwrite the old model when re-running the notebook\n seed = 0\n\nconfig = TrainingConfig()\n
\nOn my repository, you can see the logs and samples were uploaded, but none of the model checkpoints were uploaded nor can I find them in my google colab notebook. Any help is appreciated. Thanks
\n","updatedAt":"2023-05-28T04:19:17.608Z","author":{"_id":"6458a056c16ecb4815dcd582","avatarUrl":"/avatars/adb565ba5692ee14d6b6ce8eb3d9394d.svg","fullname":"Yoshi Nakajima","name":"nakajimayoshi","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false}},"numEdits":3,"editors":["nakajimayoshi"],"editorAvatarUrls":["/avatars/adb565ba5692ee14d6b6ce8eb3d9394d.svg"],"reactions":[],"isReport":false}},{"id":"64704704150f4cab863903be","author":{"_id":"6458a056c16ecb4815dcd582","avatarUrl":"/avatars/adb565ba5692ee14d6b6ce8eb3d9394d.svg","fullname":"Yoshi Nakajima","name":"nakajimayoshi","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isOwner":false,"isOrgMember":false},"createdAt":"2023-05-26T05:43:32.000Z","type":"comment","data":{"edited":true,"hidden":false,"latest":{"raw":"I have found a work around for this issue:\nThe issue is in the training loop:\n```py\n if accelerator.is_main_process:\n pipeline = DDPMPipeline(unet=accelerator.unwrap_model(model), scheduler=noise_scheduler)\n\n if (epoch + 1) % config.save_image_epochs == 0 or epoch == config.num_epochs - 1:\n evaluate(config, epoch, pipeline)\n\n if (epoch + 1) % config.save_model_epochs == 0 or epoch == config.num_epochs - 1:\n if config.push_to_hub:\n repo.push_to_hub(commit_message=f\"Epoch {epoch}\", blocking=True)\n else:\n pipeline.save_pretrained(config.output_dir) # this never gets called\n```\n\nFor one reason or another, the 'else' condition is not being reached, therefore pipline.save_pretrained(config.output_dir) never gets called. I solved this by simply moving the method call out of the else statement and saving it on every epoch:\n\n\n```py\n if accelerator.is_main_process:\n pipeline = DDPMPipeline(unet=accelerator.unwrap_model(model), scheduler=noise_scheduler)\n pipeline.save_pretrained(config.output_dir) # move to here\n if (epoch + 1) % config.save_image_epochs == 0 or epoch == config.num_epochs - 1:\n evaluate(config, epoch, pipeline)\n\n if (epoch + 1) % config.save_model_epochs == 0 or epoch == config.num_epochs - 1:\n if config.push_to_hub:\n repo.push_to_hub(commit_message=f\"Epoch {epoch}\", blocking=True)\n else:\n print('saving..') # replaced with print to see if it gets called\n```\n\nnote I could have easily just removed the entire nested if statement and have it push to hub, but to prevent any unexpected behaviors I left it as is, and only moved the method call.\nThis slows down the training speed but at the very least the model doesn't get lost.","html":"I have found a work around for this issue:
The issue is in the training loop:
if accelerator.is_main_process:\n pipeline = DDPMPipeline(unet=accelerator.unwrap_model(model), scheduler=noise_scheduler)\n\n if (epoch + 1) % config.save_image_epochs == 0 or epoch == config.num_epochs - 1:\n evaluate(config, epoch, pipeline)\n\n if (epoch + 1) % config.save_model_epochs == 0 or epoch == config.num_epochs - 1:\n if config.push_to_hub:\n repo.push_to_hub(commit_message=f\"Epoch {epoch}\", blocking=True)\n else:\n pipeline.save_pretrained(config.output_dir) # this never gets called\n
\nFor one reason or another, the 'else' condition is not being reached, therefore pipline.save_pretrained(config.output_dir) never gets called. I solved this by simply moving the method call out of the else statement and saving it on every epoch:
\n if accelerator.is_main_process:\n pipeline = DDPMPipeline(unet=accelerator.unwrap_model(model), scheduler=noise_scheduler)\n pipeline.save_pretrained(config.output_dir) # move to here\n if (epoch + 1) % config.save_image_epochs == 0 or epoch == config.num_epochs - 1:\n evaluate(config, epoch, pipeline)\n\n if (epoch + 1) % config.save_model_epochs == 0 or epoch == config.num_epochs - 1:\n if config.push_to_hub:\n repo.push_to_hub(commit_message=f\"Epoch {epoch}\", blocking=True)\n else:\n print('saving..') # replaced with print to see if it gets called\n
\nnote I could have easily just removed the entire nested if statement and have it push to hub, but to prevent any unexpected behaviors I left it as is, and only moved the method call.
This slows down the training speed but at the very least the model doesn't get lost.
Hello, I am working on training a model based on the official training example which can be located here: https://huggingface.co/nakajimayoshi/ddpm-iris-256/tree/main/
I was able to successfully train the model, and the training logs/samples were successfully uploaded, but the model was neither saved in the runtime as a .bin or .pth or pushed to my repository. I have made no modifications to the training loop, only the training config and dataset loading pipeline. You can see the modification of the training config below:
from dataclasses import dataclass
@dataclass
class TrainingConfig:
image_size = 256 # the generated image resolution
train_batch_size = 16
eval_batch_size = 16 # how many images to sample during evaluation
num_epochs = 50
gradient_accumulation_steps = 1
learning_rate = 1e-4
lr_warmup_steps = 500
save_image_epochs = 10
dataset_name= 'imagefolder'
save_model_epochs = 30
mixed_precision = 'fp16' # `no` for float32, `fp16` for automatic mixed precision
output_dir = 'ddpm-iris-256' # the model namy locally and on the HF Hub
push_to_hub = True # whether to upload the saved model to the HF Hub
hub_private_repo = False
overwrite_output_dir = True # overwrite the old model when re-running the notebook
seed = 0
config = TrainingConfig()
On my repository, you can see the logs and samples were uploaded, but none of the model checkpoints were uploaded nor can I find them in my google colab notebook. Any help is appreciated. Thanks
I have found a work around for this issue:
The issue is in the training loop:
if accelerator.is_main_process:
pipeline = DDPMPipeline(unet=accelerator.unwrap_model(model), scheduler=noise_scheduler)
if (epoch + 1) % config.save_image_epochs == 0 or epoch == config.num_epochs - 1:
evaluate(config, epoch, pipeline)
if (epoch + 1) % config.save_model_epochs == 0 or epoch == config.num_epochs - 1:
if config.push_to_hub:
repo.push_to_hub(commit_message=f"Epoch {epoch}", blocking=True)
else:
pipeline.save_pretrained(config.output_dir) # this never gets called
For one reason or another, the 'else' condition is not being reached, therefore pipline.save_pretrained(config.output_dir) never gets called. I solved this by simply moving the method call out of the else statement and saving it on every epoch:
if accelerator.is_main_process:
pipeline = DDPMPipeline(unet=accelerator.unwrap_model(model), scheduler=noise_scheduler)
pipeline.save_pretrained(config.output_dir) # move to here
if (epoch + 1) % config.save_image_epochs == 0 or epoch == config.num_epochs - 1:
evaluate(config, epoch, pipeline)
if (epoch + 1) % config.save_model_epochs == 0 or epoch == config.num_epochs - 1:
if config.push_to_hub:
repo.push_to_hub(commit_message=f"Epoch {epoch}", blocking=True)
else:
print('saving..') # replaced with print to see if it gets called
note I could have easily just removed the entire nested if statement and have it push to hub, but to prevent any unexpected behaviors I left it as is, and only moved the method call.
This slows down the training speed but at the very least the model doesn't get lost.