Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about scale factor #97

Open
lunaryle opened this issue Oct 19, 2023 · 0 comments
Open

Question about scale factor #97

lunaryle opened this issue Oct 19, 2023 · 0 comments

Comments

@lunaryle
Copy link

lunaryle commented Oct 19, 2023

Hi, @ShoufaChen thank you for the great work.

I am a starter regarding diffusion models, and have a question about scale factor applied in prepare_diffusion_concat() and ddim_sample()
I understood that signal-to-noise ratio is essential as you mentioned in 4.4.

In the implementation code for noise sampling,
you shifted and scaled x_start before q_sampling and shifted/scaled back for the model's diffused input.

  x_start = (x_start * 2. - 1.) * self.scale
  
  # noise sample
  x = self.q_sample(x_start=x_start, t=t, noise=noise)
  
  x = torch.clamp(x, min=-1 * self.scale, max=self.scale)
  x = ((x / self.scale) + 1) / 2.
  
  diff_boxes = box_cxcywh_to_xyxy(x)

However for inference, you divide x_boxes first and scaled back by multiplying.
I thought the division by scale is not needed because the model learns from diffused boxes that is scaled back as in prepare_diffusion_concat(). Even if scaling is needed for predicting noise, I thought that the order should be reversed just like at the noising step to make conditions identical.

  x_boxes = ((x_boxes / self.scale) + 1) / 2
  x_boxes = box_cxcywh_to_xyxy(x_boxes)
  x_boxes = x_boxes * images_whwh[:, None, :]
  outputs_class, outputs_coord = self.head(backbone_feats, x_boxes, t, None)
  
  x_start = outputs_coord[-1]  # (batch, num_proposals, 4) predict boxes: absolute coordinates (x1, y1, x2, y2)
  x_start = x_start / images_whwh[:, None, :]
  x_start = box_xyxy_to_cxcywh(x_start)
  x_start = (x_start * 2 - 1.) * self.scale

Could you give an explanation why the scale is considered in inference stage or the scale is divided for the input?

And one more, why is self.ddim_sampling_eta set to 1 for initialization? Shouldn't eta be zero for DDIM?

I will appreciate you for any feedbacks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant