question about concurrently stacking the activation function #23

pupupubb · 2023-06-12T13:31:14Z

Hi, thanks for the great work.

Can I understand the class 'activation(nn.ReLU)' as a combination of ReLU->depth Conv-> BN?

I don't seem to see concurrently stacking activation functions in the code?Is 'concurrently ' mean Multi-branch activation functions?

Thank you very much!

HantingChen · 2023-06-13T03:46:15Z

Thanks for the attention. We use the depth conv as an efficient implementation of our activation function, which is same as Eq. (6) in our paper. Each element of the output of this activation is related to various non-linear inputs, which can be regarded as concurrently stacking.

pupupubb · 2023-06-14T03:01:32Z

Thank you for your reply!

The learnable parameters game and beta of BN in the activation function are a and b in Eq.(6)?

HantingChen · 2023-06-14T03:12:04Z

Not really. In fact, the BN can be merged into the conv in the activation. Then, the weight and bias of the merged conv are a and b in Eq.(6).

pupupubb · 2023-06-14T03:14:15Z

Excellent idea!
Thanks!

cheezi · 2023-07-12T22:15:49Z

Not really. In fact, the BN can be merged into the conv in the activation. Then, the weight and bias of the merged conv are a and b in Eq.(6).

Is this actually true?
So the equation states aA(x+b), but with the conv you basically implement wA(x)+b. So mathematically you cannot just pull out the b with non linear activations. E.g. assuming x=-2, b=2, w=1 assuming ReLU with the equation you get 1ReLU(-2+2)=0 and with the implementation you get 1ReLU(-2)+2=2...
Also this way isn't it just a simple DepthwiseSeparable Conv once you stack multiple Blocks:
Conv1x1->ReLu->ConvDW3x3->Conv1x1->ReLU->...
And last but not least why is the KernelSize actually "doubled", if you state in the paper you use n=3 for the activation series?

mikljohansson · 2023-07-13T11:15:16Z

Not really. In fact, the BN can be merged into the conv in the activation. Then, the weight and bias of the merged conv are a and b in Eq.(6).

Is this actually true? So the equation states a_A(x+b), but with the conv you basically implement w_A(x)+b. So mathematically you cannot just pull out the b with non linear activations. E.g. assuming x=-2, b=2, w=1 assuming ReLU with the equation you get 1_ReLU(-2+2)=0 and with the implementation you get 1_ReLU(-2)+2=2... Also this way isn't it just a simple DepthwiseSeparable Conv once you stack multiple Blocks: Conv1x1->ReLu->ConvDW3x3->Conv1x1->ReLU->... And last but not least why is the KernelSize actually "doubled", if you state in the paper you use n=3 for the activation series?

I wonder the same thing. Afaikt from reading the code the sequence of layers is as below

At training time:

Conv1x1
BatchNorm
LeakyReLU
Conv1x1
BatchNorm
Optional Max/AvgPool (downsampling)
ReLU
DepthwiseConv7x7 (kernel_size=7, groups=in_channels). The "activation sequence number" is the kernel size (kernel_size=n*2+1)
BatchNorm

At inference time (the LeakyReLU is gradually removed during training and the two 1x1 convs are fused, the BatchNorm's are fused into the preceeding convs):

Conv1x1
Optional Max/AvgPool (downsampling)
ReLU
DepthwiseConv7x7 (kernel_size=7, groups=in_channels)

When considering that there's several of these blocks stacked after each other this seems kind of like an inverted MobileNet block (7x7 depthwise -> 1x1 conv -> downsample -> ReLU), but I don't understand the activation sequence. Afaikt it's just one activation and a regular depthwise convolution, perhaps I'm missing something?

There's ablation test in the paper (Table 1) showing a performance increase from 60% to 74% when comparing plain ReLU with your "activation" function, is this then comparing "1x1 convs -> ReLU" with "1x1 convs -> ReLU -> 3x3 depthwise (n=1)"? If so I can imagine that you'd get a big jump in performance when just comparing 1x1 convs against 1x1+3x3 convs, since with just 1x1 convs and no depthwise convs the receptive field of the network would be a lot worse.

HantingChen · 2023-07-18T12:38:27Z

Not really. In fact, the BN can be merged into the conv in the activation. Then, the weight and bias of the merged conv are a and b in Eq.(6).

Is this actually true? So the equation states a_A(x+b), but with the conv you basically implement w_A(x)+b. So mathematically you cannot just pull out the b with non linear activations. E.g. assuming x=-2, b=2, w=1 assuming ReLU with the equation you get 1_ReLU(-2+2)=0 and with the implementation you get 1_ReLU(-2)+2=2... Also this way isn't it just a simple DepthwiseSeparable Conv once you stack multiple Blocks: Conv1x1->ReLu->ConvDW3x3->Conv1x1->ReLU->... And last but not least why is the KernelSize actually "doubled", if you state in the paper you use n=3 for the activation series?

Sorry for the mistake, the b in A(x+b) is derived from the former conv.
The activation function is in fact implemented by DepthwiseSeparable Conv.
In Equ. (6), the i and j is counted from -n to n. Therefore, n=3 denotes the conv with kernel size 7.

HantingChen · 2023-07-18T12:42:14Z

Not really. In fact, the BN can be merged into the conv in the activation. Then, the weight and bias of the merged conv are a and b in Eq.(6).

Is this actually true? So the equation states a_A(x+b), but with the conv you basically implement w_A(x)+b. So mathematically you cannot just pull out the b with non linear activations. E.g. assuming x=-2, b=2, w=1 assuming ReLU with the equation you get 1_ReLU(-2+2)=0 and with the implementation you get 1_ReLU(-2)+2=2... Also this way isn't it just a simple DepthwiseSeparable Conv once you stack multiple Blocks: Conv1x1->ReLu->ConvDW3x3->Conv1x1->ReLU->... And last but not least why is the KernelSize actually "doubled", if you state in the paper you use n=3 for the activation series?

I wonder the same thing. Afaikt from reading the code the sequence of layers is as below

At training time:

Conv1x1

BatchNorm

LeakyReLU

Conv1x1

BatchNorm

Optional Max/AvgPool (downsampling)

ReLU

DepthwiseConv7x7 (kernel_size=7, groups=in_channels). The "activation sequence number" is the kernel size (kernel_size=n*2+1)

BatchNorm

At inference time (the LeakyReLU is gradually removed during training and the two 1x1 convs are fused, the BatchNorm's are fused into the preceeding convs):

Conv1x1

Optional Max/AvgPool (downsampling)

ReLU

DepthwiseConv7x7 (kernel_size=7, groups=in_channels)

When considering that there's several of these blocks stacked after each other this seems kind of like an inverted MobileNet block (7x7 depthwise -> 1x1 conv -> downsample -> ReLU), but I don't understand the activation sequence. Afaikt it's just one activation and a regular depthwise convolution, perhaps I'm missing something?

There's ablation test in the paper (Table 1) showing a performance increase from 60% to 74% when comparing plain ReLU with your "activation" function, is this then comparing "1x1 convs -> ReLU" with "1x1 convs -> ReLU -> 3x3 depthwise (n=1)"? If so I can imagine that you'd get a big jump in performance when just comparing 1x1 convs against 1x1+3x3 convs, since with just 1x1 convs and no depthwise convs the receptive field of the network would be a lot worse.

Your understanding is correct, we currently use depthwise conv to implement the activation function we proposed, which is for latency friendliness.

Haus226 · 2024-09-16T07:24:45Z

Hi, thank you for the innovative work. Would you mind to explain why the fusion of bias (b in Eq.6) are zeros? Why we dont let the bias becomes a learnable parameter like the normal convolution since they actually let it be a learnable parameters in BNET. And I think in both case of deploy variable, self.bias can be initialized as None for cleaner code since None and zeros for bias are actually same in convolution.

Code before changes

class activation(nn.ReLU):
    def __init__(self, dim, act_num=3, deploy=False):
        super(activation, self).__init__()
        self.act_num = act_num
        self.deploy = deploy
        self.dim = dim
        self.weight = torch.nn.Parameter(torch.randn(dim, 1, act_num*2 + 1, act_num*2 + 1))
        if deploy:
            self.bias = torch.nn.Parameter(torch.zeros(dim))
        else:
            self.bias = None
            self.bn = nn.BatchNorm2d(dim, eps=1e-6)
        weight_init.trunc_normal_(self.weight, std=.02)

    def forward(self, x):
        if self.deploy:
            return torch.nn.functional.conv2d(
                super(activation, self).forward(x), 
                self.weight, self.bias, padding=self.act_num, groups=self.dim)
        else:
            return self.bn(torch.nn.functional.conv2d(
                super(activation, self).forward(x),
                self.weight, padding=self.act_num, groups=self.dim))

    def _fuse_bn_tensor(self, weight, bn):
        kernel = weight
        running_mean = bn.running_mean
        running_var = bn.running_var
        gamma = bn.weight
        beta = bn.bias
        eps = bn.eps
        std = (running_var + eps).sqrt()
        t = (gamma / std).reshape(-1, 1, 1, 1)
        return kernel * t, beta + (0 - running_mean) * gamma / std

Code after changes

class activation(nn.ReLU):
    def __init__(self, dim, act_num=3, deploy=False):
        super(activation, self).__init__()
        self.act_num = act_num
        self.deploy = deploy
        self.dim = dim
        self.weight = torch.nn.Parameter(torch.randn(dim, 1, act_num*2 + 1, act_num*2 + 1))
        self.bias = None
        if not deploy:
            self.bn = nn.BatchNorm2d(dim, eps=1e-6)
        weight_init.trunc_normal_(self.weight, std=.02)

    def forward(self, x):
        if self.deploy:
            return torch.nn.functional.conv2d(
                super(activation, self).forward(x), 
                self.weight, self.bias, padding=self.act_num, groups=self.dim)
        else:
            return self.bn(torch.nn.functional.conv2d(
                super(activation, self).forward(x),
                self.weight, padding=self.act_num, groups=self.dim))

    def _fuse_bn_tensor(self, weight, bn):
        kernel = weight
        running_mean = bn.running_mean
        running_var = bn.running_var
        gamma = bn.weight
        beta = bn.bias
        eps = bn.eps
        std = (running_var + eps).sqrt()
        t = (gamma / std).reshape(-1, 1, 1, 1)
        return kernel * t, beta + (0 - running_mean) * gamma / std

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

question about concurrently stacking the activation function #23

question about concurrently stacking the activation function #23

pupupubb commented Jun 12, 2023

HantingChen commented Jun 13, 2023

pupupubb commented Jun 14, 2023

HantingChen commented Jun 14, 2023

pupupubb commented Jun 14, 2023

cheezi commented Jul 12, 2023 •

edited

Loading

mikljohansson commented Jul 13, 2023 •

edited

Loading

HantingChen commented Jul 18, 2023

HantingChen commented Jul 18, 2023

Haus226 commented Sep 16, 2024 •

edited

Loading

question about concurrently stacking the activation function #23

question about concurrently stacking the activation function #23

Comments

pupupubb commented Jun 12, 2023

HantingChen commented Jun 13, 2023

pupupubb commented Jun 14, 2023

HantingChen commented Jun 14, 2023

pupupubb commented Jun 14, 2023

cheezi commented Jul 12, 2023 • edited Loading

mikljohansson commented Jul 13, 2023 • edited Loading

HantingChen commented Jul 18, 2023

HantingChen commented Jul 18, 2023

Haus226 commented Sep 16, 2024 • edited Loading

cheezi commented Jul 12, 2023 •

edited

Loading

mikljohansson commented Jul 13, 2023 •

edited

Loading

Haus226 commented Sep 16, 2024 •

edited

Loading