sren.html

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="utf-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1">
    <meta name="description" content="">
    <meta name="author" content="">
    <!--link rel="shortcut icon" href="ico/favicon.ico" -->

    <title>SReN: Shape Regression Network for Comic Storyboard Extraction -- Zheqi He</title>

    <!-- Bootstrap core CSS -->
    <link href="css/bootstrap.min.css" rel="stylesheet">

    <!-- Just for debugging purposes. Don't actually copy this line! -->
    <!--[if lt IE 9]>
    <script src="js/ie8-responsive-file-warning.js"></script><![endif]-->

    <!-- HTML5 shim and Respond.js IE8 support of HTML5 elements and media queries -->
    <!--[if lt IE 9]>
    <script src="https://oss.maxcdn.com/libs/html5shiv/3.7.0/html5shiv.js"></script>
    <script src="https://oss.maxcdn.com/libs/respond.js/1.4.2/respond.min.js"></script>
    <![endif]-->

    <!-- Custom styles for this template -->
    <!-- Custom styles for this template -->
    <link href="css/custom.css" rel="stylesheet">
</head>
<!-- NAVBAR
================================================== -->
<body>
<header class="main">
    <div class="container">
        <div class="row">
            <div class="col-sm-12">
                <h1>SReN</h1>
                <h2>Shape Regression Network for Comic Storyboard Extraction</h2>
            </div>
        </div>
    </div>
</header>

<div class="container">
    <div class="row">
        <h3> ABOUT ME </h3>
        <hr/>
        <div class="col-sm-6">
            <p><b>Zheqi He</b> - <b>何哲琪</b></p>
            <p>Master Student</p>
            <p>Institute of Computer Science and Technology</p>
            <p>Peking University</p>
        </div>
        <div class="col-sm-6">
            <p> . </p>
            <p> Tel: +86 18810543960 </p>
            <p>Email: hezheqi@pku.edu.cn</p>
            <p>GitHub: <a href="https://github.com/philokey"> philokey </a></p>
        </div>
    </div>
    <div class="row">
        <h3> BACKGROUND </h3>
        <hr/>
        <div class="col-sm-12">
            <p>
                Comics, defined as “juxtaposed pictorial and other images in deliberat esequence, intended to convey
                information and/or to produce anaesthetic response in theviewer”, have gained increasing popularity
                since it sappearance in the 19th century. In this era, Comic is a kind of entertainment publication
                popular
                among people of different ages around the world.
                The storyboard is the basic semantic unit of a comic, as shown in Fig.1,
                hence, decomposing the comic image into several storyboards is the fundamental step to understand
                content of comic.
            </p>
            <div class="row">
                <div class="col-xs-6 ">
                    <img src="img/page1.png" height=500px>
                    <div align="center">
                        (a)Source:Kazuma Kamachi, A Certain Scientific Railgun,vol.1, p.34
                    </div>
                </div>
                <div class="col-xs-6">
                    <img src="img/page2.png" height=500px>
                    <div align="center">
                        (b)Source:Hiromu Arakawa,Fullmetal Alchemist,vol.26, p.6
                    </div>
                </div>
            </div>
            <div align="center">
                Fig.1 Typical comic storyboards
            </div>
            <br/>
            <br/>
            <p>
                In addition, decomposing the comic image into several storyboards is the key technique
                to produce digital comic documents suitable for reading on mobile devices with small screen.
                Fig.2 gives an example.
            </p>
            <div class="row">
                <div align="center">
                    <img src="img/mobile_read.png" width=900px>
                    <div class="caption">
                    </div>
                </div>
            </div>
            <div align="center">
                Fig. 2 Read comic on mobile device
            </div>
        </div>
    </div>
    <div class="row">
        <h3> MOTIVATION </h3>
        <hr/>
        <div class="col-sm-12">
            <p>
                Most of previous storyboard extraction methods [<b>Wang, Zhou, and Tang 2015; Li et al. 2015</b>] use
                only hand
                crafted low-level visual patterns, such as edge segments, line segments or connected component.
                These methods analyze the relationship between low-level visual patterns and combine them into
                storyboard. These methods work effectively under certain assumptions, but they may fail to handle
                the comic image with complex layout. For example, when storyboards missing borderlines, these methods
                cannot handle them well; or when there are complex overlaps between storyboards, these methods are
                tend to fail. The most important reason is low-level visual patterns can not represent image content
                well.
            </p>
            <p>
                Recently, deep learning methods[<b>Girshick 2015; Liu et al. 2015</b>] have been applied to object
                detection
                and gain the state-of-the-art performances. The effective feature learning capability of deep neural
                network
                make great contribution to high-level vision tasks. However, these methods can only obtain rectangle
                bounding
                box of objects, which are not precise enough for many application tasks. For example, for the tasks of
                comic
                storyboard detection or traffic sign detection. It is better to use parameterized shape like triangle,
                quadrangle or ellipse to express detected results.
            </p>
        </div>
    </div>
    <div class="row">
        <h3> METHOD </h3>
        <hr/>
        <div class="col-sm-12">
            <p>
                In this paper, we propose a novel architecture based on deep convolutional neural network namely
                SReN to detect storyboards within comic images. Fig.3 illustrate the architecture of SReN, which
                consists of two main steps: generate storyboard proposals and train shape regression network.
            </p>
        </div>
        <div class="row">
            <div align="center">
                <img src="img/net_new.png" width=600px>
                <div class="caption">
                </div>
            </div>
        </div>
        <div align="center">
            Fig.3 The architecture of SReN
        </div>
        <div>
            <h4>Generating storyboard proposals</h4>
            <p>
                We use comic images to train a Fast R-CNN model to detect storyboard rectangle bounding boxes <i>r</i>.
                The reason we use Fast R-CNN rather than Faster R-CNN [<b>Ren et al. 2015</b>], which performs better than Fast R-CNN
                in many challenges like Pascal VOC or COCO, is that Faster R-CNN has bad performance when we require high localization accuracy.
                Since Fast R-CNN can only generate rectangle bounding boxes, we use corresponding exterior rectangles as ground
                truth for storyboards. But these bounding boxes often miss some parts of a storyboard, in order to obtain the
                complete storyboard, we enlarge <i>r</i> by a factor of 1.1 to generate storyboard proposal <i>p</i> as the input of our SReN.
                Another problem is that storyboards are often various in sizes, to reduce the interference of this, we
                normalize
                the vertexes of storyboard proposals into <i>[-1, 1]</i>, that is
            </p>
            <div align="center">
                <img src="http://latex.codecogs.com/svg.latex?x'= \frac{x-c_x}{w}, \quad y'=\frac{y-c_y}{h}"/>
            </div>
            <p>
                where <i>(x', y')</i> is the vertex of the regression target,
                <i>(x, y)</i> is the vertex of the original storyboard, <i>(c_x, c_y)</i> is the center of the
                storyboard, <i>w</i> and <i>h</i> is the width and the height of the storyboard proposal.
            </p>
        </div>
        <div>
            <h4>Training shape regression network</h4>
            <p>
                Firstly, we sorted the regression target
                <img src="http://latex.codecogs.com/svg.latex?x'= \{\vec{t_1}, \vec{t_2},..., \vec{t_n}\}"/>,
                where <img src="http://latex.codecogs.com/svg.latex?\vec{t_i}=(t_{ix}, t_{iy})"/>, of the storyboard by it
                is polar angle.
                Then we use $p$ and their regression targets as the input of VGG16 network [<b>Simonyan and Zisserman
                2015</b>]
                to get feature <i>f</i> with 4096 dimensions. Finally we add a fully connected layer to regress the vertexes
                of the storyboard <img src="http://latex.codecogs.com/svg.latex?\{\vec{s_1}, \vec{s_2},..., \vec{s_n}\}"/>,
                where
                <img src="http://latex.codecogs.com/svg.latex?\vec{s_i}=(s_{ix}, s_{iy})"/>, as storyboards are quadrangle,
                we set <i>n = 4</i>. Like Fast R-CNN, we use the loss function:
            </p>
            <div align="center">
                <img src="http://latex.codecogs.com/svg.latex?L(\vec{t_i}, \vec{s_i}) = \sum_{i=1}^n smooth_{L_1}(t_{ix} - s_{ix})+ smooth_{L_1}(t_{iy} - s_{iy})"
                     border="0"/>,
            </div>
            <p>where</p>
            <div align="center">
                <img src="http://latex.codecogs.com/svg.latex?smooth_{L_{1}}(a) = \begin{cases} 0.5a^2,	& \mbox{if }\left|a\right| < 1  \\
                               \left|a\right| - 0.5, & \mbox{otherwise}. \\ \end{cases}"/>
            </div>
        </div>
        <div>
            <h4>Implementation details</h4>
            <p>
                For the implementation code of our paper, we make use of the Caffe framework[<b>Jia et al. 2014</b>] and tran SReN
                with a Titan X.
            </p>
            <p>
                When training Fast R-CNN, We treat all region proposals with > 0.8 IoU overlap with a ground-truth box as
                positives, the rest as negative. We start SGD at a learning rate of 0.001, in each SGD iteration, the size of mini-batch
                is 128, which consist of 32 positive and 96 negative. When training Regression Network, we start SGD at a learning rate of
                0.001 and reduce ten times after every 20000 iterations, we set the batch size equal to 64.
            </p>
        </div>
    </div>
    <div class="row">
        <h3> EXPERIMENT </h3>
        <hr/>
        <div class="col-sm-12">
            <div id="dataset">
                <h4>Dataset</h4>
                <p>
                    We construct a dataset with 29845 labeled comic pages (contain 169421 storyboards) from 103
                    different comic books, which come from different Japanese and Hong Kong comics. We randomly
                    select 15087 of the labeled comic pages to train SReN, we use another 7375 comic pages to
                    validate the training result and conduct experiments on the remaining 7382 comic pages.
                    Therefore, the proportion of train, evaluate and test dateste is about <b>5 : 2.5 : 2.5</b>.
                </p>
            </div>
            <div id="metrics">
                <h4>Evaluation criteria</h4>
                <p>
                    We evaluate results on two levels: storyboard level and page level.
                    On the storyboard level, we use precision, recall and F1 score as evaluation metrics. On the page
                    level, we use page correction rate(PCR) as evaluation criterion, i.e., the ratio of comic pages in
                    which
                    all storyboards are correctly detected. To be more specific, each detected storyboard is represented
                    by
                    a quadrangle, if the intersection-over-union(IoU) between the detected storyboard and the
                    corresponding
                    ground truth is more than 90\%, we regard it as a correct detected storyboard for the comic page.
                    IoU
                    for
                    each ground truth and detected storyboards is defined as following,
                </p>
                <div align="center">
                    <img src="http://latex.codecogs.com/svg.latex?IoU=\max_i \frac{p\cap D_i}{p\cup D_i}"/>
                </div>
                <p>
                    where
                    <img src="http://latex.codecogs.com/svg.latex?D_i"/>
                    is a set of detected storyboards,
                    <img src="http://latex.codecogs.com/svg.latex?p"/>
                    is the manually label for each storyboard within the page. The intersection and the union
                    operation are calculated in terms of area.
                </p>
            </div>
            <div id="results">
                <h4>Results</h4>
                <p>
                    We compare our method with Fast R-CNN without shape regression and two low-level visual patterns
                    based methods: TCRF[<b>Li et al. 2015</b>] and ESA[<b>Wang, Zhou, and Tang 2015</b>]. Experimental
                    results are listed in Table 1, which indicate that:
                <ul>
                    <li>The deep learning based methods can achieve much better results than hand crafted low-level visual
                        patterns based methods.
                    </li>
                    <li>Fast R-CNN with SReN is better than vanilla Fast R-CNN by the effective shape regression.</li>
                </ul>
                </p>
                <div align="center">
                    <table class="table table-striped">
                        <thead>
                        <tr>
                            <th style="text-align: center">Method</th>
                            <th style="text-align: center">Precision</th>
                            <th style="text-align: center">Recall</th>
                            <th style="text-align: center">F1 score</th>
                            <th style="text-align: center">PCR</th>
                        </tr>
                        </thead>

                        <tbody>
                        <tr>
                            <td style="text-align: center">ESA</td>
                            <td style="text-align: center">0.835</td>
                            <td style="text-align: center">0.700</td>
                            <td style="text-align: center">0.762</td>
                            <td style="text-align: center">0.418</td>
                        </tr>
                        <tr>
                            <td style="text-align: center">TCRF</td>
                            <td style="text-align: center">0.699</td>
                            <td style="text-align: center">0.64</td>
                            <td style="text-align: center">0.668</td>
                            <td style="text-align: center">0.399</td>
                        </tr>
                        <tr>
                            <td style="text-align: center">Fast R-CNN</td>
                            <td style="text-align: center">0.807</td>
                            <td style="text-align: center">0.799</td>
                            <td style="text-align: center">0.803</td>
                            <td style="text-align: center">0.518</td>
                        </tr>
                        <tr>
                            <td style="text-align: center">SReN</td>
                            <td style="text-align: center"><strong>0.888</strong></td>
                            <td style="text-align: center"><strong>0.879</strong></td>
                            <td style="text-align: center"><strong>0.883</strong></td>
                            <td style="text-align: center"><strong>0.640</strong></td>
                        </tr>
                        </tbody>
                    </table>
                    <div align="center">
                        Table 1: Results on test dataset
                    </div>
                </div>
                <p>
                    Samples of extraction results of comic pages are as following:
                </p>
                <div class="row">
                    <div class="col-xs-4 ">
                        <img src="img/res1.jpg" height=400px>
                    </div>
                    <div class="col-xs-4">
                        <img src="img/res2.jpg" height=400px>
                    </div>
                    <div class="col-xs-4">
                        <img src="img/res3.jpg" height=400px>
                    </div>
                </div>

            </div>
        </div>
    </div>
    <div class="row">
        <h3> DISCUSSION </h3>
        <hr/>
        <div class="col-sm-12">
            <p>
                In this paper, we propose a novel deep architecture to detect storyboards within comic images,
                namely SReN. The main contribution is to use a shape regression network to regress the vertexes
                of quadrilateral storyboards in comic pages. Experimental results demonstrate that SReN performs
                better than two state-of-the-art storyboards extraction methods and object detection methods.
                In the future, we will test our idea on other shapes, like ellipse and triangle, which can be used in
                traffic sign detection and other applications. We also want to investigate how to design an end-to-end
                model to automatically detect and regression the target shape.
            </p>
        </div>
    </div>
    <div class="row">
        <h4> REFERENCE </h4>
        <hr/>
        <div class="col-sm-12">
            <dl class="dl-horizontal">
                <dt>wang2015comic</dt>
                <dd>Wang Y, Zhou Y, Tang Z. Comic frame extraction via line segments combination[C]//Document Analysis
                    and Recognition (ICDAR), 2015 13th International Conference on. IEEE, 2015: 856-860.
                </dd>
            </dl>
            <dl class="dl-horizontal">
                <dt>li2015tree</dt>
                <dd>Li L, Wang Y, Suen C Y, et al. A tree conditional random field model for panel detection in comic
                    images[J]. Pattern Recognition, 2015, 48(7): 2129-2140.
                </dd>
            </dl>
            <dl class="dl-horizontal">
                <dt>liu2015ssd</dt>
                <dd>Liu W, Anguelov D, Erhan D, et al. SSD: Single Shot MultiBox Detector[J]. arXiv preprint
                    arXiv:1512.02325, 2015.
                </dd>
            </dl>
            <dl class="dl-horizontal">
                <dt>girshick2015fast</dt>
                <dd>Girshick R. Fast r-cnn[C]//Proceedings of the IEEE International Conference on Computer Vision.
                    2015: 1440-1448.
                </dd>
            </dl>
            <dl class="dl-horizontal">
                <dt>simonyan2015very</dt>
                <dd>Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[J].
                    arXiv preprint arXiv:1409.1556, 2014.
                </dd>
            </dl>
            <dl class="dl-horizontal">
                <dt>jia2014caffe</dt>
                <dd>Jia Y, Shelhamer E, Donahue J, et al. Caffe: Convolutional architecture for fast feature embedding[C]
                    //Proceedings of the 22nd ACM international conference on Multimedia. ACM, 2014: 675-678.
                </dd>
            </dl>
             <dl class="dl-horizontal">
                <dt>ren2015faster</dt>
                <dd>Ren S, He K, Girshick R, et al. Faster R-CNN: Towards real-time object detection with region proposal networks[C]//Advances in neural information processing systems. 2015: 91-99.</dd>
            </dl>
            <dl class="dl-horizontal">
                <dt></dt>
                <dd></dd>
            </dl>
        </div>
    </div>
</div>

<!-- FOOTER -->
<footer class='main'>
    <p>Edit by Zheqi He 2016</p>
</footer>


<!-- Bootstrap core JavaScript
================================================== -->
<!-- Placed at the end of the document so the pages load faster -->
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.11.0/jquery.min.js"></script>
<script src="js/bootstrap.min.js"></script>
</body>
</html>