-
Notifications
You must be signed in to change notification settings - Fork 0
/
sren.html
420 lines (409 loc) · 20.4 KB
/
sren.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="description" content="">
<meta name="author" content="">
<!--link rel="shortcut icon" href="ico/favicon.ico" -->
<title>SReN: Shape Regression Network for Comic Storyboard Extraction -- Zheqi He</title>
<!-- Bootstrap core CSS -->
<link href="css/bootstrap.min.css" rel="stylesheet">
<!-- Just for debugging purposes. Don't actually copy this line! -->
<!--[if lt IE 9]>
<script src="js/ie8-responsive-file-warning.js"></script><![endif]-->
<!-- HTML5 shim and Respond.js IE8 support of HTML5 elements and media queries -->
<!--[if lt IE 9]>
<script src="https://oss.maxcdn.com/libs/html5shiv/3.7.0/html5shiv.js"></script>
<script src="https://oss.maxcdn.com/libs/respond.js/1.4.2/respond.min.js"></script>
<![endif]-->
<!-- Custom styles for this template -->
<!-- Custom styles for this template -->
<link href="css/custom.css" rel="stylesheet">
</head>
<!-- NAVBAR
================================================== -->
<body>
<header class="main">
<div class="container">
<div class="row">
<div class="col-sm-12">
<h1>SReN</h1>
<h2>Shape Regression Network for Comic Storyboard Extraction</h2>
</div>
</div>
</div>
</header>
<div class="container">
<div class="row">
<h3> ABOUT ME </h3>
<hr/>
<div class="col-sm-6">
<p><b>Zheqi He</b> - <b>何哲琪</b></p>
<p>Master Student</p>
<p>Institute of Computer Science and Technology</p>
<p>Peking University</p>
</div>
<div class="col-sm-6">
<p> . </p>
<p> Tel: +86 18810543960 </p>
<p>Email: [email protected]</p>
<p>GitHub: <a href="https://github.com/philokey"> philokey </a></p>
</div>
</div>
<div class="row">
<h3> BACKGROUND </h3>
<hr/>
<div class="col-sm-12">
<p>
Comics, defined as “juxtaposed pictorial and other images in deliberat esequence, intended to convey
information and/or to produce anaesthetic response in theviewer”, have gained increasing popularity
since it sappearance in the 19th century. In this era, Comic is a kind of entertainment publication
popular
among people of different ages around the world.
The storyboard is the basic semantic unit of a comic, as shown in Fig.1,
hence, decomposing the comic image into several storyboards is the fundamental step to understand
content of comic.
</p>
<div class="row">
<div class="col-xs-6 ">
<img src="img/page1.png" height=500px>
<div align="center">
(a)Source:Kazuma Kamachi, A Certain Scientific Railgun,vol.1, p.34
</div>
</div>
<div class="col-xs-6">
<img src="img/page2.png" height=500px>
<div align="center">
(b)Source:Hiromu Arakawa,Fullmetal Alchemist,vol.26, p.6
</div>
</div>
</div>
<div align="center">
Fig.1 Typical comic storyboards
</div>
<br/>
<br/>
<p>
In addition, decomposing the comic image into several storyboards is the key technique
to produce digital comic documents suitable for reading on mobile devices with small screen.
Fig.2 gives an example.
</p>
<div class="row">
<div align="center">
<img src="img/mobile_read.png" width=900px>
<div class="caption">
</div>
</div>
</div>
<div align="center">
Fig. 2 Read comic on mobile device
</div>
</div>
</div>
<div class="row">
<h3> MOTIVATION </h3>
<hr/>
<div class="col-sm-12">
<p>
Most of previous storyboard extraction methods [<b>Wang, Zhou, and Tang 2015; Li et al. 2015</b>] use
only hand
crafted low-level visual patterns, such as edge segments, line segments or connected component.
These methods analyze the relationship between low-level visual patterns and combine them into
storyboard. These methods work effectively under certain assumptions, but they may fail to handle
the comic image with complex layout. For example, when storyboards missing borderlines, these methods
cannot handle them well; or when there are complex overlaps between storyboards, these methods are
tend to fail. The most important reason is low-level visual patterns can not represent image content
well.
</p>
<p>
Recently, deep learning methods[<b>Girshick 2015; Liu et al. 2015</b>] have been applied to object
detection
and gain the state-of-the-art performances. The effective feature learning capability of deep neural
network
make great contribution to high-level vision tasks. However, these methods can only obtain rectangle
bounding
box of objects, which are not precise enough for many application tasks. For example, for the tasks of
comic
storyboard detection or traffic sign detection. It is better to use parameterized shape like triangle,
quadrangle or ellipse to express detected results.
</p>
</div>
</div>
<div class="row">
<h3> METHOD </h3>
<hr/>
<div class="col-sm-12">
<p>
In this paper, we propose a novel architecture based on deep convolutional neural network namely
SReN to detect storyboards within comic images. Fig.3 illustrate the architecture of SReN, which
consists of two main steps: generate storyboard proposals and train shape regression network.
</p>
</div>
<div class="row">
<div align="center">
<img src="img/net_new.png" width=600px>
<div class="caption">
</div>
</div>
</div>
<div align="center">
Fig.3 The architecture of SReN
</div>
<div>
<h4>Generating storyboard proposals</h4>
<p>
We use comic images to train a Fast R-CNN model to detect storyboard rectangle bounding boxes <i>r</i>.
The reason we use Fast R-CNN rather than Faster R-CNN [<b>Ren et al. 2015</b>], which performs better than Fast R-CNN
in many challenges like Pascal VOC or COCO, is that Faster R-CNN has bad performance when we require high localization accuracy.
Since Fast R-CNN can only generate rectangle bounding boxes, we use corresponding exterior rectangles as ground
truth for storyboards. But these bounding boxes often miss some parts of a storyboard, in order to obtain the
complete storyboard, we enlarge <i>r</i> by a factor of 1.1 to generate storyboard proposal <i>p</i> as the input of our SReN.
Another problem is that storyboards are often various in sizes, to reduce the interference of this, we
normalize
the vertexes of storyboard proposals into <i>[-1, 1]</i>, that is
</p>
<div align="center">
<img src="http://latex.codecogs.com/svg.latex?x'= \frac{x-c_x}{w}, \quad y'=\frac{y-c_y}{h}"/>
</div>
<p>
where <i>(x', y')</i> is the vertex of the regression target,
<i>(x, y)</i> is the vertex of the original storyboard, <i>(c_x, c_y)</i> is the center of the
storyboard, <i>w</i> and <i>h</i> is the width and the height of the storyboard proposal.
</p>
</div>
<div>
<h4>Training shape regression network</h4>
<p>
Firstly, we sorted the regression target
<img src="http://latex.codecogs.com/svg.latex?x'= \{\vec{t_1}, \vec{t_2},..., \vec{t_n}\}"/>,
where <img src="http://latex.codecogs.com/svg.latex?\vec{t_i}=(t_{ix}, t_{iy})"/>, of the storyboard by it
is polar angle.
Then we use $p$ and their regression targets as the input of VGG16 network [<b>Simonyan and Zisserman
2015</b>]
to get feature <i>f</i> with 4096 dimensions. Finally we add a fully connected layer to regress the vertexes
of the storyboard <img src="http://latex.codecogs.com/svg.latex?\{\vec{s_1}, \vec{s_2},..., \vec{s_n}\}"/>,
where
<img src="http://latex.codecogs.com/svg.latex?\vec{s_i}=(s_{ix}, s_{iy})"/>, as storyboards are quadrangle,
we set <i>n = 4</i>. Like Fast R-CNN, we use the loss function:
</p>
<div align="center">
<img src="http://latex.codecogs.com/svg.latex?L(\vec{t_i}, \vec{s_i}) = \sum_{i=1}^n smooth_{L_1}(t_{ix} - s_{ix})+ smooth_{L_1}(t_{iy} - s_{iy})"
border="0"/>,
</div>
<p>where</p>
<div align="center">
<img src="http://latex.codecogs.com/svg.latex?smooth_{L_{1}}(a) = \begin{cases} 0.5a^2, & \mbox{if }\left|a\right| < 1 \\
\left|a\right| - 0.5, & \mbox{otherwise}. \\ \end{cases}"/>
</div>
</div>
<div>
<h4>Implementation details</h4>
<p>
For the implementation code of our paper, we make use of the Caffe framework[<b>Jia et al. 2014</b>] and tran SReN
with a Titan X.
</p>
<p>
When training Fast R-CNN, We treat all region proposals with > 0.8 IoU overlap with a ground-truth box as
positives, the rest as negative. We start SGD at a learning rate of 0.001, in each SGD iteration, the size of mini-batch
is 128, which consist of 32 positive and 96 negative. When training Regression Network, we start SGD at a learning rate of
0.001 and reduce ten times after every 20000 iterations, we set the batch size equal to 64.
</p>
</div>
</div>
<div class="row">
<h3> EXPERIMENT </h3>
<hr/>
<div class="col-sm-12">
<div id="dataset">
<h4>Dataset</h4>
<p>
We construct a dataset with 29845 labeled comic pages (contain 169421 storyboards) from 103
different comic books, which come from different Japanese and Hong Kong comics. We randomly
select 15087 of the labeled comic pages to train SReN, we use another 7375 comic pages to
validate the training result and conduct experiments on the remaining 7382 comic pages.
Therefore, the proportion of train, evaluate and test dateste is about <b>5 : 2.5 : 2.5</b>.
</p>
</div>
<div id="metrics">
<h4>Evaluation criteria</h4>
<p>
We evaluate results on two levels: storyboard level and page level.
On the storyboard level, we use precision, recall and F1 score as evaluation metrics. On the page
level, we use page correction rate(PCR) as evaluation criterion, i.e., the ratio of comic pages in
which
all storyboards are correctly detected. To be more specific, each detected storyboard is represented
by
a quadrangle, if the intersection-over-union(IoU) between the detected storyboard and the
corresponding
ground truth is more than 90\%, we regard it as a correct detected storyboard for the comic page.
IoU
for
each ground truth and detected storyboards is defined as following,
</p>
<div align="center">
<img src="http://latex.codecogs.com/svg.latex?IoU=\max_i \frac{p\cap D_i}{p\cup D_i}"/>
</div>
<p>
where
<img src="http://latex.codecogs.com/svg.latex?D_i"/>
is a set of detected storyboards,
<img src="http://latex.codecogs.com/svg.latex?p"/>
is the manually label for each storyboard within the page. The intersection and the union
operation are calculated in terms of area.
</p>
</div>
<div id="results">
<h4>Results</h4>
<p>
We compare our method with Fast R-CNN without shape regression and two low-level visual patterns
based methods: TCRF[<b>Li et al. 2015</b>] and ESA[<b>Wang, Zhou, and Tang 2015</b>]. Experimental
results are listed in Table 1, which indicate that:
<ul>
<li>The deep learning based methods can achieve much better results than hand crafted low-level visual
patterns based methods.
</li>
<li>Fast R-CNN with SReN is better than vanilla Fast R-CNN by the effective shape regression.</li>
</ul>
</p>
<div align="center">
<table class="table table-striped">
<thead>
<tr>
<th style="text-align: center">Method</th>
<th style="text-align: center">Precision</th>
<th style="text-align: center">Recall</th>
<th style="text-align: center">F1 score</th>
<th style="text-align: center">PCR</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: center">ESA</td>
<td style="text-align: center">0.835</td>
<td style="text-align: center">0.700</td>
<td style="text-align: center">0.762</td>
<td style="text-align: center">0.418</td>
</tr>
<tr>
<td style="text-align: center">TCRF</td>
<td style="text-align: center">0.699</td>
<td style="text-align: center">0.64</td>
<td style="text-align: center">0.668</td>
<td style="text-align: center">0.399</td>
</tr>
<tr>
<td style="text-align: center">Fast R-CNN</td>
<td style="text-align: center">0.807</td>
<td style="text-align: center">0.799</td>
<td style="text-align: center">0.803</td>
<td style="text-align: center">0.518</td>
</tr>
<tr>
<td style="text-align: center">SReN</td>
<td style="text-align: center"><strong>0.888</strong></td>
<td style="text-align: center"><strong>0.879</strong></td>
<td style="text-align: center"><strong>0.883</strong></td>
<td style="text-align: center"><strong>0.640</strong></td>
</tr>
</tbody>
</table>
<div align="center">
Table 1: Results on test dataset
</div>
</div>
<p>
Samples of extraction results of comic pages are as following:
</p>
<div class="row">
<div class="col-xs-4 ">
<img src="img/res1.jpg" height=400px>
</div>
<div class="col-xs-4">
<img src="img/res2.jpg" height=400px>
</div>
<div class="col-xs-4">
<img src="img/res3.jpg" height=400px>
</div>
</div>
</div>
</div>
</div>
<div class="row">
<h3> DISCUSSION </h3>
<hr/>
<div class="col-sm-12">
<p>
In this paper, we propose a novel deep architecture to detect storyboards within comic images,
namely SReN. The main contribution is to use a shape regression network to regress the vertexes
of quadrilateral storyboards in comic pages. Experimental results demonstrate that SReN performs
better than two state-of-the-art storyboards extraction methods and object detection methods.
In the future, we will test our idea on other shapes, like ellipse and triangle, which can be used in
traffic sign detection and other applications. We also want to investigate how to design an end-to-end
model to automatically detect and regression the target shape.
</p>
</div>
</div>
<div class="row">
<h4> REFERENCE </h4>
<hr/>
<div class="col-sm-12">
<dl class="dl-horizontal">
<dt>wang2015comic</dt>
<dd>Wang Y, Zhou Y, Tang Z. Comic frame extraction via line segments combination[C]//Document Analysis
and Recognition (ICDAR), 2015 13th International Conference on. IEEE, 2015: 856-860.
</dd>
</dl>
<dl class="dl-horizontal">
<dt>li2015tree</dt>
<dd>Li L, Wang Y, Suen C Y, et al. A tree conditional random field model for panel detection in comic
images[J]. Pattern Recognition, 2015, 48(7): 2129-2140.
</dd>
</dl>
<dl class="dl-horizontal">
<dt>liu2015ssd</dt>
<dd>Liu W, Anguelov D, Erhan D, et al. SSD: Single Shot MultiBox Detector[J]. arXiv preprint
arXiv:1512.02325, 2015.
</dd>
</dl>
<dl class="dl-horizontal">
<dt>girshick2015fast</dt>
<dd>Girshick R. Fast r-cnn[C]//Proceedings of the IEEE International Conference on Computer Vision.
2015: 1440-1448.
</dd>
</dl>
<dl class="dl-horizontal">
<dt>simonyan2015very</dt>
<dd>Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[J].
arXiv preprint arXiv:1409.1556, 2014.
</dd>
</dl>
<dl class="dl-horizontal">
<dt>jia2014caffe</dt>
<dd>Jia Y, Shelhamer E, Donahue J, et al. Caffe: Convolutional architecture for fast feature embedding[C]
//Proceedings of the 22nd ACM international conference on Multimedia. ACM, 2014: 675-678.
</dd>
</dl>
<dl class="dl-horizontal">
<dt>ren2015faster</dt>
<dd>Ren S, He K, Girshick R, et al. Faster R-CNN: Towards real-time object detection with region proposal networks[C]//Advances in neural information processing systems. 2015: 91-99.</dd>
</dl>
<dl class="dl-horizontal">
<dt></dt>
<dd></dd>
</dl>
</div>
</div>
</div>
<!-- FOOTER -->
<footer class='main'>
<p>Edit by Zheqi He 2016</p>
</footer>
<!-- Bootstrap core JavaScript
================================================== -->
<!-- Placed at the end of the document so the pages load faster -->
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.11.0/jquery.min.js"></script>
<script src="js/bootstrap.min.js"></script>
</body>
</html>