-
Notifications
You must be signed in to change notification settings - Fork 1
/
getting_started.qmd
600 lines (512 loc) · 18 KB
/
getting_started.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
# Getting started {#sec-getting-started}
```{r}
#| echo: false
library(knitr)
knit_print.gt <- function(x, ...) {
# Two steps to avoid most Quarto changes of my table styles:
# 1. as_raw_html() to use table styles *inline*
# 2. wrap output in a div that resets all Quarto styles
stringr::str_c(
"<div style='all:initial';>\n",
gt::as_raw_html(x),
"\n</div>"
) |>
knitr::asis_output()
}
registerS3method(
"knit_print", 'gt_tbl', knit_print.gt,
envir = asNamespace("gt")
# important to overwrite {gt}s knit_print
)
```
In this chapter, we're going to do two things:
1. Learn simple guidelines for better tables
2. Implement them with `{gt}`
And of course we will need data for that.
I like penguins, so we're going to use the fabulous `penguins` data set from `{palmerpenguins}`.
```{r}
#| message: false
#| warning: false
library(tidyverse)
penguins <- palmerpenguins::penguins |>
filter(!is.na(sex))
penguins
```
Using this data, let us count the penguins.
These counts will serve as a simple data set to practice table building.
```{r}
penguin_counts <- penguins |>
mutate(year = as.character(year)) |>
group_by(species, island, sex, year) |>
summarise(n = n(), .groups = 'drop')
penguin_counts
```
In an actual table, the data would probably be rearranged a bit.
There's nothing wrong with this long (i.e. many rows) data format.
In fact, this format is great for data analysis.
But in a table that is meant to be read by humans, not machines, you'll probably go with a wider format.
```{r}
penguin_counts_wider <- penguin_counts |>
pivot_wider(
names_from = c(species, sex),
values_from = n
) |>
# Make missing numbers (NAs) into zero
mutate(across(.cols = -(1:2), .fns = ~replace_na(., replace = 0))) |>
arrange(island, year)
penguin_counts_wider
```
Now, let's put this into a table.
Not too long ago I would have probably visualized the data with a table like this:
![Ugh. A not so sexy table of our `penguins_counts_wider` data set created by yours truly with LibreOffice Calc (spreadsheet software - double ugh. Though, compared to Excel it's open-source. So maybe 1.5 ugh?)](img/stupid_table_screenshot.png){#fig-terrible-tbl fig-align="center" width="90%"}
Ugh.
This is not a sexy table.
I get bored just looking at that.
So let's improve this table.
To do so, here are the 6 guidelines that will, well, guide us.
1. Avoid vertical lines
2. Use better column names
3. Align columns
4. Use groups instead of repetitive columns
5. Remove missing numbers
6. Add summaries
## Avoid vertical lines
This is the guideline that gives you the biggest bang for your buck.
The above table uses waaaay to many grid lines.
Without vertical lines, the table will look less cramped.
Thankfully, `{gt}` seems to live by this rule as it is implemented by default.
Thus, we only need to pass our data set `penguin_counts_wider` to `gt()`.
You can think of this function as the `ggplot()` analogue:
It's the starting point of any table in the `{gt}` universe.
```{r}
library(gt)
penguin_counts_wider |>
gt()
```
This isn't a great table yet but it's a start.
In any case, it feels more open due to less grid lines.
Of course, the column labels could be better which brings us to our next point.
## Use better column names
To change the column names use the "layer" called `cols_layer()`.
Much like `{ggplot2}`, `{gt}` works with layers.
To change anything about the table, we just pass the table from layer to the next.
This works with piping.
Armed with that knowledge, we could label the columns like we did in [@fig-terrible-tbl].
```{r}
penguin_counts_wider |>
gt() |>
cols_label(
island = 'Island',
year = 'Year',
Adelie_female = 'Adelie (female)',
Adelie_male = 'Adelie (male)',
Chinstrap_female = 'Chinstrap (female)',
Chinstrap_male = 'Chinstrap (male)',
Gentoo_female = 'Gentoo (female)',
Gentoo_male = 'Gentoo (male)',
)
```
But this isn't a great way to label the columns.
So let's do something else instead.
First, let us create so-called **spanners**.
These are joined columns and can be created with `tab_spanner()` layers.
You'll need one layer for each spanner.
```{r}
penguin_counts_wider |>
gt() |>
cols_label(
island = 'Island',
year = 'Year',
Adelie_female = 'Adelie (female)',
Adelie_male = 'Adelie (male)',
Chinstrap_female = 'Chinstrap (female)',
Chinstrap_male = 'Chinstrap (male)',
Gentoo_female = 'Gentoo (female)',
Gentoo_male = 'Gentoo (male)',
) |>
tab_spanner(
label = md('**Adelie**'),
columns = 3:4
) |>
tab_spanner(
label = md('**Chinstrap**'),
columns = c('Chinstrap_female', 'Chinstrap_male')
) |>
tab_spanner(
label = md('**Gentoo**'),
columns = contains('Gentoo')
)
```
As you can see, `tab_spanner()` always requires two arguments `label` and `columns`.
For the `columns` argument I have shown you three ways to get the job done:
1. Vector of column numbers
2. Vector of column names
3. [tidyselect helpers](https://tidyselect.r-lib.org/reference/language.html)
For the `label` argument you can either just state a `character` vector or you can wrap one in `md()` to enable Markdown syntax (like `**bold text**`).
Okay, now we don't really need the Species labels in the actual column names anymore.
The spanners already state that for us.
So, let us modify our previous code to rename the columns.
To do so, let me show you a cool trick that may save you some tedious typing.
First, we create a **named** vector that contains the actual and the desired column names.
```{r}
actual_colnames <- colnames(penguin_counts_wider)
actual_colnames
desired_colnames <- actual_colnames |>
str_remove('(Adelie|Gentoo|Chinstrap)_') |>
str_to_title()
names(desired_colnames) <- actual_colnames
desired_colnames
```
Then, we can use this named vector as the `.list` argument in `cols_label()`.
```{r}
penguin_counts_wider |>
gt() |>
cols_label(.list = desired_colnames) |>
tab_spanner(
label = md('**Adelie**'),
columns = 3:4
) |>
tab_spanner(
label = md('**Chinstrap**'),
columns = c('Chinstrap_female', 'Chinstrap_male')
) |>
tab_spanner(
label = md('**Gentoo**'),
columns = contains('Gentoo')
)
```
Finally, while we're currently changing labels, let us add one important label - the title.
The `tab_header()` layer does the trick.
```{r}
penguin_counts_wider |>
gt() |>
cols_label(.list = desired_colnames) |>
tab_spanner(
label = md('**Adelie**'),
columns = 3:4
) |>
tab_spanner(
label = md('**Chinstrap**'),
columns = c('Chinstrap_female', 'Chinstrap_male')
) |>
tab_spanner(
label = md('**Gentoo**'),
columns = contains('Gentoo')
) |>
tab_header(
title = 'Penguins in the Palmer Archipelago',
subtitle = 'Data is courtesy of the {palmerpenguins} R package'
)
```
By the same trick we could also add a caption for a Quarto document (`tab_caption()`), a footnote (`tab_footnote()`) or another source note (`tab_sourcenote()`),
In this case it's a bit much, though.
So I won't add them.
Just know that these functions exist in case you need them.
For now, let us talk about our next guideline.
Before we can do that, let me mention one small thing: Our spanners and headers will not change as we move along this tutorial.
To avoid repeating them all the time, let me wrap them in a function.
```{r}
spanners_and_header <- function(gt_tbl) {
gt_tbl |>
tab_spanner(
label = md('**Adelie**'),
columns = 3:4
) |>
tab_spanner(
label = md('**Chinstrap**'),
columns = c('Chinstrap_female', 'Chinstrap_male')
) |>
tab_spanner(
label = md('**Gentoo**'),
columns = contains('Gentoo')
) |>
tab_header(
title = 'Penguins in the Palmer Archipelago',
subtitle = 'Data is courtesy of the {palmerpenguins} R package'
)
}
# This produces the same output
penguin_counts_wider |>
gt() |>
cols_label(.list = desired_colnames) |>
spanners_and_header()
```
## Align columns
Did you notice that `gt()` aligned the columns differently?
That's because the columns of the corresponding `data.frame`/`tibble` contained different data types.
Specifically:
- the counts are `integers` and aligned to the right
- the `year` column is a `character` vector and uses alignment to the left (though it's not totally visible because the column is narrow)
- the `island` column is a `factor` and uses center alignment (even though its entries are `character`s)
It's a good default to align numbers to the right and texts to the left.
Why?
Because it's more readable.
Need an example?
Here's one.
Most (western) people will probably say that the left column is the easiest to read because we read from left to right.
```{r}
#| echo: false
#| message: false
size <- 5
read_csv2('data/ratios.csv') |>
mutate(location = str_remove(location, ' Location')) |>
ggplot() +
geom_text(
aes(x = 0, y = seq_along(location), label = location),
hjust = 0,
size = size,
color = 'grey20'
) +
geom_text(
aes(x = 3, y = seq_along(location), label = location),
hjust = 0.5,
size = size,
color = 'grey20'
) +
geom_text(
aes(x = 6, y = seq_along(location), label = location),
hjust = 1,
size = size,
color = 'grey20'
) +
coord_cartesian(xlim = c(0, 6)) +
theme_void()
```
For numbers it's the other way around.
That's because right-aligned numbers make it easy to see how many digits a number has compared to other numbers.
This assumes that your numbers use a font that assigns equal width to all digits (monospace fonts).
So, let us align the `island` and `year` column.
We can either do this by transforming the data types before even calling `gt()`.
Or we use the `cols_align()` layer.
Once again, this layer understands text locations and tidyselection helpers.
::: panel-tabset
## Conversion
```{r}
penguin_counts_wider |>
mutate(
island = as.character(island),
year = as.numeric(year)
) |>
gt() |>
cols_label(.list = desired_colnames) |>
spanners_and_header()
```
## Align
```{r}
penguin_counts_wider |>
gt() |>
cols_label(.list = desired_colnames) |>
spanners_and_header() |>
cols_align(align = 'right', columns = 'year') |>
cols_align(
align = 'left',
columns = where(is.factor)
)
```
:::
## Use groups instead of repetitive columns
The `island` column is somewhat repetitive.
In cases like these, I'd rather remove the column.
Instead, I would group the table using additional rows.
I like to think that this comes with better readability.
With `{gt}`, this grouping is easy.
We only need to specify the `groupname_col` argument in `gt()`.
If we want, we can also set the `rowname_col` argument to `year`.
This will format the "Year" column a bit differently.
::: panel-tabset
## Groups
```{r}
penguin_counts_wider |>
mutate(
island = as.character(island),
year = as.numeric(year)
) |>
gt(groupname_col = 'island') |>
cols_label(.list = desired_colnames) |>
spanners_and_header()
```
## Groups + row names
```{r}
penguin_counts_wider |>
mutate(
island = as.character(island),
year = as.numeric(year)
) |>
gt(groupname_col = 'island', rowname_col = 'year') |>
cols_label(.list = desired_colnames) |>
spanners_and_header()
```
:::
In this case, I prefer the latter style because we don't really need a "Year" label to identify 2007, 2008 and 2009 as years.
But an island label could be nice (I'm really bad with geography).
The easiest way to add that to the group names is via string manipulation before `gt()` is called.
```{r}
penguin_counts_wider |>
mutate(
island = as.character(island),
year = as.numeric(year),
island = paste0('Island: ', island)
) |>
gt(groupname_col = 'island', rowname_col = 'year') |>
cols_label(.list = desired_colnames) |>
spanners_and_header()
```
## Remove missing numbers
Notice that our table has a lot of zeroes in it.
For better readability, let us replace the zeroes with something more lightweight.
We accomplish this with the `sub_zero()` layer.
```{r}
penguin_counts_wider |>
mutate(
island = as.character(island),
year = as.numeric(year),
island = paste0('Island: ', island)
) |>
gt(groupname_col = 'island', rowname_col = 'year') |>
cols_label(.list = desired_colnames) |>
spanners_and_header() |>
sub_zero(zero_text = '-')
```
There are more `sub_*()` functions in `{gt}`.
We will learn about them in [@sec-sub-functions].
## Add summaries
Now, this table looks already cleaner than what we started with.
In this format, we could even add **more information** at little cost.
For example, we could add a summary for each group.
In this case, a summary could be as simple as a total or maximum over all years (we'll just assume that this makes sense for our penguin data).
Here, the key layer is `summary_rows()`.
Let's have a look at what it can produce and then I'll explain.
```{r}
penguin_counts_wider |>
mutate(
island = as.character(island),
year = as.numeric(year),
island = paste0('Island: ', island)
) |>
gt(groupname_col = 'island', rowname_col = 'year') |>
cols_label(.list = desired_colnames) |>
spanners_and_header() |>
sub_zero(zero_text = '-') |>
summary_rows(
groups = TRUE,
fns = list(
'Maximum' = ~max(.),
'Total' = ~sum(.)
),
formatter = fmt_number,
decimals = 0
)
```
The `summary_rows()` function works with a named list of functions (one function for each summary).
As you've seen, you can create one using `list('Name' = ~fct(.))`.
In this case, `.` represents the column data.
All other arguments can be named as usual.
For example, you could do something like `~mean(., na.rm = TRUE)`.[^getting_started-1]
[^getting_started-1]: Currently, this is the only possible way to define functions in `summary_rows()`.
The documentation of this function says something different but this is a [known issue](https://github.com/rstudio/gt/issues/921).
Notice that I had to set `groups = TRUE`.
Otherwise, we would get summary rows at the end of the table (using all data).
This is also known as a "grand summary".
Further, the output of the summary function had to be formatted.
By default, the output would contain two decimals.
So, we'd get numbers like `9.00`.
Here, `fmt_number()` is the formatter that corrected that.
But we had to tell it to use `decimals = 0`.
We'll learn more about the `fmt_*()` family in [@sec-fmt-functions].
Now that we've added more information to the table, it became quite long.
We can amend that by reducing the row heights.
Frankly, they have been too large for my taste for some time now.
To do so, we could set the so-called `data_row.padding` to 2 pixels.
This is done with `tab_options()`, the premier layer to style the table[^getting_started-2].
Similarly, there are padding options for `summary_row` and `row_group`[^getting_started-3]. And while we're at it, why not apply a pre-defined theme to our table with `opt_stylize()`?
[^getting_started-2]: Basically, this is the analogue of `theme()` in `{ggplot2}`.
[^getting_started-3]: I'm not sure why it's not `group_row` but we'll just go with it.
![](img/be-the-leaf-dance.gif)
```{r}
penguin_counts_wider |>
mutate(
island = as.character(island),
year = as.numeric(year),
island = paste0('Island: ', island)
) |>
gt(groupname_col = 'island', rowname_col = 'year') |>
cols_label(.list = desired_colnames) |>
spanners_and_header() |>
sub_zero(zero_text = '-') |>
summary_rows(
groups = TRUE,
fns = list(
'Maximum' = ~max(.),
'Total' = ~sum(.)
),
formatter = fmt_number,
decimals = 0
) |>
tab_options(
data_row.padding = px(2),
summary_row.padding = px(3), # A bit more padding for summaries
row_group.padding = px(4) # And even more for our groups
) |>
opt_stylize(style = 6, color = 'gray')
```
This has been a little foretaste of styling a table.
We'll learn more about changing a table's theme in [@sec-theming].
Finally, let me address the big inconsistency in the room.
We have replaced the zeroes by `-` earlier.
However, the summary rows still display `0`.
Unfortunately, there is no `sub_zero()` function that targets the summary rows.
So, we'll do something else instead.
In our data set we have replaced all `NA`s with zero.
But we didn't have to do that.
We could just let them be `NA`s and use `sub_missing()` to replace them.
In `summary_rows()`, we could then use `missing_text = "-"`.
I think you get the idea, so I'm just going to fold the code (so you can focus on the result).
```{r}
#| code-fold: true
penguin_counts_wider |>
mutate(across(.cols = -(1:2), ~if_else(. == 0, NA_integer_, .))) |>
mutate(
island = as.character(island),
year = as.numeric(year),
island = paste0('Island: ', island)
) |>
gt(groupname_col = 'island', rowname_col = 'year') |>
cols_label(.list = desired_colnames) |>
spanners_and_header() |>
sub_missing(missing_text = '-') |>
summary_rows(
groups = TRUE,
fns = list(
'Maximum' = ~max(.),
'Total' = ~sum(.)
),
formatter = fmt_number,
decimals = 0,
missing_text = '-'
) |>
tab_options(
data_row.padding = px(2),
summary_row.padding = px(3), # A bit more padding for summaries
row_group.padding = px(4) # And even more for our groups
) |>
opt_stylize(style = 6, color = 'gray')
```
## Summary
We've started this chapter with a terrible table that needed improvement.
Over the course of this chapter, we learned and applied six guidelines with `{gt}`.
These guidelines were
1. Avoid vertical lines
2. Use better column names
3. Align columns
4. Use groups instead of repetitive columns
5. Remove missing numbers
6. Add summaries
In the table business, these guidelines are pretty basic.
I don't mean basic in a bad or boring way.
It's just that these are solid recommendations that improve tables without any fancy stuff.
No icons, no pictures, no other eye-catching elements.
Just plain data formatted carefully.
So now we've learned the basics.
No need to stop there.
Let's learn the fancy stuff too.
That's what we'll do in the next chapter.