Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review: Ch 3 (wrangling) #103

Closed
leem44 opened this issue Mar 28, 2021 · 8 comments · Fixed by #235
Closed

Review: Ch 3 (wrangling) #103

leem44 opened this issue Mar 28, 2021 · 8 comments · Fixed by #235

Comments

@leem44
Copy link
Contributor

leem44 commented Mar 28, 2021

Reviewer E:

  • Chapter 3 could benefit by making sure it is encouraging students to use dplyr in the most effective ways.
    • In Section 3.6, I would have liked to see the scoped verbs (e.g. select_at, summarize_if)
    • select helpers (e.g. matches, starts_with) covered since this can make coding much more efficient.
    • In fact, summarize_all(), now meets the need that is used to motivate the introduction of the purrr package. More broadly, Section 3.7 should be evaluated to make sure it is using current best practices following the release of dplyr 1.0.0. dplyr was enhanced with the goal of making purrr less necessary when working with tabular data. In addition to the summarize_all() option that I mentioned above, mutate() now vectorizes many functions so one can write mutate(data, x = f(w)) instead of mutate(data, x = map(w, f)). In my experience, purrr seems to confuse newer users, so it may be worth striving to cut this section and using the new dplyr approaches instead.
  • Ch titles/subtitles: Section 3.6-3.7: Both of these use “iterate” in the title to describe the group_by+summarize workflow as well as the purrr workflow. This may lead to confusion on how the two methods are different. I would tend to call group_by+summarize something like “aggregation” or “subgroup analysis/summarization” to distinguish from mapping. I think iteration is a better description for purrr since it works more like traditional for-loop style iteration (same number of elements returned as inputted.)
  • In 3.4:
    • Very nice spiraling of new concepts in this chapter and repeatedly building off of previous concepts! (For example, how the tidyr::seperate() example reuses tidyr::pivot_longer() )
      • ML: no changes needed here
    • More weird spacing of figures and text
    • Explicitly try to define “wide” versus “long” before using. It should be intuitive but might trip students up.
    • Explain why each situation might arise. Students may wonder why anyone would produce the “wrong” form of data, but discussion could explain, for example, that human-readable tables are often in “wide” format, but for visualization and analysis we want them “long”.
    • For the table that is “too long”, the important column (containing the future column names) is truncated. This makes it harder to understand the concept and why this is a bad format
    • I love the repeatedly checking of the tidy criteria after each example, but I wonder if it could be more effective or helpful to explain before the transformation the shape we are trying to get it into so students know what the goal is from the outset.
  • In 3.5:
    • Consider showing “nested” alternative to pipe [f(g(h(x)))] to demonstrate how pipe adds readability and order of operations
    • Consider providing advise when to break a pipe and assign to an intermediate variable. For example, it can be overwhelming to have 10 functions piped to one another. Additionally, one might want to save the data they have prepared before feeding it into a plotting or model function so that they can alter the plotting or model function without having to continually remake all of their data transformations.
  • In 3.6: See 3 for content suggestions motivated by this part in particular ML: this pertains to first comment in this issue
  • In 3.7:
    • See 3 for content suggestions motivated by this part in particular
    • The NA issue with sum() could be explained better. The text reads like this is a purrr issue, so it could be clarified that it is a sum() issue that can be addressed while using purrr
@leem44 leem44 changed the title Review: Ch 3 (viz) Review: Ch 3 (wrangling) Mar 28, 2021
@leem44 leem44 mentioned this issue Mar 28, 2021
4 tasks
@leem44
Copy link
Contributor Author

leem44 commented Mar 28, 2021

Reviewer B:

  • appreciate how Timbers breaks down barriers between packages and wholistically covers cleaning and wrangling data. For example, my understanding is that the dplyr and tidyr packages are separate only for legacy reasons (plyr and reshape were their predecessors). However, with the advent of the tidyverse umbrella package, there is no need to make such delineanations. Furthermore, new students are confused by and not interested in the historical legacies of R package development; they’re only interested in getting stuff done.
  • I LOVE how purrr as a means for iteration is covered (as opposed to for loops). I’m only now starting to incorporate it into my coding and it’s simplifying my code greatly. The only reason I don’t introduce it early in my courses is I learned that for-loops are the gold standard for iteration. Having to unlearn this is hard; this is not a reason for students not to learn purrr.

ML: No changes needed here

@trevorcampbell
Copy link
Contributor

trevorcampbell commented Mar 28, 2021

Reviewer D

  • This chapter can be broken down into smaller chapters. The initial part of the chapter has to do with Data Objects in R, so perhaps this initial part can be presented as its own chapter. It is important for students/readers to understand how R stores data as a whole (e.g., data frame, tibble) and also component by component

  • The part that starts with pivoting the data has to do with reshaping data, so perhaps a smaller chapter on Reshaping Data in R is in order.

  • The part that talks about separating information or cleaning the data could go into its own chapter called Tidying/Cleaning Data.

  • Once the discussion on the pipe operator %>% begins, that would be a signal to start a separate, smaller chapter called Wrangling Data. (I feel that the chapter as it originally stands spans too many topics and that can make it harder on the students/readers to understand the importance and nuances of each topic.

  • The wrangling chapter also needs its own table stating clearly what wrangling tasks will be covered and what functions and packages from R will be used for implementing these tasks.

    • ML: I don't think listing the packages is super helpful and I don't think putting this at the beginning is very helpful since we already have the learning objectives, but I added a table at the end that summarizes the key functions
  • The portion on iteration with group_by() and purrr could go into its own chapter titled Computing Summary Statistics.

    • ML: For all other comments that are unaddressed above, I don't agree with breaking the chapter into smaller chunks as the above comments suggest, I think it breaks the flow of the book as a whole. Happy to discuss if others disagree!

@trevorcampbell
Copy link
Contributor

trevorcampbell commented Mar 28, 2021

Reviewer A

  • p 58 typo: "multiple two rows"
  • p 61: in the separate function: Will you mention the convert argument anywhere? It makes sense to use mutate for pedagogical reasons first, but its good to know about.
  • p 63: "ggplot": unclear typography
  • p63: in the code block: You need to use scale_fill_manual here for this to work as expected.
  • p64: the figure: Change after code (on p63) is fixed
  • p68, in the code block: Why not also discuss piping in the data frame to fully embrace the pipe operator
  • p71: purrr is very useful, but you could also use summarize + across to achieve the type of iteration you're discussing. I'm not sure which is easier, but I thought I would mention it.

@ttimbers
Copy link
Contributor

ttimbers commented Mar 29, 2021

Reviewer C

  • Section 3.1: “is” not “will be” - again, reads like a proposal rather than the delivered book
  • Section 3.3.2: This terminology threw me a bit when I first started learning R: coming from computer science, I think “type of vector” means “how vector is implemented” rather than “data type stored in vector”. May be a case where learners with less baggage are less confused…
  • Fig 3.4: but the caption says “integer”, not “number”.
  • Section 3.3.3: Do you absolutely need to introduce lists here? I know they are how dataframes are implemented, but I fear learners will be lost in a sea of terms whose value and relationships they don’t get.
    • ML: I believed we discussed this previously, but happy to change if others want
  • Note in section 3.3.3: Example please (particularly for the result for a tibble - most people aren’t used to functions returning three values).
  • Section 3.4: You’ve already introduced the term “tabular data”, which is cleaner than “spreadsheet-like data set”. Here and elsewhere, it feels like chapters could use one last consolidatio pass.
  • Section 3.4.2: I think this is a bit misleading - tidy data isn’t important because ggplot2 and stats functions expect it - they expect tidy data because it’s easiest/least ambiguous to work with.
  • Section 3.4.3: Shown as Montréal (with the accent) in the data
  • Section 3.4.4: Again, I would like an example to show why it’s clumsy - novices won’t be able to construct one in their heads on the fly.
  • Section 3.4.5: This is confusing: it’s not just one letter or digit, and it’s definitely digit rather than number (the latter will confuse).
  • Section 3.4.5: “What is a sub-type” asked the reader, not having encountered the term before.
  • Section 3.4.6: “it’s” instead of “. It’s” (sentence continues).
  • Section 3.5.1: I think this is a lot easier to understand (and more compelling) if you introduce “data %>% func” instead of using “func(data, …)”) as the start of the pipe.
  • Section 3.5.2: I think this deserves more than an italicized inline note…
    • addressed in earlier chapter
  • Section 3.6: 1. Will readers know what “iterating over” means? 2. I don’t think of group_by as “iterating”, but if it does, then doesn’t “filter” as well?
  • Section 3.6.3: spelling: “breifly” <- “briefly”
  • Section 3.6.3: “meta R package” or “R meta package”? Might feel like a minor point, but little variations in terminology confuse novices.
  • Section 3.6.3: what’s “the Stat 545 book” when it’s at home?
  • Section 3.7: As before, will you readers understand the use of “*” to mean “all” here?
    • ML: this won't apply when we remove map
  • Section 3.7: Does this have to be introduced now? ‘map’ is hard, and readers might have a better time with it if they’ve had more practice with R before tackling it.
  • Section 3.7:
  • 1. Non-native English speakers have told me they find the use of “vanilla” to mean “common” confusing.
    1. Lists have been mentioned before, but haven’t been seen (i.e., no examples) - is it fair to introduce them at the same time as map?
      -ML: This won't apply when we remove map

@leem44
Copy link
Contributor Author

leem44 commented Jun 27, 2021

Need to fix: numbers in table 3.9 don't match the numbers in the data table above

@leem44
Copy link
Contributor Author

leem44 commented Aug 7, 2021

Adding Tiffany's comment from #92
We should introduce slice in the data wrangling chapter, so we can just use it here. Again, this would make this book more standalone, we currently only do this in the worksheet/tutorial).

  • ML: slice is included in chapter 1

@leem44
Copy link
Contributor Author

leem44 commented Aug 7, 2021

Adding this comment from #106
> Also, I see we introduce pull here... is this the best place, or should we introduce this in the wrangling chapter? What do you think @leem44 ?

@leem44 leem44 linked a pull request Sep 21, 2021 that will close this issue
@leem44
Copy link
Contributor Author

leem44 commented Sep 28, 2021

these are all addressed or new issues

@leem44 leem44 closed this as completed Sep 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants