Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review: Ch 6 (classification) #106

Closed
leem44 opened this issue Mar 28, 2021 · 7 comments · Fixed by #234
Closed

Review: Ch 6 (classification) #106

leem44 opened this issue Mar 28, 2021 · 7 comments · Fixed by #234

Comments

@leem44
Copy link
Contributor

leem44 commented Mar 28, 2021

Reviewer E:

  • Would it be possible to include at least one other classifier besides KNN? I don’t see this commonly used in industry, so even a brief counterpoint of logistic regression or random forest (even without deep explanation) could illuminate that there are many different algorithms with different performance, pros/cons
  • p. 133: “learn more about x here” - phrasing makes sense for the web but consider rephrasing or removing for book
  • Factors are a fairly dense and R-specific concept. It might be worth introducing them earlier in data wrangling or explaining them a bit more deeply (or providing readers an external reference)
  • I love the EDA before modeling, but I would appreciate more narrative how the author chose what views to make and what they were looking for. Other scatter plots might not show such a strong relationship, and students should see the iterative process of success and failure in EDA
    • ML: I think this is addressed with the new edits Trevor is making
  • I absolutely love all of the plots in this chapter and the manual solving of the algorithm with ggplot and dplyr!
    • ML: no changes needed here
  • Great explanation of centering and scaling, but I worry a bit about the rule of thumb to “always do this” instead of “always think about what makes sense in the problem domain”. In some cases, you don’t want to scale (e.g. some cases with multiple measures in same units)
  • Chapter 7 is extremely dense and hits on so many foundational modeling concepts. I think some of this could be helpful to pull up before Chapter 6 and describe a holistic modeling workflow
    • ML: I think this is addressed with the new edits Trevor is making
  • Distinguish which steps are specific to KNN versus a general part of modeling
    • ML: I think this is addressed with the new edits Trevor is making
  • I love the discussion of the strengths and limitations of KNN, but it makes less sense without any presented alternatives
@leem44
Copy link
Contributor Author

leem44 commented Mar 28, 2021

Reviewer B:

  • I think classification serves as an excellent gateway to the world of modeling, especially since fairly complex models can be represented visually. Ex: Figure 6.2 has the outcome variable represented with colors, and the two predictor variables in the x-y coordinates.
    • ML: no changes needed here

@trevorcampbell
Copy link
Contributor

trevorcampbell commented Mar 28, 2021

Reviewer D

  • Can the authors clarify whether the categorical response variable used in the classification chapters is intended to have only 2 categories or could it have more? otherwise perhaps name binary classification
  • For the “labelled data set” used in this chapter, can the authors only focus on the response variable and predictor variables that will be used in the classification?
    • ML: I'm not really sure if this makes a big difference either way, also I don't know if it's necessary to hide other columns in the data from students.
  • When talking about “predicting a new observations” can the authors be more specific with the language being used? For example, can they talk instead about predicting the class (i.e., malignant or bening) for a new patient/tissue based on the perimeter and concavity values for that patient/tissue AND on the information from the patients/tissues included in the classification model/technique?
  • The section on More than two explanatory variables is not sufficiently fleshed out – perhaps this could be made into an exercise or elaborated upon more?
  • It would be nice to see a workflow for this chapter presented upfront, so that the students/readers know what to aim for:
    Pre-Process Data, Separate Pre-Processed Data into Overall Training and Test Sets, Apply Classification Model/Technique to Overall Training Set, Verify Accuracy of Classification on Test Set, Use ENTIRE Data Set (i.e., Overall Training + Test) to Perform Classifications for New Entities (e.g., subjects/tissues). This workflow is fairly complex so stating it upfront will help students/readers keep track of its moving components. (By the time students/readers arrive at the section titled Putting it together in a workflow, they will be primed for it.)
    • ML: I don't know if this will make sense to readers before we explain what each of these mean

@trevorcampbell
Copy link
Contributor

trevorcampbell commented Mar 28, 2021

Reviewer A

  • p136 first code box: Why not use cancer %>% pull(Class) %>% levels
  • p136 second code box: You haven't used n() before, does it need a little explanation?
  • p137 What is the value of defining so many colors now when only two are needed for this problem? I think this should be explained to the reader.
  • p139 figure: I suggest using a different glyph (for the red point) in addition to size and color.
  • p145: (re: rotating the 3d image) This caption would apply to the online version, but not the print version.
  • p155: I got this warning:
Warning message:
“`step_upsample()` is deprecated as of recipes 0.1.13.
Please use `themis::step_upsample()` instead.
This warning is displayed once every 8 hours.
Call `lifecycle::last_warnings()` to see where this warning was generated.”
  • This might be resolved by an update to recipes soon, but a novice might panic when they see this message!

@trevorcampbell
Copy link
Contributor

trevorcampbell commented Jul 16, 2021

From #146 regarding the first figure in this chapter

Probably for the chapter review (add this to the issue for that?) - it would be good to improve this figure to show how the key measurements we focus on in the chapter (e.g., concavity and perimeter) are extracted from this image.

From that same PR, regarding the text around the loading the data and variable descriptions area

> I think we now introduce and discuss factors in wrangling and then use them more heavily in the viz chapter (is this correct @leem44 ?) and so we can lighten up the discussion of them here now?

Also, I see we introduce pull here... is this the best place, or should we introduce this in the wrangling chapter? What do you think @leem44 ?

@trevorcampbell
Copy link
Contributor

From #146 comment by @ttimbers : need more informative axis labels in figures. The variables we have are the mean values across cells in a tissue sample.

However, I worry a bit that changing the axis labels will make the examples more confusing (because the new axis labels should be something like Mean Concavity (for example)).

I will make this same comment in the chapter-specific edits thread for classification 2.

@leem44
Copy link
Contributor Author

leem44 commented Aug 7, 2021

Also Tiffany's comments on classification here : #92

@leem44
Copy link
Contributor Author

leem44 commented Sep 23, 2021

All of these are done or made into new issues

@leem44 leem44 closed this as completed Sep 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants