Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chapter 6 (classification I) further improvements #92

Closed
ttimbers opened this issue Feb 20, 2021 · 5 comments · Fixed by #234
Closed

Chapter 6 (classification I) further improvements #92

ttimbers opened this issue Feb 20, 2021 · 5 comments · Fixed by #234

Comments

@ttimbers
Copy link
Contributor

ttimbers commented Feb 20, 2021

Tiffany's suggested improvements from reading it in the 2021 spring:

  • We can remove the discussion for forcats as it is loaded with the tidyverse. However, we should keep the sentence that discusses what factors are and why they are useful. We can move that to the first time we manipulate factors in this chapter?
  • When we introduce the variables, we define all of them in the numerical list (even the relatively obvious ones) except for ID number, symmetry and fractal dimension. To be consistent, we should do this for all of them.
  • In the following sentence we should explicitly state the B is benign and M is malignant: "Given that we only have 2 different values in our Class column (B and M), we only expect to get two names back."
  • We should really introduce colour palettes in chapter 4 in the textbook (to make this book more standalone, we currently only do this in the worksheet/tutorial), and then just use it here and not explain it (since the readers will then know it)
  • We don't use units in this chapter, we should... even the scaled data could be labelled as "μm scaled" (or whatever is appropriate
  • This line of code is too long and should be broken across several lines to increase readability: mutate(dist_from_new = sqrt((Perimeter - new_obs_Perimeter)^2 + (Concavity - new_obs_Concavity)^2)) %>% (alternatively, we could shorten the variable names to that it is readable in one line, given that it is a mathematical equation).
  • We should introduce slice in the data wrangling chapter, so we can just use it here. Again, this would make this book more standalone, we currently only do this in the worksheet/tutorial).
  • We call the X's in this chapter features, predictors and explanatory variables. We should just pick one and stick with it.
  • In the section titled "More than two explanatory variables" I think we should explicitly show the formula for 3 features, and then after that we can show the general formula for m features. I wonder if we even want to repeat a calculation for a single nearest neighbour, or 3 even, for 3 features and annotate our 3D plot like we do for our 2D plot. ML: making new issue, will do if time: Annotate 3D plot in classification chapter #228
  • I think we need a better explanation of what set_engine is doing, and why we need it. It seems like it is not always required - only if you want to add arguments to the model that exist in non-base packages? I am not 100% certain in that definition, but what we have written down currently is not clear, even to me. Also, after we fit the model we write: It confirms that the computational engine used to train the model was kknn::train.kknn which is also a bit unclear, as in set engine we only set the package name, how did it know to use the train.kknn function from the kknn package, and not some other function...
  • When we first use fit we use some fairly abstract syntax right away fit(Class ~ ., data = cancer_train) and do not really explain it. Here I think we should write the formula our first fully: fit(Class ~ Perimeter + Concavity, data = cancer_train) and then re-write it with Class ~ . and explain what ~ . means.
  • We should not use the word "tune" in this chapter, since they don't know what it means yet (sentence we use it in: "we will actually let R tune the model for us.", might also be other places, we should check)
@ttimbers
Copy link
Contributor Author

ttimbers commented Feb 21, 2021

  • When we do the first prediction, we end it stating "As above when we ran the K-nearest neighbours classification algorithm manually, the knn_fit object classifies the new observation as malignant (“M”)." I think we should add a statement something along the lines of, "Is that the true label? We don't know, because we do not have the label - this is what we were trying to predict. In the next chapter we will learn some ways to start to quantify how accurate we think our predictions are."
  • The break between the paragraph on centring and the definition of standardization is very abrupt. There should be a better transition here.
    -It is not clear that the recipe, ending in prep, is simply making a transformer object based off the data input into the recipe. They probably expect the object to be the scaled and centred data. This will leave them wondering why we later need to bake (and it might seem also redundant if they make this confusion). We might want to also explicitly explain why this framework exists (so that you can centre and scale variables for new observations that you want to predict, which were not part of the training set).
  • cancer %>% filter(Class == "M") %>% sample_n(3) should be split across two lines, after the pipe.
  • When we show how to use step_upsample we switch back to using the scaled and centred data, but don't really tell the reader about this. We should either continue to use the unstandardized data, and include the scaling and centring in the recipe, or we should explicitly say we are now using the standardized data for simplicity sake (I prefer the former I think).
  • I think the text and the figure captions of the decision boundary plots need to more explicitly and clearly communicate what is the data with labels, and what are the predictions. It's obvious to us, but I don't think its clear enough for the learners.

@ttimbers
Copy link
Contributor Author

ttimbers commented Feb 21, 2021

-Important: I think we need to add another part to the workflow section, in particular how do you use a workflow to predict a class for a single new observation. I have yet to read chapter 7, but I'll bet it's just using predict but as evidenced by my question, the learners do not know that...

@ttimbers
Copy link
Contributor Author

ttimbers commented Feb 21, 2021

  • In this chapter (and following) anytime we plot standardized data we should label the data on the plots as such.

@trevorcampbell
Copy link
Contributor

We call the X's in this chapter features, predictors and explanatory variables. We should just pick one and stick with it.

Agreed, but when we first introduce the term we should put in a comment box the synonyms that people commonly use

It is not clear that the recipe, ending in prep, is simply making a transformer object ...

This must already be explained somewhere, no? Chapter 7 maybe? Check there before editing. If not, it definitely should be. I remember talking about this at least a few times in class, and students having some trouble with it.

Important: I think we need to add another part to the workflow section, in particular how do you use a workflow to predict a class for a single new observation.

Only if it makes sense to introduce it in this section.

All the other comments I agree with!

Big comment: before we do any of these things, should read Chapter 7 as carefully, because certain organizational choices may become clearer.

leem44 pushed a commit that referenced this issue Aug 7, 2021
…ome paragraphs to clarify functions, removing unneeded explanations of functions which we will explain in earlier chapters, updating scaling/centering explanation, specifying data in balancing section
@leem44 leem44 linked a pull request Sep 21, 2021 that will close this issue
@leem44
Copy link
Contributor

leem44 commented Sep 23, 2021

All done or made into new issues

@leem44 leem44 closed this as completed Sep 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants