Chapter 6 (classification I) further improvements #92

ttimbers · 2021-02-20T23:30:18Z

Tiffany's suggested improvements from reading it in the 2021 spring:

We can remove the discussion for forcats as it is loaded with the tidyverse. However, we should keep the sentence that discusses what factors are and why they are useful. We can move that to the first time we manipulate factors in this chapter?
When we introduce the variables, we define all of them in the numerical list (even the relatively obvious ones) except for ID number, symmetry and fractal dimension. To be consistent, we should do this for all of them.
In the following sentence we should explicitly state the B is benign and M is malignant: "Given that we only have 2 different values in our Class column (B and M), we only expect to get two names back."
We should really introduce colour palettes in chapter 4 in the textbook (to make this book more standalone, we currently only do this in the worksheet/tutorial), and then just use it here and not explain it (since the readers will then know it)
- ML: added this to Review: Ch 4 (viz) #104
~~We don't use units in this chapter, we should... even the scaled data could be labelled as "μm scaled" (or whatever is appropriate~~
This line of code is too long and should be broken across several lines to increase readability: mutate(dist_from_new = sqrt((Perimeter - new_obs_Perimeter)^2 + (Concavity - new_obs_Concavity)^2)) %>% (alternatively, we could shorten the variable names to that it is readable in one line, given that it is a mathematical equation).
We should introduce slice in the data wrangling chapter, so we can just use it here. Again, this would make this book more standalone, we currently only do this in the worksheet/tutorial).
- *ML: Adding this to Review: Ch 3 (wrangling) #103 *
We call the X's in this chapter features, predictors and explanatory variables. We should just pick one and stick with it.
- ML: this is addressed in this issue response / target / explanatory variable / predictor variable / feature / label #165
In the section titled "More than two explanatory variables" I think we should explicitly show the formula for 3 features, and then after that we can show the general formula for m features. I wonder if we even want to repeat a calculation for a single nearest neighbour, or 3 even, for 3 features and annotate our 3D plot like we do for our 2D plot. ML: making new issue, will do if time: Annotate 3D plot in classification chapter #228
- In the print version, we should also show 2-3 views of the 3D plot given that it will need to be static. ML: adding to Review: Global/big picture formatting revisions #96
I think we need a better explanation of what set_engine is doing, and why we need it. It seems like it is not always required - only if you want to add arguments to the model that exist in non-base packages? I am not 100% certain in that definition, but what we have written down currently is not clear, even to me. Also, after we fit the model we write: It confirms that the computational engine used to train the model was kknn::train.kknn which is also a bit unclear, as in set engine we only set the package name, how did it know to use the train.kknn function from the kknn package, and not some other function...
When we first use fit we use some fairly abstract syntax right away fit(Class ~ ., data = cancer_train) and do not really explain it. Here I think we should write the formula our first fully: fit(Class ~ Perimeter + Concavity, data = cancer_train) and then re-write it with Class ~ . and explain what ~ . means.
We should not use the word "tune" in this chapter, since they don't know what it means yet (sentence we use it in: "we will actually let R tune the model for us.", might also be other places, we should check)

The text was updated successfully, but these errors were encountered:

ttimbers · 2021-02-21T00:30:10Z

When we do the first prediction, we end it stating "As above when we ran the K-nearest neighbours classification algorithm manually, the knn_fit object classifies the new observation as malignant (“M”)." I think we should add a statement something along the lines of, "Is that the true label? We don't know, because we do not have the label - this is what we were trying to predict. In the next chapter we will learn some ways to start to quantify how accurate we think our predictions are."
~~The break between the paragraph on centring and the definition of standardization is very abrupt. There should be a better transition here.~~
-It is not clear that the recipe, ending in prep, is simply making a transformer object based off the data input into the recipe. They probably expect the object to be the scaled and centred data. This will leave them wondering why we later need to bake (and it might seem also redundant if they make this confusion). We might want to also explicitly explain why this framework exists (so that you can centre and scale variables for new observations that you want to predict, which were not part of the training set).
~~cancer %>% filter(Class == "M") %>% sample_n(3) should be split across two lines, after the pipe.~~
When we show how to use step_upsample we switch back to using the scaled and centred data, but don't really tell the reader about this. We should either continue to use the unstandardized data, and include the scaling and centring in the recipe, or we should explicitly say we are now using the standardized data for simplicity sake (I prefer the former I think).
I think the text and the figure captions of the decision boundary plots need to more explicitly and clearly communicate what is the data with labels, and what are the predictions. It's obvious to us, but I don't think its clear enough for the learners.

ttimbers · 2021-02-21T00:47:45Z

-Important: I think we need to add another part to the workflow section, in particular how do you use a workflow to predict a class for a single new observation. I have yet to read chapter 7, but I'll bet it's just using predict but as evidenced by my question, the learners do not know that...

ttimbers · 2021-02-21T17:05:54Z

~~In this chapter (and following) anytime we plot standardized data we should label the data on the plots as such.~~

trevorcampbell · 2021-02-21T19:54:58Z

We call the X's in this chapter features, predictors and explanatory variables. We should just pick one and stick with it.

Agreed, but when we first introduce the term we should put in a comment box the synonyms that people commonly use

It is not clear that the recipe, ending in prep, is simply making a transformer object ...

This must already be explained somewhere, no? Chapter 7 maybe? Check there before editing. If not, it definitely should be. I remember talking about this at least a few times in class, and students having some trouble with it.

Important: I think we need to add another part to the workflow section, in particular how do you use a workflow to predict a class for a single new observation.

Only if it makes sense to introduce it in this section.

All the other comments I agree with!

Big comment: before we do any of these things, should read Chapter 7 as carefully, because certain organizational choices may become clearer.

…ome paragraphs to clarify functions, removing unneeded explanations of functions which we will explain in earlier chapters, updating scaling/centering explanation, specifying data in balancing section

leem44 · 2021-09-23T20:28:16Z

All done or made into new issues

leem44 mentioned this issue Aug 16, 2021

Review classification chapters #234

Merged

leem44 linked a pull request Sep 21, 2021 that will close this issue

Review classification chapters #234

Merged

leem44 closed this as completed Sep 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chapter 6 (classification I) further improvements #92

Chapter 6 (classification I) further improvements #92

ttimbers commented Feb 20, 2021 •

edited by leem44

Loading

ttimbers commented Feb 21, 2021 •

edited by leem44

Loading

ttimbers commented Feb 21, 2021 •

edited by leem44

Loading

ttimbers commented Feb 21, 2021 •

edited by leem44

Loading

trevorcampbell commented Feb 21, 2021

leem44 commented Sep 23, 2021

Chapter 6 (classification I) further improvements #92

Chapter 6 (classification I) further improvements #92

Comments

ttimbers commented Feb 20, 2021 • edited by leem44 Loading

Tiffany's suggested improvements from reading it in the 2021 spring:

ttimbers commented Feb 21, 2021 • edited by leem44 Loading

ttimbers commented Feb 21, 2021 • edited by leem44 Loading

ttimbers commented Feb 21, 2021 • edited by leem44 Loading

trevorcampbell commented Feb 21, 2021

leem44 commented Sep 23, 2021

ttimbers commented Feb 20, 2021 •

edited by leem44

Loading

ttimbers commented Feb 21, 2021 •

edited by leem44

Loading

ttimbers commented Feb 21, 2021 •

edited by leem44

Loading

ttimbers commented Feb 21, 2021 •

edited by leem44

Loading