From dea1e5dfecf85b24e9cb6c7a1cc6bebedc659e44 Mon Sep 17 00:00:00 2001
From: Angel Esteban Feliz <84166329+AngelFelizR@users.noreply.github.com>
Date: Mon, 7 Oct 2024 16:55:34 -0400
Subject: [PATCH] Join vignette (#6478)

Co-authored-by: rikivillalba <32423469+rikivillalba@users.noreply.github.com>
Co-authored-by: Toby Dylan Hocking <tdhock5@gmail.com>
---
 DESCRIPTION                                   |   1 +
 vignettes/datatable-intro.Rmd                 |   6 +-
 vignettes/datatable-joins.Rmd                 | 696 ++++++++++++++++++
 vignettes/datatable-keys-fast-subset.Rmd      |  14 +-
 vignettes/datatable-reference-semantics.Rmd   |  14 +-
 vignettes/datatable-sd-usage.Rmd              |   3 +-
 ...le-secondary-indices-and-auto-indexing.Rmd |  10 +-
 7 files changed, 721 insertions(+), 23 deletions(-)
 create mode 100644 vignettes/datatable-joins.Rmd

diff --git a/DESCRIPTION b/DESCRIPTION
index 5d03d5823..d17f2ea75 100644
--- a/DESCRIPTION
+++ b/DESCRIPTION
@@ -90,6 +90,7 @@ Authors@R: c(
   person("Anirban", "Chetia",      role="ctb"),
   person("Doris", "Amoakohene",    role="ctb"),
   person("Ivan", "Krylov",         role="ctb"),
+  person("Angel", "Feliz",         role="ctb"),
   person("Michael","Young",        role="ctb"),
   person("Mark", "Seeto",          role="ctb")
   )
diff --git a/vignettes/datatable-intro.Rmd b/vignettes/datatable-intro.Rmd
index e08280c5c..72eabbd8e 100644
--- a/vignettes/datatable-intro.Rmd
+++ b/vignettes/datatable-intro.Rmd
@@ -475,7 +475,7 @@ ans
 
 **Keys:** Actually `keyby` does a little more than *just ordering*. It also *sets a key* after ordering by setting an `attribute` called `sorted`. 
 
-We'll learn more about `keys` in the *Keys and fast binary search based subset* vignette; for now, all you have to know is that you can use `keyby` to automatically order the result by the columns specified in `by`.
+We'll learn more about `keys` in the `vignette("datatable-keys-fast-subset", package="data.table")`; for now, all you have to know is that you can use `keyby` to automatically order the result by the columns specified in `by`.
 
 ### c) Chaining
 
@@ -655,7 +655,7 @@ We have seen so far that,
 
 * We can also sort a `data.table` using `order()`, which internally uses data.table's fast order for better performance.
 
-We can do much more in `i` by keying a `data.table`, which allows for blazing fast subsets and joins. We will see this in the *"Keys and fast binary search based subsets"* and *"Joins and rolling joins"* vignette.
+We can do much more in `i` by keying a `data.table`, which allows for blazing fast subsets and joins. We will see this in the `vignette("datatable-keys-fast-subset", package="data.table")` and the `vignette("datatable-joins", package="data.table")`.
 
 #### Using `j`:
 
@@ -689,7 +689,7 @@ We can do much more in `i` by keying a `data.table`, which allows for blazing fa
 
 As long as `j` returns a `list`, each element of the list will become a column in the resulting `data.table`.
 
-We will see how to *add/update/delete* columns *by reference* and how to combine them with `i` and `by` in the next vignette.
+We will see how to *add/update/delete* columns *by reference* and how to combine them with `i` and `by` in the next vignette (`vignette("datatable-reference-semantics", package="data.table")`).
 
 ***
 
diff --git a/vignettes/datatable-joins.Rmd b/vignettes/datatable-joins.Rmd
new file mode 100644
index 000000000..b3b30598d
--- /dev/null
+++ b/vignettes/datatable-joins.Rmd
@@ -0,0 +1,696 @@
+---
+title: "Joins in data.table"
+date: "`r Sys.Date()`"
+output:
+  markdown::html_format
+vignette: >
+  %\VignetteIndexEntry{Joins in data.table}
+  %\VignetteEngine{knitr::knitr}
+  \usepackage[utf8]{inputenc}
+editor_options: 
+  chunk_output_type: console
+---
+
+```{r, echo = FALSE, message = FALSE}
+require(data.table)
+knitr::opts_chunk$set(
+  comment = "#",
+    error = FALSE,
+     tidy = FALSE,
+    cache = FALSE,
+ collapse = TRUE
+)
+```
+
+In this vignette you will learn how to perform any join operation using resources available in the `data.table` syntax.
+
+It assumes familiarity with the `data.table` syntax. If that is not the case, please read the following vignettes:
+
+- `vignette("datatable-intro", package="data.table")`
+- `vignette("datatable-reference-semantics", package="data.table")`
+- `vignette("datatable-keys-fast-subset", package="data.table")`
+
+***
+
+## 1. Defining example data
+
+To illustrate how to use the method available with real life examples, let's simulate a **normalized database** from a little supermarket by performing the following steps:
+
+1. Defining a `data.table` where each product is represented by a row with some qualities, but leaving one product without `id` to show how the framework deals with ***missing values***.
+
+```{r}
+Products = data.table(
+  id = c(1:4,
+         NA_integer_),
+  name = c("banana",
+           "carrots",
+           "popcorn",
+           "soda",
+           "toothpaste"),
+  price = c(0.63,
+            0.89,
+            2.99,
+            1.49,
+            2.99),
+  unit = c("unit",
+           "lb",
+           "unit",
+           "ounce",
+           "unit"),
+  type = c(rep("natural", 2L),
+           rep("processed", 3L))
+)
+
+Products
+```
+
+2. Defining a `data.table` showing the proportion of taxes to be applied for processed products based on their units.
+
+```{r}
+NewTax = data.table(
+  unit = c("unit","ounce"),
+  type = "processed",
+  tax_prop = c(0.65, 0.20)
+)
+
+NewTax
+```
+
+
+3. Defining a `data.table` simulating the products received every Monday with a `product_id` that is not present in the `Products` table.
+
+```{r}
+set.seed(2156)
+
+ProductReceived = data.table(
+  id = 1:10,
+  date = seq(from = as.IDate("2024-01-08"), length.out = 10L, by = "week"),
+  product_id = sample(c(NA_integer_, 1:3, 6L), size = 10L, replace = TRUE),
+  count = sample(c(50L, 100L, 150L), size = 10L, replace = TRUE)
+)
+
+ProductReceived
+```
+
+4. Defining a `data.table` to show some sales that can take place on weekdays with another `product_id` that is not present in the `Products` table.
+
+```{r}
+sample_date = function(from, to, size, ...){
+  all_days = seq(from = from, to = to, by = "day")
+  weekdays = all_days[wday(all_days) %in% 2:6]
+  days_sample = sample(weekdays, size, ...)
+  days_sample_desc = sort(days_sample)
+  days_sample_desc
+}
+
+set.seed(5415)
+
+ProductSales = data.table(
+  id = 1:10,
+  date = ProductReceived[, sample_date(min(date), max(date), 10L)],
+  product_id = sample(c(1:3, 7L), size = 10L, replace = TRUE),
+  count = sample(c(50L, 100L, 150L), size = 10L, replace = TRUE)
+)
+
+
+ProductSales
+```
+
+## 2. `data.table` joining syntax 
+
+Before taking advantage of the `data.table` syntax to perform join operations we need to know which arguments can help us to perform successful joins.
+
+The next diagram shows a description for each basic argument. In the following sections we will show how to use each of them and add more complexity little by little.
+
+```
+x[i, on, nomatch]
+| |  |   |
+| |  |   \__ If NULL only returns rows linked in x and i tables
+| |  \____ a character vector o list defining match logict
+| \_____ primary data.table, list or data.frame
+\____ secondary data.table
+```
+
+> Please keep in mind that the standard argument order in data.table is `dt[i, j, by]`. For join operations, it is recommended to pass the `on` and `nomatch` arguments by name to avoid using `j` and `by` when they are not needed.
+
+## 3. Equi joins
+
+This the most common and simple case as we can find common elements between tables to combine.
+
+The relationship between tables can be:
+
+- **One to one**: When each matching value is unique on each table.
+- **One to many**: When some matching values are repeated in one of the table both unique in the other one.
+- **Many to many**: When the matching values are repeated several times on each table.
+
+In most of the following examples we will perform *one to many* matches, but we are also going to take the time to explain the resources available to perform *many to many* matches.
+
+
+### 3.1. Right join
+
+Use this method if you need to combine columns from 2 tables based on one or more references but ***keeping all rows present in the table located on the right (in the the square brackets)***.
+
+In our supermarket context, we can perform a right join to see more details about the products received as this is relation *one to many* by passing a vector to the `on` argument.
+
+```{r}
+Products[ProductReceived,
+         on = c(id = "product_id")]
+```
+
+As many things have changed, let's explain the new characteristics in the following groups:
+
+- **Column level**
+  - The *first group* of columns in the new data.table comes from the `x` table.
+  - The *second group* of columns in the new data.table comes from the `i` table.
+  - If the join operation presents a present any **name conflict** (both table have same column name) the ***prefix*** `i.` is added to column names from the **right-hand table** (table on `i` position).
+  
+- **Row level**
+  - The missing `product_id` present on the `ProductReceived` table in row 1 was successfully matched with missing `id` of the `Products` table, so `NA` ***values are treated as any other value***.
+  - All rows from in the `i` table were kept including:
+    - Not matching rows like the one with `product_id = 6`.
+    - Rows that repeat the same `product_id` several times.
+    
+#### 3.1.1. Joining by a list argument
+
+If you are following the vignette, you might have found out that we used a vector to define the relations between tables in the `on` argument, that is really useful if you are **creating your own functions**, but another alternative is to use a **list** to define the columns to match.
+
+To use this capacity, we have 2 equivalent alternatives:
+
+- Wrapping the related columns in the base R `list` function.
+
+```{r, eval=FALSE}
+Products[ProductReceived,
+         on = list(id = product_id)]
+```
+
+- Wrapping the related columns in the data.table `list`	alias `.`.
+
+```{r, eval=FALSE}
+Products[ProductReceived,
+         on = .(id = product_id)]
+```
+
+#### 3.1.2. Alternatives to define the `on` argument
+
+In all the prior example we have pass the column names we want to match to the `on` argument but `data.table` also have alternatives to that syntax.
+
+- **Natural join**: Selects the columns to perform the match based on common column names. To illustrate this method, let's change the column of `Products` table from `id` to `product_id` and use the keyword `.NATURAL`.
+
+```{r}
+ProductsChangedName = setnames(copy(Products), "id", "product_id")
+ProductsChangedName
+
+ProductsChangedName[ProductReceived, on = .NATURAL]
+```
+
+- **Keyed join**: Selects the columns to perform the match based on keyed columns regardless of their names.To illustrate this method, we need to define keys in the same order for both tables.
+
+```{r}
+ProductsKeyed = setkey(copy(Products), id)
+key(ProductsKeyed)
+
+ProductReceivedKeyed = setkey(copy(ProductReceived), product_id)
+key(ProductReceivedKeyed)
+
+ProductsKeyed[ProductReceivedKeyed]
+```
+
+#### 3.1.3. Operations after joining
+
+Most of the time after a join is complete we need to make some additional transformations. To make so we have the following alternatives:
+
+- Chaining a new instruction by adding a pair of brakes `[]`.
+- Passing a list with the columns that we want to keep or create to the `j` argument.
+
+Our recommendation is to use the second alternative if possible, as it is **faster** and uses **less memory** than the first one.
+
+
+##### Managing shared column Names with the j argument
+
+The `j` argument has great alternatives to manage joins with tables **sharing the same names for several columns**. By default all columns are taking their source from the the `x` table, but we can also use the `x.` prefix to make clear the source and use the prefix `i.` to use any column form the table declared in the `i` argument of the `x` table.
+
+Going back to the little supermarket, after updating the `ProductReceived` table with the `Products` table, it seems convenient apply the following changes:
+
+- Changing the columns names from `id` to `product_id` and from `i.id` to `received_id`.
+- Adding the `total_value`.
+
+```{r}
+Products[
+  ProductReceived,
+  on = c("id" = "product_id"),
+  j = .(product_id = x.id,
+        name = x.name,
+        price,
+        received_id = i.id,
+        date = i.date,
+        count,
+        total_value = price * count)
+]
+```
+
+
+##### Summarizing with on in data.table
+
+We can also use this alternative to return aggregated results based columns present in the `x` table.
+
+For example, we might interested in how much money we expend buying products each date regardless the products.
+
+```{r}
+dt1 = ProductReceived[
+  Products,
+  on = c("product_id" = "id"),
+  by = .EACHI,
+  j = .(total_value_received  = sum(price * count))
+]
+
+
+dt2 = ProductReceived[
+  Products,
+  on = c("product_id" = "id"),
+][, .(total_value_received  = sum(price * count)),
+  by = "product_id"
+]
+
+identical(dt1, dt2)
+```
+
+#### 3.1.4. Joining based on several columns
+
+So far we have just joined `data.table` base on 1 column, but it's important to know that the package can join tables matching several columns.
+
+To illustrate this, let's assume that we want to add the `tax_prop` from `NewTax` to **update** the `Products` table.
+
+```{r}
+NewTax[Products, on = c("unit", "type")]
+```
+
+### 3.2. Inner join
+
+Use this method if you need to combine columns from 2 tables based on one or more references but ***keeping only rows matched in both tables***.
+
+To perform this operation we just need to add `nomatch = NULL` or `nomatch = 0` to any of the prior join operations to return the same results.
+
+```{r}
+# First Table
+Products[ProductReceived,
+         on = c("id" = "product_id"),
+         nomatch = NULL]
+
+# Second Table
+ProductReceived[Products,
+                on = .(product_id = id),
+                nomatch = NULL]
+```
+
+Despite both tables have the same information, they present some relevant differences:
+
+- They present different order for their columns
+- They have some name differences on their columns names:
+  - The `id` column of first table has the same information as the `product_id` in the second table.
+  - The `i.id` column of first table has the same information as the `id` in the second table.
+
+### 3.3. Not join
+
+This method **keeps only the rows that don't match with any row of a second table**.
+
+To apply this technique we just need to negate (`!`) the table located on the `i` argument.
+
+```{r}
+Products[!ProductReceived,
+         on = c("id" = "product_id")]
+```
+
+As you can see, the result only has 'banana', as it was the only product that is not present in the `ProductReceived` table.
+
+```{r}
+ProductReceived[!Products,
+                on = c("product_id" = "id")]
+```
+
+In this case, the operation returns the row with `product_id = 6,` as it is not present on the `Products` table.
+
+### 3.4. Semi join
+
+This method extract **keeps only the rows that match with any row in a second table** without combining the column of the tables.
+
+It's very similar to subset as join, but as in this time we are passing a complete table to the `i` we need to ensure that:
+
+- Any row in the `x` table is duplicated due row duplication in the table passed to the `i` argument.
+
+- All the renaming rows from `x` should keep the original row order. 
+
+
+To make this, you can apply the following steps:
+
+1. Perform a **inner join** with `which = TRUE` to save the row numbers related to each matching row of the `x` table.
+
+```{r}
+SubSetRows = Products[
+  ProductReceived,
+  on = .(id = product_id),
+  nomatch = NULL,
+  which = TRUE
+]
+
+SubSetRows
+```
+
+2. Select and sort the unique rows ids.
+
+```{r}
+SubSetRowsSorted = sort(unique(SubSetRows))
+
+SubSetRowsSorted
+```
+
+
+3. Selecting the `x` rows to keep.
+
+```{r}
+Products[SubSetRowsSorted]
+```
+  
+
+### 3.5. Left join
+
+Use this method if you need to combine columns from 2 tables based on one or more references but ***keeping all rows present in the table located on the left***.
+
+To perform this operation, we just need to **exchange the order between both tables** and the columns names in the `on` argument.
+
+```{r}
+ProductReceived[Products,
+                on = list(product_id = id)]
+```
+
+Here some important considerations:
+
+- **Column level**
+  - The *first group* of columns now comes from the `ProductReceived` table as it is the `x` table.
+  - The *second group* of columns now comes from the `Products` table as it is the `i` table.
+  - It didn't add the prefix `i.` to any column.
+  
+- **Row level**
+  - All rows from in the `i` table were kept as we never received any banana but row is still part of the results.
+  - The row related to `product_id = 6` is no part of the results any more as it is not present in the `Products` table.
+
+
+#### 3.5.1. Joining after chain operations
+
+One of the key features of `data.table` is that we can apply several operations before saving our final results by chaining brackets.
+
+```r
+DT[
+  ...
+][
+  ...
+][
+  ...
+]
+```
+
+So far, if after applying all that operations **we want to join new columns without removing any row**, we would need to stop the chaining process, save a temporary table and later apply the join operation.
+
+To avoid that situation, we can use special symbols `.SD`, to apply a **right join based on the changed table**.
+
+```{r}
+NewTax[Products,
+       on = c("unit", "type")
+][, ProductReceived[.SD,
+                    on = list(product_id = id)],
+  .SDcols = !c("unit", "type")]
+```
+
+### 3.6. Many to many join
+
+Sometimes we want to join tables based on columns with **duplicated `id` values** to later perform some transformations later.
+
+To illustrate this situation let's take as an example the `product_id == 1L`, which have 4 rows in our `ProductReceived` table.
+
+```{r}
+ProductReceived[product_id == 1L]
+```
+
+And 4 rows in our `ProductSales` table.
+
+```{r}
+ProductSales[product_id == 1L]
+```
+
+To perform this join we just need to filter `product_id == 1L` in the `i` table to limit the join just to that product and set the argument `allow.cartesian = TRUE` to allow combining each row from one table with every row from the other table.
+
+```{r}
+ProductReceived[ProductSales[list(1L),
+                             on = "product_id",
+                             nomatch = NULL],
+                on = "product_id",
+                allow.cartesian = TRUE]
+```
+
+Once we understand the result, we can apply the same process for **all products**.
+
+```{r}
+ProductReceived[ProductSales,
+                on = "product_id",
+                allow.cartesian = TRUE]
+```
+
+> `allow.cartesian` is defaulted to FALSE as this is seldom what the user wants, and such a cross join can lead to a very large number of rows in the result. For example, if Table A has 100 rows and Table B has 50 rows, their Cartesian product would result in 5000 rows (100 * 50). This can quickly become memory-intensive for large datasets.
+
+
+#### 3.6.1. Selecting one match
+
+After joining the table we might find out that we just need to return a single join to extract the information we need. In this case we have 2 alternatives:
+
+- We can select the **first match**, represented in the next example by `id = 2`.
+
+```{r}
+ProductReceived[ProductSales[product_id == 1L],
+                on = .(product_id),
+                allow.cartesian = TRUE,
+                mult = "first"]
+```
+
+- We can select the **last match**, represented in the next example by `id = 9`.
+
+```{r}
+ProductReceived[ProductSales[product_id == 1L],
+                on = .(product_id),
+                allow.cartesian = TRUE,
+                mult = "last"]
+```
+
+#### 3.6.2. Cross join
+
+If you want to get **all possible row combinations** regardless of any particular id column we can follow the next process:
+
+1. Create a new column in both tables with a constant.
+
+```{r}
+ProductsTempId = copy(Products)[, temp_id := 1L]
+```
+
+2. Join both table based on the new column and remove it after ending the process, as it doesn't have reason to stay after joining.
+
+```{r}
+AllProductsMix =
+  ProductsTempId[ProductsTempId,
+                 on = "temp_id",
+                 allow.cartesian = TRUE]
+
+AllProductsMix[, temp_id := NULL]
+
+# Removing type to make easier to see the result when printing the table
+AllProductsMix[, !c("type", "i.type")]
+```
+
+
+### 3.7. Full join
+
+Use this method if you need to combine columns from 2 tables based on one or more references ***without removing any row***.
+
+As we saw in the previous section, any of the prior operations can keep the missing `product_id = 6` and the **soda** (`product_id = 4`) as part of the results.
+
+To save this problem, we can use the `merge` function even thought it is lower than using the native `data.table`'s joining syntax.
+
+```{r}
+merge(x = Products,
+      y = ProductReceived,
+      by.x = "id",
+      by.y = "product_id",
+      all = TRUE,
+      sort = FALSE)
+```
+
+
+## 4. Non-equi join
+
+A non-equi join is a type of join where the condition for matching rows is not based on equality, but on other comparison operators like <, >, <=, or >=. This allows for **more flexible joining criteria**. In `data.table`, non-equi joins are particularly useful for operations like:
+
+- Finding the nearest match
+- Comparing ranges of values between tables
+
+It's a great alternative if after applying a right of inner join:
+
+- You want to decrease the number of returned rows based on comparing numeric columns of different table.
+- You don't need to keep the columns from table `x`*(secondary data.table)* in the final table.
+
+To illustrate how this work, let's center over attention on how are the sales and receives for product 2.
+  
+```{r}
+ProductSalesProd2 = ProductSales[product_id == 2L]
+ProductReceivedProd2 = ProductReceived[product_id == 2L]
+```
+
+If want to know, for example, if can find any receive that took place before a sales date, we can apply the next code.
+
+```{r}
+ProductReceivedProd2[ProductSalesProd2,
+                     on = "product_id",
+                     allow.cartesian = TRUE
+][date < i.date]
+```
+
+What does happen if we just apply the same logic on the list passed to `on`?
+
+- As this opperation it's still a right join, it returns all rows from the `i` table, but only shows the values for `id` and `count` when the rules are met.
+
+- The date related `ProductReceivedProd2` was omited from this new table.
+
+```{r}
+ProductReceivedProd2[ProductSalesProd2,
+                     on = list(product_id, date < date)]
+```
+
+Now, after applying the join, we can limit the results only show the cases that meet all joining criteria.                                                               
+
+```{r}
+ProductReceivedProd2[ProductSalesProd2,
+                     on = list(product_id, date < date),
+                     nomatch = NULL]
+```
+
+
+## 5. Rolling join
+
+Rolling joins are particularly useful in time-series data analysis. They allow you to **match rows based on the nearest value** in a sorted column, typically a date or time column. 
+
+This is valuable when you need to align data from different sources **that may not have exactly matching timestamps**, or when you want to carry forward the most recent value. 
+
+For example, in financial data, you might use a rolling join to assign the most recent stock price to each transaction, even if the price updates and transactions don't occur at the exact same times.
+
+
+In our supermarket example, we can use a rolling join to match sales with the most recent product information.
+
+Let's assume that the price for Bananas and Carrots changes at the first date of each month.
+
+```{r}
+ProductPriceHistory = data.table(
+  product_id = rep(1:2, each = 3),
+  date = rep(as.IDate(c("2024-01-01", "2024-02-01", "2024-03-01")), 2),
+  price = c(0.59, 0.63, 0.65,  # Banana prices
+            0.79, 0.89, 0.99)  # Carrot prices
+)
+
+ProductPriceHistory
+```
+
+Now, we can perform a right join giving a different prices for each product based on the sale date.
+
+```{r}
+ProductPriceHistory[ProductSales,
+                    on = .(product_id, date),
+                    roll = TRUE,
+                    j = .(product_id, date, count, price)]
+```
+
+If we just want to see the matching cases we just need to add the argument `nomatch = NULL` to perform an inner rolling join.
+
+```{r}
+ProductPriceHistory[ProductSales,
+                    on = .(product_id, date),
+                    roll = TRUE,
+                    nomatch = NULL,
+                    j = .(product_id, date, count, price)]
+```
+
+## 7. Taking advange of joining speed
+
+### 7.1. Subsets as joins
+
+As we just saw in the prior section the `x` table gets filtered by the values available in the `i` table. Actually, that process is faster than passing a Boolean expression to the `i` argument.
+
+To filter the `x` table at speed we don't to pass a complete `data.table`, we can pass a `list()` of vectors with the values that we want to keep or omit from the original table.
+
+For example, to filter dates where the market received 100 units of bananas (`product_id = 1`) or popcorn (`product_id = 3`) we can use the following:
+
+```{r}
+ProductReceived[list(c(1L, 3L), 100L),
+                on = c("product_id", "count")]
+```
+
+As at the end, we are filtering based on a join operation the code returned a **row that was not present in original table**. To avoid that behavior, it is recommended to always to add the argument `nomatch = NULL`.
+
+```{r}
+ProductReceived[list(c(1L, 3L), 100L),
+                on = c("product_id", "count"),
+                nomatch = NULL]
+```
+
+
+We can also use this technique to filter out any combination of values by prefixing them with `!` to negate the expression in the `i` argument and keeping the `nomatch` with its default value. For example, we can filter out the 2 rows we filtered before.
+
+```{r}
+ProductReceived[!list(c(1L, 3L), 100L),
+                on = c("product_id", "count")]
+```
+
+If you just want to filter a value for a single **character column**, you can omit calling the `list()` function pass the value to been filtered in the `i` argument.
+
+```{r}
+Products[c("banana","popcorn"),
+         on = "name",
+         nomatch = NULL]
+
+Products[!"popcorn",
+         on = "name"]
+
+```
+
+
+
+### 7.2. Updating by reference
+
+The `:=` operator in data.table is used for updating or adding columns by reference. This means it modifies the original data.table without creating a copy, which is very memory-efficient, especially for large datasets. When used inside a data.table, `:=` allows you to **add new columns** or **modify existing ones** as part of your query.
+
+Let's update our `Products` table with the latest price from `ProductPriceHistory`:
+
+```{r}
+copy(Products)[ProductPriceHistory,
+               on = .(id = product_id),
+               j = `:=`(price = tail(i.price, 1),
+                        last_updated = tail(i.date, 1)),
+               by = .EACHI][]
+```
+
+In this operation:
+
+- The function `copy` prevent that `:=` changes by reference the `Products` table.s
+- We join `Products` with `ProductPriceHistory` based on `id` and `product_id`.
+- We update the `price` column with the latest price from `ProductPriceHistory`.
+- We add a new `last_updated` column to track when the price was last changed.
+- The `by = .EACHI` ensures that the `tail` function is applied for each product in `ProductPriceHistory`.
+
+***
+
+## Reference
+
+- *Understanding data.table Rolling Joins*: https://www.r-bloggers.com/2016/06/understanding-data-table-rolling-joins/
+
+- *Semi-join with data.table*: https://stackoverflow.com/questions/18969420/perform-a-semi-join-with-data-table
+
+- *Cross join with data.table*: https://stackoverflow.com/questions/10600060/how-to-do-cross-join-in-r
+
+- *How does one do a full join using data.table?*: https://stackoverflow.com/questions/15170741/how-does-one-do-a-full-join-using-data-table
+
+- *Enhanced data.frame*: https://rdatatable.gitlab.io/data.table/reference/data.table.html
+
diff --git a/vignettes/datatable-keys-fast-subset.Rmd b/vignettes/datatable-keys-fast-subset.Rmd
index d85f69ad8..5602c3205 100644
--- a/vignettes/datatable-keys-fast-subset.Rmd
+++ b/vignettes/datatable-keys-fast-subset.Rmd
@@ -20,13 +20,13 @@ knitr::opts_chunk$set(
 .old.th = setDTthreads(1)
 ```
 
-This vignette is aimed at those who are already familiar with *data.table* syntax, its general form, how to subset rows in `i`, select and compute on columns, add/modify/delete columns *by reference* in `j` and group by using `by`. If you're not familiar with these concepts, please read the *"Introduction to data.table"* and *"Reference semantics"* vignettes first.
+This vignette is aimed at those who are already familiar with *data.table* syntax, its general form, how to subset rows in `i`, select and compute on columns, add/modify/delete columns *by reference* in `j` and group by using `by`. If you're not familiar with these concepts, please read the `vignette("datatable-intro", package="data.table")` and the `vignette("datatable-reference-semantics", package="data.table")` first.
 
 ***
 
 ## Data {#data}
 
-We will use the same `flights` data as in the *"Introduction to data.table"* vignette.
+We will use the same `flights` data as in the `vignette("datatable-intro", package="data.table")`.
 
 ```{r echo = FALSE}
 options(width = 100L)
@@ -54,7 +54,7 @@ In this vignette, we will
 
 ### a) What is a *key*?
 
-In the *"Introduction to data.table"* vignette, we saw how to subset rows in `i` using logical expressions, row numbers and using `order()`. In this section, we will look at another way of subsetting incredibly fast - using *keys*.
+In the `vignette("datatable-intro", package="data.table")`, we saw how to subset rows in `i` using logical expressions, row numbers and using `order()`. In this section, we will look at another way of subsetting incredibly fast - using *keys*.
 
 But first, let's start by looking at *data.frames*. All *data.frames* have a row names attribute. Consider the *data.frame* `DF` below.
 
@@ -139,7 +139,7 @@ head(flights)
 
 * Alternatively you can pass a character vector of column names to the function `setkeyv()`. This is particularly useful while designing functions to pass columns to set key on as function arguments.
 
-* Note that we did not have to assign the result back to a variable. This is because like the `:=` function we saw in the *"Reference semantics"* vignette, `setkey()` and `setkeyv()` modify the input *data.table* *by reference*. They return the result invisibly.
+* Note that we did not have to assign the result back to a variable. This is because like the `:=` function we saw in the `vignette("datatable-reference-semantics", package="data.table")`, `setkey()` and `setkeyv()` modify the input *data.table* *by reference*. They return the result invisibly.
 
 * The *data.table* is now reordered (or sorted) by the column we provided - `origin`. Since we reorder by reference, we only require additional memory of one column of length equal to the number of rows in the *data.table*, and is therefore very memory efficient.
 
@@ -258,7 +258,7 @@ flights[.("LGA", "TPA"), .(arr_delay)]
 
 * The *row indices* corresponding to `origin == "LGA"` and `dest == "TPA"` are obtained using *key based subset*.
 
-* Once we have the row indices, we look at `j` which requires only the `arr_delay` column. So we simply select the column `arr_delay` for those *row indices* in the exact same way as we have seen in *Introduction to data.table* vignette.
+* Once we have the row indices, we look at `j` which requires only the `arr_delay` column. So we simply select the column `arr_delay` for those *row indices* in the exact same way as we have seen in `vignette("datatable-intro", package="data.table")`.
 
 * We could have returned the result by using `with = FALSE` as well.
 
@@ -286,7 +286,7 @@ flights[.("LGA", "TPA"), max(arr_delay)]
 
 ### d) *sub-assign* by reference using `:=` in `j`
 
-We have seen this example already in the *Reference semantics* vignette. Let's take a look at all the `hours` available in the `flights` *data.table*:
+We have seen this example already in the `vignette("datatable-reference-semantics", package="data.table")`. Let's take a look at all the `hours` available in the `flights` *data.table*:
 
 ```{r}
 # get all 'hours' in flights
@@ -494,7 +494,7 @@ In this vignette, we have learnt another method to subset rows in `i` by keying
 
 * combine key based subsets with `j` and `by`. Note that the `j` and `by` operations are exactly the same as before.
 
-Key based subsets are **incredibly fast** and are particularly useful when the task involves *repeated subsetting*. But it may not be always desirable to set key and physically reorder the *data.table*. In the next vignette, we will address this using a *new* feature -- *secondary indexes*.
+Key based subsets are **incredibly fast** and are particularly useful when the task involves *repeated subsetting*. But it may not be always desirable to set key and physically reorder the *data.table*. In the next `vignette("datatable-secondary-indices-and-auto-indexing", package="data.table")`, we will address this using a *new* feature -- *secondary indexes*.
 
 
 ```{r, echo=FALSE}
diff --git a/vignettes/datatable-reference-semantics.Rmd b/vignettes/datatable-reference-semantics.Rmd
index 170783165..3f9fd7328 100644
--- a/vignettes/datatable-reference-semantics.Rmd
+++ b/vignettes/datatable-reference-semantics.Rmd
@@ -19,13 +19,13 @@ knitr::opts_chunk$set(
  collapse = TRUE)
 .old.th = setDTthreads(1)
 ```
-This vignette discusses *data.table*'s reference semantics which allows to *add/update/delete* columns of a *data.table by reference*, and also combine them with `i` and `by`. It is aimed at those who are already familiar with *data.table* syntax, its general form, how to subset rows in `i`, select and compute on columns, and perform aggregations by group. If you're not familiar with these concepts, please read the *"Introduction to data.table"* vignette first.
+This vignette discusses *data.table*'s reference semantics which allows to *add/update/delete* columns of a *data.table by reference*, and also combine them with `i` and `by`. It is aimed at those who are already familiar with *data.table* syntax, its general form, how to subset rows in `i`, select and compute on columns, and perform aggregations by group. If you're not familiar with these concepts, please read the `vignette("datatable-intro", package="data.table")` first.
 
 ***
 
 ## Data {#data}
 
-We will use the same `flights` data as in the *"Introduction to data.table"* vignette.
+We will use the same `flights` data as in the `vignette("datatable-intro", package="data.table")`.
 
 ```{r echo = FALSE}
 options(width = 100L)
@@ -165,7 +165,7 @@ We see that there are totally `25` unique values in the data. Both *0* and *24*
 flights[hour == 24L, hour := 0L]
 ```
 
-* We can use `i` along with `:=` in `j` the very same way as we have already seen in the *"Introduction to data.table"* vignette.
+* We can use `i` along with `:=` in `j` the very same way as we have already seen in the `vignette("datatable-intro", package="data.table")`.
 
 * Column `hour` is replaced with `0` only on those *row indices* where the condition `hour == 24L` specified in `i` evaluates to `TRUE`.
 
@@ -230,7 +230,7 @@ head(flights)
 
 * We provide the columns to group by the same way as shown in the *Introduction to data.table* vignette. For each group, `max(speed)` is computed, which returns a single value. That value is recycled to fit the length of the group. Once again, no copies are being made at all. `flights` *data.table* is modified *in-place*.
 
-* We could have also provided `by` with a *character vector* as we saw in the *Introduction to data.table* vignette, e.g., `by = c("origin", "dest")`.
+* We could have also provided `by` with a *character vector* as we saw in the `vignette("datatable-intro", package="data.table")`, e.g., `by = c("origin", "dest")`.
 
 #
 
@@ -249,7 +249,7 @@ head(flights)
 
 * Note that since we allow assignment by reference without quoting column names when there is only one column as explained in [Section 2c](#delete-convenience), we can not do `out_cols := lapply(.SD, max)`. That would result in adding one new column named `out_cols`. Instead we should do either `c(out_cols)` or simply `(out_cols)`. Wrapping the variable name with `(` is enough to differentiate between the two cases.
 
-* The `LHS := RHS` form allows us to operate on multiple columns. In the RHS, to compute the `max` on columns specified in `.SDcols`, we make use of the base function `lapply()` along with `.SD` in the same way as we have seen before in the *"Introduction to data.table"* vignette. It returns a list of two elements, containing the maximum value corresponding to `dep_delay` and `arr_delay` for each group.
+* The `LHS := RHS` form allows us to operate on multiple columns. In the RHS, to compute the `max` on columns specified in `.SDcols`, we make use of the base function `lapply()` along with `.SD` in the same way as we have seen before in the `vignette("datatable-intro", package="data.table")`. It returns a list of two elements, containing the maximum value corresponding to `dep_delay` and `arr_delay` for each group.
 
 #
 Before moving on to the next section, let's clean up the newly created columns `speed`, `max_speed`, `max_dep_delay` and `max_arr_delay`.
@@ -365,7 +365,7 @@ However we could improve this functionality further by *shallow* copying instead
 
 * It is used to *add/update/delete* columns by reference.
 
-* We have also seen how to use `:=` along with `i` and `by` the same way as we have seen in the *Introduction to data.table* vignette. We can in the same way use `keyby`, chain operations together, and pass expressions to `by` as well all in the same way. The syntax is *consistent*.
+* We have also seen how to use `:=` along with `i` and `by` the same way as we have seen in the `vignette("datatable-intro", package="data.table")`. We can in the same way use `keyby`, chain operations together, and pass expressions to `by` as well all in the same way. The syntax is *consistent*.
 
 * We can use `:=` for its side effect or use `copy()` to not modify the original object while updating by reference.
 
@@ -375,6 +375,6 @@ setDTthreads(.old.th)
 
 #
 
-So far we have seen a whole lot in `j`, and how to combine it with `by` and little of `i`. Let's turn our attention back to `i` in the next vignette *"Keys and fast binary search based subset"* to perform *blazing fast subsets* by *keying data.tables*.
+So far we have seen a whole lot in `j`, and how to combine it with `by` and little of `i`. Let's turn our attention back to `i` in the next vignette `vignette("datatable-keys-fast-subset", package="data.table")` to perform *blazing fast subsets* by *keying data.tables*.
 
 ***
diff --git a/vignettes/datatable-sd-usage.Rmd b/vignettes/datatable-sd-usage.Rmd
index 8d7b6bd04..a1c76713d 100644
--- a/vignettes/datatable-sd-usage.Rmd
+++ b/vignettes/datatable-sd-usage.Rmd
@@ -112,7 +112,8 @@ head(unique(Teams[[fkt[1L]]]))
 
 Note: 
 
-1. The `:=` is an assignment operator to update the `data.table` in place without making a copy. See [reference semantics](https://cran.r-project.org/package=data.table/vignettes/datatable-reference-semantics.html) for more. 
+
+1. The `:=` is an assignment operator to update the `data.table` in place without making a copy. See `vignette("datatable-reference-semantics", package="data.table")` for more.
 2. The LHS, `names(.SD)`, indicates which columns we are updating - in this case we update the entire `.SD`.
 3. The RHS, `lapply()`, loops through each column of the `.SD` and converts the column to a factor.
 4. We use the `.SDcols` to only select columns that have pattern of `teamID`.
diff --git a/vignettes/datatable-secondary-indices-and-auto-indexing.Rmd b/vignettes/datatable-secondary-indices-and-auto-indexing.Rmd
index 7d631fe93..ba3ec267e 100644
--- a/vignettes/datatable-secondary-indices-and-auto-indexing.Rmd
+++ b/vignettes/datatable-secondary-indices-and-auto-indexing.Rmd
@@ -26,7 +26,7 @@ This vignette assumes that the reader is familiar with data.table's `[i, j, by]`
 
 ## Data {#data}
 
-We will use the same `flights` data as in the *"Introduction to data.table"* vignette.
+We will use the same `flights` data as in the `vignette("datatable-intro", package="data.table")`.
 
 ```{r echo = FALSE}
 options(width = 100L)
@@ -189,7 +189,7 @@ flights[.("JFK", "LAX"), on = c("origin", "dest")][1:5]
 
 ### b) Select in `j`
 
-All the operations we will discuss below are no different to the ones we already saw in the *Keys and fast binary search based subset* vignette. Except we'll be using the `on` argument instead of setting keys.
+All the operations we will discuss below are no different to the ones we already saw in the `vignette("datatable-keys-fast-subset", package="data.table")`. Except we'll be using the `on` argument instead of setting keys.
 
 #### -- Return `arr_delay` column alone as a data.table corresponding to `origin = "LGA"` and `dest = "TPA"`
 
@@ -215,7 +215,7 @@ flights[.("LGA", "TPA"), max(arr_delay), on = c("origin", "dest")]
 
 ### e) *sub-assign* by reference using `:=` in `j`
 
-We have seen this example already in the *Reference semantics* and *Keys and fast binary search based subset* vignette. Let's take a look at all the `hours` available in the `flights` *data.table*:
+We have seen this example already in the `vignette("datatable-reference-semantics", package="data.table")` and the `vignette("datatable-keys-fast-subset", package="data.table")`. Let's take a look at all the `hours` available in the `flights` *data.table*:
 
 ```{r}
 # get all 'hours' in flights
@@ -249,7 +249,7 @@ head(ans)
 
 ### g) The *mult* argument
 
-The other arguments including `mult` work exactly the same way as we saw in the *Keys and fast binary search based subset* vignette. The default value for `mult` is "all". We can choose, instead only the "first" or "last" matching rows should be returned.
+The other arguments including `mult` work exactly the same way as we saw in the `vignette("datatable-keys-fast-subset", package="data.table")`. The default value for `mult` is "all". We can choose, instead only the "first" or "last" matching rows should be returned.
 
 #### -- Subset only the first matching row where `dest` matches *"BOS"* and *"DAY"*
 
@@ -323,7 +323,7 @@ system.time(dt[x %in% 1989:2012])
 
 In recent version we extended auto indexing to expressions involving more than one column (combined with `&` operator). In the future, we plan to extend binary search to work with more binary operators like `<`, `<=`, `>` and `>=`.
 
-We will discuss fast *subsets* using keys and secondary indices to *joins* in the next vignette, *"Joins and rolling joins"*.
+We will discuss fast *subsets* using keys and secondary indices to *joins* in the next vignette, `vignette("datatable-joins", package="data.table")`.
 
 ***