Add new package infrastructure (`epichains` classes and methods) #33

jamesmbaazam · 2023-06-12T22:36:07Z

This PR addresses several issues:

Refactored bpmodels::chain_ll() to likelihood() to close refactor chain_ll() #5 and fix Refactor chain_sim() #13. Tests have been delayed to accommodate any changes in name and structure resulting from code review.
Refactored bpmodels::chain_sim_susc() by splitting it into simulate_tree() and simulate_summary() to close refactor chain_sim_susc() #14. Tests have been delayed to accommodate any changes in name and structure resulting from code review.
Refactored and renamed chain_sim_susc() to simulate_tree_from_pop(). This closes refactor chain_sim_susc() #14.
Refactored and renamed chain_ll() to likelihood(). This closes Rename current functions #26.
Implemented an epichains class to partly fix Consider S3 output class for chain_ll() and chain_sim() #4. likelihood(), which is a refactoring of chain_ll() currently does not implement S3 because its use case is currently unknown.

Essentially, merging this PR will introduce:

Four simulation functions:
- simulate_tree() - simulates branching processes from a given number of chains.
- simulate_tree_from_pop() - simulates branching processes from a given population size and pre-existing immunity.
- simulate_summary() - simulates a vector of cluster sizes or lengths from branching processes
- likelihood() - calculates the likelihood/loglihood of observing transmission chains of given sizes or lengths.
An epichains class that stores data.frame-like output from simulate_tree() and simulate_tree_from_pop() with a chains_tree attribute and vector-like model output from simulate_summary() with a chains_summary attribute.
print(), summary(), and aggregate() methods. Note that the aggregate() method only works for epichains objects with the chains_tree attribute.

sbfnk

Great stuff! A few comments:

I'm not a fan of the mixture of vec/vect and think we could do with something more explicit here. Perhaps _summaries?
estimate_likelihood doesn't always produce an estimate. If an analytical solution is available then the results is exact. Perhaps this should be just likelihood? Would have to rename the file anme too then and distinguish from likelihoods.R (which should perhaps dist_likelihoods.R or something like that?).
this is probably going too far but since you're redisgning already anyway you could consider whether you want to implement this in the line of the more generic modelling functions/packages in R, e.g. have epichains (or similar) objects that contain model parameters so one could call logLik(epichain) or perhaps even predict(epichain) (instead of simulate). Again, I'm not at all convinced this is a good idea but I think it might be worth at least thinking about.

.gitignore

CITATION.cff

sbfnk · 2023-06-21T09:35:55Z

CITATION.cff

-keywords:
- branching-process
- epidemic-dynamics
- epidemic-modelling
- epidemic-simulations
- outbreak-simulator
- r
- r-package
- transmission-chain
- transmission-chain-reconstruction


any reason to remove these?

I think the tags were probably automatically removed when the package was forked. I'll re-add them.

I just added the tags back on this repo and have also set up the CITATION.cff syncing workflow here (54a29ff), so that should resolve this.

sbfnk · 2023-06-21T09:46:15Z

R/epichains.R

+      )
+
+    # print head of the simulation output
+    print(head(x[!is.na(x$ancestor), ]))


Do we need to sort by ancestor first? Or do we assume users don't mess with the order?

That's a good point. I actually wonder if it won't be best to sort by sim_id then ancestor, so that it's easier to see the chains within each simulation. I have implemented that here 6972ed7.

sbfnk · 2023-06-21T09:49:07Z

R/likelihood_estimation.R

+#' @param chains_observed Vector of sizes/lengths of transmission chains.
+#' @param chain_statistic Statistic given as \code{chains_observed}
+#' ("size" or "length" of chains).
+#' @param offspring_sampler Offspring distribution: a character string


we're not providing the sampler here but the distribution (from which the sampler is obtained internally), so would it make sense to call this offspring_dist?

Agreed. I've made the change.

sbfnk · 2023-06-21T11:02:23Z

R/simulate.r

+#' simulate_tree_from_pop(pop = 100, offspring_sampler = "nbinom",
+#' mean_offspring = 0.5, disp_offspring = 1.1, serial_sampler = function(x) 3)
+#' @export
+simulate_tree_from_pop <- function(pop,


Is the plan to eventially merge this into simulate_tree with a pop option? I think that might make sense if it doesn't make that function too complex.

sbfnk · 2023-06-21T11:05:37Z

R/epichains.R

+  if (attributes(x)$chain_type == "chains_tree") {
+    stopifnot(
+      "object does not contain the correct columns" =
+        c("sim_id", "ancestor", "generation", "time") %in%


"time" is not strictly necessary - only if used with a serial interval

Fixed in 5d3e108.

sbfnk · 2023-06-21T11:08:53Z

R/simulate.r

+#' @param chain_stat_max A cut off for the chain statistic (size/length) being
+#' computed. Results above the specified value, are set to this value.
+#' Defaults to `Inf`.
+#' @param serials_sampler The serial interval generator function; the name of a


perhaps rename it something that contains "interval", e.g. si_sampler (though see also my other comment on supporting character strings vs. anonymous functions here - I think there might be an argument for just supporting character strings in which case this would be si_dist)

I think the result of the discussion (including #25 (comment)) is that we want to do this via character strings everywhere instead of functions (though happy to be told otherwise if you read this differently). Perhaps convert my previous comment to an issue for future processing.

sbfnk · 2023-06-21T11:12:40Z

R/epichains.R

+    )
+
+    # Offer more information to view the full dataset
+    writeLines(sprintf("Use View(<object_name>) to view the full output."))


View doesn't work well in remote sessions - Perhaps instead have

Suggested change

writeLines(sprintf("Use View(<object_name>) to view the full output."))

writeLines(sprintf("Use as.data.frame(<object_name>) to view the full output."))

I've made the change here c0c6197.

R/epichains.R

pearsonca · 2023-06-21T12:16:19Z

this is probably going too far but since you're redisgning already anyway you could consider whether you want to implement this in the line of the more generic modelling functions/packages in R, e.g. have epichains (or similar) objects that contain model parameters so one could call logLik(epichain) or perhaps even predict(epichain) (instead of simulate). Again, I'm not at all convinced this is a good idea but I think it might be worth at least thinking about.

Definitely worth thinking about, imo. Gut intuition is that doing some light work would get the basic interface requirements met. Would take a fair bit to get everything properly aligned so that it "just works" with the tools written against the generic interface, but that would be potentially worthwhile.

Re tagging for #25 - happy to contribute code to making the-lambda-licious version actually happen, if that's the direction people want.

sbfnk · 2023-06-21T12:26:19Z

happy to contribute code to making the-lambda-licious version actually happen, if that's the direction people want.

I'm not convinced either way but currently leaning towards using character strings to identify r/d functions because

It means people don't have to provide both sampling and density functions but only one distribution
It means we can work with the truncdist package (which would otherwise have to be reimplemented - not impossible as it's fairly lightweight - (1) is the stronger argument I think)

Of course the real issue is that the R distribution interface is not particularly clever or sensible - we could consider using e.g. the more sensible distr6 which would remove the issue altogether but at the cost of some overhead.

pearsonca · 2023-06-21T14:56:37Z

Why would truncdist have to be reimplemented?

sbfnk · 2023-06-21T15:00:24Z

because it works with character strings specifying distributions?

pearsonca · 2023-06-21T15:04:22Z

But itself is a function that takes arguments, yes? So has the same contours as passing any other distribution function + arguments, right?

As in:

library(truncdist)
something <- function(FUN, ...) { FUN(...) }
something(runif, min = 0, max = 1)
# vs
something(rtrunc, spec = "unif", a = 0.25, b = 0.75)

sbfnk · 2023-06-21T15:07:43Z

It's a package that provides distribution-based functions (i.e. q.../r.../d....) for truncating arbitrary distributions. Where the arbitrary distributions are specified as character strings and relevant functions pulled from the environment by name.

pearsonca · 2023-06-21T15:09:38Z

See update with example; there's no need to reimplement truncdist imo. The appropriate interface for interacting with an arbitrary distribution function already handles using something from truncdist + whatever-find-the-underlying-function approach.

I don't think you'll want to use the ... args (since there are multiple function arg chunks to pass), but passing two arg lists - same logic applies.

sbfnk · 2023-06-21T15:14:03Z

Ah but the point is that the truncation could come in separately if the user wants control measures or depletion of susecptibles. Of course the same could be implemented by the user in just providing a truncated distribution but having these separate would provide convenience arguments.

pearsonca · 2023-06-21T15:36:07Z

Hmm - where does truncdist do its look up? if it along the normal environment path, couldn't you always just locally (as in, internal to your function) define some generic name for the distribution function, and then hard code that name as the lookup for truncdist?

So looking at the apparent source (https://github.com/cran/truncdist/blob/master/R/qtrunc.R), it is using get(...), which should abide by the normal environment search rules.

Something like this should be able to be made to work:

library(truncdist)
something <- function(FUN, ...) {
  rFUN <- FUN
  trundist:rtrunc(1, spec = "FUN", a = ..., b = ..., ...) 
}

sbfnk · 2023-06-21T15:52:53Z

Yes, I think that would work, and be a workaround to address point (2).

pearsonca · 2023-06-22T08:40:28Z

R/borel.r

Convention for documenting the d/r/q/p functions is to use a single rdname.

This is also waaaaaaaaay thin in terms of detail for the distribution, parameters, connection to typical measures (mean, sd, ...), etc.

Probably warrants a devoted issue, rather than tacking on to this already ... dense PR.

R/checks.R

R/epichains.R

R/helpers.R

jamesmbaazam · 2023-06-23T09:56:36Z

Conversation from Slack:

@pearsonca
hopefully, the rambling commentary on here isn't too scattershot.

I think the upshot from the discussion is that there is likely a way to support the distributions arguments as functions (and probably also backwards-compatible as strings - sniff w/ is.character(), if yes, promote to function via get()) making implicit explicit: I think those should be functor arguments.

I'm happy to put my keystrokes where my mouth is and help with sorting out how to make that happen

@sbfnk
Thanks, Carl - not at all too rambling, but I’m still not convinced that (sampler = rnorm, density = dnorm) is preferable to (dist = "norm"). I find the second one conceptually more satisfying and potentially (?) less prone to be erroneously used (as we can’t check that sampler and density are consistent with each other).

@pearsonca
mmmm, seems the (dist = "mycustom") suffers the same consistency problem?

@sbfnk
yes if using custom then a similar problem exists
though perhaps requiring them to be called r/d with the same name serves as a memo that they have to be the same?

@pearsonca
agree with preserving the convenience & safety of (dist = "X"), but could likely support just (sampler = rX) => do the NSE to fish out dX in the environment.

probably even (sampler = pkg::rX) and fish out from the right environment.

@sbfnk
Reflecting on this I’m not sure we ever need sampler and density (or, to be precise, log-likelihood as it wouldn’t be e.g. dnorm but the log-likelihood of observing chain size x if the offspring distribution is normal) as we use the sampler only to approximate when there is no analytical density. So having these as separate arguments would make that choice explicit which might be a good thing. It would require the user to find out which log-likelihoods are implemented analytically before making the choice but, again, perhaps exposing that is not a bad thing. (edited)

@pearsonca
perhaps; I will say that I don't have the overall vocabulary of the package top of mind.

@sbfnk
It might be good to write down some of the examples / construct a few more with both notations to make sure we’re not missing any obvious design issues

@pearsonca
but yeah, was about to ask under what circumstances both r/d versions are used at the same time

@sbfnk
it would require the user to specify the size distribution likelihood instead of the offspring distribution which is possibly less intuitive if more explicit. I’ll have a play with some examples (or you feel free to if you fancy) to share and then we can decide.

Perhaps; I will say that I don’t have the overall vocabulary of the package top of mind.

I do!

codecov-commenter · 2023-07-26T13:55:22Z

Codecov Report

❗ No coverage uploaded for pull request base (main@82f7d7a). Click here to learn what that means.
The diff coverage is n/a.

❗ Current head 653ca02 differs from pull request most recent head 7e9665a. Consider uploading reports for the commit 7e9665a to get more accurate results

@@           Coverage Diff           @@
##             main      #33   +/-   ##
=======================================
  Coverage        ?   43.18%           
=======================================
  Files           ?        8           
  Lines           ?      396           
  Branches        ?        0           
=======================================
  Hits            ?      171           
  Misses          ?      225           
  Partials        ?        0

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

… to likelihood()

…ummary

Co-authored-by: Sebastian Funk <[email protected]>

jamesmbaazam marked this pull request as draft June 13, 2023 11:01

jamesmbaazam changed the title ~~New package architecture~~ New package infrastructure Jun 13, 2023

jamesmbaazam added the enhancement New feature or request label Jun 14, 2023

jamesmbaazam self-assigned this Jun 14, 2023

jamesmbaazam force-pushed the develop branch from 671addb to 998341a Compare June 16, 2023 17:27

jamesmbaazam marked this pull request as ready for review June 16, 2023 17:27

jamesmbaazam changed the title ~~New package infrastructure~~ Add new package infrastructure (classes and methods) Jun 20, 2023

jamesmbaazam requested a review from sbfnk June 20, 2023 15:27

sbfnk requested changes Jun 21, 2023

View reviewed changes