Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new package infrastructure (epichains classes and methods) #33

Merged
merged 300 commits into from
Sep 2, 2023

Conversation

jamesmbaazam
Copy link
Member

@jamesmbaazam jamesmbaazam commented Jun 12, 2023

This PR addresses several issues:

Essentially, merging this PR will introduce:

  • Four simulation functions:
    • simulate_tree() - simulates branching processes from a given number of chains.
    • simulate_tree_from_pop() - simulates branching processes from a given population size and pre-existing immunity.
    • simulate_summary() - simulates a vector of cluster sizes or lengths from branching processes
    • likelihood() - calculates the likelihood/loglihood of observing transmission chains of given sizes or lengths.
  • An epichains class that stores data.frame-like output from simulate_tree() and simulate_tree_from_pop() with a chains_tree attribute and vector-like model output from simulate_summary() with a chains_summary attribute.
  • print(), summary(), and aggregate() methods. Note that the aggregate() method only works for epichains objects with the chains_tree attribute.

@jamesmbaazam jamesmbaazam marked this pull request as draft June 13, 2023 11:01
@jamesmbaazam jamesmbaazam changed the title New package architecture New package infrastructure Jun 13, 2023
@jamesmbaazam jamesmbaazam added the enhancement New feature or request label Jun 14, 2023
@jamesmbaazam jamesmbaazam self-assigned this Jun 14, 2023
@jamesmbaazam jamesmbaazam marked this pull request as ready for review June 16, 2023 17:27
@jamesmbaazam jamesmbaazam changed the title New package infrastructure Add new package infrastructure (classes and methods) Jun 20, 2023
@jamesmbaazam jamesmbaazam requested a review from sbfnk June 20, 2023 15:27
Copy link
Contributor

@sbfnk sbfnk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great stuff! A few comments:

  • I'm not a fan of the mixture of vec/vect and think we could do with something more explicit here. Perhaps _summaries?
  • estimate_likelihood doesn't always produce an estimate. If an analytical solution is available then the results is exact. Perhaps this should be just likelihood? Would have to rename the file anme too then and distinguish from likelihoods.R (which should perhaps dist_likelihoods.R or something like that?).
  • this is probably going too far but since you're redisgning already anyway you could consider whether you want to implement this in the line of the more generic modelling functions/packages in R, e.g. have epichains (or similar) objects that contain model parameters so one could call logLik(epichain) or perhaps even predict(epichain) (instead of simulate). Again, I'm not at all convinced this is a good idea but I think it might be worth at least thinking about.

.gitignore Outdated Show resolved Hide resolved
CITATION.cff Show resolved Hide resolved
CITATION.cff Outdated
Comment on lines 37 to 46
keywords:
- branching-process
- epidemic-dynamics
- epidemic-modelling
- epidemic-simulations
- outbreak-simulator
- r
- r-package
- transmission-chain
- transmission-chain-reconstruction
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason to remove these?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the tags were probably automatically removed when the package was forked. I'll re-add them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just added the tags back on this repo and have also set up the CITATION.cff syncing workflow here (54a29ff), so that should resolve this.

R/epichains.R Outdated
)

# print head of the simulation output
print(head(x[!is.na(x$ancestor), ]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to sort by ancestor first? Or do we assume users don't mess with the order?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. I actually wonder if it won't be best to sort by sim_id then ancestor, so that it's easier to see the chains within each simulation. I have implemented that here 6972ed7.

#' @param chains_observed Vector of sizes/lengths of transmission chains.
#' @param chain_statistic Statistic given as \code{chains_observed}
#' ("size" or "length" of chains).
#' @param offspring_sampler Offspring distribution: a character string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're not providing the sampler here but the distribution (from which the sampler is obtained internally), so would it make sense to call this offspring_dist?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I've made the change.

#' simulate_tree_from_pop(pop = 100, offspring_sampler = "nbinom",
#' mean_offspring = 0.5, disp_offspring = 1.1, serial_sampler = function(x) 3)
#' @export
simulate_tree_from_pop <- function(pop,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the plan to eventially merge this into simulate_tree with a pop option? I think that might make sense if it doesn't make that function too complex.

R/epichains.R Outdated
if (attributes(x)$chain_type == "chains_tree") {
stopifnot(
"object does not contain the correct columns" =
c("sim_id", "ancestor", "generation", "time") %in%
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"time" is not strictly necessary - only if used with a serial interval

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 5d3e108.

#' @param chain_stat_max A cut off for the chain statistic (size/length) being
#' computed. Results above the specified value, are set to this value.
#' Defaults to `Inf`.
#' @param serials_sampler The serial interval generator function; the name of a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps rename it something that contains "interval", e.g. si_sampler (though see also my other comment on supporting character strings vs. anonymous functions here - I think there might be an argument for just supporting character strings in which case this would be si_dist)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the result of the discussion (including #25 (comment)) is that we want to do this via character strings everywhere instead of functions (though happy to be told otherwise if you read this differently). Perhaps convert my previous comment to an issue for future processing.

R/epichains.R Outdated
)

# Offer more information to view the full dataset
writeLines(sprintf("Use View(<object_name>) to view the full output."))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

View doesn't work well in remote sessions - Perhaps instead have

Suggested change
writeLines(sprintf("Use View(<object_name>) to view the full output."))
writeLines(sprintf("Use as.data.frame(<object_name>) to view the full output."))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made the change here c0c6197.

R/epichains.R Show resolved Hide resolved
@pearsonca
Copy link

  • this is probably going too far but since you're redisgning already anyway you could consider whether you want to implement this in the line of the more generic modelling functions/packages in R, e.g. have epichains (or similar) objects that contain model parameters so one could call logLik(epichain) or perhaps even predict(epichain) (instead of simulate). Again, I'm not at all convinced this is a good idea but I think it might be worth at least thinking about.

Definitely worth thinking about, imo. Gut intuition is that doing some light work would get the basic interface requirements met. Would take a fair bit to get everything properly aligned so that it "just works" with the tools written against the generic interface, but that would be potentially worthwhile.

Re tagging for #25 - happy to contribute code to making the-lambda-licious version actually happen, if that's the direction people want.

@sbfnk
Copy link
Contributor

sbfnk commented Jun 21, 2023

happy to contribute code to making the-lambda-licious version actually happen, if that's the direction people want.

I'm not convinced either way but currently leaning towards using character strings to identify r/d functions because

  1. It means people don't have to provide both sampling and density functions but only one distribution
  2. It means we can work with the truncdist package (which would otherwise have to be reimplemented - not impossible as it's fairly lightweight - (1) is the stronger argument I think)

Of course the real issue is that the R distribution interface is not particularly clever or sensible - we could consider using e.g. the more sensible distr6 which would remove the issue altogether but at the cost of some overhead.

@pearsonca
Copy link

Why would truncdist have to be reimplemented?

@sbfnk
Copy link
Contributor

sbfnk commented Jun 21, 2023

because it works with character strings specifying distributions?

@pearsonca
Copy link

pearsonca commented Jun 21, 2023

But itself is a function that takes arguments, yes? So has the same contours as passing any other distribution function + arguments, right?

As in:

library(truncdist)
something <- function(FUN, ...) { FUN(...) }
something(runif, min = 0, max = 1)
# vs
something(rtrunc, spec = "unif", a = 0.25, b = 0.75)

@sbfnk
Copy link
Contributor

sbfnk commented Jun 21, 2023

It's a package that provides distribution-based functions (i.e. q.../r.../d....) for truncating arbitrary distributions. Where the arbitrary distributions are specified as character strings and relevant functions pulled from the environment by name.

@pearsonca
Copy link

pearsonca commented Jun 21, 2023

See update with example; there's no need to reimplement truncdist imo. The appropriate interface for interacting with an arbitrary distribution function already handles using something from truncdist + whatever-find-the-underlying-function approach.

I don't think you'll want to use the ... args (since there are multiple function arg chunks to pass), but passing two arg lists - same logic applies.

@sbfnk
Copy link
Contributor

sbfnk commented Jun 21, 2023

Ah but the point is that the truncation could come in separately if the user wants control measures or depletion of susecptibles. Of course the same could be implemented by the user in just providing a truncated distribution but having these separate would provide convenience arguments.

@pearsonca
Copy link

pearsonca commented Jun 21, 2023

Hmm - where does truncdist do its look up? if it along the normal environment path, couldn't you always just locally (as in, internal to your function) define some generic name for the distribution function, and then hard code that name as the lookup for truncdist?

So looking at the apparent source (https://github.com/cran/truncdist/blob/master/R/qtrunc.R), it is using get(...), which should abide by the normal environment search rules.

Something like this should be able to be made to work:

library(truncdist)
something <- function(FUN, ...) {
  rFUN <- FUN
  trundist:rtrunc(1, spec = "FUN", a = ..., b = ..., ...) 
}

@sbfnk
Copy link
Contributor

sbfnk commented Jun 21, 2023

Yes, I think that would work, and be a workaround to address point (2).

R/borel.r Outdated

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Convention for documenting the d/r/q/p functions is to use a single rdname.

This is also waaaaaaaaay thin in terms of detail for the distribution, parameters, connection to typical measures (mean, sd, ...), etc.

Probably warrants a devoted issue, rather than tacking on to this already ... dense PR.

R/checks.R Show resolved Hide resolved
R/checks.R Outdated Show resolved Hide resolved
R/epichains.R Outdated Show resolved Hide resolved
R/epichains.R Outdated Show resolved Hide resolved
R/helpers.R Outdated Show resolved Hide resolved
@jamesmbaazam
Copy link
Member Author

Conversation from Slack:

@pearsonca
hopefully, the rambling commentary on here isn't too scattershot.

I think the upshot from the discussion is that there is likely a way to support the distributions arguments as functions (and probably also backwards-compatible as strings - sniff w/ is.character(), if yes, promote to function via get()) making implicit explicit: I think those should be functor arguments.

I'm happy to put my keystrokes where my mouth is and help with sorting out how to make that happen

@sbfnk
Thanks, Carl - not at all too rambling, but I’m still not convinced that (sampler = rnorm, density = dnorm) is preferable to (dist = "norm"). I find the second one conceptually more satisfying and potentially (?) less prone to be erroneously used (as we can’t check that sampler and density are consistent with each other).

@pearsonca
mmmm, seems the (dist = "mycustom") suffers the same consistency problem?

@sbfnk
yes if using custom then a similar problem exists
though perhaps requiring them to be called r/d with the same name serves as a memo that they have to be the same?

@pearsonca
agree with preserving the convenience & safety of (dist = "X"), but could likely support just (sampler = rX) => do the NSE to fish out dX in the environment.

probably even (sampler = pkg::rX) and fish out from the right environment.

@sbfnk
Reflecting on this I’m not sure we ever need sampler and density (or, to be precise, log-likelihood as it wouldn’t be e.g. dnorm but the log-likelihood of observing chain size x if the offspring distribution is normal) as we use the sampler only to approximate when there is no analytical density. So having these as separate arguments would make that choice explicit which might be a good thing. It would require the user to find out which log-likelihoods are implemented analytically before making the choice but, again, perhaps exposing that is not a bad thing. (edited)

@pearsonca
perhaps; I will say that I don't have the overall vocabulary of the package top of mind.

@sbfnk
It might be good to write down some of the examples / construct a few more with both notations to make sure we’re not missing any obvious design issues

@pearsonca
but yeah, was about to ask under what circumstances both r/d versions are used at the same time

@sbfnk
it would require the user to specify the size distribution likelihood instead of the offspring distribution which is possibly less intuitive if more explicit. I’ll have a play with some examples (or you feel free to if you fancy) to share and then we can decide.

Perhaps; I will say that I don’t have the overall vocabulary of the package top of mind.

I do!

@codecov-commenter
Copy link

codecov-commenter commented Jul 26, 2023

Codecov Report

❗ No coverage uploaded for pull request base (main@82f7d7a). Click here to learn what that means.
The diff coverage is n/a.

❗ Current head 653ca02 differs from pull request most recent head 7e9665a. Consider uploading reports for the commit 7e9665a to get more accurate results

@@           Coverage Diff           @@
##             main      #33   +/-   ##
=======================================
  Coverage        ?   43.18%           
=======================================
  Files           ?        8           
  Lines           ?      396           
  Branches        ?        0           
=======================================
  Hits            ?      171           
  Misses          ?      225           
  Partials        ?        0           

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@jamesmbaazam jamesmbaazam merged commit 8188b63 into main Sep 2, 2023
10 checks passed
@jamesmbaazam jamesmbaazam deleted the develop branch September 2, 2023 17:32
@jamesmbaazam jamesmbaazam mentioned this pull request Sep 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
5 participants