-
Notifications
You must be signed in to change notification settings - Fork 146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor accumulators #575
Refactor accumulators #575
Conversation
This will allow us to store the id in the struct itself.
There is no need for IndexMap here, we don't iterate or care about indices.
We used to have the concept of "resetting" a tracked struct, which removed all instances. We don't have that concept anymore. Therefore we don't need to record a read on the tracked struct table itself.
This is a step towards refactoring `fetch` to return the memo itself and not the value.
Change the core operation to be `refresh_memo`. This is a nice refactoring on its own but it will also support the refactoring of how we manage accumulator values.
We used to store values in a central map, but now each memo has an `AccumulatorMap` that maps accumulated values (if any). The primary goals of this change are * forward compatible with speculative execution because it puts more data into tables; * step towards a refactoring to stop tracking outputs in the same array as inputs and thus to simplify how we do versioning. We will no longer need to walk the function's outputs and refresh their versions and so forth because they are stored in the function memo and so they get refreshed automatically when the memo is refreshed.
✅ Deploy Preview for salsa-rs canceled.
|
CodSpeed Performance ReportMerging #575 will not alter performanceComparing Summary
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code changes look good to me.
What I remember from the table refactor is that it avoids a hash map lookup by storing the result of query(a) -> b
at the index a
. This means the design is based on the assumption that queries are called for most of their arguments (a densely populated map).
I understand that we know do the same for accumulated values: The accumulated values for query(a)
are stored at the index a
. I think this could be problematic because the main use case for accumulators are diagnostics and diagnostics are rare. Only few queries will have accumulated values. The worst case is that a
is stored at a very high index, requiring salsa to allocate many pages to store a single accumulated value/diagnostic.
Do you foresee changes to the table representation that can better support sparsely populated tables?
Related: it would also be nice if we could avoid the Arc
inside of Memo
but I don't think that's possible with today's table design because we relocate the Memo
when storing a new value.
// NOTE: We don't have a precise way to track accumulated values at present, | ||
// so we report any read of them as an untracked read. | ||
// | ||
// Like tracked struct fields, accumulated values are essentially a "side channel output" | ||
// from a tracked function, hence we can't report this as a read of the tracked function(s) | ||
// whose accumulated values we are probing, since the accumulated values may have changed | ||
// even when the the main return value of the function has not changed. | ||
// | ||
// Unlike tracked struct fields, we don't have a distinct id or ingredient to represent | ||
// "the values of type A accumulated by tracked function X". Typically accumulated values | ||
// are read from outside of salsa anyway so this is not a big deal. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is an existing comment and unrelated to the changes at this PR. At least I remember coming across it when evaluating if Ruff should use accumulators.
We decided to not use accumulators and instead return the results as part of the function's output because of the added untracked_read
dependency when accumulators are read inside a query. Maybe we're using them incorrectly but our use case is:
- A
check_file
query that callslint
andinfer
- the
lint
query createsLintDiagnostic
s - the
infer
query createsTypeDiagnostic
I should have said that the code changes are great! I find it overall easier to follow what's happening. But I am interested on your perspective on whether the "densely populated" tables assumption is an issue for accumulators. It just feels wasteful to me to allocate a bunch of pages to store a single diagnostic for the file with index 100_000. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The dense table discussion is something that we can resolve separately from this PR.
One more data point for that discussion. Ruff has a check_file(File)
query that runs for first-party files only. I used countme to get some rough numbers and check_file
is called for 1538 out of the 3750 files (first party, third party and standard library files)
Actually, that's not quite right. The accumulators are not stored "densely" in the same way as tracked functions memos are stored. Every tracked function memo has a hashmap, yes, but the hashmap is empty most of the time. We only add an entry when you actually accumulate something. So it's sparse in the sense that we don't assume most tracked functions will accumulate something from most accumulators. I feel like what is needed here is probably some set of diagrams, as using English paragraphs feels like a fairly imprecise way to describe what is going on, I am afraid we may be saying similar things in different ways. That said, let me give you an example. Suppose there are four tracked functions (F1, F2, F3, and F4) and 3 accumulators (A1, A2, A3), and suppose that
In that case, the storage would like like
In contrast, with tracked structs, at least the way it's setup now, we use a vector of memos. So if we have a tracked struct
then the structs' memo listings would look like:
I strongly suspect we should use some other system for memo storage. That said, the above case doesn't look too bad to me, what's really painful is that we assign globally unique indices to functions, even though every function can be used with at most one kind of tracked struct. We should let the indices overlap. In other words, if we had two kinds of tracked structs (S and T) and a tracked function FT that gets invoked on T instances, then EVERY instance of |
Correct.
Hmm. So, with the old representation, it was very hard to determine when the set of values accumulated by a function had changed, which is why I adopted this "unversioned" approach. In the new representation, however, it's actually fairly easy. We have the old memo, and we have its That said, in the code I've written, I've only used accumulated for errors, and I always read errors from outside of tracked functions (e.g., in the "main IDE driver"). This also means that I structure these values differently. For example, in my dada compiled, all spans are stored relative to the enclosing item, so that they don't change when new lines are inserted, but spans for errors are Perhaps we want to make it configurable? |
My hunch is that the use case you are describing is somewhat different, but worth figuring out. I am imagining something like an "accumulation field" on a tracked struct, versus a global accumulator. Maybe we should dive into your particular case a bit deeper. This is also suggesting to me that it'd be good to start writing out the "Canonical Salsa 3.0 design patterns". |
Thanks for the detailed explanation. I read through the code changes again and realized that I missed that the accumulated values are stored on the
Oh interesting. My understanding was the table design was different in that each function would have its own table rather than that it's stored on the ingredient. But storing it on the ingredient itself
This would give us a worse overall yield but a better yield when introducing a new tracked structure,
Yes, this isn't the case for us because the main output of |
Refactor how accumulators are implemented. Instead of storing the values accumulated by a particular function X in a central map in the ingredient, attach them to the memo for X. This moves us a big step closer towards having all relevant data stored in tables and will help with speculative execution, lightweight dependencies, and a number of forward goals.
Review requested from @MichaReiser, if you have time, and others too =)