[FEATURE] Ingest Maintainer last engaged date into Metrics cluster #75

bshien · 2024-09-18T21:05:04Z

Is your feature request related to a problem?

Coming from #57

As a prerequisite for #73 and opensearch-project/automation-app#8, there needs to be data in the Metrics cluster with information about each maintainers' repo, name, affiliation, the date they were last engaged, and their inactivity status.

What solution would you like?

An index created in the Metrics OpenSearch cluster called maintainer_engagement, which will have documents with this structure:

{
    "id": "8baa664c-dec0-4201-b4b9-9747c2e7ee45",
    "repository": "opensearch-metrics",
    "name": "Brandon Shien",
    "github_login": "bshien",
    "affiliation": "Amazon",
    "event_type": "issues",
    "event_action": "opened",
    "time_last_engaged": "2024-08-27T00:31:56Z",
    "inactive": false
}

To create these documents, there should be a lambda that will use the github-activity-events index(from: #76) to collect/calculate the required fields for each document and index these to the maintainer_engagement index.

This lambda should:

Scrape the MAINTAINERS.md for each repository in the OpenSearch project. This will yield the repo, name, github_login, and affiliation fields.
Make a top hit query for the latest document in the github-activity-events index for each repo, maintainer, and event type.
Use the created_at field for each GitHub Event document to get the time_last_engaged
For each event type, calculate if the Maintainer should be considered active or inactive based on time_last_engaged and how active the repo is(see below).

To address the problem of waiting longer to flag maintainers of less active repos:

For the inactivity calculation, we can use a linear equation, y = m*x + b, where:
x = the total number of events in a repo
y = the amount of time a maintainer is inactive before we flag them as inactive

And we can calculate the slope(m) and the y-intercept(b) with two points:
(# of events in the repo with the least events, higher bound time to wait(365 days))
(# of events in the repo with the most events, lower bound time to wait(90 days))

This way we have an equation to calculate how long to wait for each repo, we wait longer on repos that are less active, wait shorter on repos that are more active.

Example: Imagine there is a repo with 600 events. You can use the two starting points to calculate the slope and y-intercept of the linear equation. You can then use the linear equation to calculate how long to wait until maintainers should be flagged.

Now that we know how long to wait, we then calculate inactivity for each event. Let's say that a maintainer has been the actor for the issues, pull_request, and label events within the last 201.01 days.

(The dots in the graph represent when the maintainer last triggered each event)

We would consider this maintainer active and we wouldn't flag them as inactive. Now let's say some time has passed and they have not triggered any new events.

Though some events have passed the threshold, because there is still a single event within the threshold, we still consider the maintainer as active and do not flag.

Now let's say even more time passes without activity:

Now that all events are past the threshold, we flag the maintainer as inactive.

Let's say the maintainer raises an issue in the repo:

Now that an event is within the threshold, we now consider the maintainer as active.

Aggregate all event types to a single document which will definitively say whether a maintainer is inactive.
For each event type and the aggregate event, index these documents to the maintainer_engagement index.

Note:

Because the source of the raw event data is the data lake that we have just started collecting in real time, the above design would yield actionable data only after a period of months(however long we decide the HIGHER_BOUND of time to wait until flagging to be).

Do you have any additional context?

#57

The text was updated successfully, but these errors were encountered:

prudhvigodithi · 2024-09-19T17:38:01Z

Thanks @bshien I would even go with splitting the documents at the event level, by adding event_name (coming from https://docs.github.com/en/rest/using-the-rest-api/github-event-types?apiVersion=2022-11-28) and inactive to true or false for a specific event.

The raw event data collected #76 already has the event name.

By segregating the documents by user (maintainer), repository, and event name, we can obtain more granular metrics for maintainers, allowing us to infer whether they are active or inactive.

bshien · 2024-10-21T19:04:32Z

Thanks, I've added this to the issue.

dblock · 2024-10-21T20:50:02Z

Scrape the MAINTAINERS.md for each repository in the OpenSearch project. This will yield the repo, name, github_login, and affiliation fields.

You will want to store information about maintainers in the metrics store, so this should be the result of a cron process that runs regularly. Note that maintainers can be added and removed at any time, and sometimes they can be re-added and re-removed. When viewing dashboards that pertain to maintainers we'll want to see the state at the time of the graphs being shown, not the latest state.

Make a top hit query for the latest document in the github-activity-events index for each repo, maintainer, and event type.
Use the created_at field for each GitHub Event document to get the time_last_engaged

IMO maintainer is just a subset of any user. I suggest representing all users in the metrics store, then adding a relationship to them that says maintainer from date X to date Y (or still a maintainer) for a given repo.

For each event type, calculate if the Maintainer should be considered active or inactive based on time_last_engaged and how active the repo is(see below).

Note that you'll want to see active or inactive at the time being displayed.

prudhvigodithi · 2024-10-21T21:29:24Z

Hey dB, the dashboard will have the current repo maintainer stats (when applied the repo filter or will show all the maintainers).

Coming from #76 the metrics cluster will already have the raw information (inferred from s3 datalake) for all the users and for all the targeted maintainer events.

The additional slope logic defined above will only target the maintainers from the MAINTAINERS.md and add a flag inactive to true/false. This will be stored in a different index to just infer if a current maintainer is active or not and the dashboard will show the same. The dashboard can be used to display for the current point in time if the existing maintainer is active or not.

Later we can have another dashboard for user engagement (with the data already part of the metrics cluster #76) which can target all users and can be used to nominate as a maintainer.

dblock · 2024-10-22T13:12:37Z

The additional slope logic defined above will only target the maintainers from the MAINTAINERS.md and add a flag inactive to true/false. This will be stored in a different index to just infer if a current maintainer is active or not and the dashboard will show the same. The dashboard can be used to display for the current point in time if the existing maintainer is active or not.

This is problematic because the maintainer state is "now" and is incorrect when you travel in time.

I think you need to store maintainer status every time you do a sweep of MAINTAINERS.md with a date, for example username (e.g. dblock), repo (e.g. OpenSearch), emeritus=true/false, date=2024-10-22, updated_at=timestamp. When doing queries you want the maintainer status closest to the date when you're looking for status on, and for current dashboard to use the set with max(updated_at).

This way you can build dashboards of maintainer growth, have data about maintainers at any given day, etc.

prudhvigodithi · 2024-10-22T16:02:43Z

Yes we can index MAINTAINERS.md which gives some good advantages, also from my point

The additional slope logic defined above will only target the maintainers from the MAINTAINERS.md and add a flag inactive to true/false. This will be stored in a different index to just infer if a current maintainer is active or not and the dashboard will show the same. The dashboard can be used to display for the current point in time if the existing maintainer is active or not.

Using this index which has the data if maintainer is active or not (the maintainer stats), can still be used to go back in time and filter the documents. Then then output is the set of documents which the maintainer stats of set the of maintainers for that point in time.

Thank you

bshien · 2024-10-22T16:23:36Z

I think you need to store maintainer status every time you do a sweep of MAINTAINERS.md with a date, for example username (e.g. dblock), repo (e.g. OpenSearch), emeritus=true/false, date=2024-10-22, updated_at=timestamp. When doing queries you want the maintainer status closest to the date when you're looking for status on, and for current dashboard to use the set with max(updated_at).

The index that this lambda will create, maintainer_engagement(better name inactive_maintainers), will scrape MAINTAINERS.md and index a new set of documents representing the current state of maintainers every time the lambda is run(every day). This yields daily snapshots of maintainer statuses. The visualization that will be shown can be made to only show the data of the newest set of documents, that way the visualizations will stay up to date on maintainer changes.

prudhvigodithi · 2024-10-22T16:26:09Z

@bshien (from dB's point) I assume we should also be able to build dashboards of maintainer growth, have data about maintainers at any given day, etc. ?

bshien · 2024-10-22T16:29:47Z

Because the above design indexes snapshots, you can build dashboards of # of maintainers over time(by counting the docs each day with unique username), and also go back and see the set of documents representing maintainer statuses on any day in the past.

prudhvigodithi · 2024-10-22T16:38:12Z

Thanks @bshien, then we can start with this and we can infer more datapoints as the data is flowing and able to see the visualizations. If still required we can change the logic to index the MAINTAINERS.md, @dblock please let us know if you are ok with this.

bshien added the enhancement New feature or request label Sep 18, 2024

bshien self-assigned this Sep 18, 2024

github-actions bot added the untriaged Issues that have not yet been triaged label Sep 18, 2024

bshien removed the untriaged Issues that have not yet been triaged label Sep 18, 2024

This was referenced Sep 18, 2024

[FEATURE] Index GitHub Events to the Metrics cluster #76

Open

[META][FEATURE] Maintainer dashboard #57

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Ingest Maintainer last engaged date into Metrics cluster #75

[FEATURE] Ingest Maintainer last engaged date into Metrics cluster #75

bshien commented Sep 18, 2024 •

edited

Loading

prudhvigodithi commented Sep 19, 2024

bshien commented Oct 21, 2024

dblock commented Oct 21, 2024 •

edited

Loading

prudhvigodithi commented Oct 21, 2024

dblock commented Oct 22, 2024 •

edited

Loading

prudhvigodithi commented Oct 22, 2024

bshien commented Oct 22, 2024 •

edited

Loading

prudhvigodithi commented Oct 22, 2024

bshien commented Oct 22, 2024

prudhvigodithi commented Oct 22, 2024

[FEATURE] Ingest Maintainer last engaged date into Metrics cluster #75

[FEATURE] Ingest Maintainer last engaged date into Metrics cluster #75

Comments

bshien commented Sep 18, 2024 • edited Loading

Is your feature request related to a problem?

What solution would you like?

Do you have any additional context?

prudhvigodithi commented Sep 19, 2024

bshien commented Oct 21, 2024

dblock commented Oct 21, 2024 • edited Loading

prudhvigodithi commented Oct 21, 2024

dblock commented Oct 22, 2024 • edited Loading

prudhvigodithi commented Oct 22, 2024

bshien commented Oct 22, 2024 • edited Loading

prudhvigodithi commented Oct 22, 2024

bshien commented Oct 22, 2024

prudhvigodithi commented Oct 22, 2024

bshien commented Sep 18, 2024 •

edited

Loading

dblock commented Oct 21, 2024 •

edited

Loading

dblock commented Oct 22, 2024 •

edited

Loading

bshien commented Oct 22, 2024 •

edited

Loading