Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Ingest Maintainer last engaged date into Metrics cluster #75

Open
Tracked by #57
bshien opened this issue Sep 18, 2024 · 10 comments
Open
Tracked by #57

[FEATURE] Ingest Maintainer last engaged date into Metrics cluster #75

bshien opened this issue Sep 18, 2024 · 10 comments
Assignees
Labels
enhancement New feature or request

Comments

@bshien
Copy link
Collaborator

bshien commented Sep 18, 2024

Is your feature request related to a problem?

Coming from #57

As a prerequisite for #73 and opensearch-project/automation-app#8, there needs to be data in the Metrics cluster with information about each maintainers' repo, name, affiliation, the date they were last engaged, and their inactivity status.

What solution would you like?

An index created in the Metrics OpenSearch cluster called maintainer_engagement, which will have documents with this structure:

{
    "id": "8baa664c-dec0-4201-b4b9-9747c2e7ee45",
    "repository": "opensearch-metrics",
    "name": "Brandon Shien",
    "github_login": "bshien",
    "affiliation": "Amazon",
    "event_type": "issues",
    "event_action": "opened",
    "time_last_engaged": "2024-08-27T00:31:56Z",
    "inactive": false
}

To create these documents, there should be a lambda that will use the github-activity-events index(from: #76) to collect/calculate the required fields for each document and index these to the maintainer_engagement index.

This lambda should:

  1. Scrape the MAINTAINERS.md for each repository in the OpenSearch project. This will yield the repo, name, github_login, and affiliation fields.
  2. Make a top hit query for the latest document in the github-activity-events index for each repo, maintainer, and event type.
  3. Use the created_at field for each GitHub Event document to get the time_last_engaged
  4. For each event type, calculate if the Maintainer should be considered active or inactive based on time_last_engaged and how active the repo is(see below).

To address the problem of waiting longer to flag maintainers of less active repos:

For the inactivity calculation, we can use a linear equation, y = m*x + b, where:
x = the total number of events in a repo
y = the amount of time a maintainer is inactive before we flag them as inactive

And we can calculate the slope(m) and the y-intercept(b) with two points:
(# of events in the repo with the least events, higher bound time to wait(365 days))
(# of events in the repo with the most events, lower bound time to wait(90 days))

reindexmaint7 drawio

This way we have an equation to calculate how long to wait for each repo, we wait longer on repos that are less active, wait shorter on repos that are more active.

Example: Imagine there is a repo with 600 events. You can use the two starting points to calculate the slope and y-intercept of the linear equation. You can then use the linear equation to calculate how long to wait until maintainers should be flagged.

reindexmaint2 drawio

Now that we know how long to wait, we then calculate inactivity for each event. Let's say that a maintainer has been the actor for the issues, pull_request, and label events within the last 201.01 days.

(The dots in the graph represent when the maintainer last triggered each event)

reindexmaint3 drawio

We would consider this maintainer active and we wouldn't flag them as inactive. Now let's say some time has passed and they have not triggered any new events.

reindexmaint4 drawio

Though some events have passed the threshold, because there is still a single event within the threshold, we still consider the maintainer as active and do not flag.

Now let's say even more time passes without activity:

reindexmaint5 drawio

Now that all events are past the threshold, we flag the maintainer as inactive.

Let's say the maintainer raises an issue in the repo:

reindexmaint6 drawio

Now that an event is within the threshold, we now consider the maintainer as active.

  1. Aggregate all event types to a single document which will definitively say whether a maintainer is inactive.

  2. For each event type and the aggregate event, index these documents to the maintainer_engagement index.

Note:

Because the source of the raw event data is the data lake that we have just started collecting in real time, the above design would yield actionable data only after a period of months(however long we decide the HIGHER_BOUND of time to wait until flagging to be).

Do you have any additional context?

#57

@bshien bshien added the enhancement New feature or request label Sep 18, 2024
@bshien bshien self-assigned this Sep 18, 2024
@github-actions github-actions bot added the untriaged Issues that have not yet been triaged label Sep 18, 2024
@bshien bshien removed the untriaged Issues that have not yet been triaged label Sep 18, 2024
@prudhvigodithi
Copy link
Collaborator

Thanks @bshien I would even go with splitting the documents at the event level, by adding event_name (coming from https://docs.github.com/en/rest/using-the-rest-api/github-event-types?apiVersion=2022-11-28) and inactive to true or false for a specific event.

The raw event data collected #76 already has the event name.

By segregating the documents by user (maintainer), repository, and event name, we can obtain more granular metrics for maintainers, allowing us to infer whether they are active or inactive.

@bshien
Copy link
Collaborator Author

bshien commented Oct 21, 2024

Thanks, I've added this to the issue.

@dblock
Copy link
Member

dblock commented Oct 21, 2024

Scrape the MAINTAINERS.md for each repository in the OpenSearch project. This will yield the repo, name, github_login, and affiliation fields.

You will want to store information about maintainers in the metrics store, so this should be the result of a cron process that runs regularly. Note that maintainers can be added and removed at any time, and sometimes they can be re-added and re-removed. When viewing dashboards that pertain to maintainers we'll want to see the state at the time of the graphs being shown, not the latest state.

Make a top hit query for the latest document in the github-activity-events index for each repo, maintainer, and event type.
Use the created_at field for each GitHub Event document to get the time_last_engaged

IMO maintainer is just a subset of any user. I suggest representing all users in the metrics store, then adding a relationship to them that says maintainer from date X to date Y (or still a maintainer) for a given repo.

For each event type, calculate if the Maintainer should be considered active or inactive based on time_last_engaged and how active the repo is(see below).

Note that you'll want to see active or inactive at the time being displayed.

@prudhvigodithi
Copy link
Collaborator

Hey dB, the dashboard will have the current repo maintainer stats (when applied the repo filter or will show all the maintainers).

Coming from #76 the metrics cluster will already have the raw information (inferred from s3 datalake) for all the users and for all the targeted maintainer events.

The additional slope logic defined above will only target the maintainers from the MAINTAINERS.md and add a flag inactive to true/false. This will be stored in a different index to just infer if a current maintainer is active or not and the dashboard will show the same. The dashboard can be used to display for the current point in time if the existing maintainer is active or not.

Later we can have another dashboard for user engagement (with the data already part of the metrics cluster #76) which can target all users and can be used to nominate as a maintainer.

@dblock
Copy link
Member

dblock commented Oct 22, 2024

The additional slope logic defined above will only target the maintainers from the MAINTAINERS.md and add a flag inactive to true/false. This will be stored in a different index to just infer if a current maintainer is active or not and the dashboard will show the same. The dashboard can be used to display for the current point in time if the existing maintainer is active or not.

This is problematic because the maintainer state is "now" and is incorrect when you travel in time.

I think you need to store maintainer status every time you do a sweep of MAINTAINERS.md with a date, for example username (e.g. dblock), repo (e.g. OpenSearch), emeritus=true/false, date=2024-10-22, updated_at=timestamp. When doing queries you want the maintainer status closest to the date when you're looking for status on, and for current dashboard to use the set with max(updated_at).

This way you can build dashboards of maintainer growth, have data about maintainers at any given day, etc.

@prudhvigodithi
Copy link
Collaborator

Yes we can index MAINTAINERS.md which gives some good advantages, also from my point

The additional slope logic defined above will only target the maintainers from the MAINTAINERS.md and add a flag inactive to true/false. This will be stored in a different index to just infer if a current maintainer is active or not and the dashboard will show the same. The dashboard can be used to display for the current point in time if the existing maintainer is active or not.

Using this index which has the data if maintainer is active or not (the maintainer stats), can still be used to go back in time and filter the documents. Then then output is the set of documents which the maintainer stats of set the of maintainers for that point in time.

Thank you

@bshien
Copy link
Collaborator Author

bshien commented Oct 22, 2024

I think you need to store maintainer status every time you do a sweep of MAINTAINERS.md with a date, for example username (e.g. dblock), repo (e.g. OpenSearch), emeritus=true/false, date=2024-10-22, updated_at=timestamp. When doing queries you want the maintainer status closest to the date when you're looking for status on, and for current dashboard to use the set with max(updated_at).

The index that this lambda will create, maintainer_engagement(better name inactive_maintainers), will scrape MAINTAINERS.md and index a new set of documents representing the current state of maintainers every time the lambda is run(every day). This yields daily snapshots of maintainer statuses. The visualization that will be shown can be made to only show the data of the newest set of documents, that way the visualizations will stay up to date on maintainer changes.

@prudhvigodithi
Copy link
Collaborator

@bshien (from dB's point) I assume we should also be able to build dashboards of maintainer growth, have data about maintainers at any given day, etc. ?

@bshien
Copy link
Collaborator Author

bshien commented Oct 22, 2024

Because the above design indexes snapshots, you can build dashboards of # of maintainers over time(by counting the docs each day with unique username), and also go back and see the set of documents representing maintainer statuses on any day in the past.

@prudhvigodithi
Copy link
Collaborator

Thanks @bshien, then we can start with this and we can infer more datapoints as the data is flowing and able to see the visualizations. If still required we can change the logic to index the MAINTAINERS.md, @dblock please let us know if you are ok with this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: 🏗 In progress
Development

No branches or pull requests

3 participants