Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pipeline standards #17

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Pipeline standards #17

wants to merge 1 commit into from

Conversation

RGilliard-Arch
Copy link
Contributor

No description provided.

Copy link
Collaborator

@janejuenyang janejuenyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting this together, Reggie! I like what you've noted already and have made some requests to add some sections.

- Establish clear metrics for a successful pipeline.

## Choose the right tools and technologies
Depending on the data type, volume, and velocity, choose appropriate tools and technologies. For example:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be helpful to distinguish tools that are possible when building GFE-based pipelines vs. cloud-based pipelines.

- Data orchestration: Apache Airflow

## Scalability and flexibility
Where possible, design the pipeline to be easily scaled up or down and to adapt to changes in data types and data formats.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, and there needs to be a balance between adaptability/scalability and how long it takes to deliver what's needed. I suggest adding a few example questions to guide people in this consideration. For instance:

  • How will implementing a certain scaling capability or data handling flexibility affect the project timeline and code complexity / maintainability?
  • What cost constraints are there?

In addition, are there some minimum guidelines on flexibility and scalability? For instance -- anything about use of regex, minimizing hard coding, etc?

## Monitoring and optimizing
Continuously monitor the performance of the pipeline and seek opportunities to optimize data processing times, reduce costs, and improve data quality. Implement monitoring and logging to track the performance and health of the pipeline. Alerts should be set up for failures or significant performance degradations. Logs can include assessments of data quality and any major errors or inconsistencies caught during data quality checks.

## Ensure security and compliance
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's include a link to the HHS approved software list (noting that the link is only accessible within HHS).

## Scalability and flexibility
Where possible, design the pipeline to be easily scaled up or down and to adapt to changes in data types and data formats.

## Implement data quality checks
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this (and all of the subsequent sections), can you add a sub-section for Examples and start by linking to relevant parts of the PIR code base? For future projects, we can similarly add links -- though some will be to private repos, which is okay.

@@ -0,0 +1,42 @@
# Pipeline Best Practices
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also add to the README

- How will success be measured?
- Establish clear metrics for a successful pipeline.

## Choose the right tools and technologies
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a section for building iteratively, ensuring there's constant demos and syncs with the client -- you can adapt from the lessons learned doc.

@RGilliard-Arch
Copy link
Contributor Author

RGilliard-Arch commented Mar 29, 2024

Thank you @janejuenyang! @skalaga-arch is spearheading the work here, so I'll let him take the lead on these changes, but I'll review and contribute--especially the items from the lessons learned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants