Skip to content

Project Infrastructure

Katie Doroschak edited this page Apr 9, 2020 · 1 revision

Here are my infrastructure recommendations for supporting our goal of writing distributable, maintainable software that other clients can consume and enjoy. These recommendations will facillitate this goal from end-to-end: from dependency management to deployment. I'm making these recommendations based on personal experience, research, and discussions with others, but my conclusions are by no means authoritative. Please edit/discuss if you have other ideas :)

Table of Contents

Package management

Unless there are strong objections, I think we should use Nix to handle package management. Nix is a purely functional package manager that allows atomic, contained, deterministic dependency management for MacOS and Linux machines. This is advantageous over solutions like pip, which are very machine-dependent.

Nix allows builds with non-Python dependencies, plus multiple versions of libraries conflict-free, and does so without overwriting anything on the user's system.

The drawback to this is that the syntax is verbose and sometimes esoteric with less than stellar documentation (in my opinion). If this turns out to be a problem, we can seek something simpler (e.g. conda).

Version Control

Git Hooks

I recommend we use a client-side push-hook for linting and testing. This will help us catch small mistakes before they can interfere with the build process.

Code hosting

I think it's fine to continue hosting in MISL's Github repository. Gitlab is also free for University of Washington staff, and has some advantages in terms of issue tracking, but I think cohesion with the rest of the MISL ecosystem is more important.

Code

Project structure

I recommend using the Hitchhiker's Guide to Python: Structuring Your Project format, originally described by Kenneth Reitz (author of the Requests library-- one of the most elegant and pythonic libraries I've ever used-- to the extent that I read its source code for fun 😅).

For example:

README.md
LICENSE.txt
setup.py
setup.cfg
nanoporeters/__init__.py
nanoporeters/core.py
nanoporeters/helpers.py
nanoporeters/.../
docs/conf.py
docs/index.rst
tests/test_basic.py
tests/test_advanced.py

Style guide and Formatting

Consistent coding styles make code easier to maintain and easier for newcomers to contribute to.

I recommend using Black library for style-enforcement and code formatting. It's a very opinionated code formatter that enforces a subset of PEP 8 guidelines. I think using Black will be easier for us than checking and maintaining a separate MISL style guide.

  • Style consistency makes code more maintainable.
  • (Bonus: Black is recommended by the author of the Requests library, Kenneth Reitz, which is one of the most elegant, pythonic libraries I've ever used (I read its source code for fun :) )

Documentation

Convention

I recommend abiding by Python's documentation proposal PEP 257. Specifically, following the numpy documentation convention. I prefer the numpy convention over other conventions because it's easier to read (in my opinion), and also can transpile to the Google or Sphinx styles.

Here's an example of how the numpy convention looks:

def polymerase_chain_reaction(dna: DNA, enzyme: Enzyme, cycles=10) -> List[DNA]:
    """Performs a polymer chain reaction on some DNA, copying it a bunch of times.
    
    Parameters
    ----------
    dna : DNA
        The DNA to be sequenced and copied.
    enzyme : Enzyme
        The specific polymerase to use in the PCR reaction.
    cycles : int, optional
        The number of times to run the PCR, which will result in 2**cycles copies of DNA, by default 10
    
    Returns
    -------
    List[DNA]
        A list of DNA copies synthesized from the original strand.
    """
    pass

Tooling

The tool Pydocstyle is a static analyzer that checks for compliance to this standard automatically.

Choosing Pydocstyle binds us to Python >= 3.5, but this should be acceptable based on my current understanding of the project.

If you're using VSCode, you can configure it to make document stubs automatically using autoDocstring and configuring it for numpy :D

Linting

Having consistent style rules often catches mysterious bugs (and also makes developer onboarding easier!), so I suggest using Flake 8. I (@jdunstan) will do my best to obey current conventions as they exist.

If you're using VSCode, you can configure it to run linting automatically :D

Typing

In my experience, static typing leads to fewer bugs in production, and this is especially important for dynamically typed languages. Static typing also makes code easier to reason about and discuss, which is critical for collaborative projects.

Type hints were introduced to Python in Pep 484. And let us write code like this:

def incrementer(i: int) -> int:
    pass

Instead of this:

def incrementer(i):
    pass

I recommend PyType because of its type-inference capabilities and leniancy. MyPy, in my opinion, tends to be a little too strict which restricts some of the language's expressiveness. There's a nice Lightning Talk on MyPy and Pytype from PyCon 2019

MyPy vs. PyType

PyType only throws error for run-time problems, which is quite convenient. I'm usually suspicous of Google dev tools, as they're notorious for dropping support for projects, but I think PyType has widespread enough adoption that this risk is minimal.

Choosing Pep 484 and PyType force us to build for Python >= 3.5, but this should be acceptable based on my current understanding of the project.

I don't know how well this plays with numpy/scipy out of the box, but there is a data-science-types type-hints package to help, which covers a subset of matplotlib, pandas, and numpy.

Testing

Unit tests are a critical part of any sustainable software project.

Integration tests are important as well. I need to understand the NanoporeTERs project more deeply before I make specific recommendations around integration testing.

For unit testing, I recommend using Pytest.

Unit tests should be run upon pushing to the repository (see Git Hooks). Pipeline builds should only succeed if unit tests pass (see Continuous Integration and Pipelines).

I've heard good things about Hypothesis test framework (it auto-generates edge-cases), but it might be overkill right now.

I recommend we start with the unit test framework available out of the box, then bring in more heavy test frameworks as needed.

Issue Tracking

It'll be useful to know what each developer is working on, to avoid toe-stepping, merge conflicts, and regressions. This is where Issue Tracking comes in handy. If we take a few moments to create a a Feature/Bug description of what we're working on, it'll likely reduce such collisions.

In my experience, JIRA seems to be the industry standard for issue tracking and project management. However, with so few developers and a straight-forward release path, I think it's likely overkill for our project.

Git issues should suffice.

  • Features: Issues that represent general feature work.
  • Bugs: Issues that need to be fixed in an upcoming release.
  • Security Vulnerabilities: Issues that need to be fixed for security reasons. These often result in immediate patch releases.

Deployment

In agreement with Katie's GE slides, I think Docker would be a wise choice for distribution, as it's widely used and thus easier for our users to integrate into their existing workflows. It's also easy to set up. By using nix, we can pipe the nix-build output straight into a Docker image.

Continuous Integration and Pipelines

Continuous integration is important for incremental development and modular releases.

I recommend hooking our repository to Travis CI to handle continuous integration and pipelining. This pipeline will kick off flows to check formatting, linting, and unit tests, before building the final distributable package and uploading it for our consumers.

  • This means onitor the pipeline after pushing code

  • Pipeline build should only succeed if:

    • All unit tests pass
    • All formatting works pass
    • All linter preferences pass
    • Code coverage remains above CODE_COVERAGE_THRESHOLD (I think we should start at 20% and raise it if comfortable)