Skip to content

Latest commit

 

History

History
78 lines (63 loc) · 3.87 KB

GettingStarted_DataAnalysis.md

File metadata and controls

78 lines (63 loc) · 3.87 KB

keeping yourself organized, efficient and sane before, during, and after data analysis

-This is a draft outline, not everything is here yet.-

Why be so precise about how we organize our data and analysis scripts?

Why go to such extremes about how we organize data and scripts? It turns out that most people (including me), left to their own devices, will end up with a morass of analysis scripts, various copies of the data (which have been edited in different ways). Often all of these scripts, copies of data and figures end up in a single folder. Anyone who has gone back to their own data and analysis after 6 months knows that it can be incredibly difficult or impossible to reconstruct their own work. So how on earth would anyone else (including your PI) be able to do so.

So instead of asking yourself the question "Will I be able to make sense of what I have done tomorrow, when I come back to the analysis?", you should be asking "Will I understand this in a year? Will my labmates and PI understand this? How about other people who want to reproduce my analysis once it is published?".

So you want to:

  • Make it easy, clear and efficient for current use.
  • Make it easy, clear and efficient for future use.

Make your research reproducible from day one

Data back up

what you do not want

File formats

  • Why we use flat text files
  • Excel is ok for data entry, but not for long term storage
  • Why we use .csv for "spreadhseets
  • When to use relational databases
  • Why you should do data transformations in your scripts and not in the datafile

Folder structures

  • Why we have a recommended folder structure
  • The recommended folder structure
    /projectName
           /data
           /scripts
           /outputs
           /figures
           /misc
           /manuscript
  • Example
  • See blog postings that suggest why to use this. See here for some ideas of how to organize your folders. here as well another one While this article is written for computational biology projects, it holds in general.

Scripting

  • Philosophy for organizing scripts.
  • Avoid doing this.
  • Source scripts (for functions) and analysis scripts.
  • Syntax style guide for R
  • Syntax style guide for python

Some brief notes on Sanity Checks during analysis

  • Philosophy: Assume there are mistakes in the data and analysis until you convince yourself otherwise.
  • Sanity checks on the computational process (Unit testing)
  • Sanity checks on data (labeling, units, extra zeroes...)
  • Sanity checks on the analysis
  • Some important readings.

Resources for programming and analysis in R

There are many many resoucres out there that you can find all over the web. In addition on the lab folders (the shared dropbox folder) there are PDFs of several useful books. Here are just a few additional links Nice R Code

Getting started using git and guthub

  • What you don't want
  • version control for scripts (and small data)

Reproducible research