Data Organization

Data Organization

Directory, Folder, and File Naming

One of the first things to consider is: How do I want to organize my data?  There are a number of questions you will want to consider:

  • Are there file naming conventions for your discipline?
  • What directory structure and file naming conventions to use?
  • Version control -- record every change to a file, no matter how small
    • Consider version control software, if applicable
    • Discard obsolete versions, but never the raw copy

Keep the following best practices in mind:

  • Be consistent with how you name directories, folders, and files
    • Always include the same information
    • Retain the order of information
  • Be descriptive so others can understand what file names mean
  • Keep track of versions (and be consistent!)
  • Use application-specific codes for file extensions, such as .mov, tif, wrl

It will be important to track changes in your data files especially if there is more than one person involved in the research.

The following are some free file renaming applications if you need to revise your naming system (endorsement not implied):

Suggested Folder Structures 

It is generally recommended to keep data, code, and outputs separate to avoid confusion. Although the exact structure can vary depending on the project, here are two approaches to organizing files.

Basic Projects
One or two data processing scripts, simple pipeline with few steps

├── InputData          <- Folder containing data that will be processed
├── OutputData         <- Folder containing data that has been processed
├── Figures            <- Folder containing with figures or tables summarizing the results
├── Code               <- Folder the scripts or programs to do the analysis
├── LICENSE            <- File explaining the terms under which data/code is being made available
├── README.txt         <- File documenting the analysis, and (ideally) the purpose of each file.

Advanced Projects
Many kinds of input data, documentation, and code.

This structure can be generated and auto-populated with the Reproducible Science template for Cookiecutter.

├──         <- File: List of people that contributed to the project (Markdown format)
├── LICENSE            <- File: Plain text file explaining the usage terms/license of the data/code file (CC-By, MIT, GNU, etc.)
├──          <- File: Readme file (Markdown format)
├── bin                <- Folder: Your compiled model code can be stored here (not tracked by git)
├── config             <- Folder: Configuration files, e.g., for doxygen or for your model if needed
├── data               <- Folder: Data for this project
│   ├── external       <- Folder: Data from third party sources.
│   ├── interim        <- Folder: Intermediate data that has been transformed.
│   ├── processed      <- Folder: The final, canonical data sets for modeling.
│   └── raw            <- Folder: The original, immutable data dump.
├── docs               <- Folder: Documentation, e.g., doxygen or scientific papers (not tracked by git)
├── notebooks          <- Folder: Ipython or R notebooks
├── reports            <- Folder: Manuscript source, e.g., LaTeX, Markdown, etc., or any project reports
│   └── figures        <- Folder: Figures for the manuscript or reports
└── src                <- Folder: Source code for this project
    ├── data           <- Folder: scripts and programs to process data
    ├── external       <- Folder: Any external source code, e.g., other projects, or external libraries
    ├── models         <- Folder: Source code for your own model
    ├── tools          <- Folder: Any helper scripts go here
    └── visualization  <- Folder: Visualisation scripts, e.g., matplotlib, ggplot2 related

File Formats 

The file format is the principal factor in the ability for others to use your data in the future.  You need to plan for software and hardware obsolescence since technology continually changes.  How will others use your data if the software used to produce is no longer available?  You may want to consider migrating your files to a format with the characteristics listed below and keep a copy in the original format.

Formats most likely to be accessible in the future include:

  • Non-proprietary, not tied to a specific software product
  • Unencrypted
  • Uncompressed
  • Common, used by the research community
  • Standard representation, such as ASCII, Unicode
  • Open, documented standard

Examples of preferred formats:

  • PDF/A, not Word
  • Plain-text CSV, not Excel
  • MPEG-4, not Quicktime
  • XML, CSV, or RDF, not MS Access database
  • HDF, not MATLAB binary arrays/matrices

Project Organization & Management 

In addition to applying file and folder organization best practices, an overall project strategy should consider other aspects to ensure successful projects, publications and hand-offs. In addition, a solid strategy helps avoid errors due to mix-ups and enhances research reproducibility. The tools mentioned in the table are for informational purposes and endorsement is not implied.

Overall organization & collaboration
  • OSF
  • Confluence
  • Slack, Hipchat
  • GitHub, Google Drive, Box, Office 365, etc.

OSF can connect many services together. Examples of how the OSF can be used to organize a lab:

Traditional project management
  • Asana
  • Trello
  • Basecamp, Freedcamp
Calendars, to-do lists, Kanban boards, Gantt, etc. Open source offerings exist however many of the commonly used tools are paid or freemium.
Tracking bugs, issues, time
  • Jira
  • Redmine
Some like Taiga are geared towards Agile development but can be used for any kind of project.
Standard Operating Procedures (SOPs)
  • Standard word processors or spreadsheets
  • OSF

Establishing SOPs for projects or research groups is an important step to maintain organization and ensure clean hand-offs, research reproducibility, and project archiving. Aspects include

  • Establishing file naming conventions
  • Styles and practices for software development (e.g., all functions must be documented, files must be checked into version control, etc.)
  • Standardizing workflows for data collection and processing
  • Establishing and enforcing backups
  • Establishing roles (e.g., who will be responsible for what)

Ideally, SOPs form part of the implementation of a Data Management Plan. See an example of SOPs from a UA researcher on the OSF.

Experiment tracking, organization

Electronic Lab Notebooks (ELNs)

  • RSpace
  • LabArchives
  • OSF
ELNs can be a useful tool to manage projects and labs. There is a large number of ELNs on the market from open source to cloud-hosted. The Harvard Medical School maintains a comprehensive list comparing more than 50 features across 27 ELNs. The OSF can also be used as an ELN. See this template from Johns Hopkins University.