Data Sharing & Archiving
Why Share Data and Code?
- Required by some journal publishers (such as Nature), and funding agencies (National Science Foundation, National Institute of Health etc.). It is expected other funding agencies will also require researchers to share data produced during the course of their research project (see OMB Policy Memorandum on sharing data).
- Data can be reused to answer new questions, opens up new interpretations and discoveries.
- Sharing data may lead to sharing research process, workflows, and tools (enhances potential for replicating research results).
- Makes your articles and papers more useful and citable by others, increases their value!
How and Where to Share Your Data and Code
Posting data on a web page is useful for increasing visibility and for presentation but is not a recommended as the main strategy for data sharing. Instead, deposit into a trusted repository and refer to/showcase/cite the deposit in any web pages or other media.
- Recommended: Deposit into a recognized data repository. The University of Arizona's Research Data Repository (ReDATA) or any other appropriate disciplinary Data Repository.
- Submit data/code, along with your article as supplementary material, if the journal allows for it.
Posting code on GitHub is an accepted way of sharing code. To enhance code citability and to ensure that the exact version of the code is included alongside the data it is associated with, depositing code into a data repository is recommended. Many data repositories including ReDATA support GitHub integration.
Preparing Your Data for Sharing
When sharing your data and code
- Bundle your data together in a systematic way by following best practices for organizing files and structuring code so others can easily understand and use the data and/or code. See Data Organization and Software/Code Best Practices.
- Include enough information so that others can understand and reuse the dataset. See Data Documentation, Readme, & Metadata
- Follow best practices to include enough information in readme files or elsewhere to make it possible to cite the dataset. See Citing Data & Code.
- Follow best practices to ensure confidentiality of any human participants
Since practices for data preparation vary depending on their characteristics, the Data Curation Network has prepared primers with recommended practices for preparing a wide variety of file formats for sharing.
Archiving
Archiving is an
activity that ensures that data are properly selected, stored, and can be accessed, and for which logical and physical integrity are maintained over time, including security and authenticity.
Data (and code) archiving involves ensuring data is clean, documented, organized and is as self-contained as possible. Data can be archived in various ways
- On local hard disk drives
- On long-term tape drives
- In the cloud
- In dedicated data repositories
A rule of thumb when archiving a project is to document and organize materials such that if you were to give the archive bundle to your colleague, would they be able to understand it without asking you?
Placing data in a dedicated repository is the preferred method of archiving publicly releasable datasets and code (e.g., data associated with published articles) since it allows for data reuse and citation and enables research reproducibility. See the Data Repositories page for how to get started finding one.
Although the risk of data loss is low in dedicated data repositories (namely, those that have explicit acknowledgements regarding data storage and how long data will remain available), it is always advisable, where possible to retain an offline copy of the data. This is especially true for general purpose cloud storage that, unlike a dedicated data repository, allows for unintended modifications and deletion of files.
Data Retention
Various guidelines govern data retention requirements, depending on the kind of data and its use (e.g., financial information, data used for patent applications, etc). Unless otherwise specified, the following guidance applies to research data in general:
Each investigator should treat data properly to ensure authenticity, reproducibility, and validity and to meet the requirements of relevant grants and other agreements concerning the retention of data. Primary data should be reserved for ten (10) years, if there is no requirement in the Award Document, to ensure that any questions raised in published results can be answered.
See the UA Research, Innovation, & Impact page for more information on research data retention. See Records & Archives for help with official UA records retention and destruction policies and procedures.
Where to Archive your Data at UA
Refer to the options listed in Storage, Backups, and Security. For larger datasets, there are three recommended options: OneDrive, Tier 2 storage, and ReDATA. Depending on the reason for archiving, one may be a better fit than another.
- OneDrive is a good general purpose option for amounts < 1 TB. It does not meet data sharing requirements from funders and journals.
- For private data storage for larger amounts, Tier 2 storage is a good solution. It does not meet data sharing requirements from funders and journals.
- For data publication and sharing (e.g., for dissertations or journal articles), ReDATA is best.
To help you decide, the table below compares ReDATA and Tier 2 storage.
ReDATA | HPC Tier 2 | HPC Rented Storage | |
---|---|---|---|
Main purpose | Data publication to meet funder/journal data sharing requirements | Large storage space for data not undergoing analysis (cloud storage in Amazon S3) | Large storage space for data not undergoing analysis (on premises storage) |
Usage examples | Public archiving, getting a DOI, preserving curated and final data for journal articles, dissertations, etc. | Private archiving, project backups, transferring data to HPC for analysis | Private archiving, project backups, transferring data to HPC for analysis |
Eligibility | All individuals with a valid NetID and library privileges | Only faculty may request an account | Only faculty may request an account |
Cost | Free | Free up to 1 TB, standard S3 pricing thereafter | $47.35 / TB / year |
Stewardship | Libraries assume full responsibility for ensuring long-term data availability according to the FAIR principles | Data is automatically moved Amazon Glacier after a time (for free) but users must otherwise manage their own data | None |
Storage quotas & egress | Up to 1 TB by request (additional storage available for special projects | None (usage beyond 1TB in standard S3 storage will be billed) | None |
Data curation services & structured metadata | Data is curated by ReDATA staff to prepare it for sharing. Documentation and structured metadata is added. | None. Users must manage their own data | None. Users must manage their own data |