Data Sharing & Archiving

Why Share Data and Code?

Required by some journal publishers (such as Nature), and funding agencies (National Science Foundation, National Institute of Health etc.). It is expected other funding agencies will also require researchers to share data produced during the course of their research project (see OMB Policy Memorandum on sharing data).
Data can be reused to answer new questions, opens up new interpretations and discoveries.
Sharing data may lead to sharing research process, workflows, and tools (enhances potential for replicating research results).
Makes your articles and papers more useful and citable by others, increases their value!

How and Where to Share Your Data and Code

Posting data on a web page is useful for increasing visibility and for presentation but is not a recommended as the main strategy for data sharing. Instead, deposit into a trusted repository and refer to/showcase/cite the deposit in any web pages or other media.

Recommended: Deposit into a recognized data repository. The University of Arizona's Research Data Repository (ReDATA) or any other appropriate disciplinary Data Repository.
Submit data/code, along with your article as supplementary material, if the journal allows for it.

Posting code on GitHub is an accepted way of sharing code. To enhance code citability and to ensure that the exact version of the code is included alongside the data it is associated with, depositing code into a data repository is recommended. Many data repositories including ReDATA support GitHub integration.

Preparing Your Data for Sharing

When sharing your data and code

Bundle your data together in a systematic way by following best practices for organizing files and structuring code so others can easily understand and use the data and/or code. See Data Organization and Software/Code Best Practices.
Include enough information so that others can understand and reuse the dataset. See Data Documentation, Readme, & Metadata
Follow best practices to include enough information in readme files or elsewhere to make it possible to cite the dataset. See Citing Data & Code.
Follow best practices to ensure confidentiality of any human participants

Since practices for data preparation vary depending on their characteristics, the Data Curation Network has prepared primers with recommended practices for preparing a wide variety of file formats for sharing.

Archiving

Archiving is an

activity that ensures that data are properly selected, stored, and can be accessed, and for which logical and physical integrity are maintained over time, including security and authenticity.
- CODATA RDM Glossary

Data (and code) archiving involves ensuring data is clean, documented, organized and is as self-contained as possible. Data can be archived in various ways

On local hard disk drives
On long-term tape drives
In the cloud
In dedicated data repositories

A rule of thumb when archiving a project is to document and organize materials such that if you were to give the archive bundle to your colleague, would they be able to understand it without asking you?

Placing data in a dedicated repository is the preferred method of archiving publicly releasable datasets and code (e.g., data associated with published articles) since it allows for data reuse and citation and enables research reproducibility. See the Data Repositories page for how to get started finding one.

Although the risk of data loss is low in dedicated data repositories (namely, those that have explicit acknowledgements regarding data storage and how long data will remain available), it is always advisable, where possible to retain an offline copy of the data. This is especially true for general purpose cloud storage that, unlike a dedicated data repository, allows for unintended modifications and deletion of files.

Data Retention

Various guidelines govern data retention requirements, depending on the kind of data and its use (e.g., financial information, data used for patent applications, etc). Unless otherwise specified, the following guidance applies to research data in general:

Each investigator should treat data properly to ensure authenticity, reproducibility, and validity and to meet the requirements of relevant grants and other agreements concerning the retention of data. Primary data should be reserved for ten (10) years, if there is no requirement in the Award Document, to ensure that any questions raised in published results can be answered.

See the Office of Research & Partnership's page on retention and ownership for more information. See Records & Archives for help with official UA records retention and destruction policies and procedures.

Where to Archive your Data at UA

Refer to the options listed in Storage, Backups, and Security. For larger datasets, there are three recommended options: OneDrive, Tier 2 storage, and ReDATA. Depending on the reason for archiving, one may be a better fit than another.

OneDrive is a good general purpose option for amounts < 1 TB. It does not meet data sharing requirements from funders and journals.
For private data storage for larger amounts, Tier 2 storage is a good solution. It does not meet data sharing requirements from funders and journals.
For data publication and sharing (e.g., for dissertations or journal articles), ReDATA is best.

To help you decide, the table below compares ReDATA and the various the various storages provided by Research Computing (Tier 1 is excluded from the table as it is not intended for long-term storage).

	ReDATA	HPC Tier 2, rented storage, R-DAS
Main purpose	Data publication to meet funder/journal data sharing requirements	General purpose storage. Performance varies. Not designed for public sharing
Usage examples	Public archiving, getting a DOI, preserving curated and final data for journal articles, dissertations, etc.	Private archiving, project backups, transferring data to HPC for analysis
Eligibility	All individuals with a valid NetID and library privileges	Only faculty may request an account
Cost	Free	Most tiers are free, quotas apply. Additional storage available for a fee
Stewardship	Libraries assume full responsibility for ensuring long-term data availability according to the FAIR principles. Data is preserved using multiple distributed copies.	Users are responsible for maintaining the integrity own data. Not backed up (except Tier 2)
Storage quotas & egress	Up to 1 TB by request (additional storage available for special projects). No egress charges to the public internet	Varies. Egress subsidized for Tier 2.
Data curation services & structured metadata	Data is curated by ReDATA staff to prepare it for sharing. Documentation and structured metadata is added.	None. Users must manage their own data and ensure it remains usable

Resources & Best Practices