Data Sharing & Archiving
Why Share Data and Code?
- Required by some journal publishers (such as Nature), and funding agencies (National Science Foundation, National Institute of Health etc.). It is expected other funding agencies will also require researchers to share data produced during the course of their research project (see OMB Policy Memorandum on sharing data).
- Data can be reused to answer new questions, opens up new interpretations and discoveries.
- Sharing data may lead to sharing research process, workflows, and tools (enhances potential for replicating research results).
- Makes your articles and papers more useful and citable by others, increases their value!
How and Where to Share Your Data and Code
- Recommended: Deposit into a recognized data repository.
- Submit data/code, along with your article, to a journal publisher (such as Nature)
Posting data on a web page is useful for increasing visibility and for presentation but is not a recommended as the main strategy for data sharing. Instead, deposit into a trusted repository and refer to/showcase/cite the deposit in any web pages or other media.
Posting code on GitHub is an accepted way of sharing code. To enhance code citability and to ensure that the exact version of the code is included alongside the data it is associated with, depositing code into a data repository is recommended. Many data repositories including ReDATA support GitHub integration.
When sharing your data and code
- Follow best practices to include enough information in readme files or elsewhere to make it possible to cite the dataset. See Citing Data & Code.
- Include enough information so that others can understand and reuse the dataset. See Data Documentation, Readme, & Metadata
- Follow best practices for organizing files and structuring code so others can easily understand and use the data and/or code. See Data Organization and Software/Code Best Practices.
- Consider using tools to enable Research Reproducibility.
Archiving is an
activity that ensures that data are properly selected, stored, and can be accessed, and for which logical and physical integrity are maintained over time, including security and authenticity.
Data (and code) archiving involves ensuring data is clean, documented, organized and is as self-contained as possible. Data can be archived in various ways
- On local hard disk drives
- On long-term tape drives
- In the cloud
- In dedicated data repositories
A rule of thumb when archiving a project is to document and organize materials such that if you were to give the archive bundle to your colleague, would they be able to understand it without asking you?
Placing data in a dedicated repository is the preferred method of archiving publicly releasable datasets and code (e.g, data associated with published articles) since it allows for data reuse and citation and enables research reproducibility. See the page on Data Repositories for how to get started finding one.
Although the risk of data loss is low in dedicated data repositories (namely, those that have explicit acknowledgements regarding data storage and how long data will remain available), it is always advisable, where possible to retain an offline copy of the data. This is especially true for general purpose cloud storage that, unlike a dedicated data repository, allows for unintended modifications and deletion of files.
Various guidelines govern data retention requirements, depending on the kind of data and its use (e.g., financial information, data used for patent applications, etc). Unless otherwise specified, the following guidance applies to research data in general:
Each investigator should treat data properly to ensure authenticity, reproducibility, and validity and to meet the requirements of relevant grants and other agreements concerning the retention of data. Primary data should be reserved for ten (10) years, if there is no requirement in the Award Document, to ensure that any questions raised in published results can be answered.
See the UA Research Gateway for more information.
It is vital to maintain the confidentiality of research subjects for reasons of ethics and to ensure continued participation in research. Sometimes, research data resulting from funded research cannot be shared. There are policies that address this, such as Privacy Rule of the Health Insurance Portability and Accountability Act (HIPAA).
Researchers who want to ethically share sensitive and confidential data may want to consider the following:
- Include a provision for data sharing when obtaining informed consent of research participants. U.K. Data Archive Guide provides an example of a consent form with a provision for data sharing.
- Protect privacy through anonymizing data
- Evaluate the sensitivity of your data -- researchers should consider if their data contains either direct or indirect identifiers that could be combined with other public information to identify research participants
- Obtain a confidentiality review -- some data archives, such as Inter-University Consortium for Political and Social Research (ICPSR), will review your data for the presence of confidential information
- Responsible Conduct of Research
- Research Integrity
- Health Insurance Portability and Accountability Act (HIPAA)
- Human Subjects Protection Program
- De-identification services from the UA Center for Biomedical Informatics & Biostatistics
- Records & Archives - For help with official UA records retention and destruction policies and procedures