Open Data

This guide is designed to serve as an introduction to the concepts and tools supporting the Open Data Movement

Workflow Management

Workflow based systems provide explicit representation of the structure of the experiments, automation of repetitive tasks and computations, and transparently capture provenance information.

Provenance in workflow systems include:

Prospective provenance -  description of the experiment  workflow structure such as modules, connections, and inputs.

Retrospective provenance – information on the execution of workflow and what happened when it was run. 

Workflow evolution –history and versions of workflow (especially when data is iteratively refined)

To capture research provenance, 3 classes of tools need to be install: a shell or terminal program for access to command line, a plain text editor or a development environment and software allowing the user to write and execute code in a chosen programming language

Examples of Workflow Management tools include:


Registration and Registered Reports

When you preregister your research, you're simply specifying your research plan in advance of your study and submitting it to a registry.  In clinical research in particular, the preregistration of a study is mandatory.

Preregistration separates hypothesis-generating  (exploratory) from hypothesis-testing (confirmatory) research. Both are important. But the same data cannot be used to generate and test a hypothesis, which can happen unintentionally and reduce the credibility of your results. Addressing this problem through planning improves the quality and transparency of your research. This helps you clearly report your study and helps others who may wish to build on it.

For more information on registration.  Please check the following links:

Data processing and Analysis Best Practices

Data Cleanup and Processing are key components of replicability and reproducibility.

Here are some best practices:

  • Document all operations fully and automate as much as possible
  • Each step taken in the process should be recorded in detail fine enough for replication of cleaning strategies.
  • Encode the instructions for data processing as computer code that will read the raw data. If done manually, the file should be accompanied by a very detailed human readable description which is saved in a separate text file.
  • Design a workflow as a sequence of small steps that are glued together with intermediate outputs from one step feeding into the next step as inputs
  • Comment your code
  • Version control your code
  • Use free and open tools.


Acquiring Data and Electronic Lab Notebooks

Best practices for Acquisition of data include:

  • Create a spreadsheet and save work to a text file (csv,)
  • Clearly name your working files
  • Create and save a metadata file to document the source of the data and any information about it (create a data dictionary, README.txt,) which is saved in a simple text format,
  • Use an appropriate directory file structure.

Electronic Lab Notebooks are frequently used to encourage reproducible data acquisition practices.

File versioning

File versioning is a great way to maintain order in your research  computations as well as to allow for better collaboration.  There are many collaborative tools that support file versioning.

Check our guide for information on principles of good file naming, versioning and maintenance.

Data Validation and Code Checking

Data Sharing Best Practices

  • Host code on a collaborative platform
  • Obtain a DOI for your data and code.
  • Avoid spreadsheets and any proprietary file structures when possible, plain data text preferable
  • Clearly separate, label, and document all data, files, and operations that occur on data and files
  • Share using open licensing
  • Upload preprints, try ScholarWorks
  • Release code near time of paper submission
  • Add a reproducibility statement
  • Keep an up-to-date web presence
  • Describe software properly with versions and software dependencies
  • Describe fully the environment of you computations
  • Scripts for data cleaning included with research materials  and commentary to explain key decisions made about missing data and discarded data
  • Include a README file
  • Whenever possible use computation software that has license permissive enough to allow users to use the software, reproduce the results and extend them.

Code documentation tools

Code documentation provides human readable elements embedded in your code to allow other users to follow your code.

Many of these documentation tools will also allow you to create runnable code within a text.  You can also make complete documents and even books!

Packaging and sharing

Packaging tools allow you to collect multiple files (data, text, etc.) into...well...packages for easier portability.  They are very helpful in moving files to and from a computer to a cloud environment.

Some packaging systems are being developed to also provide the appropriate software that is needed to run analyses within the code packages.  CodeOcean is one of these systems.

Sharing and Preserving