A Realistic Guide to Making Data Available Alongside Code to Improve Reproducibility.

Nicholas Tierney, Monash University
&
Karthik Ram, Berkeley Institute of Data Science, UC Berkeley

NUMBAT

Friday 17th January, 2020

talk link

nj_tierney

1 / 51

"Data! data! data!" he cried impatiently. "I can't make bricks without clay."

--Sherlock Holmes (The Adventure of the Copper Beeches by Sir Arthur Conan Doyle)

2 / 51

This talk represents work that is currently in progress with Karthik Ram:

"A Realistic Guide to Making Data Available Alongside Code to Improve Reproducibility"

Any feedback and questions would be much appreciated!

3 / 51

njt-numbat-data.netlify.com/ • @nj_tierney
We need data4 / 51

We need data

We need it to do our job

4 / 51

We need data

We need it to do our job

(though not strictly true for theory, but you get what I mean)

4 / 51

🔎 Makes work transparent

✅ Increases trust

🔈 Increases visibility

Independent validation

♻️ Reproducibility

5 / 51

Research isn't often shared

It's often really available to the authors

6 / 51

Rowhani-Farid & Barnett, 2016

Out of 160 randomly sampled BMJ papers:

3 included data in the paper
7/157 research articles shared their data sets
For 21 clinical trials bound by the BMJ data sharing policy, 24% shared data

7 / 51

Stodden, Seiler and Ma, 2018

Science made clause for Authors to provide data with papers

Authors compared 204 papers before and after clause for reproducibility

8 / 51

Stodden, Seiler and Ma, 2018

9 / 51

Stodden, Seiler and Ma, 2018

😞

When you approach a PI for the source codes and raw data, you better explain who you are, whom you work for, why you need the data and what you are going to do with it.

10 / 51

Stodden, Seiler and Ma, 2018

😿

I have to say that this is a very unusual request without any explanation! Please ask your supervisor to send me an email with a detailed, and I mean detailed, explanation.

11 / 51

Stodden, Seiler and Ma, 2018

😭

We do not typically share our internal data or code with people outside our collaboration.

12 / 51

Stodden, Seiler and Ma, 2018

😿

The code we wrote is the accumulated product of years of effort by [redacted] and myself. Also, the data we processed was collected painstakingly over a long period by collaborators, and so we will need to ask permission from them too.

13 / 51

Stodden, Seiler and Ma, 2018

😿

Normally we do not provide this kind of information to people we do not know. It might be that you want to check the data analysis, and that might be of some use to us, but only if you publish your findings while properly referring to us.

14 / 51

Stodden, Seiler and Ma, 2018

😭

Thank you for your interest in our paper. For the [redacted] calculations I used my own code, and there is no public version of this code, which could be downloaded. Since this code is not very user-friendly and is under constant development I prefer not to share this code

15 / 51

Stodden, Seiler and Ma, 2018

🎉

Our program [redacted] is available here [URL redacted] (documentation and tutorials were included)

16 / 51

Stodden, Seiler and Ma, 2018

🎉

If you go to [URL redacted], under the publications, I have a link to the gitHub repository. I don’t know if I have all of the raw simulated data, but I certainly have the processed data used to make the plots. What do you need? All of the simulated data could of course be regenerated from the code.

17 / 51

Stodden, Seiler and Ma, 2018

🎉

Please find attached a .zip file called [redacted].zip that has the custom MATLAB [redacted] analysis code. If you run Masterrunfigureone.m this will generate several panels from the paper.

18 / 51

Stodden, Seiler and Ma, 2018

🎉

In the next email I will enclose the custom image analysis software. This can also be accessed from [URL redacted] where there is a manual and tutorial.

19 / 51

Plenty of research that tells you it is important to share data

And that data sharing should be FAIR (findable, Accessible, Interoperable, and Reusable)

But these don't actually tell you how to do share data

20 / 51

There are indeed good reasons to not share data:

Privacy concerns (e.g., human subjects, locations of critically endangered species)
May put the authors at a competitive disadvantage ( but data can be embargoed for reasonable periods of time)

21 / 51

"If you can't do something right, don't do it"

This ^^ is wrong - you can provide something, even if it is just simulated data.
Sharing data (in most cases) has a net positive benefit

22 / 51

It can feel like a wall or a mountain we need to climb.

These require special tools and knowledge.

It should instead be an "on-ramp"

This talk should hopefully get you started

23 / 51

njt-numbat-data.netlify.com/ • @nj_tierney
The on ramp to sharing dataAnalysis ready data: Final data used in analysis
README: A Human readable description of the data
Data dictionary: Human readable dictionary of data contents
Raw data: The original/first data provided
Scripts: To clean raw data ready for analysis
License: How to use and share the data
Citation: How you want your data to be cited
Machine readable meta data: Make your data searchable

24 / 51

njt-numbat-data.netlify.com/ • @nj_tierney
Analysis ready data: Final data used in analysisproject 
└── data
    └── crime.csv
25 / 51

njt-numbat-data.netlify.com/ • @nj_tierney
Analysis ready data: Final data used in analysisproject 
└── data
    └── crime.csv
Dataset(s) in the form used in analysis.
e.g., data used in a linear regression.
Ideally, it would be "tidy data" form.
Plain-text format, .csv, .tsv, .txt
Binary / proprietary formats are discouraged, since they require special software (although these can go into a data-raw file - to be discussed)
e.g., don't use .rda, .rds, .sav, .dta, .sas7bdat.

25 / 51

njt-numbat-data.netlify.com/ • @nj_tierney
README: A Human readable description of the dataproject 
├── data
|   └── crime.csv
└── README.md
Guides the reader/user to how to understand this directory.
Handy when there are no reliable standards
Top level of the data repository, (optionally for each dataset)
Contains the who, what, when, where, why, of your data.
.md allows you to take advantage of markdown - making it easy to insert lists, tables, and links.

26 / 51

njt-numbat-data.netlify.com/ • @nj_tierney
Data dictionary: Human readable dictionary of data contentsproject 
├── data
│   ├── crime.csv
│   └── crime-dictionary.csv 
└── README.md
Human readable description, context, and structure of the data
Helps familiarise user with data
It should contain:variable names
variable labels
variable codes, and
special values for missing data


27 / 51

Data Dictionary: Human readable dictionary of data contents

28 / 51

njt-numbat-data.netlify.com/ • @nj_tierney
Raw data: The original/first data providedproject 
├── data
│   ├── crime.csv
│   └── crime-dictionary.csv 
├── data-raw
│   └── crime-raw.dat
└── README.md
usually first format of data provided before any tidying or cleaning. 
If the raw data is a practical size to share, it should be shared in a folder called data-raw. 
Should be in the form that was first received, even if it is in binary or some proprietary format.  
Option to include data dictionaries of the raw data can be provided in data-raw.

29 / 51

njt-numbat-data.netlify.com/ • @nj_tierney
Scripts: To clean raw data ready for analysisproject 
├── data
│   ├── crime.csv
│   └── crime-dictionary.csv 
├── data-raw
│   ├── crime-raw.dat
│   ├── clean-crime.R
│   └── other-steps.md
└── README.md
Code used to clean and tidy the raw data. 
clean-crime.R
Ideally involves only scripted languages
If other practical steps were taken to clean up the data, these should be recorded in a plain text or markdown file.
other-steps.md

30 / 51

njt-numbat-data.netlify.com/ • @nj_tierney
License: How to use and share the dataproject 
├── data
│   ├── crime.csv
│   └── crime-dictionary.csv 
├── data-raw
│   ├── crime-raw.dat
│   ├── clean-crime.R
│   └── other-steps.md
├── README.md
└── LICENSE
Data + license clearly establishes how everyone to modify, use, and share data. 
Two licenses well suited for data sharing:

CCBY: enforce attribution and credit required, no warranty.
CC0: public domain. No ownership or warranty

Provide LICENSE file with entire license in the top level of directory.
use_cc0_license()
use_ccby_license()

31 / 51

njt-numbat-data.netlify.com/ • @nj_tierney
Citation: How you want your data to be citedproject 
├── data
│   ├── crime.csv
│   └── crime-dictionary.csv 
├── data-raw
│   ├── crime-raw.dat
│   ├── clean-crime.R
│   └── other-steps.md
├── README.md (reference DOI)
├── CITATION
└── LICENSE
A Digital Object Identifier (DOI) uniquely + permanently identifies a digital object ( paper, poster, or software)
DOIs are minted by repositories like Dryad or Zenodo for free.
Put the DOI in a reference format like BibTex (zenodo does this for you)

32 / 51

Citation: example

@software{housing-data,
  author       = {Nicholas Tierney},
  title        = {njtierney/melb-housing-data: Added LICENSE.md file},
  month        = feb,
  year         = 2019,
  publisher    = {Zenodo},
  version      = {1.0.1},
  doi          = {10.5281/zenodo.2575545},
  url          = {https://doi.org/10.5281/zenodo.2575545}
}

33 / 51

njt-numbat-data.netlify.com/ • @nj_tierney
Machine readable meta data: Make your data searchableproject 
├── data
│   ├── crime.csv
│   ├── crime-dictionary.csv 
│   └── metadata
│       ├── access.csv
│       ├── attributes.csv
│       ├── biblio.csv
│       ├── creators.csv
│       └── dataspice.json
├── data-raw
│   ├── crime-raw.dat
│   ├── clean-crime.R
│   └── other-steps.md
├── README.md (reference DOI here)
├── CITATION
└── LICENSE
Helps ensure data types are preserved.
Provides structure allowing data to be indexed and searched online, through services such as google datasets search using JSON-LD.
To create appropriate metadata, we recommend metadata generators such as dataspice or codebook
Metadata should be provided in a folder called "metadata", which should be provided for every dataset. 

34 / 51

Example Machine readable meta data: housing data

35 / 51

Now that you've created your data folder, you need to get it somewhere online

Two options I would like to discuss:

Putting the data online
Sharing as an R package

36 / 51

njt-numbat-data.netlify.com/ • @nj_tierney
Online repositories: Zenodo & Dryad37 / 51

njt-numbat-data.netlify.com/ • @nj_tierney
ZenodoLaunched in 2013 in a joint collaboration between openAIRE and CERN
Free, archival location to deposit datasets
File size limit is 50gb for individual files
Able to accommodate larger file sizes upon request

38 / 51

njt-numbat-data.netlify.com/ • @nj_tierney
DryadThe Dryad Digital Repository takes data from any field of research, and perform human quality control and assistance of the data
Can link data with a journal publication, in exchange for a data publishing fee.

39 / 51

Linking zenodo and GitHub

You can also link Zenodo with Github

Zenodo updates with new DOI at every "release" (Helps avoid managing many moving pieces)

See this article on github, making your code citable (Thanks to Arfon Smith)

40 / 51

"data only" R packages. (e.g., nycflights13, eechidna, other Australian datasets)

Pros

installable
documentation
share data cleaning
great for R users

Cons

Size: ! >= 5Mb (CRAN)
Doesn't help others outside R

41 / 51

Most (but not all!) data shared as an R package, or with an R package is for teaching purposes
Sharing the data together in this way allows you to create a "research compendium", where code + paper + computing environment + data is in one place.
See more about this in Karthik Ram's talk: "How To Make Your Data Analysis Notebooks More Reproducible"
Note also that the suggested directory structure for sharing data is based off of an R package

42 / 51

njt-numbat-data.netlify.com/ • @nj_tierney
Online "data" journalsProvides familiar mechanism for citation
But journals don't yet have a good way to outline how to share data
Often looks like a "mini paper" with the methods, and isn't always about the data, but about collection methods.
Can link to a Zenodo or Dryad repository.

43 / 51

Online "data" journals

Some example data journals:

Nature: Scientific Data
Data in Brief
Data

44 / 51

njt-numbat-data.netlify.com/ • @nj_tierney
Take homesYou don't have to do every single thing to publish your data
Take small steps - get the data somewhere first, add more detail as you go

45 / 51

njt-numbat-data.netlify.com/ • @nj_tierney
Future DirectionsCurrently working on a proposal for "datadevtools" - a set of developer tools to facilitate sharing data
These tools can then be used to assess "shareability" of data

46 / 51

njt-numbat-data.netlify.com/ • @nj_tierney
ThanksKarthik Ram
Miles McBain
Anna Kystalli
Daniella Lowenberg
ACEMS International Mobility Programme
Helmsley Charitable Trust
Gordon and Betty Moore Foundation
Sloan Foundation

47 / 51

References

48 / 51

Colophon

Slides made using xaringan
Extended with xaringanthemer
Colours taken + modified from lorikeet theme from ochRe
Header font is Josefin Sans
Body text font is Montserrat
Code font is Fira Mono
template available: njtierney/njt-talks

49 / 51

Learning more

paper (released soon)

talk

nj_tierney

njtierney

nicholas.tierney@gmail.com

50 / 51

End.

51 / 51

Help

Keyboard shortcuts

↑, ←, Pg Up, k

Go to previous slide

↓, →, Pg Dn, Space, j

Go to next slide

Home

Go to first slide

End

Go to last slide

Number + Return

Go to specific slide

b / m / f

Toggle blackout / mirrored / fullscreen mode

Clone slideshow

Toggle presenter mode

Restart the presentation timer

?, h

Toggle this help

A Realistic Guide to Making Data Available Alongside Code to Improve Reproducibility.

Nicholas Tierney, Monash University & Karthik Ram, Berkeley Institute of Data Science, UC Berkeley

NUMBAT Friday 17th January, 2020 talk link nj_tierney

We need data

We need data

We need data

Benefits of Sharing data

Research isn't often shared

Rowhani-Farid & Barnett, 2016

Stodden, Seiler and Ma, 2018

Stodden, Seiler and Ma, 2018

Stodden, Seiler and Ma, 2018

Stodden, Seiler and Ma, 2018

Stodden, Seiler and Ma, 2018

Stodden, Seiler and Ma, 2018

Stodden, Seiler and Ma, 2018

Stodden, Seiler and Ma, 2018

Stodden, Seiler and Ma, 2018

Stodden, Seiler and Ma, 2018

Stodden, Seiler and Ma, 2018

Stodden, Seiler and Ma, 2018

Sharing data?

Why not share data?

Why not share data?

How to think about sharing data

The on ramp to sharing data

Analysis ready data: Final data used in analysis

Analysis ready data: Final data used in analysis

README: A Human readable description of the data

Data dictionary: Human readable dictionary of data contents

Data Dictionary: Human readable dictionary of data contents

Raw data: The original/first data provided

Scripts: To clean raw data ready for analysis

License: How to use and share the data

Citation: How you want your data to be cited

Citation: example

Machine readable meta data: Make your data searchable

Actually sharing the data

Online repositories: Zenodo & Dryad

Zenodo

Dryad

Linking zenodo and GitHub

Sharing data as an R package

Sharing data as an R package

Online "data" journals

Online "data" journals

Take homes

Future Directions

Thanks

References

Colophon

Learning more

Help

Nicholas Tierney, Monash University
&
Karthik Ram, Berkeley Institute of Data Science, UC Berkeley

NUMBAT

Friday 17th January, 2020

talk link

nj_tierney