"Data! data! data!" he cried impatiently. "I can't make bricks without clay."
--Sherlock Holmes (The Adventure of the Copper Beeches by Sir Arthur Conan Doyle)
This talk represents work that is currently in progress with Karthik Ram:
"A Realistic Guide to Making Data Available Alongside Code to Improve Reproducibility"
Any feedback and questions would be much appreciated!
We need it to do our job
We need it to do our job
(though not strictly true for theory, but you get what I mean)
π Makes work transparent
β Increases trust
π Increases visibility
Independent validation
β»οΈ Reproducibility
It's often really available to the authors
Out of 160 randomly sampled BMJ papers:
Science made clause for Authors to provide data with papers
Authors compared 204 papers before and after clause for reproducibility
π
When you approach a PI for the source codes and raw data, you better explain who you are, whom you work for, why you need the data and what you are going to do with it.
πΏ
I have to say that this is a very unusual request without any explanation! Please ask your supervisor to send me an email with a detailed, and I mean detailed, explanation.
π
We do not typically share our internal data or code with people outside our collaboration.
πΏ
The code we wrote is the accumulated product of years of effort by [redacted] and myself. Also, the data we processed was collected painstakingly over a long period by collaborators, and so we will need to ask permission from them too.
πΏ
Normally we do not provide this kind of information to people we do not know. It might be that you want to check the data analysis, and that might be of some use to us, but only if you publish your findings while properly referring to us.
π
Thank you for your interest in our paper. For the [redacted] calculations I used my own code, and there is no public version of this code, which could be downloaded. Since this code is not very user-friendly and is under constant development I prefer not to share this code
π
Our program [redacted] is available here [URL redacted] (documentation and tutorials were included)
π
If you go to [URL redacted], under the publications, I have a link to the gitHub repository. I donβt know if I have all of the raw simulated data, but I certainly have the processed data used to make the plots. What do you need? All of the simulated data could of course be regenerated from the code.
π
Please find attached a .zip file called [redacted].zip that has the custom MATLAB [redacted] analysis code. If you run Masterrunfigureone.m this will generate several panels from the paper.
π
In the next email I will enclose the custom image analysis software. This can also be accessed from [URL redacted] where there is a manual and tutorial.
Plenty of research that tells you it is important to share data
And that data sharing should be FAIR (findable, Accessible, Interoperable, and Reusable)
But these don't actually tell you how to do share data
There are indeed good reasons to not share data:
Privacy concerns (e.g., human subjects, locations of critically endangered species)
May put the authors at a competitive disadvantage ( but data can be embargoed for reasonable periods of time)
"If you can't do something right, don't do it"
This ^^ is wrong - you can provide something, even if it is just simulated data.
Sharing data (in most cases) has a net positive benefit
It can feel like a wall or a mountain we need to climb.
These require special tools and knowledge.
It should instead be an "on-ramp"
This talk should hopefully get you started
project βββ data βββ crime.csv
project βββ data βββ crime.csv
.csv
, .tsv
, .txt
data-raw
file - to be discussed).rda
, .rds
, .sav
, .dta
, .sas7bdat
.project βββ data| βββ crime.csvβββ README.md
.md
allows you to take advantage of markdown - making it easy to insert lists, tables, and links.project βββ dataβ βββ crime.csvβ βββ crime-dictionary.csv βββ README.md
project βββ dataβ βββ crime.csvβ βββ crime-dictionary.csv βββ data-rawβ βββ crime-raw.datβββ README.md
data-raw
. data-raw
.project βββ dataβ βββ crime.csvβ βββ crime-dictionary.csv βββ data-rawβ βββ crime-raw.datβ βββ clean-crime.Rβ βββ other-steps.mdβββ README.md
clean-crime.R
other-steps.md
project βββ dataβ βββ crime.csvβ βββ crime-dictionary.csv βββ data-rawβ βββ crime-raw.datβ βββ clean-crime.Rβ βββ other-steps.mdβββ README.mdβββ LICENSE
use_cc0_license()
use_ccby_license()
project βββ dataβ βββ crime.csvβ βββ crime-dictionary.csv βββ data-rawβ βββ crime-raw.datβ βββ clean-crime.Rβ βββ other-steps.mdβββ README.md (reference DOI)βββ CITATIONβββ LICENSE
@software{housing-data, author = {Nicholas Tierney}, title = {njtierney/melb-housing-data: Added LICENSE.md file}, month = feb, year = 2019, publisher = {Zenodo}, version = {1.0.1}, doi = {10.5281/zenodo.2575545}, url = {https://doi.org/10.5281/zenodo.2575545}}
project βββ dataβ βββ crime.csvβ βββ crime-dictionary.csv β βββ metadataβ βββ access.csvβ βββ attributes.csvβ βββ biblio.csvβ βββ creators.csvβ βββ dataspice.jsonβββ data-rawβ βββ crime-raw.datβ βββ clean-crime.Rβ βββ other-steps.mdβββ README.md (reference DOI here)βββ CITATIONβββ LICENSE
dataspice
or codebook
Now that you've created your data folder, you need to get it somewhere online
Two options I would like to discuss:
You can also link Zenodo with Github
Zenodo updates with new DOI at every "release" (Helps avoid managing many moving pieces)
See this article on github, making your code citable (Thanks to Arfon Smith)
"data only" R packages. (e.g., nycflights13
, eechidna
, other Australian datasets)
Pros
Cons
Some example data journals:
End.
"Data! data! data!" he cried impatiently. "I can't make bricks without clay."
--Sherlock Holmes (The Adventure of the Copper Beeches by Sir Arthur Conan Doyle)
Keyboard shortcuts
β, β, Pg Up, k | Go to previous slide |
β, β, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |