Data are good. Data we can share are better.
It’s easy to hold onto data we’re using for research. Despite the “security breaches” which fill news outlets with repetitive stories and warning, the data we tend to use in academics can typically be kept, quite securely, on a password-protected desktop in our locked offices, with little fear they’ll make it into the hands of ne’er-do-wells. Certainly this safety is still important—so much so that most institutional review boards require a specific statement regarding how we’ll handle our data to ensure it remains safe. “Protected health information” needs to be kept confidential, with hefty fines and public shaming for those who fail to adequately protect it.
There is, however, more than the logistics and patient confidentiality that makes it easy to hold our data in near-secrecy. Academic surgeons are expected to be “academically productive”, that is, to publish papers and perform original research. While an ever-increasing number of authors are choosing to release their raw data as supplements to their articles, we still enjoy re-using some of those hard-earned collections for further research, updated or expanded studies, finding new (unpublished) correlations and drawing new conclusions. We release data only to support our claims, to enhance reproducibility and (presumably) validity of our studies. Given that so much of our data is still able to be tied to individual patients, with the necessary concerns about their privacy, it is reasonably to avoid releasing a great deal of it.
There’s been plenty of discussion about “big data”, about how that concept is changing the medical research landscape, about how we can use data “instead” of experimentation to get more rapid results. While I’ve got my own concerns about claims that this is “changing science” (see a brief Twitter conversation sparked by my friend Niraj Gusani), it’s an understandably enticing idea, one that likely can lead us to more inclusive and rigorous retrospective studies and big-picture analyses.
On the other hand, I think there’s a space that’s currently poorly filled. I’m calling it “average data”. (No, that wasn’t my first choice, but “small data”, “little data”, and “medium data” all already seem to be taken.) Average data are those we work with normally. Those we use to do our typical simple retrospective evaluations. Those we us to come up with our typical articles. Not those we use for groundbreaking Framingham-type research, not those we analyze for years or those to which institutions subscribe, just average data. Sure, average data are there, we’re frequently using them, dusting off a simple Excel spreadsheet we made last year that probably just needs one more column added to it in order to give us what we need for a new abstract submission. We may be able to get them from our partners, our colleagues, our mentors.
Why can’t I get them from you? No, I can’t get the individual patient-level data, there are too many concerns about privacy. Summary data don’t do me much good in finding new correlations, testing new hypothesis, forming new conclusions. But for those (admittedly rare) times when information is already publicly available, and I’ve done some massaging of it to fit a statistical format, or I’ve combined it with some other data set (also publicly available) to build something that might be worth studying, shouldn’t that be available to you?
Even better, shouldn’t it be available not as a result of a study I’ve published, something you can use to re-run my numbers and ensure I didn’t make mistakes, but rather as something you can use to write a paper at the same time I’m writing my own paper?
Like most academic surgeons, I’m working on several different projects at the moment. Most of those projects have databases or spreadsheets of patient information that I can’t legally share. But for once, I have a project that I’m working on which uses only publicly-available data. I’ve still had to do the massaging I mentioned, still had to figure out how to process that data, how to combine them into (relatively) simple files and clean up the formatting of those files. I’ll be trying to do a few things with those individual data sets over the next few weeks, things which probably aren’t really journal-worthy on their own, and I’m not hurting for articles in low-impact publications, so I don’t feel the need to grab every little reference I can. Therefore, I’ll just talk about some of those simpler combinations on this blog while I’m working on the bigger (and presumably more interesting) project as well. And I’ll be posting the actual data sets themselves, updated as needed, with the ability to easily track changes to them and with written descriptions of how I obtained the data and exactly what “massaging” I’ve done.
One great place people are posting everything from huge projects to tiny data files is GitHub, and there is a GitHub repository where I’ll be hosting these files. I’ve saved you the trouble of figuring out the copyright issues associated with the data sets I’ve used, I’ve written the scripts and described the methods I’ve taken to obtain the data in the first place then get them into the form they are now. The data presentation is raw—don’t expect a nicely presented “Results” section from the repository—but it’s there, it’s free, and you can do with it what you wish. Hopefully this is a modest start to some interesting collaborations and conclusions.
Yes, I’ll write more about what exactly we’re able to find with these data and what (if anything) they mean, but for now just consider sharing what you can. I’ve got plenty of average data, and I’m guessing you do as well. It’s possible a bunch of average data won’t sum to above-average data, but why not try?