The ‘how’ of sharing research data is codified by the FAIR principles, but deciding ‘what’ should be shared is much less settled. I recently wrote a post for the Scholarly Kitchen about the ‘What’ question, and argued that research articles should be our focal unit of research. A consensus on this would allow researchers, funders, publishers, and other policy-makers to work in concert to ensure that all of the data underlying that unit have been shared.
Here I want to talk about how DataSeer fits in with and enables this approach. Deciding to focus on getting all the data associated with articles into the public sphere is one step, but we also need to make this process both easy and enforceable. Authors need to be clear about which datasets from their articles they are expected to share; they also need to know what format those data should be in and which repository is most appropriate. Policy makers need to be clear about which data sharing steps should be taken for each article, so that they can hold authors accountable for their data sharing actions.
It’s impossible to formulate a generally worded policy that achieves all of the above. Instead, we have to generate this information separately for each individual article. Some journals employ specialist data curators to ensure that all of the data from an article are shared, but this process is slow and expensive. Human curators also can’t scale to deal with the >2 million articles published each year, and that is where we need to get to if we want to make all research more reproducible.
This is why we made DataSeer: the task of determining what datasets the authors have collected can readily be accomplished by AI-powered Natural Language Processing. From there we guide the authors through the process of sharing each dataset, then report back to the relevant stakeholder (journal, funder, or institution). DataSeer is quick, cost effective, and (like most AI tools) highly scalable.
Other reproducibility tools in this area determine whether the authors have shared any data, but do not work out how many datasets the article contains in total. Without the latter information it’s impossible to work out whether the authors have shared all of their data, only a subset, or actually have none to share.
The other big advantage of focusing on the total number of datasets is that DataSeer can act at two crucial stages of the publication process. First, we can guide authors through the data sharing process for a first draft of manuscript, a pre-print, or an article accepted for publication at a journal.
Second, we can help journals, funders, or institutions assess the proportion of datasets that have been shared for a corpus of articles, and (if necessary) prompt further action from the authors. We’re also developing DataSeer’s capabilities with Data Management Plans, which would bring the step of identifying what data should be shared to the start of the research cycle.
In summary, the most effective way we can improve the sharing of research data is to agree on a fundamental unit, and the natural choice is the article. Stakeholders can then focus their effort on ensuring that all the data associated with that unit are shared.
DataSeer is the most advanced and complete tool to make this happen: our algorithm works out the full list of datasets associated with an article and guides authors through sharing each of them, everyone can then assess the proportion of datasets that have been made public. Please contact us if you’d like to know more.