3 Publication phase
Although we typically think of a journal or conference paper as the key output of a research project, most research in fact generates a multitude of information, including data, code, study protocols, insights not included in the paper, and so on. Given the huge amount of work that goes into research, there is great value in making sure that as many of these outputs as possible are available to as many people as possible. We won’t make the argument here for why this is worth doing (though there are many good arguments for doing so; see Links below). Rather, this guide will describe best practices, and encourage you to adopt as many of them as possible.
The five key areas that will be discussed in this section are:
- Open Data: making the raw data available for further research and replication
- Open Source Code: making the analysis pipeline transparent and available for others to borrow or verify
- Reproducible Environments: making not just the data and code available for others, but making it easy for them to re-run the analysis in an easily reproducible manner
- Open Publication Models such that anyone can see the scholarly output associated with the work
- Documenting Processes and Decisions: making it clear to interested parties not only what was done and how, but also why, by mechanisms such as open lab notebooks
All of these are important, but each can be adopted independently of the others. Fortunately, there are now a huge range of tools and resources available for facilitating this work. At the end, we also provide some examples, and links to additional resources.
3.1 Making Data Available
As one would expect for data science, the data itself is of central importance for transparent and reproducible research. Depending if you spend most of your time in a wet lab or a computer lab, you might have very different notions of what counts as data; here we will emphasize general principles and strategies that are widely applicable.
3.1.1 Vision for making data available
The vision for sharing data is that, to the extent possible (while respecting concerns such as privacy and legality), researchers should make all data relevant to a project publicly available online. It should be stored in a repository that is accessible to all, that will endure over a long period of time, and in a format that is readable without commercial software; in many cases, the ideal is a human-readable format such as CSV or JSON. The data should be licensed and paired with documentation that explains what it is and how to use it. In most cases, the raw data should be shared (i.e., prior to any preprocessing, such as smoothing or outlier rejection), although it may also be helpful to share processed versions, along with details of what was done. Also make sure to provide something that people can cite to give you credit, either a DOI for the dataset, or a reference for the corresponding paper.
3.1.2 Essential considerations for sharing data
Location online: Since the whole point of sharing data is to allow others to access it, it is obviously important that you put the data somewhere that is accessible to everyone and stable over time. You could, for example, choose to host the data on your own website, but this has severe disadvantages, including the possibility that your website will go down, and the inability of others to easily submit feedback. In practice, you are much better off putting it in a central repository, which offers numerous advantages, such as persistence and findability. Some good places to consider include:
- GitHub is very popular for sharing code and version control, with excellent archival practices. It can also be used for sharing data, but the maximum file size is on the small side (100Mb). https://github.com.
- Harvard Dataverse is free to all and allows 1 Tb of storage per researcher, with a maximum file size of 2.5 Gb: https://dataverse.harvard.edu/
- Dryad is not free to all (typically used via institutional access) but allows for up to 300Gb per dataset: https://datadryad.org
- The Open Science Framework is a another comprehensive platform for open science, as well as a favourable place to make data publicly available: https://osf.io
For a chart comparing several popular options, please see the Generalist Repository Comparison Chart
Pro tip: For very large datasets (e.g., over 500Gb), or those that will be downloaded very frequently, you might want to consider paid alternatives, such as Amazon Web Services, which allow for a “requester pays” arrangement (e.g., https://arxiv.org/help/bulk_data_s3). Another option that could be considered would be to use torrents or some other distributed format, to help distribute the storage and transfer costs.
Amount of preprocessing: Different people might want to use your data in different ways. For some, they will just want to incorporate your final numbers into their own work. Others might want to re-analyze your data. As such, different audiences will prefer different amounts of preprocessing. Generally speaking, it is recommended to include the “raw” data (that is, the data prior to any preprocessing, such as smoothing, compression, or deletion of outliers for tabular data, or tokenization for text data); however, because others may want to make use of the preprocessing you have done, it may also be valuable to include multiple copies at different stages of preprocessing. This is largely a matter of judgement, but connects to the sharing of code and reproducible environments (see below). Ideally, others should be able to easily reproduce all the preprocessing steps such that they can obtain any stage of data starting from the raw files.
Documentation: Trying to make sense of someone else’s data can be extremely challenging. You can help them greatly by including a README file that explains what all the files are the repo are, how to load the data, and what the various fields correspond to; this may also connect with making code available (see below).
Datasets for Datasets: One specific type of documentation that we suggest including is described in Datasheets for Datasets (https://arxiv.org/abs/1803.09010). This paper describes a series of questions that dataset developers should think about, as well as examples of datasheets developed for specific datasets. These questions and examples cover aspects such as who created this dataset and for what purpose? Are there known errors or problems with this data? Does it contain information that might be sensitive? How has it been used so far, and are there any potentially harmful uses, or applications that it should not be used for? We refer the reader to the original paper for further details, and encourage researchers to adapt the datasheet format to their specific needs.
Format: It is important to make sure that your data is available in a format that will be useful to others.
- Good choices are formats that don’t require any specialized (paid) software, and that are robust across operating systems. For relatively simple datasets, such as tabular data, good choices include .csv and .json formats. For example, don’t share a Microsoft Excel file (.xlsx); export it as a .csv file, such that it can be opened by many programs.
- If your data requires a specialized data format (such as for neuroimaging data), make sure to document what the format is, and how others can load the data.
Pro tip: comma-separate value files (csv) can cause difficulties when working with text data across platforms. JSON is a cross-platform framework that is flexible and human readable. Although it is slightly less compact, it can be worth it for ease of use. For a nice way to store tabular data, convert each row to a json object and write it as a string to a text file, one row per line. Also, be sure to specify a widely-used encoding format, preferably unicode (utf-8). – Dallas
Pro tip: Be especially carefully when working with dates and times, as different programs may interpret this information differently. (In fact, the HUGO Gene Nomenclature Committee recently officially renamed several genes to avoid having them interpreted as dates by Microsoft Excel, e.g., MARCH1 became MARCHF1). One option is to explicitly break the dates down into the smallest meaningful unit (e.g., storing them with separate fields for year, month, and day). – Dallas
License: Although we normally think of licenses as applying to code or other written materials, it is good practice to include a license that explicitly permits use by others with your dataset. Note that different licenses may be relevant to data as opposed to code. For datasets, the standard licenses differ primarily by whether they require attribution, whether they allow reuse for commercial purposes, and whether they require the same license be kept by derivative works. The number of choices can seem overwhelming, but there are many useful guides out there. An example license is the Creative Commons Zero license (CC0-1.0), which is higlhy permissive, but provides variations to specify particular restrictions (e.g., CC-BY, CC-NC). Harvard Datavese, for example, applies a CC0 license by default. To read further about licenses, refer to articles on data.world, the Open Knowledge Foundation, and UK Discovery.
Updates: There may be a point at which you want to make a change to the data that you are sharing, such as when an error is found. As such, it is very useful to make use of version control, such that it is possible to look at the data from different versions. A big advantage of storing your data on a site like Github is that it makes such updates transparent, and allows users to navigate back to any earlier versions of the dataset. It also allows people to open “issues” if they run into problems, or want to suggest improvements or request updates. Whenever you do update the data, be sure to make absolutely clear what was changed and why. Also be sure to make it clear which version of the data was used in the official publication.
3.1.3 Challenges in making data available
Legality: If you have collected data via a third party, it is essential to check their requirements for redistributing such data. For example, the Twitter terms of service state that researchers are not allowed to share any data about individual tweets; rather they are only allowed to share the tweet IDs. This means that anyone could in principle recollect the same dataset using these tweet IDs, but permits individual users to remove their tweets to prevent future collection. Note that this means that the dataset will not be truly stable over time. However, any alternative would violate the Twitter terms of service. It goes without saying that one should not make copyrighted material available without permission. Such considerations are also good to keep in mind when designing a study, and design data collection such that it will be possible to share data at the end of a project.
Pro tip: When deciding whether or not data can be ethically shared, pay attention not just to the legal requirements, but also to the community norms. For example, research by Casey Fiesler and Nicholas Proferes found that Twitter users were largely unware that Twitter’s terms of service allowed their data to be used by researchers. (See also this medium post). – Dallas
Privacy: In many cases, data may be sensitive and contain information that others would not want revealed. Typically this arises when one is working with data that have to do with people (demographic data, text data, etc.), although other reasons may arise (such as when working information that could be potentially dangerous if shared publicly). Although some people recommend “de-anonymizing” data by removing particularly sensitive information, such as addresses and social security numbers, researchers have now convincingly shown that almost any information could be sensitive in the proper context way. In particular, it turns out that any information might be identifying when combined with other external data. The classic example of this is the Netflix prize, in which many people’s movie viewing choices were made identifiable by combining this public data with other public data from IMDB.
Pro tip: There is some exciting work being done on how to share private data without impinging on anyone’s privacy, including work in differential privacy. This space is still evolving however, so unless you are an expert or collaborating with an expert, it is better to err on the side of caution. You definitely don’t want to contribute to accidentally doxing someone online! – Dallas
Backwards compatibility: Especially for data stored in obscure formats, there is a risk that researchers in the future may not be able to access the data because they are not able to run the software required to open it. While this may not seem like a pressing concern, software actually exists in a complex ecosystem. If you assume access to a particular version of a piece of software, it may not be available for future versions of the operating system. Although there are solutions here, such as the emulation techniques used to run old video games, this is another reason to favor formats that are simple and ideally “human readable” (meaning that one can simply look at the data in a text editor and it will make sense), such as .csv and .json.
A note on sharing models: In many cases, it may also be useful or important to share trained models, either instead of or in addition to the raw data. Nearly all of the same considerations described above apply to sharing models as well. Indeed, it is important to remember that models can be thought of as a compressed representation of the data (along with prior assumptions and/or non-deliberate contributions, such as a random seed). This means that in some cases it may be possible to extract some or all of the original data from the model, and this should be kept in mind if the original data is sensitive or private. For additional recommendations regarding documentation to provide when sharing models, we recommend consulting Model Cards for Model Reporting (https://arxiv.org/abs/1810.03993).
3.1.4 Additional resources on data sharing
- For additional considerations, see the Primer on Data Management
- One popular framework for thinking about sharing data is FAIR (Findable, Accessible, Interoperable, Reusable). For more details on FAIR, please see resources on OpenAIRE.
- You can also search for more options on the Registry of Research Data Repositories.
3.2 Making Code Available
Closely tied to sharing data is making the code to analyze that data available as well. Even in cases where the data itself cannot be shared (such as for reasons for legality or privacy), there is great value in still sharing whatever code you used to analyze that data. This not only allows others to see precisely how you analyzed it, it also makes your hard work available in a way that may be useful to others.
Indeed, for complicated data analyses, it is extremely difficult to precisely describe every single step that was done, even with extensive supplementary material to a publication. As such, the actual code used to do the analysis is like a kind of self-documenting description of what was done (though please don’t skimp on the actual documentation!). While sharing data can be as simple (in the most minimal case) as just putting a single fine online, sharing code generally requires some additional care in order to deal with documentation, licesnses, dependencies, computational environments, etc.
Bceause making code reproducible is so important, we are dedicating an entire chapter to it, complete with examples in python [FORTHCOMING].
3.2.1 Vision for making code available
The vision for sharing code depends to some extent on the nature of the project. For methodological work, this may entail releasing a package that can be used to perform a particular type of data analysis. For more substantively focused scientific research, on the other hand, the code shared might simply replicate the particular analysis presented in a paper, going from the raw data to the result obtained, without trying to generalize to other cases. In either case, the ideal outcome is that anyone can get access to the code, run it locally, and obtain the desired result, without excessive difficulty or cost.
3.2.2 Essential considerations for sharing code
Begin with the end in mind: Sharing code ideally begins with adopting the mindset as early as possible that you will make your code available to others. Many people fear embarrassment if their code is not up to some standard, but in practice everyone very much appreciates when code is available, even when it’s messy (though clean and well-documented code is much preferable!)
Methods and tools: A plan to make your code public will also inform your choice of methods. While it is technically possible to do statistical analysis in Excel, it is not really set up for reproducibility as a core element the way that something like R is. Python also nicely lends itself to reproducibility, when used with care. Other software systems such as STATA and Matlab may also provide reproducibility options, through the sharing of scripts, but are less preferred because they may not be freely available to all. Much may depend here on the complexity of the analysis involved.
Pro tip: Both python and R provide extensive support (via libraries) for statistical analysis, visualization, interactive notebooks, optimization, and text processing. R still maintains a slight advantage when it comes to advanced but off-the-shelf statistical analyses (e.g. generalized hierarchical mixed effects models). Python on the other hand has better integration with deep learning frameworks, such as torch. Although many attempts have been made to make these two frameworks interoperable, these attempts to be somewhat brittle, and may go out of date quickly. – Dallas
Consider the user: When deciding what to share, and how, think about who might want to use this code. If you are sharing a new method, it is likely someone who wants to apply this method to their own data. They will want to know how to install your software (if required), how to run it, what the options are, and will likely be happiest if you provide an example illustrating how it works. On the other hand, if you are sharing analysis code for replicating a result, people will likely want to re-run the exact same analysis you used, and perhaps try variations on it. They will want to know what is required to obtain the exact same results (e.g. precise versions of dependencies, access to the right data, etc.), and will also likely want to know the reasoning behind all choices that were made (making documentation all the more essential). In either case, it is good to assume that the user is less familiar with the project than you are, and will likely be confused unless you provide landmarks and explanations to help guide them.
Location online: As with data, you need to decide where to store your code online, and as with data, there are multiple good options
- GitHub, which is owned by Microsoft, is by far the most popular site for sharing code, though many alternatives exist, such as BitBucket. Different alternatives often have different free or academic plans, which may be worth investigating. A major advantage of sites like GitHub is the integration of code, data, versioning, and “issues”, such that others can post questions, suggest improvements, or submit updates to be incorporated into the main branch. Using the full potential of git can be quite complicated, but the basic usage of creating and uploading code to a repo is actually quite straightforward. For an introduction to sharing code via GitHub, we recommend the relevant chapter of The Turing Way.
- CodaLab is a Stanford initiative that allows sharing code and data in the form of “worksheets”, which are like “executable papers”.
- Many additional alternatives exist (see sites listed under data sharing 2.2.1 for more)
License: It is important to include a license which covers the legal usage of your code. Several open-software licenses have become more or less standard, and are all quite similar, though differ in a few key ways.
- The MIT license is one of the most permissive, and allows repurposing your code wihtout attribution, even for commercial purposes.
- The Apache 2.0 license is similar, but places more explicit restrictions on trademarking or patenting work dervied from your code.
- The GNU GPLv3 license is more restictive, and requires that new distributions of your code must document changes, make the soucre code public, and be released using the same license.
GitHub provides standard template licenses for these and other types. For more advice on choosing a license, please refer to the Open Source Initiative, choosealicense.com or the relevant chapter of The Turing Way. Obviously you should also pay attention to the licesnses attached to any code you incorporate into your own work, which may dicatate which license you have to use!
Documentation: As with sharing data, documentation is essential. In addition to an overall README file that explains what the project is, provides a link to the paper, etc., it is good to provide comments in each file to explain as much of the code as possible. Typical documentation should include:
- A brief description of what this repository is;
- Installation instructions that anyone can follow;
- A list of required dependencies (see below) with version numbers (e.g., in a requirements.txt for python);
- Description of how the software can be used;
- A brief (working) illustrative example, complete with an example dataset;
- A reference to any accompanying publications or related resources.
3.2.3 Challenges in sharing code
Software engineering: There is a reason that some people specialize in software engineering – there is an enormous amount that is known about how to develop software. Although much of it could be relevant to writing and sharing research code, it is also possible to focus on the parts that matter most. For those who are interested, The Turing Way again provides some good starting points.
Data that cannot be shared: Even if the data itself cannot be shared, there is still great value in sharing the code, as this will help others to be able to see exactly what was done; in this case, it may be very helpful to provide an artificial dataset that is similar in structure to the true data, to help others to be able to easily run your code. This can also be a useful tool in debugging your analysis – to create a simulated dataset that mimics many of the features of your real data, but with known parameters. For starting points on this approach, see this blog post from Andrew Gelman.
Hyperparameters: As advabced methods from machine learning are being more widely adopted, issues related to hyperparameter selection have taken on a new urgency. For some methods, such as ridge regression, the main hyperparameter might be something as simple as regularization strength. For modern deep learning architectures, on the other hand, the number of hyperparameters is virtually limitless. If you just want people to be able to replicate an analysis, it may be sufficient to provide the exact hyperparameters used (including a random seed which guarantees reproducible results, though see below). Much better, however, is to provide some rationale for why these values were chosen, or even code to reproduce the selection (e.g. by running random search). This is a rich topic which we explore in the analysis section.
Pro tip: What is a hyperparameter? There is no single definition that covers all situations. Conceptually, it is some part of the specification of a model that could reasonably take on multiple different values (such as regularization strength). In practice, we generally think of it as something we might tune in order to try to improve the model (including the learning rate or other parameters of an optimization algorithm, or even the optimization algorithm itself!). In theory, as Maclaurin, Duvenaud, and Adams have pointed out, even the dataset itself can be thought of as a kind of hyperparameter, though, as they note, this risks a kind of “philosophical vertigo”. – Dallas
Dependencies: Almost all code will require some sort of dependencies in order to run. For example, an R script requires that the user have R installed. Although environments such as R are relatively stable over time, there are occasionally updates to environments and packages that break backwards compatibility, especially for modules under active development, such as Torch and Tensorflow. It is highly recommended to provide a list of all dependencies used (e.g. packages in python or libraries in R), and the exact versions you used in running your analysis. For python, this could take the form of a requirements file, which lists the version of each package in the environment (e.g., tensorflow==0.12).
Pro tip: Anaconda is a great option for python which provides both a package manager, named “conda”, and a basic slate of packages for scientific computing. Just like pip and virtualenv, conda allows you to create a new environment for your project, install relevant packages to that environment, and then export that environment in a way that others can easily recreate. For an introduction to conda see this introduction or the conda documentation. – Dallas
Computing environment: In some cases, even providing the exact dependencies will not be sufficient, as results can depend even on hardware considerations, such as which GPU was used. At a minimum, it is good to document such choices if they might be relevant, and we provide more detail on this in the Reproducible Environments section below.
Randomness: Although many simple analyses may be purely deterministic, for a lot of work related to machine learning, there are many ways in which randomness can creep into the code. It can come in through the initialization of weights in neural networks, in partitioning the data into train and validation data or in shuffling the data, as well as other considerations. Although there is great value in exploring how your results might vary across this randomness, it is also good to make sure that it is possible to reproduce one set of results exactly. This is typically done by setting a random seed and sharing it. Note however, that this can sometimes be tricky. For example, in python, you might need to set a seed for both the “random” package and “numpy” (if using both) and/or make sure these get propagated to other frameworks such as torch. It’s always good to check to see if others are able to reproduce your results exactly using different environments.
Pro tip: Recent work has shown that both the random seed and the ordering of data can make a massive difference to the performance of complicated NLP models such as BERT. When working with deep learning, it is worth thinking about whether you are making a claim about a particular instantiation of a particular instantiation of a trained model, or about the expected performance of a family of models. The type of claim you want to make should dictate the type of experiments you choose to run. Above all, make sure to show your work! – Dallas
3.3 Reproducible Environments
Even with the most beautifully written code, users will not necessarily be able to successfully run it and obtain an identical result unless they can recreate the same computing environment that was used. In some cases, the version of a software package can make a big difference to the outcome. In other cases, even the computer hardware that was used can matter. While recording the versions of all packages used is a step in the right direction, an even more comprehensive solution is to package up the entire environment using something like Docker, or to use online computation, such as Google’s colaboratory notebooks.
3.3.1 Vision for reproducible environments
For any code that is shared, authors should make it as easy as possible for others to run it and obtain the same result. This includes availability of software, path dependencies, software dependencies, and general computing environment. Ideally, this process should involve testing to make sure that others are able to recreate one’s results and/or a system for users to report errors that they encounter.
3.3.2 Essential considerations for reproducible environments
Software availability: The most obvious consideration for reproducible environments is the software that is used. If programs are written in commercial packages, like SPSS, Matlab, or Tableau, this means that only researchers with access to those packages will be able to use the code that is shared. Much better is to use freely available tools, like python, R, or other mainstream programming languages.
Path dependencies: During software development, it is often convenient to write explicit file locations into the code. However, this will not work for others who have files in different locations. For example, if you tell your script to load a file at /Users/
Pro tip: Note that the way of specifying paths differs between different operating systems, such as Mac/Linux and Windows. One option to avoid this problem is to use system-agnostic tools to specify paths, such as python’s os.path.join()
function. – Dallas
Package dependencies: Although we often think of it as relatively static, software is constantly changing and being updated. Even packages of central importance to researchers, such as python’s numpy are constantly being upgraded with new features and bug fixes. Many of these can be benign, but some can be “breaking” changes, such that people using older versions will no longer obtain the same result. Fortunately, most older versions of packages remain available. So, a minimal solution is to record the packages that you used when running your experiments. Others can then use this information to create the same environment. With python, you can do this by developing and running your code in a project environment, using either pip or conda, and exporting that environment as a requirements file.
Pro tip: Python 2 has now reached the end of it’s life. If you are creating a new project in python, make sure to use Python 3. – Dallas
Containers: An even more comprehensive solution is to use container systems, such as Docker. These solutions package up an entire computing environment in one file, such that anyone can easily run things in the exact same environment you used. Although not as widely used as they could be, it is likely we will see this sort of solution become the norm over time. For a good introduction to containers, see The Turing Way.
Online environments: Another alternative is to not have others use their local resources at all. This can be done, for example, by hosting your code in the cloud, either using something like Amazon Web Services, or a simpler solution, such as an interactive notebook shared through Google Colaboratory. Although there are open questions about how stable these will be over time, they do have the advantage that anyone can easily run your code without doing anything more than clicking in a web browser.
Pro tip: At the time of writing, Google Collaboratory even provides limited free access to GPUs! – Dallas
3.3.3 Challenges in reproducible environments
Deprecated software: Software has a tendency to go stale. In the worst cases, the tools you used may no longer be available after some period of time. For example, many machine learning papers were published based on code written using theano or early versions of tensorflow, and reproducing these results is now much more difficult as a result. The risk of this can generally be minimized by using mainstream, well-supported packages with active communities, such as python’s numpy or scikit-learn. The risk is generally highest when working at the cutting edge.
Maintenance: In some ways, the largest cost associated with sharing code or releasing software is that people will want you to help solve their problems! Many potential problems can be headed off by following the advice above (e.g. providing requirements files, etc.). However, if your code is popular, people may have requests for features they would like you to incorporate. Someone may even discover a bug. Fortunately, there are also good options here for migrating your code to a real open-source project. In particular, GitHub provides a number of features beyond just archiving and making your code available. It allows others to “fork” your repository, making their own copy of it which they can modify, while leaving your original intact. Anyone can “open an issue”, if they discover a problem or want to request an extension, and the GitHub system organizes these. Users can even submit “pull requests”, in which they are asking for changes they have made to be incorporated into your main codebase. While these advanced features are unlikely to apply to most projects, as usual, it is worth it to begin with the end in mind. Hosting your code on GitHub doesn’t require that you have any knowledge of these more advanced features, and easily allows your project to expand as necessary going forward.
Custom vs generic solutions: There are now multiple systems designed to facilitate the sharing of research materials, including runnable code, such as Binder, and the eLife Reproducible Document Stack. At this point, there is no one solution that will be best for all researchers, and some work may be required to determine which, if any, will work for you. However, we would also warn that there is some risk in attaching one’s research to a particular platform, which may or may not persist over the long term. Where possible, we advocate for keeping things simple – transparent, well documented code, raw data in a human-readable format, careful specification of dependencies, and storage in one or more well-supported locations (e.g. GitHub). For a review of the current landscape of infrastructure to support open and reproducible research, please see this paper.
3.4 Open Publication Models
Academic publishing is a complex ecosystem with a lot of moving parts. While the key role is still the dissemination of ideas, published papers accomplish many things, from providing focal points for discussion, to helping people get promotions. While no one system can adequately serve all these purposes equally well, it is especially important to keep the first of these in mind, and try to make your research available to as many people as possible. Regrettably, there are serious hurdles to this, but as in all areas, new innovations are shaking up this system and providing new and better options.
3.4.1 Vision for open publication models
The vision for open publication is simple: any paper you publish should be freely available to anyone on Earth. Unfortunately, this is not the norm; most papers end up in journals that can only be accessed by people with extremely expensive subscriptions (or a pay-per-article model that basically no one participates in). There are many practical and philosophical arguments behind this vision, but in line with our emphasis on the practical, we simply maintain that this is the right thing to do. In practice, there is a spectrum of options, which we describe below.
3.4.2 Publishing models
Traditional publishing: The traditional publishing model is the one we want to avoid. In this model, journals select which articles they will publish, and then only make those articles available to people who pay for access (typically through a university library). This may serve the authors, who can list the paper on their resume, and it may to some extent serve the community, as most researchers working in the area will likely have access through their universities. But it contributes to a broken system, and is no longer strictly necessary. Many good alternatives now exist, and the only hurdle to moving away from this system is the collective action problem of needing entire communities to make the move together.
“Open-access” publishing: The term “open-access” can have many meanings. When used by traditional publishers, however, it typically means a model by which authors pay additional fees, and publishers make the published paper freely available online. This is also the default model used by certain modern journals, such as Frontiers. Where possible (i.e. for researchers that have the funds available), this is preferable to the traditional model, though it is by no means the only one that should be considered. A Directory of open access journals can be found at: https://doaj.org/
Preprints: Many journals will force people to pay for access to their journals, but still allow authors to share a “pre-print” or other copy of the paper on their website. If you are publishing in a paid journal, you should check to see what the rules are. If it is allowed, you can make a copy of your article freely available on your website (along with a link to the data and code!)
Preprint servers: One of the most exciting revolutions in science in recent decades has been the rise of preprint servers, such as arXiv.org. These systems allow anyone to upload papers, adding them to the permanent archive. arXiv.org also emails those who subscribe with daily updates about new papers that have been uploaded in each area. These systems receive support from various sources, such as universities and foundations, and allow anyone to upload a paper or read any paper they host, all for free. The downside is that preprint servers do not provide a peer review process or any space for comments from the research community (and so papers that are only published on arxiv may be seen as having less worth or credibility).
Pro tip: People sometimes speak of peer review as the “gold standard” for scientific publications, but it is important to remember that peer review is not a guarantee of quality, and entails its own biases. Peer review can help, in some cases, to provide useful feedback to authors or to filter out very low-quality work, but we should also not overrate its value or underestimate its downsides. – Dallas
Conference models: In some fields, such as computer science, conference publications as an important pillar exist alongside traditional journals. Details may differ, but the standard setup is that costs are covered by membership dues (paid by those who attend the conferences), and reviews are carried out by the community. As such, these conference publications are treated as interchangeable with journal publications for the purposes of promotion. The key advantage is that there is no cost to publish, and the papers are freely available to all. This is the model used by conferences such as NeurIPS and ACL. This is in many ways the optimal model, though it is unclear how easy it is to create such a system and make it sustainable.
Experimental models: Yet more systems are constantly being created and tried. New online journals have been started specifically to focus on null results and replications. Various comment sites exist, which allow people to have online dialogues about published work. Sites such as Distill.pub have embraced interactive papers, which embed animations or interactive widgets. Others have experimented with embedding data directly into papers in various ways. None is as robustly developed as any of the above systems yet, but they serve as an excellent reminder that we are not restricted to traditional models. The sky is the limit and anything is possible here!
Pro tip: For an excellent review of interactive publication models, have a look at this recent Distill.pub article. – Dallas
3.5 Documenting Processes and Decisions
Every data analysis involves decisions. Reporting the choices that were made is a big step towards reproducibility. However, this alone does not necessarily tell readers why those choices were made. For example, reporting the full set of hyperparameter values used will better enable others to reproduce your results, but it is also important to know how you arrived at those particular values (even if it was as simple as prior intuition). As a last section of best practice in reporting, we encourage readers to be transparent and open about the processes and decisions made in carrying out their research.
3.5.1 Vision for documenting processes and decisions
It is likely impossible to entirely document and/or recreate all of your decision making processes. However, to the extent possible, it would be ideal to create a supplementary document that describes the process by which key decisions were made. This might describe the process of model selection, or hyperparameter tuning. It might describe preliminary data analyses or pilot experiments. It might explain how you arrived at a particular criterion for removal of outliers, and so on. In the most fully-formed case, you can think of this as a kind of “lab notebook” for data science work. Having a record of how you arrived at various decisions might be important in the future, both for yourself and others!
3.5.2 Examples of documenting processes and decisions
Pilot studies: It is common in many fields to run a small pilot experiment before running the full experiment. This might be to verify that a protocol works, or to estimate power. It is rarely worth describing every aspect of this in the main paper, but it is important to document somewhere what the results were and how they informed the main study.
Study protocol or codebook: In annotation projects, it is common for procedures to evolve slightly as the data collection proceeds. A codebook describes the guidelines that are given to annotators in annotating data, and describes overall goals, as well as important clarifications for ambiguities that may arise. It is important to make such codebooks public, so that others can attempt to follow the same protocol, as well as document key moments in which the codebook was updated.
Model selection: An infinite range of are possible for any particular dataset. Often the reasons for using a particular model are obscure. Sometimes there is an “obvious” or “default” model to use (e.g. linear or logistic regression). In other cases, we might try many models and choose one based on how well it fits the data, or just a validation set. This kind of choice is particularly important to report, as it may drastically influence the results. Readers should be aware if you tried many models and only reported what seemed to be the best. Alternatively, if there is some good reason for using a particular model, that is also useful for interested readers to know.
Hyperparameter tuning: Especially for machine learning models, there are often many hyperparameters that can be tuned for better performance (e.g. regularization strength, learning rate in optimization, etc.). Even the model itself can be thought of as a kind of hyperparameter, as can the random seed used to initialize the model and/or sort the data. There are also many processes for selecting hyperparameters, from random search, to grid search, to more sophisticated iterative methods. The more detail provided here the better. When extensive hyperparameter tuning is done, it may also be valuable to report expected validation performance (see arxiv.org/pdf/1909.03004.pdf).
Retrospectives: The idea of doing a retrospective analysis of published work is gaining popularity in machine learning. This might include revisiting assumptions, discussing limitations, or looking at things from a new perspective. For more examples, see ML Retrospectives.
3.6 Additional Resources
- The Turing Way: An excellent community-driven guie to reproducible data science.
- The Center for Open Science: A central organization devoted to open and reproducible research.
- The Open Source Initiative: Working to raise awareness and adoption of open source software.