,

Storage Meets DNA: How to Fit a Data Center into a Shoebox

By Eitan Yaakobi

When I tell people that I work on storage, it’s easy to explain why it’s important. Just think about all the things you want to store in your own personal life – your photos, movies, music, and e-mail. Then consider what enterprises store – their data, systems, back ups. Much of this data is stored in the cloud at data centers run by companies such as Amazon Web Services, Google and Microsoft.

It is reported that over 2.7 Zettabytes of data exist in the digital universe, while 90% of this data was generated in the past two years. More than that, it is predicted that the global datasphere will hit 175 Zettabytes by 2025!

Yet capacity limitations in existing storage solutions mean that the amount of storage available will not scale at nearly the same rate.

It’s not only the amount of storage that’s important. The way we store our data matters a lot. Our pictures and videos in archival today will last only approximately 10 years due to the limited lifetime of solid state drives, hard disk drives, and tapes – not the 50-100 years that much of our data needs to last.

It turns out that the solution to our storage problem may lie in the DNA that’s inside our own bodies. We each have DNA in every cell and that DNA stores 3.5B characters. In fact, we each have more information stored in our bodies than exists in storage throughout the world today. With this sort of storage density, a data center’s worth of information could fit into a shoebox – and we could even read it in 100 years.

My research focuses on the information coding, storage and retrieval techniques, specific to the make-up of DNA, that will make this happen.

DNA As a Potential Solution

The potential for using macromolecules for ultra-dense storage was recognized as early as in the 1960s [5]. DNA molecules, which may be abstracted as strings over the symbols {A,C,G,T} stand out due to a number of unique properties:

  • Self-assembly potential – DNA has been successfully used as a building block of a number of small scale self-assembly based computers [14].
  • Stability – DNA can be recovered from 30,000 year old Neanderthals and 700,000 year old horse bones.
  • Capacity – a single human cell, with a mass of roughly 3 pgrams, hosts DNA strands encoding 6.4GBs of information.

Building upon the rapid growth of DNA synthesis and sequencing technologies, several laboratories recently outlined architectures for archival DNA-based storage [1, 2, 3, 4, 9, 10, 20, 21], while all of them had to use coding solutions to correct the errors introduced during synthesis and sequencing. The work on DNA-based storage has received significant attention both in the synthetic biology and storage communities, as well as in the public media. Furthermore, current synthesis technologies do not generate a single copy of a requested strand, but thousands to millions copies for each strand. All of these copies are stored in the DNA pool and a portion of them is sequenced when retrieving the information, and thereby every DNA strand is read many times.

However, the most critical drawback in these systems is that the coding techniques used are off-the shelf solutions that do not take into account the special nature of synthesis and sequencing errors. Furthermore, the knowledge on the error behavior in DNA and its channel modeling is very limited. As a result, all existing DNA storage systems could only show a proof of concept for storing data in DNA.

The goals of my work at the Technion are to provide, for the first time, a complete and rigorous error characterization for errors in DNA, and to design coding solutions that are specifically targeted to the error behavior in DNA in order to build a storage system with high capacity, durability, and yet with low cost.

How Would It Work?

DNA consists of four types of nucleotides: adenine (A), cytosine (C), guanine (G), and thymine (T). A single DNA strand, also called an oligonucleotide (oligo), is an ordered sequence of some combination of these nucleotides. Single DNA strands can be synthesized chemically and modern DNA synthesizers can concatenate the four DNA nucleotides to form almost any possible sequence. This process enables storage of digital data in the strands. The data can be read back with common DNA sequencers, while the most popular ones use DNA polymerase enzymes and are referred to as sequencing by synthesis.

Progress in synthesis and sequencing technologies have paved the way for the development of a non-volatile data storage technology based upon DNA molecules.

A DNA storage system consists of three important entities; see Figure 1. The first is a DNA synthesizer that produces the strands that encode the data to be stored in DNA. In order to produce strands with an acceptable error rate, the length of the strands is typically limited to no more than 250 nucleotides.

The second part is a storage container with compartments that store the DNA strands, however unordered.

Lastly, a DNA sequencer reads back the strands and transfers them back to digital data. The encoding and decoding stages are two external processes to the storage systems which convert the binary user data into strands of DNA in such a way that even in the presence of errors (the nucleotides in red in Figure1), it will be possible to revert back to the original binary data of the user.

DNA as a storage system has several attributes that distinguish it from any other storage system. The most outstanding one is that the strands are not ordered in the memory and thus it is not possible to know the order in which they were stored. Usually, this constraint can be overcome by using block addresses, also called indices, that are stored as part of the strand. Note that this limitation already imposes the capacity of DNA storage to be strictly less than 2 bits per nucleotide. This structure also prevents random access to the stored data since it is not possible to read a given strand in the pool and the proposed systems have to read the entire pool in order to retrieve even a single strand [1, 3, 4, 9, 10].

The Current State of DNA Storage

Storing data in DNA is not a new idea. The first large scale experiments that demonstrated the potential of in vitro DNA storage were reported by Church et al. who recovered 643 KB of data [3] and Goldman et al. who accomplished the same task for a 739 KB message [9]. However neither of these groups recovered the entire message successfully due to the lack of appropriate coding solutions to correct errors.

Later, in [10], Grass et al. stored and recovered successfully 81 KB message and Bornholt et al. similarly succeeded while storing 42 KB message [2]. More progress in the amount of stored data was reported in [1] by Blawat et al. who successfully stored 22 MB of data. Recently, Erlich and Zielinski managed to store 2.11 MB of data with high storage rate [4]. Lastly, in [16] the authors succeeded in storing 200 MB of data, thereby storing an order of magnitude more data than the previous experiment in [1]. Yazdi et al. [21] developed a method that offers both random access and rewritable storage and in [20] the authors built a portable DNA-based storage system.

The microscopic world in which the DNA molecules reside induces error patterns that are fundamentally different from their digital counterparts. This distinction results from the specific error behavior in DNA and the method in which DNA strands are stored together.

Hence, to maintain reliability in reading and writing, new coding schemes must be developed, and first attempts for such solutions were already implemented in proof-of-concept storage systems, e.g. [1, 4, 10, 21]. In [8], we studied codes in the Damerau distance for error-correction in DNA storage and a similar work was carried in [6] for codes in the asymmetric Lee distance for DNA storage. Information theory research on the fundamental limits of DNA storage systems was recently studied in [11], and a related model has been investigated in [13, 18]. In our work [12], we set the coding theory foundations to correct errors inside DNA-based storage systems.

However, none of these methods have been experimentally tested to correct errors in DNA strands.

While the research on coding for DNA storage has received attention in the past five years, all existing solutions are either conventional and do not fit the specific error behavior in DNA or focus only one coding aspect. There is no research that combines together both a deep understanding of the errors in DNA and accordingly designs a combination of coding schemes to combat these errors in order to increase the reliability of DNA as a storage solution.

How to Approach the Problem

In order to solve this problem, my research focuses on designing new coding schemes that are specifically targeted to the special structure of DNA storage and its error behavior. This challenging task requires a rigorous understanding of how information is stored in DNA-based storage systems and the limitations of this storage. Furthermore, it is also important to understand the types of dominant errors and why they occur.

I approach this research in the following way:

  • Storing information on DNA strands in a way that will allow it to be reconstructed back into the right order is an important task. During synthesis, every DNA strand is synthesized thousands to millions of times and a portion of them will be sequenced when retrieving information. Thus, the first task is to partition the read strands into clusters such that all strands at each cluster are copies of the same synthesized strand.
  • I will then focus on the reconstruction step needed to output the original strand. Each cluster consists of read strands, which are noisy copies of a synthesized strand, and are used to recover the selected strand.
  • The output strands after clustering and reconstruction may still suffer errors – especially those caused during synthesis. Hence, every synthesized strand is either reconstructed without errors, reconstructed with errors, or was not reconstructed. The latter case corresponds to a failure in synthesizing the strand, or none of its copies was read during sequencing. These errors should be corrected by an error-correcting code built upon the DNA error characterization in order to design the appropriate codes.
  • Since the errors in DNA are data dependent, the design of efficient constrained codes is another crucial component in the coding solutions and the final portion of this planned research.

This upcoming work addresses fundamental new practical and theoretical questions in the fields of coding theory and synthetic biology and brings forth impacts in the research areas of information theory, storage, bioinformatics, and bioengineering.

I look forward to sharing the results of my upcoming research over time as we use DNA to disrupt the current storage paradigm.

References

[1] M. Blawat et al., “Forward error correction for DNA data storage,” Int. Conf. on Computational Science, vol. 80, pp. 1011–1022, 2016.

[2] J. Bornholt et al., “A DNA-based archival storage system,” Proc. of the Twenty-First Int. Conf. on Arch. Support for Prog. Languages and Operating Systems (ASPLOS), pp. 637–649, Atlanta, GA, Apr. 2016.

[3] G. M. Church, Y. Gao, and S. Kosuri, “Next-generation digital information storage in DNA,” Science, vol. 337, no. 6102, pp. 1628–1628, Sep. 2012.

[4] Y. Erlich and D. Zielinski, “DNA fountain enables a robust and efficient storage architecture,” Science, vol. 355, no. 6328, pp. 950–954, 2017.

[5] R. Feynman, “There’s plenty of room at the bottom,” Engineering and Science, California Institute of Technology, vol. 23, pp. 22–36, 1960.

[6] R. Gabrys, H.M. Kiah, and O. Milenkovic “Asymmetric Lee distance codes for DNA-based storage,” IEEE Trans. on Inform. Theory, vol. 63, no. 7, pp. 1–14, 2017.

[7] R. Gabrys and F. Sala, “Codes correcting two deletions,” submitted to IEEE Trans. on Inform. Theory, 2017.

[8] R. Gabrys, E. Yaakobi, and O. Milenkovic, “Codes in the Damerau distance for DNA storage,” IEEE Trans. on Inform. Theory, vol. 64, no. 4, pp. 2550–2570, Apr. 2018.

[9] N. Goldman et al., “Towards practical, high-capacity, low-maintenance information storage in synthesized DNA,” Nature, vol. 494, no. 7435, pp. 77–80, 2013.

[10] R.N. Grass et al., “Robust chemical preservation of digital information on DNA in silica with error-correcting codes,” Angewandte Chemie International Edition, vol. 54, no. 8, pp. 2552–2555, Feb. 2015.

[11] R. Heckel, I. Shomorony, K. Ramchandran, D. Tse, “Fundamental limits of DNA storage systems,” Proc. IEEE Int. Symp. on Inform. Theory, pp. 3140–3144, Aachen, Germany, Jun. 2017.

[12] A. Lenz, P. H. Siegel, A.Wachter-Zeh, and E. Yaakobi, “Coding over sets for DNA storage,” submitted to IEEE Trans. on Inform. Theory, Available online https://arxiv.org/abs/1812.02936, 2018.

[13] D. J. C. MacKay, J. Sayir, and N. Goldman. “Near-capacity codes for fountain channels with insertions, deletions, and substitutions, with applications to DNA archives,” unpublished manuscript, 2015.

[14] N. C. Nadrian, “An overview of structural DNA nanotechnology,” Molecular biotechnology, vol. 37, no. 3, pp. 246–257, 2007.

[15] Nanoporetech.com. The MinION device: Portable, real-time biological analyses, Available from: https://nanoporetech.com/products/minion.

[16] L. Organick et al., “Scaling up DNA data storage and random access retrieval,” bioRxiv, Mar. 2017.

[17] M. Qin, E. Yaakobi, and P. H. Siegel, ”Time-Space Constrained Codes for Phase-Change Memories.” IEEE Trans. on Inf. Theory vol. 59, no. 8, pp. 5102–5114, 2013.

[18] J. Sayir, “Codes for efficient data storage on DNA molecules,” Talk at Inform. Inference, and Energy Symp., Cambridge, UK. Mar. 2016.

[19] J. Wolf, “On codes derivable from the tensor product of check matrices,” IEEE Trans. on Inform. Theory, vol. 11, no. 2, pp. 281–284, 1965.

[20] S. H. T. Yazdi, R. Gabrys, and O. Milenkovic, “Portable and error-free DNA-based data storage,” Cold Spring Harbor Labs Journals, 2016.

[21] S. M. H. T. Yazdi, Y. Yuan, J. Ma, H. Zhao, and O. Milenkovic, “A rewritable, random-access DNA-based storage system,” Nature Scientific Reports, vol. 5, no. 14138, Aug. 2015.