Skip directly to content

The Nucleotide Walk Graph: a visualization technique that create a nucleotide sequence portrait

on Sun, 10/09/2016 - 16:37

Introduction

Deoxyribonucleic acid or DNA, is the hereditary material in humans and almost all other organisms. It contains the biological instructions that make each species unique. The information in DNA is stored as a code made up of four chemical bases or nucleotides: adenine (A), guanine (G), cytosine (C), and thymine (T). Human DNA consists of about 3 billion bases, and more than 99 percent of those bases are the same in all people. The order or sequence of these bases determines the information available for building and maintaining an organism. See more details at the DNA factsheet, National Human Genome Research Institute, National Institute of Health (NIH).

Techniques of scientific visualization and large-scale data analytics play a key role for understanding and making sense from millions of DNA sequences. Scientists are working on creating more effective ways to visualize DNA sequences that help developing a better understanding of how our DNA influences our lives.

This post aims to describe the nucleotide walk graph, a simple visualization technique that create a portrait of a nucleotide sequence, useful for visually analyze DNA sequences, facilitate comparison, and identify changes over time or between strains of the same gene from different places. For illustration purpose, the nucleotide walk graph is applied to four strains of Zika Virus from Aedes aegypti collected in Florida, USA.

The visualization technique

The nucleotide walk graph is a 2-dimensional (2D) graphical representation where the nucleotide sequence is plotted base by base on a Cartesian plane starting from the origin (x=0 and y=0) and where at each step the path moves one unit east (positive x-axis direction) for adenine (A), north (positive y-axis direction) for thymine (T), west (negative x-axis direction) for guanine (G), and south (negative y-axis direction) for cytosine (C). This is known as a base-4 walk as described by Aragon Artacho in Walking on Real Numbers (Aragon Artacho F, et al 2012). Starting from the origin, plots of bases as connected points according to this algorithm generate a mesh in the 2D space that is characterized by the distribution of nucleotide bases along the sequence. Color coding the connected points by nucleotide allows that the path followed by the walk provide additional information about the content of the sequence, which is of value for sequence analysis.

Examination of the nucleotide walk graph of many sequences of the same gene can give visual clues of conserved and changed regions, and specific characteristics of the sequences.

Two main numerical measures can be derived from the graph: 1) the center of the mass gR, which is determine by the average of the x- and y-coordinates of the points representing all N bases, µx = Σxi/N and µy = Σyi/N where i = 1,2,…, N and N is the total number of bases in the sequence; and 2) the distance from the origin to the center of mass gR, defined as the graph radius and serves as sequence descriptor. The gR is found to be quite sensitive to any changes in the bases in the sequence and thus identical gR between two sequences implies generally that the sequences have the same distribution pattern of bases. Origin and center of mass are represented in the graph by dotted reference lines, color-coded in back for the origin and light-red for the center mass.

Detailed informaction on especific nucleotide in the sequence, and the two numenrical meausres of the graph are displayed in a tooltip when hovering the mouse over the graph.

The nucleotide walk graph is illustrated in the following interactive visualization, created in Tableau Public, a free visual analytic platform. It uses a series of four Zika Virus genome sequences from Aedes aegypti collected in Florida, USA, which has been sequenced by the Andersen Lab.  Every datasets of DNA sequences, originally in fasta format were parsed and transformed to a data table where each base of the sequence has a row in the table, and each of the four Zika Virus sequences were integrated in a tidy dataset suitable for data analysis and visualization.

Limitation of this representation

The nucleotide walk graph produces overlapping of points and paths in some segments of a sequence, as the plot may retrace the path or reach the same point on the graph several times, creating circuits in the graphical representation. This means that the sequence can not be uniquely determined from the representation in the graph. In other words, the overlapping of segments does not allows us to see some parts of the sequence. However, this overlapping effect does not affect the computation of the center of mass, and the graph radius, the two main quantitative measures derived from the graph. Alternative orientation of each nucleotide in the 2-D plane has been proposed in order to overcome this limitation (Guo X, 2001).

Acknowledgement

This blog post was inspired by the blog post Walking on Mucleotides and the work developed by Karthik Gangavarapu visualizing a series of DNA sequences of Zika virus from human and mosquitoes, and human chromosomes using the visualization technique applied by Aragón Artacho (Aragon Artacho, 2013) for visualizing large mathematical datasets and determining if a real number is normal through the visualization. Thank you Karthik Gangavarapu for the inspiration.

References: 

Aragón Artacho, FJ., et al. "Walking on real numbers". The Mathematical Intelligencer. Vol. 35, Issue 1 (March 2013). ISSN 0343-6993, pp. 42-60. [Link] [Free pdf]

Karthik Gangavarapu. Walking on Nucleotides. Sept 16, 2016, Karthik's web site. (accessed  4 October 2016). 

X. Guo, M. Randi´c, S. Basak, A novel 2-D graphical representation of DNA sequences of low degeneracy, Chem. Phys. Lett. 350 (2001) 106–112.

Post new comment