Friday, September 13, 2013

The non-nuclear genome: including a mitochondrial genome in GRCh38

The GRC maintains and improves the reference nuclear assembly only. The MITOMAP (http://www.mitomap.org/MITOMAP) group has served a similar function curating genetic variation of  the mitochondrial genome. When the GRC released GRCh37 we only released the nuclear genome, as we didn't think we could distribute the sequence representing the mitochondrial (MT) reference (which we did not produce) as part of the GRCh37 reference. Not distributing an MT sequence with reference nuclear assembly led to some confusion in the research community as different groups adopted different versions of the MT sequence records to use as their reference.
The current MT reference sequence is the Revised Cambridge Reference Sequence (rCRS) represented by GenBank accession number J01415.2 and RefSeq accession number NC_012920.1. The MITOMAP group has graciously allowed the GRC to distribute their annotated version of the rCRS MT reference sequence with the nuclear assembly and this sequence was added to GRCh37 with the second patch release. This same MT reference sequence will be included in the GRCh38 assembly release. 

Tuesday, September 10, 2013

Tech Tip: The Pseudo-autosomal region of the reference assembly

As part of normal cell division, maternal and paternal chromosomes copies pair (that is maternal chromosome 1 pairs with paternal chromosome 1) and exchange genetic material. This is an important step in cell division and poses a unique problem for males, who normally only have a single copy of the X and Y chromosomes. The X and Y chromosomes have regions at each end referred to as the pseudo-autosomal region - or the PAR. The PAR regions of the X and Y are homologous which allows for the X and Y to pair and exchange genetic material within these regions. Within the PAR regions, males contain two copies of genes (one on the X and one on the Y) where as on the rest of the X and Y chromosome they only contain one copy. There are two PAR regions, one at either end of the X and Y chromosomes.
Human chromosomes X and Y with the PAR locations highlighted in orange and denoted by gray triangles. The PAR 1 region (on Xp and Yp, to the left) is larger than the PAR 2 region (on Xq and Yq).  The PAR 2 region orange highlight is not visible as it is very small with respect to the rest of the chromosome. 


During the Human Genome Project (HGP) a decision was made to not separately sequence the PAR region of a Y chromosome. At that point, the reference assembly was meant to be a haploid genome representation, so only one copy of the PAR regions was necessary. In order to build the Y chromosome, the HGP made a copy of the X chromosome PAR sequence and inserted it into the chromosome Y sequence assembly. This was done in order to make a complete model of each chromosome- without the PAR representation the Y sequence assembly would be incomplete.

Since the GRC has taken responsibility for the reference assembly, we have updated the assembly model so that the reference is no longer a single, haploid representation [PubMed]. However, when we introduce allelic duplication, we put the duplicated sequence into a separate assembly unit (refresher on the assembly model). The PAR is the only case where we represent allelic duplication within the same assembly unit. This was true for GRCh37 and will again be the case when we submit GRCh38 to GenBank.

This duplication needs to be taken into account when performing sequence analysis so that you can distinguish allelic duplication from other types of duplications, such as repeats and segmental duplication. When you sequence a female sample, reads from the PAR regions will align to both the X and Y PAR sequences. This may affect the mapping quality of reads in this region and can affect variant calling. One approach many groups take to solving this problem is to 'hard-mask' the PAR regions on the Y chromosome- this means replacing the actual sequence with Ns. This preserves the sequence coordinate space of the Y chromosome, but eliminates the duplication at this locus.