Wednesday, September 17, 2014

GRCz10 - The GRC's first zebrafish genome reference assembly

GRCz10 - The GRC's first zebrafish genome reference assembly


When the Zv9 assembly was released in July 2010, the zebrafish genome sequence was given into the care of the GRC for future improvement and maintenance. After 4 years of hard work (and zebrafish IS hard work), we have now produced a new reference assembly, GRCz10.

The previous assembly was already of high value for the scientific community, and served well for both the investigation of isolated gene loci and to address overall bioinformatics questions (Howe et al. 2013), but still featured many gaps and suffered from sub-optimal long-range continuity. To address this, we have sequenced more than 1500 additional BAC and fosmid clones and added them to the assembly. We reviewed clone overlaps and clone placements with a variety of techniques. In collaboration with the Stemple lab, using the MGH panel, we generated a new meiotic map to fill remaining gaps in the high density meiotic map SATMAP. This new map, GAPMAP, helped with placing previously unlocalised contigs onto chromosomes, and allowed us to assess and improve the order of existing chromosome placements. The creation of an optical map further improved the clone assignments, with a notable impact on the structure of the repeat-rich chromosome 4. Thanks to a collaboration with Mark Hills from the Lansdorp lab, we gained additional insight into the orientation of assembly components, leading to more than 250 orientation changes and re-placements.  In total, more than 4000 genome issues were reviewed and resolved. The remaining gaps in the clone path were filled with sequence from the WGS31 whole genome assembly, as done before with Zv9.

The most notable changes in the chromosome landscape since Zv9 can be found on chromosome 4, which has gained about 15 Mb in length, and 94 of the 112 previously unplaced clone-contigs found a home on a chromosome. Whilst 85% of all publicly available cDNAs could be assigned a place on Zv9 with at least 97% identity and 90% coverage, we now find 87% in GRCz10. If we classify cDNAs with less than 97% identity and less than 40% coverage as not found, then Zv9 was missing 7% of the cDNAs, whilst GRCz10 now is only missing 3%. Now that the assembly has been released, the Havana team at the Sanger Institute is busy manually (re-)annotating genes, and the Ensembl team is working on generating an automated gene build and integrating it with these manually produced models. The NCBI eukaryotic genome annotation pipeline (gpipe) will also annotate the GRCz10 RefSeq assembly.

If you are working with the zebrafish genome assembly, we'd be very happy to get some feedback from you. You can either fill in the form at the GRC home page, or send us an email to zfish-help@sanger.ac.uk.





Friday, April 4, 2014

Chromosome 9 peri-centromeric assembly improvement


With the release of the GRCh38 reference assembly, we are highlighting areas where improvements to the genome have been made.

The chromosome 9 peri-centromeric region has undergone significant change for GRCh38. Assembly-assembly alignments between GRCh37 and GRCh38 reveal some of the differences in the peri-centromeric region of chr. 9. As shown below, some sequences that were on the q-arm in GRCh37 are now on the p-arm in GRCh38. Why were these and other changes made?

Peri-centromeric regions of Chr. 9 in GRCh37 (top) and GRCh38 (bottom).
Blue horizontal bar: chromosome sequence. Blue/green fragments: individual clone and WGS components in the assembly tiling path. Purple bars: assembly-assembly alignments. The p- and q- arms, as well as the location of the centromere and adjacent heterochromatin gaps are marked. Note: in GRCh38, the centromere gap was replaced with sequence. The vertical bars through the alignments highlight sequence from the q-arm of GRCh37 chr. 9 that is now found on the p-arm of GRCh38.


In the GRCh37 release the region was highly fragmented, with little evidence for the order and orientation of the contigs placed within. The optical map information was consistent with a path problem in this region. The map data suggested that several contigs in the region were misplaced and did not represent a valid chromosome structure in this region.

Optical map alignments to GRCh37, highlighting the fragmented and discordant pathway.
Track legend:
Pink: clone path; Green: gap; Blue: in silico SwaI fragments.
Aligned optical map track legend:
Gold: Concordant fragment; Red: Missing fragment (seen where OM consensus span gap); Grey: Unaligned fragment





Utilizing analyses from optical mapping, strand sequencing and admixture mapping we have made advancements in the representation of the region.

These data sets have allowed us to alter the tile path with a degree of confidence and the GRCh38 release now provides near complete representation of the chromosome 9 short arm.

Admixture mapping data provided by GRC collaborator Giulio Genovese confirmed localisation of clones to chromosome 9 and, in several instances, their positioning on the long or short arm. Strand sequencing data from GRC collaborators Mark Hills and Peter Lansdorp identified contigs on the GRCh37 reference assembly that sat in incorrect orientations.
Aligning these sequences to the optical map data from 3 cell lines, we were able to confirm results from the other data analysis and place clone contigs in the correct order, creating longer contiguous contigs.

Optical map alignments to a pre-release GRCh38 pathway containing unfinished clones.



Although the heterochromatic region on chr. 9 is still underrepresented in GRCh38, improvements have also been made to the long arm. Several contigs localizing to the peri-centromeric region are now ordered, thus providing a better representation of the chromosome.






Tuesday, January 14, 2014

GRCh38: Incorporating Modeled Centromere Sequence

Centromeres are specialized chromatin structures that are required for cell division. The composition of these regions is complex, as they are made up of a series of tandem repeats that are arranged into nearly identical multi-megabase arrays. The size and repetitive nature of these regions mean they are typically not represented in reference assemblies. The Human Genome Project (HGP) employed a clone based strategy (largely BAC clones) to produce the reference assembly, but cloning centromere sequences generally requires special effort, and isn't readily applicable to all human centromeres (see Kouprina et al., 2003 for one such effort). With the recent widespread adoption of whole genome sequencing (WGS), there are clearly alpha-satellite sequences in the reads produced, but assembling these sequences into faithful representations of centromeres using standard techniques is impossible due to the repetitive nature of these sequences. In all previous versions of the human reference assembly, the centromere regions have been represented by a 3 Mb gap (that is a stretch of 3 million Ns). Recent efforts by Karen Miga and her colleagues are helping us improve centromere representation in the reference assembly. The GRCh38 reference assembly incorporates centromere models created by Miga and colleagues, along with their modeled region of one of the heterochromatic regions on the long arm of chromosome 7. These models replace the multi-megabase gaps that are in GRCh37.

As described in Miga et al., 2013,  Karen and her colleagues used the whole genome shotgun (WGS) reads that were generated as part of the Venter sequencing project (Levy, et al., 2007) to build centromere models (Fig.1). They started by identifying sequence reads containing alpha-satellite centromere sequences. They then used these reads to construct models representing the approximate repeat number and order for each of the centromeric alpha-satellite higher order arrays in the genome. Because there are two copies of each centromere for each autosome, these centromere models represent an average of the two centromere copies. On the acrocentric chromosomes, where there is extreme inter-chromosomal array sequence homogeneity, the array models found in GRCh38 include data from all four acrocentric regions. The team was also able to use read pair information to link the modeled scaffold arrays to the adjacent euchromatic sequence present in the Venter assembly.
Fig. 1
Schematic of modeled centromere sequence. Centromeres are comprised of higher order array sequences, which consist of alpha-satellite interrupted by various repeat elements (such as SINE or LINE elements), and inter-array (euchromatic) sequences.
The model centromere sequences are not exact representations of the centromeres found in the Venter genome. The sequence diversity and complexity of these regions make constructing the exact copies of each centromere with current sequencing technologies impossible. Each model represents variants and monomer ordering in a proportional manner to that observed in the initial read database, but the long-range ordering of the repeats and ordering of the linked euchromatic contigs represents only an inferred sequence. However, inclusion of these models in the reference assembly will be beneficial for the research community. Even for those not interested in centromere biology, it is likely that inclusion of these models will improve overall read alignments in individual re-sequencing efforts. Reads containing centromeric sequences are generated in whole genome sequencing experiments and providing an alignment target for these reads will reduce the number of off target alignments and unaligned reads. For those interested in centromere biology, Karen and her colleagues provide evidence in their manuscript that these models can be used to study sequence diversity in these regions.

Tuesday, December 24, 2013

Announcing GRCh38

The GRC announces the public release of GRCh38, the latest version of the human reference genome assembly. This represents the first major assembly update since 2009, and introduces changes to chromosome coordinates. The GRC would like to thank the many individuals and groups that have provided helpful feedback and shared data, often ahead of publication, in efforts to improve the reference assembly. Such interactions help ensure the reference assembly is truly a community resource.

Users can download the latest version of the assembly from the GenBank FTP site: ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh38/

The GRC does not provide annotation for the assembly. The assembly will be picked up from this FTP site for annotation by the major browsers (UCSC, Ensembl and NCBI), who will make it available on their websites in the upcoming weeks and months.

GRCh38 highlights

Mitochondrial genome

MITOMAP, the organization responsible for management human mitochondrial sequences, has kindly allowed the GRC to include the mitochondrial reference sequence with GRCh38. As in GRCh37, the current MT reference sequence is the Revised Cambridge Reference Sequence (rCRS), represented by GenBank accession number J01415.2 and RefSeq accession number NC_012920.1.

Sequence representation for centromeres

In previous reference assembly versions, the centromeres were represented by large, megabase-sized, gaps (N's in the assembly sequence). In GRCh38, these gaps are replaced by sequences derived from the reads generated during the sequencing of the HuRef genome. These sequences were used to create centromere models, as described in Miga et al., 2013, that  provide the approximate repeat number and order for each centromere in the genome. These model centromere sequences are anticipated to be useful for read mapping and variation studies. Be on the lookout for upcoming GRC blogs with more information about these centromeres.

General assembly updates

Large scale studies of human variation, such as the 1000 Genomes Project, identified a number of bases and indels in GRCh37 that were never seen in any individuals, suggesting they may represent errors in the assembly. Several thousand individual bases were updated in GRCh38, many of which corrected errors in coding sequence. In addition, a number of assembly regions that were misassembled in GRCh37, such as 1Q21, 10Q11 and the chr. 9 peri-centromeric regions have been retiled. Several highly variant genomic regions, such as the IGH locus have been retiled with components derived from a single haplotype resource in order to ensure the reference assembly provides a valid haplotypic representation. More that 100 assembly gaps have also been updated; these are either closed or reduced, in many cases with publicly available WGS sequences from other genome sequencing projects.

Variation

Like GRCh37, the updated reference assembly provides alternate sequence representation for variant regions in the form of alternate loci (alt loci) scaffolds. The alt loci are stand-alone, accessioned sequences for which chromosomal context is provided via alignment to the reference chromosomes. All alternate loci include at least one anchor sequence, a component also found on the reference chromosomes, to ensure these alignments are of high quality. Alt loci belong to alternate loci assembly units: the assembly unit ALT_REF_LOCI_1 contains the first alternate sequence representation for any genomic locus, ALT_REF_LOCI_2 contains the second alternate sequence representation and so forth. GRCh38 contains 261 alt loci scaffolds, in 35 alternate assembly units. 72 of these alternate loci were previously available as NOVEL patches to GRCh37. The LRC/KIR complex on chr. 19 has the largest number of alternate sequence representations (35), followed by the MHC on chr. 6 (7).


Friday, September 13, 2013

The non-nuclear genome: including a mitochondrial genome in GRCh38

The GRC maintains and improves the reference nuclear assembly only. The MITOMAP (http://www.mitomap.org/MITOMAP) group has served a similar function curating genetic variation of  the mitochondrial genome. When the GRC released GRCh37 we only released the nuclear genome, as we didn't think we could distribute the sequence representing the mitochondrial (MT) reference (which we did not produce) as part of the GRCh37 reference. Not distributing an MT sequence with reference nuclear assembly led to some confusion in the research community as different groups adopted different versions of the MT sequence records to use as their reference.
The current MT reference sequence is the Revised Cambridge Reference Sequence (rCRS) represented by GenBank accession number J01415.2 and RefSeq accession number NC_012920.1. The MITOMAP group has graciously allowed the GRC to distribute their annotated version of the rCRS MT reference sequence with the nuclear assembly and this sequence was added to GRCh37 with the second patch release. This same MT reference sequence will be included in the GRCh38 assembly release. 

Tuesday, September 10, 2013

Tech Tip: The Pseudo-autosomal region of the reference assembly

As part of normal cell division, maternal and paternal chromosomes copies pair (that is maternal chromosome 1 pairs with paternal chromosome 1) and exchange genetic material. This is an important step in cell division and poses a unique problem for males, who normally only have a single copy of the X and Y chromosomes. The X and Y chromosomes have regions at each end referred to as the pseudo-autosomal region - or the PAR. The PAR regions of the X and Y are homologous which allows for the X and Y to pair and exchange genetic material within these regions. Within the PAR regions, males contain two copies of genes (one on the X and one on the Y) where as on the rest of the X and Y chromosome they only contain one copy. There are two PAR regions, one at either end of the X and Y chromosomes.
Human chromosomes X and Y with the PAR locations highlighted in orange and denoted by gray triangles. The PAR 1 region (on Xp and Yp, to the left) is larger than the PAR 2 region (on Xq and Yq).  The PAR 2 region orange highlight is not visible as it is very small with respect to the rest of the chromosome. 


During the Human Genome Project (HGP) a decision was made to not separately sequence the PAR region of a Y chromosome. At that point, the reference assembly was meant to be a haploid genome representation, so only one copy of the PAR regions was necessary. In order to build the Y chromosome, the HGP made a copy of the X chromosome PAR sequence and inserted it into the chromosome Y sequence assembly. This was done in order to make a complete model of each chromosome- without the PAR representation the Y sequence assembly would be incomplete.

Since the GRC has taken responsibility for the reference assembly, we have updated the assembly model so that the reference is no longer a single, haploid representation [PubMed]. However, when we introduce allelic duplication, we put the duplicated sequence into a separate assembly unit (refresher on the assembly model). The PAR is the only case where we represent allelic duplication within the same assembly unit. This was true for GRCh37 and will again be the case when we submit GRCh38 to GenBank.

This duplication needs to be taken into account when performing sequence analysis so that you can distinguish allelic duplication from other types of duplications, such as repeats and segmental duplication. When you sequence a female sample, reads from the PAR regions will align to both the X and Y PAR sequences. This may affect the mapping quality of reads in this region and can affect variant calling. One approach many groups take to solving this problem is to 'hard-mask' the PAR regions on the Y chromosome- this means replacing the actual sequence with Ns. This preserves the sequence coordinate space of the Y chromosome, but eliminates the duplication at this locus. 

Wednesday, January 9, 2013

Genome Update: Highly variant immune regions retiled as single haplotype paths


Genes encoding for proteins that compose the immune system are constantly evolving in response to selective pressures from pathogens. This rapid host-pathogen co-evolution has led to large families of genes that are highly polymorphic and are often a result of gene duplication and diversification. In GRCh37, the current reference assembly, some chromosome regions encompassing such genes are comprised of components from several different genomic libraries. The lack of a single haplotype and excess allelic variation at such regions hinders haplotype inference using traditional linkage disequilibrium based methodology. In addition, given the polymorphic nature of these genes, paralogs may be missing from the reference assembly. The CHORI-17 BAC library, derived from a hydatidiform mole, is an excellent resource for resolving loci such as these, as it is composed of germline material without any allelic variation. We sequenced clones from CHORI-17 to create a single haplotype across two of these loci: the leukocyte receptor complex (LRC) and the immunoglobulin heavy chain locus (IGH). These new paths have now been released as fix patches in GRCh37.p11.

The LRC on chromosome 19q13.4 is approximately 1 Mbp and contains many genes related to immune response including the LILR (Leukocyte Immunoglobulin-like Receptor) and KIR (Killer Immunoglobulin-like Receptor) gene families (Fig.1). The products of these genes interact with HLA molecules making them important components of the innate immune response. The GRC previously released 8 novel patches providing partial representation of the LRC region for eight different haplotypes. We have now released a fix patch (KB021647.1for this region that provides full representation for the CHORI-17 haplotype. In GRCh38, this patch will be incorporated into the reference chromosome, replacing the GRCh37 mixed haplotype. The CHORI-17 haplotype harbors the common 6.8 kbp LILRA3 deletion, which has been associated with multiple autoimmune disorders such as psoriasis and multiple sclerosis. In addition, the KIR haplotype is the A01 haplotype, which contains the 22 bp frameshift deletion variant of the 2DS4 gene that inactivates the protein.


Fig. 1 LRC CHORI-17 patch
Fig. 1 Top: Alignment of GRCh37 chr. 19 to the LRC region fix patch. Bottom: Alignment of the fix patch and 8 LRC region novel patches to GRCh37 chr. 19. The blue bars represent the tiling paths of chr. 19 (NC_000019.9) and the fix patch (KB021647.1). The region of the fix patch comprised of CHORI-17 clones is highlighted in orange. Genes annotated on the chromosome are shown in green. The gray tracks below represent the alignments: the thin horizontal lines indicate gaps, while the small vertical red bars indicate mismatches.  The red arrows show the location of the LILRA3 deletion in the CHORI-17 haplotype.

The 1 Mbp IGH locus on chromosome 14q32.33 contains genes that encode for the heavy chain of immunoglobulin molecules that interact with antigen epitopes (Fig. 2). This locus is even more complicated than the LRC given that the IGH genes are subject to somatic rearrangements, and attempts to reconcile the organization of the locus using B-lymphocyte derived material have been difficult. The GRC has now released a fix patch (KB021645.1that provides a single haplotype representation for the majority of this locus, covering the IG variable domain encoding gene segments. The CHORI-17 haplotype adds 101 kbp of previously uncharacterized sequence, including functional IGH variable genes and four large germline copy number variants (Watson and Steinberg, in review).


Fig. 2. Top: Alignment of GRCh37 chr. 14 to the IGH region fix patch. Bottom: Alignment of the fix patch to GRCh37 chr. 14. The blue and gray bars represent the tiling paths of chr. 14 (NC_000014.8) and the fix patch (KB021645.1). The region of the fix patch comprised of CHORI-17 clones is highlighted in orange. Genes annotated on the chromosome are shown in green. The purple bars below represent the alignments: the thin regions indicate gaps, while the small vertical ticks indicate mismatches.

These two updates highlight the utility of using hydatidiform mole BAC libraries for resolving complex, highly duplicated loci of the human genome. By releasing these updates as fix patches to the reference sequence researchers can make use of these high quality sequences to better characterize sequence variation from their own disease association studies ahead of the GRCh38 genome update.