Wednesday, January 28, 2015

GRCh38: Patching the ABO gene

GRCh38 has started receiving patch updates, and this blog post describes a FIX patch to the ABO gene, located on chr. 9. You might have been aware that the GRC released a FIX patch to ABO for GRCh37. So why is there an ABO FIX patch for GRCh38 as well?

In GRCh37, the ABO gene was annotated on sequence derived from two RP11 library clones, AL732364.9 (RP11-244N20) and AL158826.23 (RP11-430N14). However, the RP11 library is derived from a diploid genome and analysis demonstrated that the two sequenced clones represented two different Type O ABO alleles. As a result, the GRCh37 chr.9  representation of ABO was an invalid haplotype for the gene (Fig. 1, top panel).
Fig. 1 Top: ABO region in GRCh37. The gene is derived from 2 components, resulting in an invalid  haplotype not seen in any individuals. Bottom: ABO fix patch. The gene is derived from a single component and represents a known Type O haplotype.


To address this issue, we identified a clone from the CalTech human BAC library D that captures the complete ABO gene (CTD-2612A24). The sequence for this clone was finished (AL772161.10) and inserted into the chr. 9 tiling path, replacing RP11 component AL158826.23. By setting the switch points between AL732364.9 (RP11-244N20) and AL772161.10 (CTD-2612A24) so that the full insert sequence of the new component contributed to the scaffold, we were able to provide a complete and valid ABO Type A1.02 representation for the gene. Thus update was provided as a FIX patch scaffold (GL339450.1) for GRCh37. (Fig.1, bottom panel).

Unfortunately, this update is not reflected in GRCh38. Subsequent to the final GRCh37 patch release (GRCh37.p13) and the release of GRCh38, the sequence to RP11-244N20 was updated (AL732364.10) and inserted into the chr. 9 tiling path. The switch points between the updated sequence AL732364.10 and AL772161.10 were set incorrectly (Fig. 2). This resulted in an invalid haplotypic representation for ABO. Whereas the GRCh37 representation was a Type O/O mix, in GRCh38 it is a Type A/O mix.
Fig.2 Top: ABO fix patch. Gene is derived from a single component. Bottom: ABO region in GRCh38. The gene is derived from 2 components, creating an invalid haplotype. This is fixed by the GRCh38 FIX patch.

The GRCh38 FIX patch scaffold KN196479.1 corrects this switch point and provides the same single haplotype representation for ABO that was present in the GRCh37 FIX patch scaffold. This re-patching of the ABO gene again restores the functionality of the gene with the valid Type A1.02 haplotype.


Thursday, November 27, 2014

Optical Mapping data in the GRC track hub

The GRC track hub now includes Optical Mapping analysis information.

What is Optical Mapping?
Optical Mapping (OM) is a method to produce ordered restriction maps from single DNA molecules (rMaps).  These rMaps are assembled into consensus maps which can be aligned against the reference assembly, taking into account the positioning of restriction sites and length of fragments. OM aids the scaffolding of genomic sequence and the identification of errors in genome assemblies, but it is also very helpful in confirming assembled contigs and sizing gaps.

What OM data is available?
OM data is currently available for human (Teague et al., 2010) and mouse (Church et al., 2009) and we would like to thank Steve Goldstein and David Schwartz for providing the alignments to the respective reference assemblies.

What is displayed in the GRC track hub?
The OM data is divided into several tracks in the GRC track hub. These tracks are of three types:
  • OM alignment tracks show the alignments of consensus maps to the reference genome, based on the comparison of restriction patterns. Each track of this type corresponds to an analysis of OM data from a single cell line.
  • OM deletion tracks present the locations of additional restriction fragments that have no corresponding fragment in the reference assembly. Their position is defined by the remaining alignment of the respective consensus map. Again, each track of this type corresponds to an analysis of OM data from a single cell line.
  • Each assembly also has a single OM reference track, which presents the set of OM fragments that would be expected based on the reference sequence, produced via an in silico restriction digest.
How is this information visualised?
The way that OM analysis data is displayed is slightly different for each of the types of track mentioned above (alignments, deletions, and predicted fragments based on the reference).
  • The OM alignments tracks present each contig as a horizontal line, with restriction cut-sites dividing fragments being displayed as vertical lines along that contig. Where there is a space between the placement of successive restriction fragments according to this analysis, this is represented as a thicker vertical bar spanning the gap between the fragments.
  • The OM deletions tracks use a single vertical bar to show the location of each fragment or group of fragments with no corresponding fragment in the reference assembly. The size of the fragment is not represented by the glyph in the browser, but is shown as one of its data fields.
  • The OM reference track displays the cut-sites between expected restriction digest fragments as vertical lines.
Display examples
In the Ensembl genome browser (from version 78 onwards), the OM reference track is at the top, with the OM deletions track "OM gap 15510" below it, followed by three OM alignments tracks based on different cell lines. (Note that OM deletions tracks exist for all those cell lines which have alignments tracks, but only one OM deletion track has data at this location.)

Here is how it appears in the UCSC genome browser. The tracks are in the same order as for the Ensembl example above: the OM reference track is at the top, with the OM deletions track "OM gap 15510" below it, followed by three OM alignments tracks based on different cell lines.


    Tuesday, October 21, 2014

    GRC track hub arrives!

    The GRC are now providing assembly-related tracks for reference genomes via a track hub, which will allow you to view that data in a range of genome browsers, including Ensembl and UCSC. The location of this hub is: http://ngs.sanger.ac.uk/production/grit/track_hub/hub.txt

    What tracks are available?
    The GRC generates a range of annotations on the reference genomes it curates. These tracks describe the assembly of the genome, as well as quality issues with the genome, and provides information relevant to their resolution or planned improvement. Much of this annotation is already available via the gEVAL browser. However, the track hub allows you to view this annotation in the genome browser of your choice.

    The individual tracks currently available are:
    • Genome issues under review by the GRC
    • Genomic regions defined by the GRC
    • Alignments between the primary assembly and alternate loci or patches
    • Clone sequence anomalies
    • Human regions with clones from the CHORI-17 library (CHM1tert)
    These tracks are updated on a weekly basis. We will be adding to this range of information as time goes on.

    How does this look?
    Here's how this data looks in Ensembl:
     Here's how it looks in UCSC:

    What is a track hub?
    A track hub is a means of attaching multiple annotation tracks to a genome browser via a single URL. Full documentation on track hubs is available here.

    How do I attach the track hub in my favourite genome browser?
    You need to specify the following URL to the genome browser:
    http://ngs.sanger.ac.uk/production/grit/track_hub/hub.txt
    • In Ensembl: Go to "Add/Manage your data", select "Add your data" (if necessary), then select the data format "TrackHub", and add the hub URL.
    • In UCSC: select the "Track Hubs" button just beneath the main browser area, then select the "My Hubs" tab, and add the hub URL.

    Tuesday, October 14, 2014

    GRCh38.p1 has arrived!

    The first patch release for the GRCh38 reference assembly is now available. The GRCh38.p1 release includes 16 scaffolds: 13 FIX patches and 3 NOVEL patches. The FIX patch scaffolds correct existing assembly sequences, while the NOVEL patch scaffolds provide new alternate sequence representations. You can download the GRCh38.p1 assembly, including the alignments of the patches to GRCh38, from the GenBank FTP site: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA_000001405.16_GRCh38.p1/.

    Stay tuned for upcoming blog posts on individual patches!


    Wednesday, September 17, 2014

    GRCz10 - The GRC's first zebrafish genome reference assembly

    When the Zv9 assembly was released in July 2010, the zebrafish genome sequence was given into the care of the GRC for future improvement and maintenance. After 4 years of hard work (and zebrafish IS hard work), we have now produced a new reference assembly, GRCz10.

    The previous assembly was already of high value for the scientific community, and served well for both the investigation of isolated gene loci and to address overall bioinformatics questions (Howe et al. 2013), but still featured many gaps and suffered from sub-optimal long-range continuity. To address this, we have sequenced more than 1500 additional BAC and fosmid clones and added them to the assembly. We reviewed clone overlaps and clone placements with a variety of techniques. In collaboration with the Stemple lab, using the MGH panel, we generated a new meiotic map to fill remaining gaps in the high density meiotic map SATMAP. This new map, GAPMAP, helped with placing previously unlocalised contigs onto chromosomes, and allowed us to assess and improve the order of existing chromosome placements. The creation of an optical map further improved the clone assignments, with a notable impact on the structure of the repeat-rich chromosome 4. Thanks to a collaboration with Mark Hills from the Lansdorp lab, we gained additional insight into the orientation of assembly components, leading to more than 250 orientation changes and re-placements.  In total, more than 4000 genome issues were reviewed and resolved. The remaining gaps in the clone path were filled with sequence from the WGS31 whole genome assembly, as done before with Zv9.

    The most notable changes in the chromosome landscape since Zv9 can be found on chromosome 4, which has gained about 15 Mb in length, and 94 of the 112 previously unplaced clone-contigs found a home on a chromosome. Whilst 85% of all publicly available cDNAs could be assigned a place on Zv9 with at least 97% identity and 90% coverage, we now find 87% in GRCz10. If we classify cDNAs with less than 97% identity and less than 40% coverage as not found, then Zv9 was missing 7% of the cDNAs, whilst GRCz10 now is only missing 3%. Now that the assembly has been released, the Havana team at the Sanger Institute is busy manually (re-)annotating genes, and the Ensembl team is working on generating an automated gene build and integrating it with these manually produced models. The NCBI eukaryotic genome annotation pipeline (gpipe) will also annotate the GRCz10 RefSeq assembly.

    If you are working with the zebrafish genome assembly, we'd be very happy to get some feedback from you. You can either fill in the form at the GRC home page, or send us an email to zfish-help@sanger.ac.uk.





    Friday, April 4, 2014

    Chromosome 9 peri-centromeric assembly improvement


    With the release of the GRCh38 reference assembly, we are highlighting areas where improvements to the genome have been made.

    The chromosome 9 peri-centromeric region has undergone significant change for GRCh38. Assembly-assembly alignments between GRCh37 and GRCh38 reveal some of the differences in the peri-centromeric region of chr. 9. As shown below, some sequences that were on the q-arm in GRCh37 are now on the p-arm in GRCh38. Why were these and other changes made?

    Peri-centromeric regions of Chr. 9 in GRCh37 (top) and GRCh38 (bottom).
    Blue horizontal bar: chromosome sequence. Blue/green fragments: individual clone and WGS components in the assembly tiling path. Purple bars: assembly-assembly alignments. The p- and q- arms, as well as the location of the centromere and adjacent heterochromatin gaps are marked. Note: in GRCh38, the centromere gap was replaced with sequence. The vertical bars through the alignments highlight sequence from the q-arm of GRCh37 chr. 9 that is now found on the p-arm of GRCh38.


    In the GRCh37 release the region was highly fragmented, with little evidence for the order and orientation of the contigs placed within. The optical map information was consistent with a path problem in this region. The map data suggested that several contigs in the region were misplaced and did not represent a valid chromosome structure in this region.

    Optical map alignments to GRCh37, highlighting the fragmented and discordant pathway.
    Track legend:
    Pink: clone path; Green: gap; Blue: in silico SwaI fragments.
    Aligned optical map track legend:
    Gold: Concordant fragment; Red: Missing fragment (seen where OM consensus span gap); Grey: Unaligned fragment





    Utilizing analyses from optical mapping, strand sequencing and admixture mapping we have made advancements in the representation of the region.

    These data sets have allowed us to alter the tile path with a degree of confidence and the GRCh38 release now provides near complete representation of the chromosome 9 short arm.

    Admixture mapping data provided by GRC collaborator Giulio Genovese confirmed localisation of clones to chromosome 9 and, in several instances, their positioning on the long or short arm. Strand sequencing data from GRC collaborators Mark Hills and Peter Lansdorp identified contigs on the GRCh37 reference assembly that sat in incorrect orientations.
    Aligning these sequences to the optical map data from 3 cell lines, we were able to confirm results from the other data analysis and place clone contigs in the correct order, creating longer contiguous contigs.

    Optical map alignments to a pre-release GRCh38 pathway containing unfinished clones.



    Although the heterochromatic region on chr. 9 is still underrepresented in GRCh38, improvements have also been made to the long arm. Several contigs localizing to the peri-centromeric region are now ordered, thus providing a better representation of the chromosome.






    Tuesday, January 14, 2014

    GRCh38: Incorporating Modeled Centromere Sequence

    Centromeres are specialized chromatin structures that are required for cell division. The composition of these regions is complex, as they are made up of a series of tandem repeats that are arranged into nearly identical multi-megabase arrays. The size and repetitive nature of these regions mean they are typically not represented in reference assemblies. The Human Genome Project (HGP) employed a clone based strategy (largely BAC clones) to produce the reference assembly, but cloning centromere sequences generally requires special effort, and isn't readily applicable to all human centromeres (see Kouprina et al., 2003 for one such effort). With the recent widespread adoption of whole genome sequencing (WGS), there are clearly alpha-satellite sequences in the reads produced, but assembling these sequences into faithful representations of centromeres using standard techniques is impossible due to the repetitive nature of these sequences. In all previous versions of the human reference assembly, the centromere regions have been represented by a 3 Mb gap (that is a stretch of 3 million Ns). Recent efforts by Karen Miga and her colleagues are helping us improve centromere representation in the reference assembly. The GRCh38 reference assembly incorporates centromere models created by Miga and colleagues, along with their modeled region of one of the heterochromatic regions on the long arm of chromosome 7. These models replace the multi-megabase gaps that are in GRCh37.

    As described in Miga et al., 2013,  Karen and her colleagues used the whole genome shotgun (WGS) reads that were generated as part of the Venter sequencing project (Levy, et al., 2007) to build centromere models (Fig.1). They started by identifying sequence reads containing alpha-satellite centromere sequences. They then used these reads to construct models representing the approximate repeat number and order for each of the centromeric alpha-satellite higher order arrays in the genome. Because there are two copies of each centromere for each autosome, these centromere models represent an average of the two centromere copies. On the acrocentric chromosomes, where there is extreme inter-chromosomal array sequence homogeneity, the array models found in GRCh38 include data from all four acrocentric regions. The team was also able to use read pair information to link the modeled scaffold arrays to the adjacent euchromatic sequence present in the Venter assembly.
    Fig. 1
    Schematic of modeled centromere sequence. Centromeres are comprised of higher order array sequences, which consist of alpha-satellite interrupted by various repeat elements (such as SINE or LINE elements), and inter-array (euchromatic) sequences.
    The model centromere sequences are not exact representations of the centromeres found in the Venter genome. The sequence diversity and complexity of these regions make constructing the exact copies of each centromere with current sequencing technologies impossible. Each model represents variants and monomer ordering in a proportional manner to that observed in the initial read database, but the long-range ordering of the repeats and ordering of the linked euchromatic contigs represents only an inferred sequence. However, inclusion of these models in the reference assembly will be beneficial for the research community. Even for those not interested in centromere biology, it is likely that inclusion of these models will improve overall read alignments in individual re-sequencing efforts. Reads containing centromeric sequences are generated in whole genome sequencing experiments and providing an alignment target for these reads will reduce the number of off target alignments and unaligned reads. For those interested in centromere biology, Karen and her colleagues provide evidence in their manuscript that these models can be used to study sequence diversity in these regions.