Which Version of the Human Genome Should I Use?
Written by John SantaLucia, Jr.
Updated 5-5-2024
Sequencing of the human genome required an international effort over 35 years starting in the late 1980’s with the first rough draft (<90% complete) published in 2001.1,2 Over the next 12 years the completion improvements released in 2009 (GRCh37) and 2013 (GRCh38).3,4 The GRCh38 version was 92% complete, and was widely utilized in studies of human disease and variation. Then, finally in 2022 the Telomere-to-Telomere consortium (T2T) finished the first truly complete human genome sequence.5-12 Due to this history, there is a vast quantity of literature that use the coordinates from the older sequences. For example, there are many exome sequencing databases and SNP databases (such as dbSNP and COSMIC) that use the older GRCh37 and GRCh38 versions. Thus, a common question that we get is “Which version of the human genome should I use?”. Table 1 gives the RefSeq accessions for three most widely used versions of the human genome. The RefSeq accessions are preferred over GenBank accessions, because the RefSeq accessions also contain the full annotation information for genes, ncRNAs, and other features (though the T2T version of mitochondria has not yet been deposited into RefSeq). The T2T version has no ambiguity codes, no gaps, and corrects previous scaffold mistakes. Table 2 gives the percentage ambiguity codes for each chromosome in the three different versions of the human genome (data compiled using ThermoSleuth from DNA Software, Inc.). Table 3 gives different metrics of completion for the three different versions of the human genome.3 Table 4 gives a detailed comparison of the T2T vs. GRCh38 versions.5
We strongly encourage our users to utilize the T2T version, which is the first truly complete human genome sequence. Both ThermoSleuth and PanelPlex support all 3 major versions of the human genome (Table 1). If you need to convert coordinates from different genome versions, see the links in the “Additional Resources” section.
Additional resources:
- Website with information about annotations in the T2T genome: https://www.ncbi.nlm.nih.gov/refseq/annotation_euk/Homo_sapiens/GCF_009914755.1-RS_2023_03/
- Useful website for converting between different genome versions (this is fine if you just have a few conversions to do): http://genome.ucsc.edu/cgi-bin/hgLiftOver
- Useful site that describes several different software packages for high-throughput conversion of many coordinates between different genome versions: https://www.biostars.org/p/65558/
- Website that describes the differences between hg19 and hg38: http://seqanswers.com/forums/showthread.php?t=75570
- Website for converting a chromosome location to transcript oriented position: https://mutalyzer.nl/position-converter
- Other tools: Bowtie2, Picard Tools, SAMtools
Table 1: RefSeq Accessions for Three Versions of the Human Genome
| Chromosome | GRCh37.p13 (hg19) | GRCh38.p7 (hg38) | T2T (CHM13v2.0) |
|---|---|---|---|
| 2-27-2009 | 12-17-2013 | 4-1-2022 | |
| 1 | NC_000001.10 | NC_000001 or NC_000001.11 | NC_060925.1 |
| 2 | NC_000002.11 | NC_000002 or NC_000002.12 | NC_060926.1 |
| 3 | NC_000003.11 | NC_000003 or NC_000003.12 | NC_060927.1 |
| 4 | NC_000004.11 | NC_000004 or NC_000004.12 | NC_060928.1 |
| 5 | NC_000005.9 | NC_000005 or NC_000005.10 | NC_060929.1 |
| 6 | NC_000006.11 | NC_000006 or NC_000006.12 | NC_060930.1 |
| 7 | NC_000007.13 | NC_000007 or NC_000007.14 | NC_060931.1 |
| 8 | NC_000008.10 | NC_000008 or NC_000008.11 | NC_060932.1 |
| 9 | NC_000009.11 | NC_000009 or NC_000009.12 | NC_060933.1 |
| 10 | NC_000010.10 | NC_000010 or NC_000010.11 | NC_060934.1 |
| 11 | NC_000011.9 | NC_000011 or NC_000011.10 | NC_060935.1 |
| 12 | NC_000012.11 | NC_000012 or NC_000012.12 | NC_060936.1 |
| 13 | NC_000013.10 | NC_000013 or NC_000013.11 | NC_060937.1 |
| 14 | NC_000014.8 | NC_000014 or NC_000014.9 | NC_060938.1 |
| 15 | NC_000015.9 | NC_000015 or NC_000015.10 | NC_060939.1 |
| 16 | NC_000016.9 | NC_000016 or NC_000016.10 | NC_060940.1 |
| 17 | NC_000017.10 | NC_000017 or NC_000017.11 | NC_060941.1 |
| 18 | NC_000018.9 | NC_000018 or NC_000018.10 | NC_060942.1 |
| 19 | NC_000019.9 | NC_000019 or NC_000019.10 | NC_060943.1 |
| 20 | NC_000020.10 | NC_000020 or NC_000020.11 | NC_060944.1 |
| 21 | NC_000021.8 | NC_000021 or NC_000021.9 | NC_060945.1 |
| 22 | NC_000022.10 | NC_000022 or NC_000022.11 | NC_060946.1 |
| X | NC_000023.10 | NC_000023 or NC_000023.11 | NC_060947.1 |
| Y | NC_000024.9 | NC_000024 or NC_000024.10 | NC_060948.1 |
| mitochondria | NC_012920.1 | NC_012920.1 | CP068254.1 or NC_012920.1 |
Table 2: Percent ambiguity in different versions of the human genome
(data from ThermoSleuth (DNA Software, Inc.)
| Chromosome | GRCh37.p13 (hg19) | GRCh38.p7 (hg38) | T2T (CHM13) |
|---|---|---|---|
| 1 | 9.62 | 7.42 | 0 |
| 2 | 2.05 | 0.68 | 0 |
| 3 | 1.63 | 0.1 | 0 |
| 4 | 1.83 | 0.24 | 0 |
| 5 | 1.78 | 0.15 | 0 |
| 6 | 2.17 | 0.43 | 0 |
| 7 | 2.38 | 0.24 | 0 |
| 8 | 2.37 | 0.26 | 0 |
| 9 | 14.92 | 12.00 | 0 |
| 10 | 3.11 | 0.40 | 0 |
| 11 | 2.87 | 0.41 | 0 |
| 12 | 2.52 | 0.10 | 0 |
| 13 | 17.00 | 14.32 | 0 |
| 14 | 17.76 | 15.39 | 0 |
| 15 | 20.32 | 17.01 | 0 |
| 16 | 12.69 | 9.45 | 0 |
| 17 | 4.19 | 0.41 | 0 |
| 18 | 4.38 | 0.35 | 0 |
| 19 | 5.62 | 0.30 | 0 |
| 20 | 5.69 | 0.78 | 0 |
| 21 | 27.06 | 14.18 | 0 |
| 22 | 31.99 | 22.94 | 0 |
| X | 2.69 | 0.74 | 0 |
| Y | 55.79 | 53.84 | 0 |
Table 3: Completion metrics for different versions of the human genome3
| Metric | GRCh37.p13 (hg19) | GRCh38.p7 (hg38) | T2T (CHM13) |
|---|---|---|---|
| Number of ambiguities | 234350281 | 150630719 | 0 |
| Unplaced scaffolds | 39 | 127 | 0 |
| Collapsed repeats | Yes | Yes | No |
| Percent Completion | 90% | 92% | 100% |
Table 4: Comparison of GRCh38 and T2T versions of Human Genome5
References:
- International Human Genome Consortium. “Initial sequencing and analysis of the human genome” Nature, 409: 860-921 (2001).
- Venter, J.C. Adams, M.D., Myers, E.W., Li, P.W., Mural, R.J., et al. “The sequence of the human genome” Science, 291: 1304-1351 (2001).
- Guo, Y., Dai, Y., Yu, H., Zhao, S., Samuels, D.C., Shyr, Y. “Improvements and impacts of GRCh38 human reference on high throughput sequencing data analysis” Genomics, 109, 83-90 (2017).
- V. A. Schneider, T. Graves-Lindsay, K. Howe, N. Bouk, H.-C. Chen, P. A. Kitts, T. D. Murphy, K. D. Pruitt, F. Thibaud-Nissen, D. Albracht, R. S. Fulton, M. Kremitzki, V. Magrini, C. Markovic, S. McGrath, K. M. Steinberg, K. Auger, W. Chow, J. Collins, G. Harden, T. Hubbard, S. Pelan, J. T. Simpson, G. Threadgold, J. Torrance, J. M. Wood, L. Clarke, S. Koren, M. Boitano, P. Peluso, H. Li, C.-S. Chin, A. M. Phillippy, R. Durbin, R. K. Wilson, P. Flicek, E. E. Eichler, D. M. Church, “Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly” Genome Res. 27, 849–864 (2017).
- Nurk, S. et al., “The complete sequence of a human genome” Science 376, 44–53 (2022).
- Pennisi, E. “Most complete human genome yet is revealed” Science 376, 15-16 (2022).
- Altemose et al., “Complete genomic and epigenetic maps of human centromeres” Science 376, 56–66 (2022).
- Hoyt et al., “From telomere to telomere: The transcriptional and epigenetic state of human repeat elements” Science 376, eabk3112 (2022).
- Gershman et al., “Epigenetic patterns in a complete human genome” Science 376, eabj5089 (2022).
- Aganezov et al., “A complete reference genome improves analysis of human genetic variation” Science 376, eabl3533 (2022).
- Vollger et al., “Segmental duplications and their variation in a complete human genome” Science 376, 55–66 (2022).
- Church, D.M., “A next-generation human genome sequence” Science 376, 34–35 (2022).
