Welcome to the repository for APG phase 1 (APGp1).
In phase 1, we generated 320 de novo near-T2T assemblies from 160 East Asian (EAS) individuals. Detailed meta-information for each individual with all private identifiers removed, can be found in APGp1_metadata.csv.
This GitHub repository primarily contains the analytical scripts and pipelines used in the APGp1 flagship study (Wu et al., unpublished).
including:
- Genome_assembly - Genome assembly, gap-filling, polishing
- QC - Basic stats, QV, GCI, flagger, etc.
- Annotation - Repeatome, centromere, rDNA, gene annotation
- SV — SV decomposition (PanSVMerger), merging, comparison
- Pangenome_graph — Graph construction, comparison, mapping
- Loss_of_function — pLoF annotation and phasing
- Inversions — Large inversion detection
- Complex_loci — MHC and SMN structural haplotyping
Each folder contains its own README.md with detailed input/output specifications.
| Data type | Accession |
|---|---|
| BioProject | PRJCA030428 |
| Genome Sequence Archive (Raw reads) | HRA010014 |
| Assemblies (FASTA) | PRJCA030428 |
Note: To protect participant confidentiality, assemblies and raw sequencing data are available for general scientific research through a controlled access process in accordance with relevant regulations. Applications can be submitted to the Data Access Committee of APG at NGDC (https://ngdc.cncb.ac.cn/).
Annotations are available for each APGp1 assembly.
| Type | Format | Description |
|---|---|---|
| Repeat elements | GFF | RepeatMasker annotation including SINE, LINE, LTR, etc. |
| Centromeric satellite arrays | BED | Pericentromeric and centromeric satellite annotation |
| Centromeric HORs | BED | Centromeric high-order-repeat annotation |
| rDNA arrays | BED, FASTA | rDNA regions and individual rDNA copy sequences |
| Protein-coding genes | GFF.gz | Liftoff + Exonerate + Augustus merged annotation |
| HLA and C4 genes | GFF | Annotated using Immuannot |
| SMN structural haplotypes (sHap) | TXT | sHap assignments for 434 fully resolved SMN loci |
| Dataset | Method | Version | Reference | Haplotype size | Size (Gb) | Note |
|---|---|---|---|---|---|---|
| APGp1 | MC | 6.1.0 | T2T-CN1 | 320 | 3.418 | access |
| APGp1 | MC | 6.1.0 | T2T-CHM13 | 320 | 3.429 | access |
| APGp1 + HPRCy1 + HGSVC3 | MC | 6.1.0 | T2T-CN1 | 540 | 3.608 | access |
| APGp1 + HPRCy1 + HGSVC3 | MC | 6.1.0 | T2T-CHM13 | 540 | 3.621 | access |
| APGp1 | MG | 0.21-r606 | T2T-CN1 | 320 | 3.594 | access |
| APGp1 | MG | 0.21-r606 | T2T-CHM13 | 320 | 3.548 | access |
| APGp1 + HPRCy1 + HGSVC3 | MG | 0.21-r606 | T2T-CN1 | 540 | 3.904 | access |
| APGp1 + HPRCy1 + HGSVC3 | MG | 0.21-r606 | GRCh38 | 540 | 3.397 | access |
| HPRCy1 | MG | 0.21-r606 | T2T-CHM13 | 94 | 3.333 | access |
| HGSVC3 | MG | 0.21-r606 | T2T-CHM13 | 130 | 3.402 | access |
| HPRCy1eas-HGSVC3eas | MG | 0.21-r606 | T2T-CHM13 | 30 | 3.183 | access |
| HPRCy1eas-HGSVC3eas | MC | 2.1.1 | T2T-CN1 | 30 | 3.202 | access |
| CPC* | MC | 2.1.1 | T2T-CHM13 | 124 | 3.285 | Gao et al., 2023 |
| HPRCy1* | MC | NA | T2T-CHM13 | 95 | 3.338 | Liao et al., 2023 |
| CPC-HPRCy1* | MC | 2.1.1 | T2T-CHM13 | 212 | 3.510 | Gao et al., 2023 |
| HPRCy1* | MG | 0.14 | T2T-CHM13 | 95 | 3.366 | Liao et al., 2023 |
- Assembly
| Assembly | Version |
|---|---|
| T2T-CN1 | v1.0 |
| T2T-CHM13 | v2.0 |
| GRCh38 | p14 |
| HG002 | Q100 |
| YAO | v1.1 |
| HPRCy1 | year 1 |
| HGSVC3 | phase 3 |
- Other Datasets
| Type | File | Note | Link |
|---|---|---|---|
| Chain | CN1v1.0_hap_To_CHM13v2.0.over.chain.gz | T2T-CN1 v1.0 → T2T-CHM13 v2.0 | Download |
| Chain | CN1v1.0_hap_To_GRCh38.p14.over.chain.gz | T2T-CN1 v1.0 → GRCh38.p14 | Download |
| Chain | CHM13v2.0_To_CN1v1.0_hap.over.chain.gz | T2T-CHM13 v2.0 → T2T-CN1 v1.0 | Download |
| Chain | GRCh38.p14_To_CN1v1.0_hap.over.chain.gz | GRCh38.p14 → T2T-CN1 v1.0 | Download |
| Region | CN1v1_Easy_region.bed | Easy region in T2T-CN1 | Download |
| Region | CN1v1_CMRG_region.bed | CMRG region in T2T-CN1 | Download |
| Region | CN1v1_SD_region.bed | SD region in T2T-CN1 | Download |
| Region | CN1v1_rDNA_region.bed | rDNA arrays in T2T-CN1 | Download |
| Region | CN1v1_Centromere_region.bed | Centromere regions in T2T-CN1 | Download |
| Region | CN1v1_CentromerePlus_region.bed | Centromere+5Mb regions in T2T-CN1 | Download |
| Region | CN1v1_MHC_region.bed | MHC in T2T-CN1 | Download |
| Region | CN1v1.0_haploid.RM.out.gff | TE annotation in T2T-CN1 by Repeatmasker | Download |
| Region | CN1v1_VNTR_STR.anno | VNTR/STR in T2T-CN1 | Download |
| GeneExpression | MAGE RNAseq | MAGE dataset for 1KGP | Download |
For specific analyses and methodologies developed during APGp1, please refer to the following companion studies:
-
Centromere (Sun et al., unpublished)
-
Archaic introgression (Suo et al., unpulished)
New method: ASMaid
-
Y chromosome (Liu et al., unpublished)
-
Complex regions (Han et al., unpublished)
-
Tibetan pangenome (He et al., 2025, bioRxiv)
-
Schizophrenia pangenome study (Yang et al., ubpublished)
Please raise issues on this Github repository concerning this dataset. For more informtion, please contact Dongya Wu (Zhejiang University) at wudongya@zju.edu.cn .
