Skip to content

Asian-Pan-Genome/APGp1

Repository files navigation

Asian Pan-Genome project phase 1

APG Project Logo

Welcome to the repository for APG phase 1 (APGp1).

In phase 1, we generated 320 de novo near-T2T assemblies from 160 East Asian (EAS) individuals. Detailed meta-information for each individual with all private identifiers removed, can be found in APGp1_metadata.csv.

Table of Contents

Repository structure

This GitHub repository primarily contains the analytical scripts and pipelines used in the APGp1 flagship study (Wu et al., unpublished).

including:

  1. Genome_assembly - Genome assembly, gap-filling, polishing
  2. QC - Basic stats, QV, GCI, flagger, etc.
  3. Annotation - Repeatome, centromere, rDNA, gene annotation
  4. SV — SV decomposition (PanSVMerger), merging, comparison
  5. Pangenome_graph — Graph construction, comparison, mapping
  6. Loss_of_function — pLoF annotation and phasing
  7. Inversions — Large inversion detection
  8. Complex_loci — MHC and SMN structural haplotyping

Each folder contains its own README.md with detailed input/output specifications.


Resources

Sequencing reads and assemblies

Data type Accession
BioProject PRJCA030428
Genome Sequence Archive (Raw reads) HRA010014
Assemblies (FASTA) PRJCA030428

Note: To protect participant confidentiality, assemblies and raw sequencing data are available for general scientific research through a controlled access process in accordance with relevant regulations. Applications can be submitted to the Data Access Committee of APG at NGDC (https://ngdc.cncb.ac.cn/).


Annotation

Annotations are available for each APGp1 assembly.

Type Format Description
Repeat elements GFF RepeatMasker annotation including SINE, LINE, LTR, etc.
Centromeric satellite arrays BED Pericentromeric and centromeric satellite annotation
Centromeric HORs BED Centromeric high-order-repeat annotation
rDNA arrays BED, FASTA rDNA regions and individual rDNA copy sequences
Protein-coding genes GFF.gz Liftoff + Exonerate + Augustus merged annotation
HLA and C4 genes GFF Annotated using Immuannot
SMN structural haplotypes (sHap) TXT sHap assignments for 434 fully resolved SMN loci

Pangenome Graphs

Dataset Method Version Reference Haplotype size Size (Gb) Note
APGp1 MC 6.1.0 T2T-CN1 320 3.418 access
APGp1 MC 6.1.0 T2T-CHM13 320 3.429 access
APGp1 + HPRCy1 + HGSVC3 MC 6.1.0 T2T-CN1 540 3.608 access
APGp1 + HPRCy1 + HGSVC3 MC 6.1.0 T2T-CHM13 540 3.621 access
APGp1 MG 0.21-r606 T2T-CN1 320 3.594 access
APGp1 MG 0.21-r606 T2T-CHM13 320 3.548 access
APGp1 + HPRCy1 + HGSVC3 MG 0.21-r606 T2T-CN1 540 3.904 access
APGp1 + HPRCy1 + HGSVC3 MG 0.21-r606 GRCh38 540 3.397 access
HPRCy1 MG 0.21-r606 T2T-CHM13 94 3.333 access
HGSVC3 MG 0.21-r606 T2T-CHM13 130 3.402 access
HPRCy1eas-HGSVC3eas MG 0.21-r606 T2T-CHM13 30 3.183 access
HPRCy1eas-HGSVC3eas MC 2.1.1 T2T-CN1 30 3.202 access
CPC* MC 2.1.1 T2T-CHM13 124 3.285 Gao et al., 2023
HPRCy1* MC NA T2T-CHM13 95 3.338 Liao et al., 2023
CPC-HPRCy1* MC 2.1.1 T2T-CHM13 212 3.510 Gao et al., 2023
HPRCy1* MG 0.14 T2T-CHM13 95 3.366 Liao et al., 2023

External Datasets

  • Assembly
Assembly Version
T2T-CN1 v1.0
T2T-CHM13 v2.0
GRCh38 p14
HG002 Q100
YAO v1.1
HPRCy1 year 1
HGSVC3 phase 3
  • Other Datasets
Type File Note Link
Chain CN1v1.0_hap_To_CHM13v2.0.over.chain.gz T2T-CN1 v1.0 → T2T-CHM13 v2.0 Download
Chain CN1v1.0_hap_To_GRCh38.p14.over.chain.gz T2T-CN1 v1.0 → GRCh38.p14 Download
Chain CHM13v2.0_To_CN1v1.0_hap.over.chain.gz T2T-CHM13 v2.0 → T2T-CN1 v1.0 Download
Chain GRCh38.p14_To_CN1v1.0_hap.over.chain.gz GRCh38.p14 → T2T-CN1 v1.0 Download
Region CN1v1_Easy_region.bed Easy region in T2T-CN1 Download
Region CN1v1_CMRG_region.bed CMRG region in T2T-CN1 Download
Region CN1v1_SD_region.bed SD region in T2T-CN1 Download
Region CN1v1_rDNA_region.bed rDNA arrays in T2T-CN1 Download
Region CN1v1_Centromere_region.bed Centromere regions in T2T-CN1 Download
Region CN1v1_CentromerePlus_region.bed Centromere+5Mb regions in T2T-CN1 Download
Region CN1v1_MHC_region.bed MHC in T2T-CN1 Download
Region CN1v1.0_haploid.RM.out.gff TE annotation in T2T-CN1 by Repeatmasker Download
Region CN1v1_VNTR_STR.anno VNTR/STR in T2T-CN1 Download
GeneExpression MAGE RNAseq MAGE dataset for 1KGP Download

Companion Papers & Repositories

For specific analyses and methodologies developed during APGp1, please refer to the following companion studies:


Contact

Please raise issues on this Github repository concerning this dataset. For more informtion, please contact Dongya Wu (Zhejiang University) at wudongya@zju.edu.cn .

About

This repository for APG phase 1 contains scripts, workflows, and documentation for genome assembly, annotation, pangenome graph construction, SV/LoF analysis, and complex region characterization (MHC, SMN, centromeres, rDNA).

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors