1 Key concepts
Allele: an allele is one of two or more alternative forms of a DNA sequence at the same locus on homologous chromosomes. This locus may lie within a gene or any other genetically defined region. Alleles may differ by a single nucleotide (SNP), or by larger insertions, deletions, or repeat copy number. In diploid organisms, each individual carries two alleles — one inherited maternally and one paternally — at any given autosomal locus; the combination constitutes the genotype. If both alleles are identical, the genotype is homozygous; if they differ, it is heterozygous.
SNP: a single nucleotide polymorphism, or SNP (pronounced “snip”), is a variation at a single position in a DNA sequence among individuals. If more than 1% of a population does not carry the same nucleotide at a specific position in the DNA sequence, then this variation can be classified as a SNP. If a SNP occurs within a gene, then the gene is described as having more than one allele.
单核苷酸多态性(SNP)是一项基于群体水平的遗传学定义。针对某一染色体特定位点,在包含 N 个个体的群体中,人类二倍体基因组共提供 2N 条同源染色体序列。对该位点进行频率统计,若检出两种或以上碱基,且次常见等位基因频率(minor allele frequency, MAF)不低于 1%,则该位点可被正式归类为单核苷酸多态性(SNP)。
Genotype: the allelic constitution of an individual at a given locus, represented by the combination of the two alleles occupying the same homologous chromosomal site — one inherited from each parent. If both alleles are identical, the genotype is homozygous; if they differ, it is heterozygous.
Haplotype: a haplotype is the combination of alleles, SNPs, indels, or other polymorphisms that are physically linked on a single DNA molecule and tend to be co-inherited. At any given autosomal segment, a diploid human can therefore carry at most two distinct haplotypes — one inherited maternally and one inherited paternally. It represents the allelic profile along one continuous stretch of DNA rather than the genotype contributed by both homologous chromosomes.
2 Ensembl 提供的基因组类型
基因组类型
toplevel:包含 primary assembly、alternative haplotypes、patches等。GRC 在主组装中保留一条“代表序列”,把其它常见单倍型做成额外的 contig 放在同一基因组文件里。也提供了单独存储 alternative haplotypes 的基因组文件。primary_assembly:不含 alternative haplotypes、patches 等。
Repeat masking 类型
dna(unmasked): repeats unmasked.dna_sm(soft-masked): repeats masked with lower letters.dna_rm(hard-masked): repeats masked with Ns.
3 Chromosome names conversion among UCSC, Ensembl, GenBank, and RefSeq
# Download a chromosome name alias table from UCSC for given genome version
chromToUcsc --get hg38
# You'll get a file like hg38.chromAlias.tsv for hg38
# You can convert chromosome names for your files yourself
# For files, supported by chromToUCSC,
# you can directly use chromToUCSC to do this
head hg38.chromAlias.tsv
# # ucsc assembly ensembl genbank refseq
# chr1 1 1 CM000663.2 NC_000001.11
# chr10 10 10 CM000672.2 NC_000010.11
# chr10_GL383545v1_alt HSCHR10_1_CTG1 GL383545.1 NW_003315934.1
# chr10_GL383546v1_alt HSCHR10_1_CTG2 GL383546.1 NW_003315935.1