1 OrthoFinder
For detailed info, see https://orthofinder.github.io/OrthoFinder/ and https://github.com/OrthoFinder/OrthoFinder.
- Extract the longest transcript of each gene.
singularity shell -B "${PWD}":/data /data/softwares/agat/agat_1.5.0--pl5321hdfd78af_0.sif
agat_sp_keep_longest_isoform.pl --gff Homo_sapiens.GRCh38.v49.gencode.basic.chr.annotation.gtf -o Homo_sapiens.GRCh38.v49.gencode.basic.chr.annotation.longest_isoforms.gff3
- Extract the protein sequence of each transcript.
For OrthoFinder, you should use *.asterisk.protein_sequences.fa.
#!/bin/bash -e
# uncompressed GFF3 only
annotations=(
/data/database/annotation/release/neovision_vision/Neovision_vision.v1.v2.KIZ_NCBI.basic.chr.annotation.longest_isoforms.gff3
/data/database/annotation/release/petaurus_breviceps/Petaurus_breviceps.v1.v3.KIZ_NCBI.basic.chr.annotation.longest_isoforms.gff3
)
# gzipped FASTA only
genomes=(
/data/database/genome/release/neovision_vision/Neovision_vision.v1.KIZ_ShengLab.dna_sm.primary_assembly.fa.gz
/data/database/genome/release/petaurus_breviceps/Petaurus_breviceps.v1.self.dna_sm.primary_assembly.fa.gz
)
num_threads=60
work_dir=/data/tmp/dell_tmp
mkdir -p "${work_dir}"
cd "${work_dir}"
(( ${#genomes[@]} == ${#annotations[@]} )) || {
echo "ERROR: array lengths are not equal" >&2
exit 1
}
for((i=0;i<${#annotations[@]};i++)); do
genome="${genomes[i]}"
annotation="${annotations[i]}"
period_fa_file="${annotation%.gff3}.period.protein_sequences.fa"
asterisk_fa_file="${annotation%.gff3}.asterisk.protein_sequences.fa"
tmp_genome=$(basename "${genome}" .gz).tmp
pigz -k -c -d "${genome}" > "${tmp_genome}"
samtools faidx -@ "${num_threads}" "${tmp_genome}"
gffread -y "${period_fa_file}" -g "${tmp_genome}" "${annotation}"
gffread -S -y "${asterisk_fa_file}" -g "${tmp_genome}" "${annotation}"
rm -f "${tmp_genome}"*
done
- You’d better prepare species three yourself.
See https://timetree.org/home.
- Run OrthoFinder.
conda activate orthofinder3
orthofinder -f input -s species.nwk