Identify orthologous genes

1 OrthoFinder

For detailed info, see https://orthofinder.github.io/OrthoFinder/ and https://github.com/OrthoFinder/OrthoFinder.

Extract the longest transcript of each gene.

singularity shell -B "${PWD}":/data /data/softwares/agat/agat_1.5.0--pl5321hdfd78af_0.sif
agat_sp_keep_longest_isoform.pl --gff Homo_sapiens.GRCh38.v49.gencode.basic.chr.annotation.gtf -o Homo_sapiens.GRCh38.v49.gencode.basic.chr.annotation.longest_isoforms.gff3

Extract the protein sequence of each transcript.

For OrthoFinder, you should use *.asterisk.protein_sequences.fa.

#!/bin/bash -e

# uncompressed GFF3 only
annotations=(
  /data/database/annotation/release/neovision_vision/Neovision_vision.v1.v2.KIZ_NCBI.basic.chr.annotation.longest_isoforms.gff3
  /data/database/annotation/release/petaurus_breviceps/Petaurus_breviceps.v1.v3.KIZ_NCBI.basic.chr.annotation.longest_isoforms.gff3
)
# gzipped FASTA only
genomes=(
  /data/database/genome/release/neovision_vision/Neovision_vision.v1.KIZ_ShengLab.dna_sm.primary_assembly.fa.gz
  /data/database/genome/release/petaurus_breviceps/Petaurus_breviceps.v1.self.dna_sm.primary_assembly.fa.gz
)
num_threads=60
work_dir=/data/tmp/dell_tmp

mkdir -p "${work_dir}"
cd "${work_dir}"

(( ${#genomes[@]} == ${#annotations[@]} )) || {
  echo "ERROR: array lengths are not equal" >&2
  exit 1
}

for((i=0;i<${#annotations[@]};i++)); do
  genome="${genomes[i]}"
  annotation="${annotations[i]}"

  period_fa_file="${annotation%.gff3}.period.protein_sequences.fa"
  asterisk_fa_file="${annotation%.gff3}.asterisk.protein_sequences.fa"

  tmp_genome=$(basename "${genome}" .gz).tmp
  pigz -k -c -d "${genome}" > "${tmp_genome}"

  samtools faidx -@ "${num_threads}" "${tmp_genome}"
  gffread -y "${period_fa_file}" -g "${tmp_genome}" "${annotation}"
  gffread -S -y "${asterisk_fa_file}" -g "${tmp_genome}" "${annotation}"

  rm -f "${tmp_genome}"*
done

You’d better prepare species three yourself.

See https://timetree.org/home.

Run OrthoFinder.

conda activate orthofinder3
orthofinder -f input -s species.nwk