pharokka creates a number of output files in different formats.
Main Output
The main output is a .gff gff3 file that is suitable for use downstream pangenomic pipelines such as Roary or panaroo to generate pangenomes.
-
The 'phrog=' section shows the closest matching PHROG, or "No_PHROG" if there are no matching PHROGs below the E-value threshold. The 'top_hit=' section shows the closest matching protein in the PHROGs database.
-
The 6th column of the
.gfffile is the PHANOTATE w_orf score (the more negative, the more likely the CDS is a gene) or the prodigal coding score (the larger, the more likely the CDS is a gene). Please see the PHANOTATE and Prodigal papers for more details.
Other Files
-
A
.gbkgenbank formatted file, which is converted from the gff. -
A
.logfile, which holds all logging output. It is time-stamped in the "%m%d%Y_%H%M%S" format. -
A
.tblfile, which is a flat-file table suitable to be uploaded to the NCBI's Bankit. -
A
_cds_functions.tsvfile, which includes counts of CDSs, tRNAs, CRISPRs and tmRNAs and functions assigned to CDSs according to the PHROGs database. -
A
_length_gc_cds_density.tsvfile, which outputs the phage's length, GC percentage, translation table and CDS coding density. -
phanotate.ffnorprodigal.ffnwhich will hold all nucleotide sequences of predicted CDSs. -
phanotate.faaorprodigal.faawhich will hold all amino acid sequences of predicted CDSs. -
_aragorn.txtand_aragorn.gfffiles, which hold the raw and parsed output from Aragorn, respectively. -
_minced_spacers.txtand_minced.gff, which hold the output from MinCED. -
A
_trnascan.gffwhich holds the output from tRNAscan-SE 2. -
A
_cds_final_merged_output.tsv, which gives the parsed output from MMseqs2 and PyHMMER. In general, using the default E-value threshold of 1E-05,- MMseqs2 or PyHMMER should identify a PHROG for most CDSs if the genome is a phage, while small (80-200bp) hypothetical proteins often will have no matching PHROG. This may also be the case for phages from uncommon sources where few phage have been isolates (such as environment samples).
pharokkashould be used as a rough guide only in these cases. - The PHROG for each gene will be found in column R, with the annotations and PHROG category in columns V and W.
- As of v1.4.0, the MMseqs2 phrog will be preferred to the PyHMMER phrog in the rare case of disagreement between the two methods.
- The 'score' column F contains the PHANOTATE score for each CDS. In general, the closer the score to 0, the smaller the CDS and the more likely that a PHROG will not be identified by MMseqs2.
- MMseqs2 or PyHMMER should identify a PHROG for most CDSs if the genome is a phage, while small (80-200bp) hypothetical proteins often will have no matching PHROG. This may also be the case for phages from uncommon sources where few phage have been isolates (such as environment samples).
-
A
top_hits_card.tsvfile, which contains any CARD database hits. -
A
top_hits_vfdb.tsvfile, which contains any VFDB database hits. -
A
terL.ffnfile, which contains the nulceotide sequences of all identified large terminase subunit (terL) CDSs. -
A
terL.faafile, which contains the amino acid sequences of all identified large terminase subunit (terL) CDSs. -
A
_top_hits_mash_inphared.tsvfile which from v1.2.0 holds the top hits of the INPHARED search. -
Optionally, if you reorient the input contig using
--terminase, a_genome_terminase_reoriented.fastawith the reoriented genome FASTA. -
Optionally, if you specify
-sor split mode, folders calledsingle_gbks,single_gffsandsingle_fastaswill be created and contain genbank, gff and FASTA files for each respective input contig, named by the contig header. -
Optionally, if you specify
--dnaapler, a_dnaapler_reoriented.fastafile that will contain the reoriented FASTA format genome, and adnaaplerdirectory containing the output from Dnaapler. -
For more information about PHROGs please consult the website https://phrogs.lmge.uca.fr and paper https://doi.org/10.1093/nargab/lqab067.