About CDSParser

GenBank

Download

Manual

Contact Us

File Types

*.cgi

*.cgi is the default extension GenBank uses when GBSeq XML files are downloaded. All GenBank files must be in GBSeq XML to work with CDSParser.

*.fasta

FASTA files are a type of sequence file. These files can be outputted from CDSParser for use in ClustalW and other programs.

key (*.txt)

Key files are outputted with FASTA file from CDSParser. They match the sequence labels in the FASTA files to the name of the sequence in a particular GenBank record.

*.grp

*.grp is the extension given to group files outputted from CDSParser. They store the groups and delete set as specified by the user. The first line is the list of gene names not to output. Each of the succeeding lines are a group of names. The names are delimited by a tilde (~) because I have not been able to find any gene names on GenBank that have a tilde in them. The group sets are useful when there are various names for the same gene, such as "cyt b", "cyt-b", and "cytochrome b", because CDSParser will treat them as the same name and output them in the same column (in tab-delimited files). See hantavirus.grp for an example.

*.def

clustal.def and defaults.def are CDSParser files used to locate ClustalW and specify which columns should be outputted in the tab-delimited files. defaults.def is provided to allow a user to have more control of the output of CDSParser.

*.aln

This file type is an alignment file, and is one of the file types that will be outputted by ClustalW when it calculates a neighbor-joining tree.

*.tre

ClustalW can output neighbor-joining trees as Nexus trees.

*.nj

This is the format that Clustal uses for neighbor-joining trees.

*.ph

ClustalW can output neighbor-joining trees as Phylip trees.

tab-delimited

Tab-delimited files are outputted for use in spreadsheet programs. CDSParser outputs two types of tab-delimited files.

  • Tab-delimited files organized by record are outputted in the order they were read in from the GBSeq XML file. Each record's information is outputted on one row (excluding the case where two or more genes fall within the same group column). Each CDS (coding sequence) may be placed in a group column, the miscellaneous column, or no where (if its name is in the delete set).
  • Tab-delimited files organized by taxa are outputted by taxonomic order. Records may be outputted in taxonomic order, or combined to represent a synthesis of the available sequences at a certain taxonomic level (sequences are chosen based on the presence of an amino acid sequence and the length of the nucleotide sequence).