Sequence Collections and Alignments#
For loading collections of unaligned or aligned sequences see Loading sequence collections from a file or url.
Basic Collection objects#
Constructing a SequenceCollection or Alignment object from strings#
Converting a SequenceCollection to FASTA format#
Adding new sequences to an existing collection or alignment#
More than one sequence can be added to a collection simultaneously. Note that add_seqs() does not modify the existing collection/alignment, it creates a new one.
The elements of a collection or alignment#
Accessing individual sequences from a collection or alignment by name#
We can get a sequence by name by indexing the .seqs attribute.
For an alignment, the result is an Aligned instance.
For the alignment case, you can get the ungapped sequence by accessing the .seq attribute of the aligned instance.
Alternatively, if you want to extract the aligned (i.e., gapped) sequence from an alignment, you can use get_gapped_seq.
To see the names of the sequences in a sequence collection, use the names attribute.
Slice the sequences from a collection like a list#
Use the .seqs attribute. We can index a single sequence
but you cannot index a slice (Use .take_seqs() for that).
Getting a subset of sequences from the alignment#
Alternatively, you can extract only the sequences which are not specified by passing negate=True:
Note
The subset contains references to the original sequences, not copies.
Writing sequences to file#
Both collection and alignment objects have a write() method. The output format is inferred from the filename suffix,
or by the format argument.
Alignments#
Creating an Alignment object from a SequenceCollection#
Convert alignment to DNA, RNA or PROTEIN moltypes#
This is useful if you’ve loaded a sequence alignment without specifying the moltype and later need to convert it using the dedicated method
Or using the generic to_moltype() method
To RNA
To PROTEIN
Handling gaps#
Remove all gaps from an alignment#
This necessarily returns a SequenceCollection.
Removing all columns with gaps in a named sequence#
Converting an alignment to FASTA format#
Converting an alignment to Phylip format#
Converting an alignment to a list of strings#
Converting an alignment to a list of arrays#
Getting an alignment as an array#
The rows are sequences and their order corresponds to that of aln.names.
Slicing an alignment#
Alignments can be thought of as a matrix, with sequences along the rows and alignment positions as the columns. However, all slicing is only along positions.
Warning
A SequenceCollection cannot be sliced!
Getting a single column from an alignment#
Getting a region of contiguous columns#
Iterating over alignment positions#
Getting codon 3rd positions#
We can use conventional slice notation. Note, because Python counts from 0, the 3rd position starts at index 2.
Filtering positions#
Trim terminal stop codons#
For evolutionary analyses that use codon models we need to exclude terminating stop codons. For the case where the sequences are all of length divisible by 3.
To detect if the alignment contains sequences not divisible by 3, use the strict argument. This argument covers both allowing partial terminating codons / not divisible by 3.
Eliminating columns with non-nucleotide characters#
We sometimes want to eliminate ambiguous or gap data from our alignments. We demonstrate how to exclude alignment columns based on the characters they contain. In the first instance, we do this just for single nucleotide columns, then for trinucleotides (equivalent for handling codons). Both are done using the no_degenerates() method.
We apply to nucleotides,
Applying the same filter to trinucleotides (specified by setting motif_length=3).
Getting all variable positions from an alignment#
Getting all constant positions from an alignment#
Getting all variable codons from an alignment#
This is done using the filtered method using the motif_length argument.
Filtering sequences#
Extracting sequences using an arbitrary function into a new alignment object#
You can use take_seqs_if to extract sequences into a new alignment object based on whether an arbitrary function applied to the sequence evaluates to True. For example, to extract sequences which don’t contain any N bases you could do the following:
You can additionally get the sequences where the provided function evaluates to False:
Computing alignment statistics#
Getting motif counts#
Motif counts are obtained from non-overlapping k-mers. (This is distinct from k-mer counting, in which they do overlap.)
We state the motif length we want and whether to allow gap or ambiguous characters. The latter only has meaning for IPUAC character sets (the DNA, RNA or PROTEIN moltypes). We illustrate this for the DNA moltype with motif lengths of 1 and 3.
Note
Only the observed motifs are returned, rather than all defined by the alphabet.
Calculating GC% for a collection#
This is just a variant of the previous example but re-expressed for unaligned sequences. Note that the returned value from the counts() method acts like a dictionary, but also has an array attribute.
Getting k-mer counts#
This only applies to sequence collections. It returns the counts of k-mers per sequence as a numpy array with shape (number of sequences, number of k-mers).
The order of elements in the numpy array corresponds to the result of
Note
We support third-party plugins for k-mer counting. After installing one, they can be selected by specifying the package with the .count_kmers(k=2, use_hook="<package name>").
Getting motif counts per position#
Note
There are also .probs_per_pos() and .entropy_per_pos() methods.
Computing motif probabilities from an alignment#
The method get_motif_probs of Alignment objects returns the probabilities for all motifs of a given length. For individual nucleotides:
For dinucleotides or longer, we need to pass in a KmerAlphabet with the appropriate word length. Here is an example with trinucleotides:
Some calculations in cogent3 require all non-zero values in the motif probabilities, in which case we use a pseudo-count. We illustrate that here with a simple example where T is missing. Without the pseudo-count, the frequency of T is 0.0, with the pseudo-count defined as 1e-6 then the frequency of T will be slightly less than 1e-6.
Note
For alignments, motif probabilities are computed by treating sequences as non-overlapping tuples. To get all possible k-mers, use the iter_kmers() method on the sequence classes.
Working with alignment gaps#
Filtering extracted columns for the gap character#
Calculating the gaps per position#
To turn that into grap fraction
Filtering alignments based on gaps#
If we want to remove positions from the alignment which are gaps in more than a certain percentage of the sequences, use the omit_gap_pos().
Note
The default for filtered_aln.omit_gap_pos() is to remove columns with gaps in all the sequences. This can occur after sequences have been removed from the alignment.