6.3. Algorithmic thinking#
Interactive page
This is an interactive book page. Press launch button at the top right side.
So far we’ve seen how to work with variables of different types, and how to control the flow of execution of a Python code with conditional statements and loops. With this knowledge, we can start writing Python solutions to increasingly more complex problems. In general, computer programs can perform highly complicated data processing and can consist of thousands of lines of code (not in this course, though).
Usually, we talk about algorithms, i.e., “finite sequences of well-defined, computer-implementable instructions, typically to solve a class of problems or to perform a computation”. In order to develop computer code, we use what is called algorithmic thinking. In essence, algorithmic thinking is the process of solving a problem in a systematic way. Here we develop some algorithmic thinking and go through the basic steps that are required to make a computer program from scratch.
6.3.1. Designing algorithms#
Programming problems generally start with a very general description in written language of what the computer needs to do. To write code, you usually have to specify the problem better. Let’s take a look at the following example:
Given a DNA sequence, translate it into the corresponding amino acid sequence.
Although this may seem like a trivial task, algorithmic thinking can improve the rapid and correct implementation of this problem in computer code.
6.3.1.1. Step 1: Understanding the problem#
The first thing we do is to define more precisely what is required: we will write a Python code that translates DNA sequence into protein sequence. For instance, ATGTTTTTC
will translate to MFF
. In our example, we will include only five codons.
However, let’s consider the following:
What if the input sequence contains invalid characters (e.g., “X” or “Z”)?
Are we able to split the given sequence into valid codons?
What if the sequence follows a genetic code different from the standard one?
Therefore, we will need to carefully handle the input, check for valid codons, and account for differences in the genetic code across different organisms or organelles.
Let’s describe in words what we need to do:
Define the input sequence, its origin (nucleus or mitochondria), and genetic code tables (standard and mitochondrial) for the selected five codons. Based on sequence origin, select which genetic table we’ll use for translation. If sequence length is a multiple of 3 (i.e., we can split it into codons), start translating DNA sequence into a protein sequence. For valid codons, write the output amino acids into a list. If an invalid character or a stop codon is found, stop translation. Print the output for the user.
Note 1: The list of considerations given above is not exhaustive. We could, for instance, also account for possible frameshifts, cases when the start codon ATG
is not the first codon in the sequence, having lower-case letters in the sequence, usage of alternative start and stop codons, etc.
Note 2: If you were working with a data file, your first step would be opening the file and figuring out what’s contained in it.
6.3.1.2. Step 2: A sketch of the algorithm#
Before writing Python code, it’s good practice to sketch out what our algorithm will look like. Here we will make a flow chart. In the next section, we will explore how to use pseudocode for the same purpose. You can also opt for a graphical representation of the flowchart (see example of the algorithm).
[Start]
↓
[Input DNA Sequence & Organism Type]
↓
[Select Genetic Code] → (Is organism 'standard'?)
↓ Yes ↓ No
[Use Standard Code] [Use Mitochondrial Code]
↓
[Check Sequence Length]
↓
(Is Length Multiple of 3?)
↓ No ↓ Yes
[Print Error: Invalid Length] [Initialize Empty Protein Sequence]
↓ ↓
[End] [Start Loop Over Codons]
↓
[Extract Next Codon]
↓
(Contains Invalid Characters?)
↓ Yes ↓ No
[Print Error: Invalid Codon] [Translate Codon]
↓ ↓
[End] (Is Codon Stop Codon?)
↓ Yes ↓ No
[End Translation] [Add Amino Acid]
↓
(More Codons to Process?)
↓ No ↓ Yes
[Print Protein] [Loop]
↓
[End]
6.3.1.3. Step 3: Define program outline#
We make a step-by-step implementation of the algorithm in Python.
To that end, we first prepare a general description of the program using comments. As you know by now, this is good coding practice: if you look at the code after some time, you will very likely have forgotten everything about the code and be very happy to see a short description of what it does.
As we start writing the program, we should make sure that it keeps running free of errors in all steps we make (therefore, run parts of code regularly to check them). This way, if errors occur we know at which step they occur, which makes it more easy to solve them. Below is a first outline of our program. Note that we will define a test input to validate the correct operation of our program.
"""
This code translates DNA sequence into protein sequence for the
select 5 codons. We account for different origins of DNA (nucleus
and mitochondria) and check that the input sequence is valid.
"""
# Define genetic codes for two different organisms
# Define the DNA sequence and sequence origin
# Select the genetic code based on the origin
# Check if the sequence length is a multiple of 3
# Initialize an empty list to store the protein sequence
# For loop that translates DNA into protein
# If invalid character/codon or stop codon in encountered, stop translation
# Else add translated amino acid to protein sequence
# Print the final protein sequence
6.3.1.4. Step 4: Implementation#
Let’s start with implementing the simple case where no errors need to be raised (there are no issues with sequence or genetic code table), and make sure that our code works for it.
# Define genetic codes for two different organisms
standard_genetic_code = {
'ATG': 'M', # Methionine
'TTT': 'F', # Phenylalanine
'TTC': 'F', # Phenylalanine
'TAA': 'Stop',
'TAG': 'Stop'
}
mitochondrial_genetic_code = {
'ATG': 'M', # Methionine
'TTT': 'F', # Phenylalanine
'TTC': 'F', # Phenylalanine
'TAA': 'Stop',
'TAG': 'W' # Tryptophan (different from the standard code)
}
# Define the DNA sequence and sequence origin
dna_sequence = "ATGTTTTTC"
origin = 'nucleus'
# Select the genetic code based on the origin
if origin == 'nucleus':
genetic_code = standard_genetic_code
else:
genetic_code = mitochondrial_genetic_code
# Check if the sequence length is a multiple of 3
# Initialize an empty list to store the protein sequence
protein_sequence = []
# For loop that translates DNA into protein
for i in range(0, len(dna_sequence), 3):
codon = dna_sequence[i:i+3]
amino_acid = genetic_code[codon]
# If invalid character/codon or stop codon in encountered, stop translation
# Else add translated amino acid to protein sequence
protein_sequence.append(amino_acid)
# Print the final protein sequence
print("".join(protein_sequence))
As expected, the output is protein sequence MFF
.
Let’s start adding other pieces of code. Firstly, let’s assume that the sequence origin isn’t “nucleus” nor “mitochondria”. To account for this possibility, we can rewrite the if-else
statement in our code into if-elif-else
.
# Select the genetic code based on the origin
if origin == 'nucleus':
genetic_code = standard_genetic_code
elif origin == 'mitochondria':
genetic_code = mitochondrial_genetic_code
else:
raise ValueError("Unknown sequence origin type. Choose 'nucleus' or 'mitochondria'.")
If you were to plug this new if-elif-else
statement into our code, and define an origin other than “nucleus” or “mitochondria”, an error would be raised:
# Define genetic codes for two different organisms
standard_genetic_code = {
'ATG': 'M', # Methionine
'TTT': 'F', # Phenylalanine
'TTC': 'F', # Phenylalanine
'TAA': 'Stop',
'TAG': 'Stop'
}
mitochondrial_genetic_code = {
'ATG': 'M', # Methionine
'TTT': 'F', # Phenylalanine
'TTC': 'F', # Phenylalanine
'TAA': 'Stop',
'TAG': 'W' # Tryptophan (different from the standard code)
}
# Define the DNA sequence and sequence origin
dna_sequence = "ATGTTTTTC"
origin = 'endoplasmic reticulum'
# Select the genetic code based on the origin
if origin == 'nucleus':
genetic_code = standard_genetic_code
elif origin == 'mitochondria':
genetic_code = mitochondrial_genetic_code
else:
raise ValueError("Unknown sequence origin type. Choose 'nucleus' or 'mitochondria'.")
# Check if the sequence length is a multiple of 3
# Initialize an empty list to store the protein sequence
protein_sequence = []
# For loop that translates DNA into protein
for i in range(0, len(dna_sequence), 3):
codon = dna_sequence[i:i+3]
amino_acid = genetic_code[codon]
# If invalid character/codon or stop codon in encountered, stop translation
# Else add translated amino acid to protein sequence
protein_sequence.append(amino_acid)
# Print the final protein sequence
print("".join(protein_sequence))
Next, we want to account for the possibility that our sequence is not easily splittable into codons. In Python, we can code this as: “when we divide sequence length by 3, the remainder should be 0 to start translating; otherwise raise an error”.
if len(dna_sequence) % 3 != 0:
print("Error: DNA sequence length is not a multiple of 3.")
# else
# translate
Let’s plug this into our code. We will now define an input dna_sequence
of length 8 to see if the error will be properly raised. At the same time, because we previously verified that the selection of the genetic code table works, we can now redefine origin
as “nucleus”. Note that the final print
statement now becomes indented because we only need to print the output if the translation was allowed to proceed.
# Define genetic codes for two different organisms
standard_genetic_code = {
'ATG': 'M', # Methionine
'TTT': 'F', # Phenylalanine
'TTC': 'F', # Phenylalanine
'TAA': 'Stop',
'TAG': 'Stop'
}
mitochondrial_genetic_code = {
'ATG': 'M', # Methionine
'TTT': 'F', # Phenylalanine
'TTC': 'F', # Phenylalanine
'TAA': 'Stop',
'TAG': 'W' # Tryptophan (different from the standard code)
}
# Define the DNA sequence and sequence origin
dna_sequence = "ATGTTTTT"
origin = 'nucleus'
# Select the genetic code based on the origin
if origin == 'nucleus':
genetic_code = standard_genetic_code
elif origin == 'mitochondria':
genetic_code = mitochondrial_genetic_code
else:
raise ValueError("Unknown sequence origin type. Choose 'nucleus' or 'mitochondria'.")
# Check if the sequence length is a multiple of 3
if len(dna_sequence) % 3 != 0:
print("Error: DNA sequence length is not a multiple of 3.")
else:
# Initialize an empty list to store the protein sequence
protein_sequence = []
# For loop that translates DNA into protein
for i in range(0, len(dna_sequence), 3):
codon = dna_sequence[i:i+3]
amino_acid = genetic_code[codon]
# If invalid character/codon or stop codon in encountered, stop translation
# Else add translated amino acid to protein sequence
protein_sequence.append(amino_acid)
# Print the final protein sequence
print("".join(protein_sequence))
Finally, we add the last remaining checks in our code to account for possible invalid characters in the sequence and for stop codons. As before, with each change, we test the code with informative input parameters. E.g., we can try the code for invalid characters by using dna_sequence = "ATGXTTTC"
(the length has to remain divisible by 3), and the code for stop codons by using dna_sequence = "ATGTTTTAGTTC"
.
# Define genetic codes for two different organisms
standard_genetic_code = {
'ATG': 'M', # Methionine
'TTT': 'F', # Phenylalanine
'TTC': 'F', # Phenylalanine
'TAA': 'Stop',
'TAG': 'Stop'
}
mitochondrial_genetic_code = {
'ATG': 'M', # Methionine
'TTT': 'F', # Phenylalanine
'TTC': 'F', # Phenylalanine
'TAA': 'Stop',
'TAG': 'W' # Tryptophan (different from the standard code)
}
# Define the DNA sequence and sequence origin
dna_sequence = "ATGXTTTC"
origin = 'nucleus'
# Select the genetic code based on the origin
if origin == 'nucleus':
genetic_code = standard_genetic_code
elif origin == 'mitochondria':
genetic_code = mitochondrial_genetic_code
else:
raise ValueError("Unknown sequence origin type. Choose 'nucleus' or 'mitochondria'.")
# Check if the sequence length is a multiple of 3
if len(dna_sequence) % 3 != 0:
print("Error: DNA sequence length is not a multiple of 3.")
else:
# Initialize an empty list to store the protein sequence
protein_sequence = []
# For loop that translates DNA into protein
for i in range(0, len(dna_sequence), 3):
codon = dna_sequence[i:i+3]
# If invalid character/codon, stop translation
if codon[0] not in ['A', 'T', 'C', 'G'] \
or codon[1] not in ['A', 'T', 'C', 'G'] \
or codon[2] not in ['A', 'T', 'C', 'G']:
print("Invalid character identified in DNA sequence")
break
amino_acid = genetic_code[codon]
# If stop codon if found, stop translation
if amino_acid == 'Stop':
break
# Else add translated amino acid to protein sequence
protein_sequence.append(amino_acid)
# Print the final protein sequence
print("".join(protein_sequence))
Note that we only allow our code to translate a codon into an amino acid with amino_acid = genetic_code[codon]
if the codon is valid, and only allow for an amino acid to be added to a protein sequence if the amino acid is not Stop
.
Finally, we want to point out that multiple implementations are possible, and it’s likely that following the same logic you would end up with a code that looks different than the one shown here.