{ "cells": [ { "cell_type": "markdown", "metadata": { "nbgrader": { "grade": false, "grade_id": "cell-636e6a2b17097f10", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "# Algorithmic thinking\n", "\n", "```{admonition} Interactive page\n", ":class: warning, dropdown\n", "This is an interactive book page. Press launch button at the top right side.\n", "```\n", "\n", "So far we've seen how to work with variables of different types, and how to control the flow of execution of a Python code with conditional statements and loops. \n", "With this knowledge, we can start writing Python solutions to increasingly more complex problems. \n", "In general, computer programs can perform highly complicated data processing and can consist of thousands of lines of code (not in this course, though).\n", "\n", "Usually, we talk about [algorithms](https://en.wikipedia.org/wiki/Algorithm), i.e., \"finite sequences of well-defined, computer-implementable instructions, typically to solve a class of problems or to perform a computation\". In order to develop computer code, we use what is called **algorithmic thinking**. In essence, algorithmic thinking is the **process of solving a problem in a systematic way**. Here we develop some algorithmic thinking and go through the basic steps that are required to make a computer program from scratch. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Designing algorithms\n", "\n", "Programming problems generally start with a very general description in written language of what the computer needs to do. To write code, you usually have to specify the problem better. Let's take a look at the following example:\n", "\n", "> Given a DNA sequence, translate it into the corresponding amino acid sequence.\n", "\n", "Although this may seem like a trivial task, algorithmic thinking can improve the rapid and correct implementation of this problem in computer code." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 1: Understanding the problem\n", "\n", "The first thing we do is to define more precisely what is required: we will write a Python code that translates DNA sequence into protein sequence. For instance, `ATGTTTTTC` will translate to `MFF`. In our example, we will include only five codons. \n", "\n", "However, let's consider the following:\n", "- What if the input sequence contains invalid characters (e.g., \"X\" or \"Z\")?\n", "- Are we able to split the given sequence into valid codons?\n", "- What if the sequence follows a [genetic code](https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi) different from the standard one?\n", "\n", "Therefore, we will need to carefully handle the input, check for valid codons, and account for differences in the genetic code across different organisms or organelles.\n", "\n", "Let's describe in words what we need to do:\n", "\n", "Define the input sequence, its origin (nucleus or mitochondria), and genetic code tables (standard and mitochondrial) for the selected five codons. Based on sequence origin, select which genetic table we'll use for translation. If sequence length is a multiple of 3 (i.e., we can split it into codons), start translating DNA sequence into a protein sequence. For valid codons, write the output amino acids into a list. If an invalid character or a stop codon is found, stop translation. Print the output for the user.\n", "\n", "*Note 1*: The list of considerations given above is not exhaustive. We could, for instance, also account for possible frameshifts, cases when the start codon `ATG` is not the first codon in the sequence, having lower-case letters in the sequence, usage of alternative start and stop codons, etc.\n", "\n", "*Note 2*: If you were working with a data file, your first step would be opening the file and figuring out what's contained in it." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 2: A sketch of the algorithm\n", "\n", "Before writing Python code, it's good practice to sketch out what our algorithm will look like. Here we will make a flow chart. In the next section, we will explore how to use **pseudocode** for the same purpose. You can also opt for a graphical representation of the flowchart (see example of the [algorithm](../chapter1/computer-glossary.md)).\n", "\n", "```\n", "[Start] \n", " ↓\n", "[Input DNA Sequence & Organism Type]\n", " ↓\n", "[Select Genetic Code] → (Is organism 'standard'?) \n", " ↓ Yes ↓ No\n", "[Use Standard Code] [Use Mitochondrial Code]\n", " ↓\n", "[Check Sequence Length]\n", " ↓\n", "(Is Length Multiple of 3?)\n", " ↓ No ↓ Yes\n", "[Print Error: Invalid Length] [Initialize Empty Protein Sequence]\n", " ↓ ↓\n", " [End] [Start Loop Over Codons]\n", " ↓\n", " [Extract Next Codon]\n", " ↓\n", " (Contains Invalid Characters?)\n", " ↓ Yes ↓ No\n", " [Print Error: Invalid Codon] [Translate Codon]\n", " ↓ ↓\n", " [End] (Is Codon Stop Codon?)\n", " ↓ Yes ↓ No\n", " [End Translation] [Add Amino Acid]\n", " ↓\n", " (More Codons to Process?)\n", " ↓ No ↓ Yes\n", " [Print Protein] [Loop]\n", " ↓\n", " [End]\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 3: Define program outline \n", "\n", "We make a step-by-step implementation of the algorithm in Python. \n", "\n", "To that end, we first prepare a general description of the program using comments. As you know by now, this is good coding practice: if you look at the code after some time, you will very likely have forgotten everything about the code and be very happy to see a short description of what it does.\n", "\n", "As we start writing the program, we should make sure that it keeps running free of errors in all steps we make (therefore, **run parts of code regularly to check them**). This way, if errors occur we know at which step they occur, which makes it more easy to solve them. Below is a first outline of our program. Note that we will define a **test input to validate the correct operation of our program**. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "\"\"\"\n", "This code translates DNA sequence into protein sequence for the\n", "select 5 codons. We account for different origins of DNA (nucleus\n", "and mitochondria) and check that the input sequence is valid.\n", "\"\"\"\n", "\n", "# Define genetic codes for two different organisms\n", "\n", "# Define the DNA sequence and sequence origin\n", "\n", "# Select the genetic code based on the origin\n", "\n", "# Check if the sequence length is a multiple of 3\n", "\n", "# Initialize an empty list to store the protein sequence\n", "\n", "# For loop that translates DNA into protein \n", "\n", "# If invalid character/codon or stop codon in encountered, stop translation\n", "# Else add translated amino acid to protein sequence\n", "\n", "# Print the final protein sequence\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 4: Implementation\n", "\n", "Let's start with implementing the simple case where no errors need to be raised (there are no issues with sequence or genetic code table), and make sure that our code works for it." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "remove-output" ] }, "outputs": [], "source": [ "# Define genetic codes for two different organisms\n", "standard_genetic_code = {\n", " 'ATG': 'M', # Methionine\n", " 'TTT': 'F', # Phenylalanine\n", " 'TTC': 'F', # Phenylalanine\n", " 'TAA': 'Stop',\n", " 'TAG': 'Stop'\n", "}\n", "\n", "mitochondrial_genetic_code = {\n", " 'ATG': 'M', # Methionine\n", " 'TTT': 'F', # Phenylalanine\n", " 'TTC': 'F', # Phenylalanine\n", " 'TAA': 'Stop',\n", " 'TAG': 'W' # Tryptophan (different from the standard code)\n", "}\n", "\n", "# Define the DNA sequence and sequence origin\n", "dna_sequence = \"ATGTTTTTC\"\n", "origin = 'nucleus' \n", "\n", "# Select the genetic code based on the origin\n", "if origin == 'nucleus':\n", " genetic_code = standard_genetic_code\n", "else:\n", " genetic_code = mitochondrial_genetic_code\n", "\n", "# Check if the sequence length is a multiple of 3\n", "\n", "# Initialize an empty list to store the protein sequence\n", "protein_sequence = []\n", "\n", "# For loop that translates DNA into protein \n", "for i in range(0, len(dna_sequence), 3):\n", " codon = dna_sequence[i:i+3]\n", " amino_acid = genetic_code[codon]\n", "\n", " # If invalid character/codon or stop codon in encountered, stop translation\n", " # Else add translated amino acid to protein sequence\n", " protein_sequence.append(amino_acid)\n", "\n", "# Print the final protein sequence\n", "print(\"\".join(protein_sequence)) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As expected, the output is protein sequence `MFF`.\n", "\n", "Let's start adding other pieces of code. Firstly, let's assume that the sequence origin isn't \"nucleus\" nor \"mitochondria\". To account for this possibility, we can rewrite the `if-else` statement in our code into `if-elif-else`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "remove-output" ] }, "outputs": [], "source": [ "# Select the genetic code based on the origin\n", "if origin == 'nucleus':\n", " genetic_code = standard_genetic_code\n", "elif origin == 'mitochondria':\n", " genetic_code = mitochondrial_genetic_code\n", "else: \n", " raise ValueError(\"Unknown sequence origin type. Choose 'nucleus' or 'mitochondria'.\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you were to plug this new `if-elif-else` statement into our code, and define an origin other than \"nucleus\" or \"mitochondria\", an error would be raised:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "remove-output" ] }, "outputs": [], "source": [ "# Define genetic codes for two different organisms\n", "standard_genetic_code = {\n", " 'ATG': 'M', # Methionine\n", " 'TTT': 'F', # Phenylalanine\n", " 'TTC': 'F', # Phenylalanine\n", " 'TAA': 'Stop',\n", " 'TAG': 'Stop'\n", "}\n", "\n", "mitochondrial_genetic_code = {\n", " 'ATG': 'M', # Methionine\n", " 'TTT': 'F', # Phenylalanine\n", " 'TTC': 'F', # Phenylalanine\n", " 'TAA': 'Stop',\n", " 'TAG': 'W' # Tryptophan (different from the standard code)\n", "}\n", "\n", "# Define the DNA sequence and sequence origin\n", "dna_sequence = \"ATGTTTTTC\"\n", "origin = 'endoplasmic reticulum' \n", "\n", "# Select the genetic code based on the origin\n", "if origin == 'nucleus':\n", " genetic_code = standard_genetic_code\n", "elif origin == 'mitochondria':\n", " genetic_code = mitochondrial_genetic_code\n", "else: \n", " raise ValueError(\"Unknown sequence origin type. Choose 'nucleus' or 'mitochondria'.\")\n", "\n", "# Check if the sequence length is a multiple of 3\n", "\n", "# Initialize an empty list to store the protein sequence\n", "protein_sequence = []\n", "\n", "# For loop that translates DNA into protein \n", "for i in range(0, len(dna_sequence), 3):\n", " codon = dna_sequence[i:i+3]\n", " amino_acid = genetic_code[codon]\n", "\n", " # If invalid character/codon or stop codon in encountered, stop translation\n", " # Else add translated amino acid to protein sequence\n", " protein_sequence.append(amino_acid)\n", "\n", "# Print the final protein sequence\n", "print(\"\".join(protein_sequence)) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we want to account for the possibility that our sequence is not easily splittable into codons. In Python, we can code this as: \"when we divide sequence length by 3, the remainder should be 0 to start translating; otherwise raise an error\"." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "remove-output" ] }, "outputs": [], "source": [ "if len(dna_sequence) % 3 != 0:\n", " print(\"Error: DNA sequence length is not a multiple of 3.\")\n", "# else \n", " # translate" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's plug this into our code. We will now define an input `dna_sequence` of length 8 to see if the error will be properly raised. At the same time, because we previously verified that the selection of the genetic code table works, we can now redefine `origin` as \"nucleus\". Note that the final `print` statement now becomes indented because we only need to print the output if the translation was allowed to proceed." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "remove-output" ] }, "outputs": [], "source": [ "# Define genetic codes for two different organisms\n", "standard_genetic_code = {\n", " 'ATG': 'M', # Methionine\n", " 'TTT': 'F', # Phenylalanine\n", " 'TTC': 'F', # Phenylalanine\n", " 'TAA': 'Stop',\n", " 'TAG': 'Stop'\n", "}\n", "\n", "mitochondrial_genetic_code = {\n", " 'ATG': 'M', # Methionine\n", " 'TTT': 'F', # Phenylalanine\n", " 'TTC': 'F', # Phenylalanine\n", " 'TAA': 'Stop',\n", " 'TAG': 'W' # Tryptophan (different from the standard code)\n", "}\n", "\n", "# Define the DNA sequence and sequence origin\n", "dna_sequence = \"ATGTTTTT\"\n", "origin = 'nucleus' \n", "\n", "# Select the genetic code based on the origin\n", "if origin == 'nucleus':\n", " genetic_code = standard_genetic_code\n", "elif origin == 'mitochondria':\n", " genetic_code = mitochondrial_genetic_code\n", "else: \n", " raise ValueError(\"Unknown sequence origin type. Choose 'nucleus' or 'mitochondria'.\")\n", "\n", "# Check if the sequence length is a multiple of 3\n", "if len(dna_sequence) % 3 != 0:\n", " print(\"Error: DNA sequence length is not a multiple of 3.\")\n", "else:\n", " # Initialize an empty list to store the protein sequence\n", " protein_sequence = []\n", "\n", " # For loop that translates DNA into protein \n", " for i in range(0, len(dna_sequence), 3):\n", " codon = dna_sequence[i:i+3]\n", " amino_acid = genetic_code[codon]\n", "\n", " # If invalid character/codon or stop codon in encountered, stop translation\n", " # Else add translated amino acid to protein sequence\n", " protein_sequence.append(amino_acid)\n", "\n", " # Print the final protein sequence\n", " print(\"\".join(protein_sequence)) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, we add the last remaining checks in our code to account for possible invalid characters in the sequence and for stop codons. As before, with each change, we test the code with informative input parameters. E.g., we can try the code for invalid characters by using `dna_sequence = \"ATGXTTTC\"` (the length has to remain divisible by 3), and the code for stop codons by using `dna_sequence = \"ATGTTTTAGTTC\"`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "remove-output" ] }, "outputs": [], "source": [ "# Define genetic codes for two different organisms\n", "standard_genetic_code = {\n", " 'ATG': 'M', # Methionine\n", " 'TTT': 'F', # Phenylalanine\n", " 'TTC': 'F', # Phenylalanine\n", " 'TAA': 'Stop',\n", " 'TAG': 'Stop'\n", "}\n", "\n", "mitochondrial_genetic_code = {\n", " 'ATG': 'M', # Methionine\n", " 'TTT': 'F', # Phenylalanine\n", " 'TTC': 'F', # Phenylalanine\n", " 'TAA': 'Stop',\n", " 'TAG': 'W' # Tryptophan (different from the standard code)\n", "}\n", "\n", "# Define the DNA sequence and sequence origin\n", "dna_sequence = \"ATGXTTTC\"\n", "origin = 'nucleus' \n", "\n", "# Select the genetic code based on the origin\n", "if origin == 'nucleus':\n", " genetic_code = standard_genetic_code\n", "elif origin == 'mitochondria':\n", " genetic_code = mitochondrial_genetic_code\n", "else: \n", " raise ValueError(\"Unknown sequence origin type. Choose 'nucleus' or 'mitochondria'.\")\n", "\n", "# Check if the sequence length is a multiple of 3\n", "if len(dna_sequence) % 3 != 0:\n", " print(\"Error: DNA sequence length is not a multiple of 3.\")\n", "else:\n", " # Initialize an empty list to store the protein sequence\n", " protein_sequence = []\n", "\n", " # For loop that translates DNA into protein \n", " for i in range(0, len(dna_sequence), 3):\n", "\n", " codon = dna_sequence[i:i+3]\n", "\n", " # If invalid character/codon, stop translation\n", " if codon[0] not in ['A', 'T', 'C', 'G'] \\\n", " or codon[1] not in ['A', 'T', 'C', 'G'] \\\n", " or codon[2] not in ['A', 'T', 'C', 'G']:\n", " print(\"Invalid character identified in DNA sequence\")\n", " break\n", "\n", " amino_acid = genetic_code[codon]\n", "\n", " # If stop codon if found, stop translation\n", " if amino_acid == 'Stop':\n", " break\n", " \n", " # Else add translated amino acid to protein sequence\n", " protein_sequence.append(amino_acid)\n", "\n", " # Print the final protein sequence\n", " print(\"\".join(protein_sequence)) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that we only allow our code to translate a codon into an amino acid with `amino_acid = genetic_code[codon]` if the codon is valid, and only allow for an amino acid to be added to a protein sequence if the amino acid is not `Stop`.\n", "\n", "Finally, we want to point out that **multiple implementations** are possible, and it's likely that following the same logic you would end up with a code that looks different than the one shown here." ] } ], "metadata": { "celltoolbar": "Create Assignment", "jupytext": { "formats": "ipynb,md" }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.4" }, "toc": { "base_numbering": "3", "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": true } }, "nbformat": 4, "nbformat_minor": 4 }