'''
Central dogma

In this exercise, you will work with genomic sequences using dictionaries, 
lists and strings.

Specifically, you will write a function that compares two DNA sequences and 
checks for certain types of mutations.
'''

'''
Transcription

TFIID is a protein complex 
essential for the initiation of transcription by RNA Polymerase II. It 
contains multiple subunits, including the TATA-binding protein (TBP). 
TBP binds specifically to the TATA box, a DNA sequence typically found in the 
promoter region of many genes. This binding helps position TFIID at the correct 
location on the DNA, which in turn helps recruit other transcription factors and 
RNA Polymerase II to initiate transcription.
 
Usually, this TATA-box sits 25 nucleotides upstream of the transcription start 
site. An unknown number of nucleotides later, you will find the start codon.

Chemically, DNA is transcribed from 3' to 5' with respect to the template strand.
In bioinformatics, the convention is to store genomic data from 5' to 3'. You 
may assume that the given sequence corresponds to the coding sequence. 

Sequence as provided by us: 
'TTGTGATATAGGTACCAGTCACGTTGACGTAGTCTAGCTAGCATGTCAAGCACTTGAA'.

See also the attached image.
'''

DNA = 'TTGTGATATAGGTACCAGTCACGTTGACGTAGTCTAGCTAGCATGTCAAGCACTTGAA'


'''
Exercise A: Transcription

Write a function `transcribe` that takes a DNA sequence (as string),
localises the TATA-box and transcribes the gene from the transcription start 
site onwards, until the end of the given sequence. The output should be a string. 
The function should not contain any for loops.

* Example input: 'TTGTGATATAGGTACCAGTCACGTTGACGTAGTCTAGCTAGCATGTCAAGCACTTGAA'
* Example output: 'GUCUAGCUAGCAUGUCAAGCACUUGAA'

Hints:
- `string.find` may be useful for finding the TATA-box.
- `string.replace` may be useful for replacing T with U.
'''

def transcribe(DNA):
    
    # Your code here

    return RNA


'''
Translation

Next, the mRNA is translated in the ribosome, where all triplets are paired to 
the anticodons of charged tRNA. Note that translation only starts after the 
start codon AUG, which also corresponds to the first amino acid of the 
polypeptide: methionine. Since there are 64 possible codons and only 20 amino 
acids, we know that several codons can map onto the same amino acid.

See attached image for an example.


Exercise B: Coding sequence

Write a function `codingRNA` that takes this RNA transcript, finds 
the first start codon, finds the corresponding first stop codon and exports 
all triplets as a list (including stop codon). The stop codons are `UAA`, 
`UAG` and `UGA`. If there is no stop codon at the end, write `'no stop'` as 
final codon. If there is no start codon, return `[]`.

* Example input 1: `'GUCUAGCUAGCAUGUCAAGCACUUGAA'`
* Example output 1: `['AUG','UCA','AGC','ACU','UGA']`

* Example input 2: `'GUCUAGCUAGCAUGUCAAGCACU'`
* Example output 2: `['AUG','UCA','AGC','ACU','no stop']`
'''

def codingRNA(RNA):

# Your code here


'''
Exercise C: Translation

Write a function `translate` that takes this coding RNA array and translates it.
You may use the given dictionary. The output should be a string. If there 
is no start codon, return ''. If there is no stop codon, add a dot at the end: 
'MSST.'.

conversion = {'AUA':'I', 'AUC':'I', 'AUU':'I', 'AUG':'M','ACA':'T', 'ACC':'T',\
            'ACG':'T', 'ACU':'T','AAC':'N', 'AAU':'N', 'AAA':'K', 'AAG':'K',\
            'AGC':'S', 'AGU':'S', 'AGA':'R', 'AGG':'R','CUA':'L', 'CUC':'L',\
            'CUG':'L', 'CUU':'L','CCA':'P', 'CCC':'P', 'CCG':'P', 'CCU':'P',\
            'CAC':'H', 'CAU':'H', 'CAA':'Q', 'CAG':'Q','CGA':'R', 'CGC':'R',\
            'CGG':'R', 'CGU':'R','GUA':'V', 'GUC':'V', 'GUG':'V', 'GUU':'V',\
            'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCU':'A','GAC':'D', 'GAU':'D',\
            'GAA':'E', 'GAG':'E','GGA':'G', 'GGC':'G', 'GGG':'G', 'GGU':'G',\
            'UCA':'S', 'UCC':'S', 'UCG':'S', 'UCU':'S','UUC':'F', 'UUU':'F',\
            'UUA':'L', 'UUG':'L','UAC':'Y', 'UAU':'Y', 'UAA':'stop', \
            'UAG':'stop','UGC':'C', 'UGU':'C', 'UGA':'stop', 'UGG':'W'}

triplet = 'AUG'
print(conversion[triplet])
'''

def translate(codingRNAlist: list) -> str:
    
   #conversion = 
    protein = '' # Initialise an empty protein
    
    # Your code here


'''
Mutation

Mutations can throw a spanner in the works with regard to protein functionality. 
For the purpose of this exercise, we will distinguish the following types of 
mutations:
* Silent: mutation outside coding RNA.
* Synonymous: no different amino acid sequence despite mutation in coding RNA.
* Missense: single substitution in the amino acid sequence.
* Nonsense: a sense codon is mutated into stop codon, leading to early termination.
* Resense: a stop codon is mutated into a sense codon, leading to extension of 
the protein.
* Indel: alteration of reading frame by insertion or deletion (hence indel), often 
leading to a frameshift.


Exercise D: Mutation checker

Write a function `mutationcheck` that takes two DNA sequences and checks what 
type of mutation occurs (if any). Possible outputs are: `'none'`, `'silent'`, 
`'synonymous'`, `'missense'`, `'nonsense'` and `'resense'`. You may assume that 
there is only one mutated nucleotide and there are no indels. Your code should 
be able to handle mutations that lead to lack of stop codon.

* Example inputs: 'TTGTGATATAGGTACCAGTCACGTTGACGTAGTCTAGCTAGCATGTCAAGCACTTGAA', 
                  'TTGTGATATAGGTACCAGTCACGTTGACGTAGTCTAGCTAGCATGTCTAGCACTTGAA'
* Example output: `'synonymous'`
'''

 def mutationcheck(DNA_wildtype: str, DNA_mutation: str) -> str:
    
# Your code here


'''
Point mutations in real life: Hemoglobin

In this exercise, you will apply the previously written functions to real genomic
data.


Red blood cells are the body's oxygen transporters. They contain an oxygen-binding 
protein called hemoglobin (Hb), which consists of four subunits. In adults, the 
most common form of Hb is hemoglobin A (HbA). It consists of two alpha and two 
beta globin subunits. The alpha globin subunit comes in two flavours: type 1 
and type 2. You are given the DNA of healthy alpha 1, alpha 2, and beta globin.
'''

alpha1globin = 'TATACTGGCGCGCTCGCGGGCCGGCACTCTTCTGGTCCCCACAGACTCAGAGAGAACCCACCATG'\
   + 'GTGCTGTCTCCTGCCGACAAGACCAACGTCAAGGCCGCCTGGGGTAAGGTCGGCGCGCACGCTGGCGAGTATGGTGC'\
   + 'GGAGGCCCTGGAGAGGATGTTCCTGTCCTTCCCCACCACCAAGACCTACTTCCCGCACTTCGACCTGAGCCACGGCT'\
   + 'CTGCCCAGGTTAAGGGCCACGGCAAGAAGGTGGCCGACGCGCTGACCAACGCCGTGGCGCACGTGGACGACATGCCC'\
   + 'AACGCGCTGTCCGCCCTGAGCGACCTGCACGCGCACAAGCTTCGGGTGGACCCGGTCAACTTCAAGCTCCTAAGCCA'\
   + 'CTGCCTGCTGGTGACCCTGGCCGCCCACCTCCCCGCCGAGTTCACCCCTGCGGTGCACGCCTCCCTGGACAAGTTCC'\
   + 'TGGCTTCTGTGAGCACCGTGCTGACCTCCAAATACCGTTAAGCTGGAGCCTCGGTGGCCATGCTTCTTGCCCCTTGG'\
   + 'GCCTCCCCCCAGCCCCTCCTCCCCTTCCTGCACCCGTACCCCCGTGGTCTTTGAATAAAGTCTGAGTGGGCGGCA'

alpha2globin = 'TATACTGGCGCGCTCGCGGCCCGGCACTCTTCTGGTCCCCACAGACTCAGAGAGAACCCACCATG'\
   + 'GTGCTGTCTCCTGCCGACAAGACCAACGTCAAGGCCGCCTGGGGTAAGGTCGGCGCGCACGCTGGCGAGTATGGTGC'\
   + 'GGAGGCCCTGGAGAGGATGTTCCTGTCCTTCCCCACCACCAAGACCTACTTCCCGCACTTCGACCTGAGCCACGGCT'\
   + 'CTGCCCAGGTTAAGGGCCACGGCAAGAAGGTGGCCGACGCGCTGACCAACGCCGTGGCGCACGTGGACGACATGCCC'\
   + 'AACGCGCTGTCCGCCCTGAGCGACCTGCACGCGCACAAGCTTCGGGTGGACCCGGTCAACTTCAAGCTCCTAAGCCA'\
   + 'CTGCCTGCTGGTGACCCTGGCCGCCCACCTCCCCGCCGAGTTCACCCCTGCGGTGCACGCCTCCCTGGACAAGTTCC'\
   + 'TGGCTTCTGTGAGCACCGTGCTGACCTCCAAATACCGTTAAGCTGGAGCCTCGGTAGCCGTTCCTCCTGCCCGCTGG'\
   + 'GCCTCCCAACGGGCCCTCCTCCCCTCCTTGCACCGGCCCTTCCTGGTCTTTGAATAAAGTCTGAGTGGGCAGCA'

betaglobin = 'TATAGGGCAGAGCCATCTATTGCTTACATTTGCTTCTGACACAACTGTGTTCACTAGCAACCTCAAA'\
   + 'CAGACACCATGGTGCATCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAA'\
   + 'GTTGGTGGTGAGGCCCTGGGCAGGCTGCTGGTGGTCTACCCTTGGACCCAGAGGTTCTTTGAGTCCTTTGGGGATCT'\
   + 'GTCCACTCCTGATGCTGTTATGGGCAACCCTAAGGTGAAGGCTCATGGCAAGAAAGTGCTCGGTGCCTTTAGTGATG'\
   + 'GCCTGGCTCACCTGGACAACCTCAAGGGCACCTTTGCCACACTGAGTGAGCTGCACTGTGACAAGCTGCACGTGGAT'\
   + 'CCTGAGAACTTCAGGCTCCTGGGCAACGTGCTGGTCTGTGTGCTGGCCCATCACTTTGGCAAAGAATTCACCCCACC'\
   + 'AGTGCAGGCTGCCTATCAGAAAGTGGTGGCTGGTGTGGCTAATGCCCTGGCCCACAAGTATCACTAAGCTCGCTTTC'\
   + 'TTGCTGTCCAATTTCTATTAAAGGTTCCTTTGTTCCCTAAGTCCAACTACTAAACTGGGGGATATTATGAAGGGCCT'\
   + 'TGAGCATCTGGATTCTGCCTAATAAAAAACATTTATTTTCATTGCAA'

'''
Note that we edited the sequences slightly to make the exercise easier. For 
instance, we removed the introns. You can download the original sequences from 
NCBI: https://www.ncbi.nlm.nih.gov/nuccore/NM_000558.5, 
https://www.ncbi.nlm.nih.gov/nuccore/NM_000517.6 , and 
https://www.ncbi.nlm.nih.gov/nuccore/NM_000518.5.

Below, the genomic sequences for a person who suffers episodes of acute generalised 
pain, fatigue and dactylitis are given. We suspect the patient has the sickle cell 
anaemia, a disease where a genetic defect leads to malformed hemoglobin (called HbS),
which undergoes polymerisation and leads to typical sickle-shaped red blood cells.
'''

alpha1globin_patient = 'TATACTGGCGCGCTCGCGGGCCGGCACTCTTCTGGTCCCCACAGACTCAGAGAGAAC'\
   + 'CCACCATGGTGCTGTCTCCTGCCGACAAGACCAACGTCAAGGCCGCCTGGGGTAAGGTCGGCGCGCACGCTGGCGAG'\
   + 'TATGGTGCGGAGGCCCTGGAGAGGATGTTCCTGTCCTTCCCCACCACCAAGACCTACTTCCCGCACTTCGACCTGAG'\
   + 'CCACGGCTCTGCCCAGGTTAAGGGCCACGGCAAGAAGGTGGCCGACGCGCTGACCAACGCCGTGGCGCACGTGGACG'\
   + 'ACATGCCCAACGCGCTGTCCGCCCTGAGCGACCTGCACGCGCACAAGCTTCGGGTGGACCCGGTCAACTTCAAGCTC'\
   + 'CTAAGCCACTGCCTGCTGGTGACCCTGGCCGCCCACCTCCCCGCCGAGTTCACCCCTGCGGTGCACGCCTCCCTGGA'\
   + 'CAAGTTCCTGGCTTCTGTGAGCACCGTGCTGACCTCCAAATACCGTTAAGCTGGAGCCTCGGTGGCCATGCTTCTTG'\
   + 'CCCCTTGGGCCTCCCCCCAGCCCCTCCTCCCCTTCCTGCACCCGTACCCCCGTGGTCTTTGAATAAAGTCTGAGTGG'\
   + 'GCGGCA'

alpha2globin_patient = 'TATACTGGCGCGCTCGCGGCCCGGCACTCTTCTGGTCCCCACAGACTCAGAGAGAAC'\
   + 'CCACCATGGTGCTGTCTCCTGCCGACAAGACCAACGTCAAGGCCGCCTGGGGTAAGGTCGGCGCGCACGCTGGCGAG'\
   + 'TATGGTGCGGAGGCCCTGGAGAGGATGTTCCTGTCCTTCCCCACCACCAAGACCTACTTCCCGCACTTCGACCTGAG'\
   + 'CCACGGCTCTGCCCAGGTTAAGGGCCACGGCAAGAAGGTGGCCGACGCGCTGACCAACGCCGTGGCGCACGTGGACG'\
   + 'ACATGCCCAACGCGCTGTCCGCCCTGAGCGACCTGCACGCGCACAAGCTTCGGGTGGACCCGGTCAACTTCAAGCTC'\
   + 'CTAAGCCACTGCCTGCTGGTGACCCTGGCCGCCCACCTCCCCGCCGAGTTCACCCCTGCGGTGCACGCCTCCCTGGA'\
   + 'CAAGTTCCTGGCTTCTGTGAGCACCGTGCTGACCTCCAAATACCGTTAAGCTGGAGCCTCGGTAGCCGTTCCTCCTG'\
   + 'CCCGCTGGGCCTCCCAACGGGCCCTCCTCCCCTCCTTGCACCGGCCCTTCCTGGTCTTTGAATAAAGTCTGAGTGGG'\
   + 'CAGCA'

betaglobin_patient = 'TATAGGGCAGAGCCATCTATTGCTTACATTTGCTTCTGACACAACTGTGTTCACTAGCA'\
   + 'ACCTCAAACAGACACCATGGTGCATCTGACTCCTGTGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACG'\
   + 'TGGATGAAGTTGGTGGTGAGGCCCTGGGCAGGCTGCTGGTGGTCTACCCTTGGACCCAGAGGTTCTTTGAGTCCTTT'\
   + 'GGGGATCTGTCCACTCCTGATGCTGTTATGGGCAACCCTAAGGTGAAGGCTCATGGCAAGAAAGTGCTCGGTGCCTT'\
   + 'TAGTGATGGCCTGGCTCACCTGGACAACCTCAAGGGCACCTTTGCCACACTGAGTGAGCTGCACTGTGACAAGCTGC'\
   + 'ACGTGGATCCTGAGAACTTCAGGCTCCTGGGCAACGTGCTGGTCTGTGTGCTGGCCCATCACTTTGGCAAAGAATTC'\
   + 'ACCCCACCAGTGCAGGCTGCCTATCAGAAAGTGGTGGCTGGTGTGGCTAATGCCCTGGCCCACAAGTATCACTAAGC'\
   + 'TCGCTTTCTTGCTGTCCAATTTCTATTAAAGGTTCCTTTGTTCCCTAAGTCCAACTACTAAACTGGGGGATATTATG'\
   + 'AAGGGCCTTGAGCATCTGGATTCTGCCTAATAAAAAACATTTATTTTCATTGCAA'

'''
Exercise E: Hemoglobin

Use your function `mutationcheck()` to find out what is wrong. Try to understand
how the `for` loop works and replace `A`, `B` and `C` in the code below with something 
else. In which subunit of hemoglobin does the mutation occur?
'''

# Complete the following code

paired_genes = [('alpha1', alpha1globin, alpha1globin_patient), 
                ('alpha2', alpha2globin, alpha2globin_patient), 
                ('beta', betaglobin, betaglobin_patient)]

# Replace A, B and C with the correct variables:
for A, B, C in paired_genes:
    outcome = mutationcheck(healthygene,patientgene)
    print("Mutation in " + genename + ": " + outcome)


''' 
Explain the outcome here.
'''