Exercises

9.4. Exercises#

Each exercise carries the difficulty level indication:

[*] Easy
[**] Moderate
[***] Advanced

The highest difficulty level is meant for students who wish to challenge themselves and/or have previous experience in programming.

Exercise 9.1 ([*] Mercury vapor pressure)

Download this exercise.

Dataset reference: R pressure dataset.

In this exercise, we will work with a file containing data on vapor pressure of mercury (in millimeters of mercury) as a function of temperature (in degrees Celsius). You received this file, called pressure.csv, with this exercise.

To start: Explore

Navigate to and open the data file in the terminal. What does it look like? Is it in line with what you’d expect from a CSV file (what kind of delimiter do you see)? Are there header rows?

Exercise A: Open

Open the file in Python using NumPy’s function np.loadtext().

Hints:

What is the location of your file with respect to this Python script?
Do you need to set anything else in your function?

# Your code here

Exercise B: Shape

What is the size of your data (number of rows and columns)? Check it with Python and print the output.

# Your code here

Exercise C: Modify

Degrees Celsius and millimiters of mercury are not SI units for temperature and pressure. Using Python, modify your data so that the values are expressed in SI units.

# Your code here

Exercise D: Save

Now that you have changed your data into SI units that we want to use for further analysis, save it! You can name the new file pressure_SI_units.csv.

# Your code here

Exercise 9.2 ([**] Cars)

Download this exercise.

In this exercise, we will work with a file containing data on speeds of cars and the distances required to stop. You received this file, called cars.csv, with this exercise.

Exercise A: File contents

Navigate to and open the data file in the terminal. What does it look like?

Open the file in Python using NumPy. How many rows are there? What do the numbers look like? Are the values in line with your expectation? If not, can you think of reasons why?

# Your code here

Exercise B: Making sense of data

Navigate to the data source and learn more about your dataset.

With this information, do the values in the file make more sense?

Transform the data into units that we commonly use and save it using NumPy.

Are the distances what you’d expect from modern cars?

# Your code here

Exercise C: pandas

Could you also import, manipulate, and save this data into Python using pandas? Try it!

Is it easier or more difficult compared to NumPy?

# Your code here

Exercise 9.3 ([*] Proteins)

Download this exercise.

In this exercise we will start with importing data, either the whole or partial file, from different file formats. For this, we will use the pandas package.

To start: Navigate to and open the data file in the terminal. What does it look like? What kind of delimiter do you see? Are there header rows?

Exercise A: CSV file

To import a CSV file, we will use pd.read_csv(r"\path\to\file\filename.csv"). Note that you may need to define the delimiter.

You received the file “uniprotkb_organism_id_9606_AND_reviewed_2024_06_19.csv” with this exercise. Read the file using pandas and find out in which format the data is stored using the type command. Then, print the data and analyse the output of the print command.

What information on the data can you find? E.g., how many rows and columns does your data have?

# Your code here

Exercise B: Excel file

Instead of a CSV file, you can also read Excel files using pd.read_excel(r"\path\to\file\filename.xlsx"). Try it! We’ve provided you with an Excel file carrying the same name as the above CSV.

# Your code here

Exercise C: Importing with parameters

pandas stores data in a so-called data frame, which is a 2D table with indexed rows and columns. When printing the data, you will find the row index to the left from the respective row (starting at 0,1,2,…). The top-most row gives the column names. For the imported data, from only looking at the dataframe information (not the original file), give the names of all the columns.

# Solution: Entry, Reviewed, Entry name, Protein names, Gene Names, Organism, Length

There are a few important parameters to pd.read_csv() that you can play around with. A (non-exhaustive list) is:

sep: The default separator between columns in CSV files is a column, but if you have a different delimeter, specify it using sep =.
header: Gives the row that has the column names. The default is to use the 0-th row as column names. If your data has no headers, use header = None. Use header = 0 paired with names = (you, column, names) to manually override the header names.
usecols: select the colums to be imported. You can both use the column numbers (e.g., usecols = [1,2]) or headers (e.g., usecols=["Reviewed","Entry Name"]).
nrows: Select the number of rows you want to use.

For this exercise, import the first two columns and first 30 rows of the data. Use CSV file as in exercise A.

# Your code here

Exercise D: Text files

Of course, Excel isn’t the only data type you might need to read. Another common file type is a text file. For this, we will use the “Phosphorylation_Y.txt” file you received with this exercise. It describes tyrosine phosphorylation sites, including the UniProt IDs, tyrosine site, phosphorylated motif, and more.

For importing, we will use pd.read_table(), where you put your file location between brackets as before.

# Your code here

Exercise E: Writing files

We can also write files directly from Python into a text, CSV, or Excel format. To do that, we use dataframe_name.to_*, where dataframe_name is the name of your DataFrame, and * indicates either a CSV or Excel and will determine what type we write to.

Use this to write the first two columns and 30 rows of the phosphorylation data into a CSV file. Don’t forget to specify the filepath when you save the file!

# Your code here

Exercise 9.4 ([**] Beavers)

Download this exercise.

Here we will work with a small part of a study of the long-term temperature dynamics of the beaver Castor canadensis in north-central Wisconsin. Body temperature was measured by telemetry every 10 minutes for four females, but data from one period of less than a day for two animals is used here (dataset source).

Exercise A: Explore the data

What do the two provided files for beaver_1 and beaver_2 look like? Use the terminal to explore them.

Exercise B: Import

Based on what you observed in Exercise A, decide on a Python package and use it to import the data files (beaver1.csv and beaver2.csv) into Python.

Is the same number of measurements available for both animals? If not, make a print statement saying which beaver has more measurements and how many more.

# Your code here

Exercise C: Average temperature

Which beaver had a higher average body temperature? Print the mean values for both animals.

# Your code here

Exercise D: Combine

Combine the data for two beavers into a single table and save it as an Excel file.

Which function is useful for concatenating the data? Will you be able to distinguish which measurements came from which beaver after you concatenate the data? If not, how can you go about this problem?

# Your code here

Exercise 9.5 ([**] Friends)

Download this exercise.

pandas stores data in a 2D table called a DataFrame. In this exercise we will learn how to work with such a data frame.

Exercise A: Creating a DataFrame

For our exercise purposes, we will create a simple DataFrame. To do that, you first create a dictionary, with the column title corresponding to the data for that column.

For example, to get a column of ages and names, the command would be:

data_friends = {"Name": ["Alex","Alin", "Lucia", "Tessa"], "Age": [22,21,23,21]}

We can convert this into a dataframe using:

df_friends = pd.DataFrame(data).

Similarly, create a DataFrame with the names, ages, gender and hair colour of your friends and/or family members.

# Your code here

Exercise B: Creating a DataFrame II

Another way to create a DataFrame is directly via the pd.DataFrame command. For this, you must specify the data, column names and index labels (if present). The index labels will replace 0,1,2,3 etc., as the labels for the rows.

For example, to create the friends DataFrame we can write:

df_friends = pd.DataFrame([[22,"male","brown"],
                           [21,"male","brown"],
                           [23,"other","brown"],
                           [21,"female","brown"]],
                           columns= ["Age","Gender","hair colour"],
                           index= ["Alex","Alin","Lucia","Tessa"]) 

Note that in this case we don’t need the column label “names” as the names are now not a column, but the labels for the rows.

Create a DataFrame with the same data as above directly using the DataFrame command. For the exercises below, we will be using this DataFrame.

# Your code here

Exercise C: Retrieving specific data from DataFrames

In DataFrames, it’s possible to access only certain rows or columns of a data, just as with matrices.

To get a specific column, use df_friends["gender"] or any other column name. To access a row, we use the command df_friends.loc["Lucia"] to access the row with the label “Lucia”.

From DataFrame df_friends:

Create and print series (a 1D array in pandas) of the ages of your friends.
Create and print a series that gives all the data of one of your friends.
We can also use the command df_friends.iloc[2] to get all the information about Lucia, where we use 2 to denote the row we want to access. Try it yourself!
We can also get only the rows that correspond to a specific condition, as seen in the book. Give all rows with male as gender.

# Your code here

Exercise D: Iterating over rows

We can iterate over rows using dataframe_name.iterrows(). Use the command:

for i,j in df_friends.iterrows(): 
    print(i,j) 

What do i and j represent in this case? What happens if you iterate only over one, not two variables, e.g., only i?

Print the types for i and j in the first, and just i in the second case.

# Your code here

Exercise E: Iterating over columns

To iterate over a column, we don’t have a specific command. Rather, we first extract the column names by converting the dataframe into a list. Check for yourself that doing so will give you only the names of the columns!

We can now iterate using column indexing (as seen earlier in this exercise) and a for loop over the column names. Create a for loop that prints the contents of each column i.

# Your code here

Exercise F: Mathematical operations

We can perform mathematical operations on DataFrames and Series similar to how we work with matrices and arrays in NumPy. Take the DataFrame:

df_abcd = pd.DataFrame([[2,6,3,6],[5,2,8,5],[3,1,7,5],[6,7,4,8]],
                        columns= ["A","B","C","D"]) 

Part I. Perform the following operations:

Multiply by 4
Subtract 7
Multiply the dataframe by 2*pi and take the sine
Find the mean value of the columns of the DataFrame, and of the rows of the DataFrame

Part II. We can also perform operations with two Series. Next take the 0^th and 1^st row of the DataFrame and multiply them. Then, subract the 0^th row from the DataFrame.

# Your code here