9.4. Exercises#
Each exercise carries the difficulty level indication:
[*] Easy
[**] Moderate
[***] Advanced
The highest difficulty level is meant for students who wish to challenge themselves and/or have previous experience in programming.
([*] Mercury vapor pressure)
Dataset reference: R pressure dataset.
In this exercise, we will work with a file containing data on vapor pressure
of mercury (in millimeters of mercury) as a function of temperature (in degrees
Celsius). You received this file, called pressure.csv
, with this exercise.
To start: Explore
Navigate to and open the data file in the terminal. What does it look like? Is it in line with what you’d expect from a CSV file (what kind of delimiter do you see)? Are there header rows?
Exercise A: Open
Open the file in Python using NumPy’s function np.loadtext()
.
Hints:
What is the location of your file with respect to this Python script?
Do you need to set anything else in your function?
# Your code here
Exercise B: Shape
What is the size of your data (number of rows and columns)? Check it with Python and print the output.
# Your code here
Exercise C: Modify
Degrees Celsius and millimiters of mercury are not SI units for temperature and pressure. Using Python, modify your data so that the values are expressed in SI units.
# Your code here
Exercise D: Save
Now that you have changed your data into SI units that we want to use for further
analysis, save it!
You can name the new file pressure_SI_units.csv
.
# Your code here
([**] Cars)
In this exercise, we will work with a file containing data on speeds of cars
and the distances required to stop. You received this file, called cars.csv
,
with this exercise.
Exercise A: File contents
Navigate to and open the data file in the terminal. What does it look like?
Open the file in Python using NumPy. How many rows are there? What do the numbers look like? Are the values in line with your expectation? If not, can you think of reasons why?
# Your code here
Exercise B: Making sense of data
Navigate to the data source and learn more about your dataset.
With this information, do the values in the file make more sense?
Transform the data into units that we commonly use and save it using NumPy.
Are the distances what you’d expect from modern cars?
# Your code here
Exercise C: pandas
Could you also import, manipulate, and save this data into Python using pandas? Try it!
Is it easier or more difficult compared to NumPy?
# Your code here
([*] Proteins)
In this exercise we will start with importing data, either the whole or partial file, from different file formats. For this, we will use the pandas package.
To start: Navigate to and open the data file in the terminal. What does it look like? What kind of delimiter do you see? Are there header rows?
Exercise A: CSV file
To import a CSV file, we will use pd.read_csv(r"\path\to\file\filename.csv)
.
You received the file “uniprotkb_organism_id_9606_AND_reviewed_2024_06_19.csv”
with this exercise. Read the file using pandas and find out in which format the
data is stored using the type
command. Then, print the data and analyse the
output of the print command.
What information on the data can you find? E.g., how many rows and columns does your data have?
# Your code here
Exercise B: Excel file
Instead of a CSV file, you can also read Excel files using
pd.read_csv(r"\path\to\file\filename.xlsx)
. Try it!
We’ve provided you with an Excel file carrying the same name as the above CSV.
# Your code here
Exercise C: Importing with parameters
pandas stores data in a so-called data frame, which is a 2D table with indexed rows and columns. When printing the data, you will find the row index to the left from the respective row (starting at 0,1,2,…). The top-most row gives the column names. For the imported data, from only looking at the dataframe information (not the original file), give the names of all the columns.
# Solution: Entry, Reviewed, Entry name, Protein names, Gene Names, Organism, Length
There are a few important parameters to pd.read_csv()
that you can play
around with. A (non-exhaustive list) is:
sep
: The default separator between columns in CSV files is a column, but if you have a different delimeter, specify it usingsep =
.header
: Gives the row that has the column names. The default is to use the 0-th row as column names. If your data has no headers, useheader = None
. Useheader = 0
paired withnames = (you, column, names)
to manually override the header names.usecols
: select the colums to be imported. You can both use the column numbers (e.g.,usecols = [1,2]
) or headers (e.g.,usecols=["Reviewed","Entry Name"]
).nrows
: Select the number of rows you want to use.
For this exercise, import the first two columns and first 30 rows of the data. Use CSV file as in exercise A.
# Your code here
Exercise D: Text files
Of course, Excel isn’t the only data type you might need to read. Another common file type is a text file. For this, we will use the “Phosphorylation_Y.txt” file you received with this exercise. It describes tyrosine phosphorylation sites, including the UniProt IDs, tyrosine site, phosphorylated motif, and more.
For importing, we will use pd.read_table()
, where you put your file location
between brackets as before.
# Your code here
Exercise E: Writing files
We can also write files directly from Python into a text, CSV, or Excel format.
To do that, we use dataframe_name.to_*
, where dataframe_name
is the name of
your DataFrame, and * indicates either a CSV or Excel and will determine what
type we write to.
Use this to write the first two columns and 30 rows of the phosphorylation data into a CSV file. Don’t forget to specify the filepath when you save the file!
# Your code here
([**] Beavers)
Here we will work with a small part of a study of the long-term temperature dynamics of the beaver Castor canadensis in north-central Wisconsin. Body temperature was measured by telemetry every 10 minutes for four females, but data from one period of less than a day for two animals is used here (dataset source).
Exercise A: Explore the data
What do the two provided files for beaver_1 and beaver_2 look like? Use the terminal to explore them.
Exercise B: Import
Based on what you observed in Exercise A, decide on a Python package and use
it to import the data files (beaver1.csv
and beaver2.csv
) into Python.
Is the same number of measurements available for both animals? If not, make a print statement saying which beaver has more measurements and how many more.
# Your code here
Exercise C: Average temperature
Which beaver had a higher average body temperature? Print the mean values for both animals.
# Your code here
Exercise D: Combine
Combine the data for two beavers into a single table and save it as an Excel file.
Which function is useful for concatenating the data? Will you be able to distinguish which measurements came from which beaver after you concatenate the data? If not, how can you go about this problem?
# Your code here
([**] Friends)
pandas stores data in a 2D table called a DataFrame. In this exercise we will learn how to work with such a data frame.
Exercise A: Creating a DataFrame
For our exercise purposes, we will create a simple DataFrame. To do that, you first create a dictionary, with the column title corresponding to the data for that column.
For example, to get a column of ages and names, the command would be:
data_friends = {"Name": ["Alex","Alin", "Lucia", "Tessa"], "Age": [22,21,23,21]}
We can convert this into a dataframe using:
df_friends = pd.DataFrame(data).
Similarly, create a DataFrame with the names, ages, gender and hair colour of your friends and/or family members.
# Your code here
Exercise B: Creating a DataFrame II
Another way to create a DataFrame is directly via the pd.DataFrame
command.
For this, you must specify the data, column names and index labels (if present).
The index labels will replace 0,1,2,3 etc., as the labels for the rows.
For example, to create the friends DataFrame we can write:
df_friends = pd.DataFrame([[22,"male","brown"],
[21,"male","brown"],
[23,"other","brown"],
[21,"female","brown"]],
columns= ["Age","Gender","hair colour"],
index= ["Alex","Alin","Lucia","Tessa"])
Note that in this case we don’t need the column label “names” as the names are now not a column, but the labels for the rows.
Create a DataFrame with the same data as above directly using the DataFrame command. For the exercises below, we will be using this DataFrame.
# Your code here
Exercise C: Retrieving specific data from DataFrames
In DataFrames, it’s possible to access only certain rows or columns of a data, just as with matrices.
To get a specific column, use df_friends["gender"]
or any other column name.
To access a row, we use the command df_friends.loc["Lucia"]
to access the row
with the label “Lucia”.
From DataFrame df_friends
:
Create and print series (a 1D array in pandas) of the ages of your friends.
Create and print a series that gives all the data of one of your friends.
We can also use the command
df_friends.iloc[2]
to get all the information about Lucia, where we use 2 to denote the row we want to access. Try it yourself!We can also get only the rows that correspond to a specific condition, as seen in the book. Give all rows with male as gender.
# Your code here
Exercise D: Iterating over rows
We can iterate over rows using dataframe_name.iterrows()
.
Use the command:
for i,j in df_friends.iterrows():
print(i,j)
What do i
and j
represent in this case?
What happens if you iterate only over one, not two variables, e.g., only i
?
Print the types for i
and j
in the first, and just i
in the second case.
# Your code here
Exercise E: Iterating over columns
To iterate over a column, we don’t have a specific command. Rather, we first extract the column names by converting the dataframe into a list. Check for yourself that doing so will give you only the names of the columns!
We can now iterate using column indexing (as seen earlier in this exercise)
and a for
loop over the column names. Create a for
loop that prints the
contents of each column i
.
# Your code here
Exercise F: Mathematical operations
We can perform mathematical operations on DataFrames and Series similar to how we work with matrices and arrays in NumPy. Take the DataFrame:
df_abcd = pd.DataFrame([[2,6,3,6],[5,2,8,5],[3,1,7,5],[6,7,4,8]],
columns= ["A","B","C","D"])
Part I. Perform the following operations:
Multiply by 4
Subtract 7
Multiply the dataframe by 2*pi and take the sine
Find the mean value of the columns of the DataFrame, and of the rows of the DataFrame
Part II. We can also perform operations with two Series. Next take the 0th and 1st row of the DataFrame and multiply them. Then, subract the 0th row from the DataFrame.
# Your code here