9.1. Working with data files#
Interactive page
This is an interactive book page. Press launch button at the top right side.
Until now, we have seen how to generate data inside Python, e.g., by assigning values to arrays or using functions like np.zeros()
and np.linspace()
. However, what if we want to use Python to analyze data from an experiment? How do we get the data into Python?
Sometimes, we might have just a small number of data points that we have measured by hand and wrote down on a piece of paper: say a list of measured optical densities OD600 of our bacterial culture (OD600 correlates with the number of cells). In this case, you can just “load” the data by defining Python arrays that contain your measured values:
# Time in hours
t = np.array([0, 1, 2, 3, 4, 5, 6])
# Measured OD600
od = np.array([0.05, 0.10, 0.25, 0.50, 0.90, 1.50, 2.00])
In a machine-controlled experiment, though, you may have a data file with hundreds, thousands, or even millions of data points, for example, voltage measured as a function of time using an oscilloscope, or omics data (genomics, proteomics) in bioinformatics. To perform operations, analyze, and plot large data, you can always resort to Python. Using a programming language such as Python has multiple advantages over MS Excel:
Scalability: Python can handle large data much more efficiently and has no limit on the number of rows and columns.
Automation: You can write a script to automate a repetitive task, which is useful for data cleaning, transformation and repetitive calculations, as well as when you need to handle multiple data files in one go (imagine if you ran a transcriptomics experiment on 60 samples).
Data analysis and manipulation: Python has libraries that allow you to manipulate and visualize data in a more powerful way and with more options than you’d get in Excel.
In this chapter, we will explore how to load, manipulate, and save data files in Python. For this purpose, we will use specific Python packages, namely NumPy and pandas.
9.1.1. NumPy vs. pandas#
We’ve already become familiar with the NumPy library for scientific computing. Here we introduce another Python data analysis library - pandas.
Both NumPy and pandas are fundamental Python libraries that can be used for data manipulation and analysis. While they have some overlapping functionalities, they were designed for different purposes and are often used side by side.
NumPy is meant for (efficient) numerical computations, provides support for \(N\)-dimensional NumPy arrays, and has a collection of mathematical functions that can be applied to arrays (you can think of linear algebra and statistics). Operations on NumPy arrays are more efficient than operations on lists. Therefore, we use NumPy for:
Numerical computations, scientific computing
Handling multidimensional data
Mathematical and statistical operations on large datasets
pandas is built on top of NumPy and is aimed specifically at data manipulation and analysis. Just like NumPy has its special data type called NumPy arrays, pandas offers high-level data structures:
Series: 1D labeled array that can contain any data type
DataFrame: 2D labeled data structure, where columns can have data of different types
pandas is great for data manipulation (e.g., merging and joining datasets) and analysis. With pandas, you can read and write files of various formats, such as CSV and Excel.
When should we use each of these libraries?
Speed:
NumPy is more efficient for large numerical computations due to its optimized array operations.
pandas is less efficient due to its additional functionalities but is optimized for high-level data manipulation.
Data structures:
NumPy is suitable for homogeneous data (all elements of the same type).
pandas is suitable for heterogeneous data (different types within the same data structure).
Functionality:
NumPy is focused on numerical computations, less on data manipulation.
pandas has extensive data manipulation capabilities, ideal for cleaning, transforming, and analyzing structured data.
import micropip
await micropip.install("jupyterquiz")
from jupyterquiz import display_quiz
import json
with open("questions4.json", "r") as file:
questions=json.load(file)
display_quiz(questions, border_radius=0)
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[1], line 1
----> 1 import micropip
2 await micropip.install("jupyterquiz")
3 from jupyterquiz import display_quiz
ModuleNotFoundError: No module named 'micropip'