9.1. Libraries, documentation, and NumPy#
Interactive page
This is an interactive book page. Press launch button at the top right side.
NumPy (Numerical Python) is the fundamental package for scientific computing in Python.
But what are Python packages, and why would we want to use NumPy in particular?
9.1.1. Python packages#
Packages are building blocks of programming. Packages consist of reusable pieces of code so that you do not have to program everything from scratch (and, thereby, you don’t have to keep “reinventing the wheel”). Sometimes we refer to published packages as libraries. According to Python packages, there is over 350,000 Python packages (based on data from 2022), and the number keeps growing. This means that there is a lot of code out there for you to reuse!
Why use packages?
Imagine you are working with a vector of numbers and, using Python, you wish to calculate the average of the numbers in your vector. As you can imagine, many people before you have wanted to do the same thing in Python.
To prevent programmers from reinventing the same pieces of code time and again, there are Python packages available, each containing pre-made code (such as code for calculating an average). Another good reason to use packages is computational efficiency - the code from the packages often runs much faster than the code you would write yourself (due to some Python features). More on this in a later section on vectorization.
Useful packages
Python packages often focus on a specific problem or domain. Sometimes several libraries offer options that will result in the same outcome, e.g., you can make an xy scatter plot with more than one library.
Here we offer an overview of some widely used Python packages that will likely come handy in your nanobiology studies and research. Packages that we will use in this book are shown in bold.
Plotting and visualization
Matplotlib
seaborn
Plotly
Scientific computing
NumPy
SciPy
Bioinformatics
Biopython
Data analysis
pandas
Machine learning
Scikit-learn
PyTorch
Utilities and tools
os
re
Naming conflicts
DO NOT use variable names that are the same as function names or package names. Examples are list, str, print, or numpy. This can override the original meaning and cause unexpected errors.
Similarly, AVOID naming your Python files (e.g., numpy.py) after libraries or third-party packages. Doing so can confuse the import system and prevent you from using the real package in your code.
9.1.1.1. Importing a package#
When you want to use a package in your code, you have to tell that to Python explicitly. We call this importing a package into the namespace of your kernel. Namespace is a container that holds a collection of functions and variable names.
You typically do this at the very beginning of your Python script by using a line such as this:
import numpy
If you have multiple packages to import, you also need several import lines one beneath the other.
In order to use a package, you have to make sure it’s actually installed. If you use Anaconda, it installs a lot of the packages and tools for you. In our case, we installed a more light-weight version called Miniconda (remember the installation steps). We then created a conda environment from the file env.yaml containing all packages needed for this course.
In case you want to install a package on your own, two popular tools to do that are pip and conda. In practice, installing a package (such as NumPy) is as simple as opening your terminal and running one of these lines:
conda install numpy
or
pip install numpy
Installation into your current environment
Running conda install numpy installs NumPy into the currently active environment. If you want to install NumPy into a different environment, make sure to run conda deactivate, then conda activate <target-environment> before running conda install numpy.
Windows vs. macOS/Linux
When needed, run the package installation commands in your Bash terminal for macOS/Linux, and in Anaconda Prompt terminal for Windows.
If you’re unsure whether a Python package is installed, you can also simply check it in your terminal using pip list or conda list. Check with conda - do you already have NumPy installed in pf-env?
Pip vs. conda
Both pip and conda can be used to install packages, so what are the differences between them?
There are three differences:
condais cross-language, meaning that it can also install non-Python libraries and tools, whilepipis for Python only.condainstalls from its own channels, whilepipinstalls from the Python Packaging Index. The latter is the largest collection of packages, but all popular ones are also available withconda.condais an integrated tool for managing packages, dependencies, and environments, while withpipyou may need other tools for dealing with environments (see below) or complex dependencies.
9.1.1.2. Packages contain functions#
But what does a package actually look like, what’s inside it? A package contains functions, with each function performing a specific task. A function typically also has a set of arguments, which allow you to pass information into the function to tailor what exactly it will do. Depending on a function, the arguments can be:
mandatory - you have to specify this for a function to work
optional - you can set this parameter yourself; if you don’t, there is a default value that the function uses
This is actually also true for Python’s built-in functions, such as print() and input(), which we’ve seen earlier.
9.1.1.2.1. Reading documentation#
When you encounter a new function, you will wonder “how do I use it exactly?” This is where documentation becomes immensely useful, and it’s crucial to know how to read it.
You can access documentation online or directly in Python (such as in code cells here in the book and in your VS Code). The online Python documentation contains information about built-in functions and more. There are also dedicated package documentations such as the one for NumPy, which has information about NumPy’s functions.
Let’s look at an example of a NumPy function called numpy.zeros().
To learn which arguments this function takes, you can either type ?numpy.zeros to ask Python here in the book or in your VS Code, or check this function’s online documentation.
If you do that, you will see
numpy.zeros(shape, dtype=float, order='C', *, like=None)
followed by explanations about each argument, as well as information on whether it’s mandatory or optional (in np.zeros(), only the first one is mandatory). Depending on the argument, you will also be able to read what is its default value. At the end, there are a few concrete examples of usage of this function.
AI tip
Need a more detailed explanation of a function, or wish to see more examples of its usage? Refer to AI.
Let’s try using numpy.zeros() ourselves:
# First we have to import numpy
import numpy
# Creating an array of 10 zeros
numpy.zeros(10)
# To see help for numpy.zeros, type ?numpy.zeros
Asterisk * in function descriptions
The asterisk * in a Python function description indicates that all the following parameters are keyword-only. This means they must be specified using their names when calling the function.
For example, with our function numpy.zeros(shape, dtype=float, order='C', *, like=None), if we want to use the last argument, we have to write:
numpy.zeros(5, int, 'C', like=None)
rather than just:
numpy.zeros(5, int, 'C', None)
Therefore, for the parameter like which is listed after the asterisk, we have to explicitly write like=.
In general, when you wish to use a function from a specific library, such as numpy.zeros(), you need to define a library by using (in this case) numpy. in its name.
To save you from some typing, you can also tell Python how you want to refer to functions from a specific package. For NumPy, it’s very typical to use this import line:
import numpy as np
which then allows you to invoke its functions using np., e.g., np.zeros() rather than the longer numpy.zeros().
Try recreating the code from above, but using np for NumPy.
# Your code here - create an array of 10 zeros with np.zeros
9.1.1.3. Importing a single function#
When we import NumPy with import numpy as np, we get access to all its functions by adding np. in front of the function name.
To see which functions are in the library, type dir(numpy), which will generate a list of available functions.
The list of functions can be quite long and exhaustive.
If you need only a single function from a library, there is a second commonly used way to import only that single function, e.g.:
from numpy import zeros
When you do this, the function zeros() will be available directly, without any prefix:
from numpy import zeros
a = zeros(10)
print(a)
If you look around on the internet, you will also find people that do the following:
from numpy import *
(Remember: * is a wildcard and here replaces a function with any name.)
This will import all the functions from NumPy directly into the namespace with no np. or numpy. prefix. You might think: what a great idea, this will save me loads of typing! Instead of typing np.array() I could just type array(), and so on for tens of NumPy functions.
While it’s true that it will save typing, it also comes with a high risk: sometimes different packages have functions that have the same name, but do different things. A concrete example is the function sqrt(), which is available in both math and numpy libraries. Unfortunately, math.sqrt() will give an error when using NumPy arrays.
If you import both of these libraries with the command above (using *), you will overwrite these functions by the second import, and if you’re not careful, you will forget which one you are using. This could cause your code to “break”. It will also “crowd” your namespace: you suddenly have hundreds or even thousands of functions, instead of just a library.
For these reasons, it is generally advised not to use import *, and it is considered a poor coding practice in Python.
9.1.2. NumPy package#
As a nanobiologist, you will regularly work with scientific data or build computational simulations and models. A lot of the data will come in the form of arrays (think of vectors as 1D and matrices as 2D arrays), on which you will want to perform mathematical operations. In fact, the number of problems in modern biological science that uses array representations is huge: from DNA datasets to neuronal networks to biomolecular networks, data and models are conveniently and powerfully represented as arrays and linear transformations (i.e., multiplying by matrices). Python’s NumPy library was built precisely to aid in this kind of programming tasks.
For a deeper dive into the significance of NumPy in the world of scientific programming and for a great visual summary on some of the fundamental NumPy array concepts, see this publication. You can also visit NumPy’s website if you’re curious to learn more.
In the next sections, we will familiarize ourselves with NumPy and some of its functions.