{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Libraries, documentation, and NumPy\n", "\n", "```{admonition} Interactive page\n", ":class: warning, dropdown\n", "This is an interactive book page. Press launch button at the top right side.\n", "```\n", "\n", "NumPy (**Num**erical **Py**thon) is the fundamental **package for scientific computing** in Python.\n", "\n", "But what are Python packages, and why would we want to use NumPy in particular?\n", "\n", "## Python packages\n", "\n", "**Packages are building blocks of programming**.\n", "Packages consist of **reusable pieces of code** so that you do not have to program everything from scratch (and, thereby, you don't have to keep \"reinventing the wheel\"). \n", "Sometimes we refer to published packages as **libraries**.\n", "According to [Python packages](https://py-pkgs.org/01-introduction.html), there is over 350,000 Python packages (based on data from 2022), and the number keeps growing. This means that there is **a lot of code out there for you to reuse**!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```{admonition} Why use packages?\n", ":class: note, dropdown\n", "Imagine you are working with a vector of numbers and, using Python, you wish to calculate the average of the numbers in your vector.\n", "As you can imagine, many people before you have wanted to do the same thing in Python.\n", "\n", "To prevent programmers from reinventing the same pieces of code time and again, there are Python packages available, each containing pre-made code (such as code for calculating an average). Another good reason to use packages is computational efficiency - the code from the packages often runs much faster than the code you would write yourself (due to some Python features). More on this in a [later section on vectorization](vectorization.ipynb).\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "```{admonition} Useful packages\n", ":class: tip, dropdown\n", "Python packages often focus on a **specific problem or domain**. Sometimes several libraries offer options that will result in the same outcome, e.g., you can make an xy scatter plot with more than one library. \\\n", "Here we offer an overview of some widely used Python packages that will likely come handy in your nanobiology studies and research. Packages that we will use **in this book** are shown in bold.\n", "* Plotting and visualization\n", " * **Matplotlib**\n", " * **seaborn** \n", " * Plotly \n", "* Scientific computing\n", " * **NumPy**\n", " * SciPy\n", "* Bioinformatics\n", " * Biopython\n", "* Data analysis\n", " * **pandas**\n", "* Machine learning\n", " * Scikit-learn\n", " * PyTorch\n", "* Utilities and tools\n", " * os\n", " * re\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Importing a package\n", "\n", "When you want to use a package in your code, you have to tell that to Python explicitly. We call this **importing a package** into the *namespace* of your kernel. Namespace is a container that holds a collection of functions and variable names.\n", "\n", "You typically do this at the very beginning of your Python script by using a line such as this:\n", "\n", "```\n", "import numpy\n", "```\n", "\n", "If you have multiple packages to import, you also need several import lines one beneath the other.\n", "\n", "In order to use a package, you have to make sure it's actually **installed**. If you use Anaconda, it installs a lot of the packages and tools for you. In our case, we installed a more light-weight version called Miniconda (remember the [installation steps](../chapter3/installation.md)), so we will need to install additional packages ourselves. \n", "\n", "In case you want to install a package on your own, two popular tools to do that are `pip` and `conda`. In practice, installing a package (such as NumPy) is as simple as **opening your terminal and running one of these lines**:\n", "\n", "```\n", "conda install numpy\n", "```\n", "\n", "or \n", "\n", "```\n", "pip install numpy\n", "```\n", "\n", "If you're unsure **whether a package is installed**, you can also simply check it in your terminal using `pip list` or `conda list`. Check with conda - do you already have NumPy installed?\n", "\n", "```{admonition} Pip vs. conda\n", ":class: note, dropdown\n", "Both `pip` and `conda` can be used to install packages, so what are the differences between them?\n", "There are [three differences](https://numpy.org/install/):\n", "1. `conda` is cross-language, meaning that it can also install non-Python libraries and tools, while `pip` is for Python only.\n", "2. `conda` installs from its own channels, while `pip` installs from the Python Packaging Index. The latter is the largest collection of packages, but all popular ones are also available with `conda`.\n", "3. `conda` is an integrated tool for managing packages, dependencies, and environments, while with `pip` you may need other tools for dealing with environments (see below) or complex dependencies.\n", "```\n", "\n", "```{admonition} Hands-on: installing NumPy on your laptop\n", ":class: important\n", "\n", "If you type `conda list` in your terminal, you will notice that NumPy wasn't part of our Miniconda installation. \n", "Therefore, if you try to run the line `import numpy` in your VS Code, it will result in an error.\n", "\n", "Let’s install NumPy from within the terminal by running:\n", "`conda install numpy`.\n", "When prompted, type in `y` for “yes” and press `Enter`.\n", "If you then rerun `conda list` after the installation is finished, you will find NumPy in the list.\n", "\n", "In VS Code, select View > Command Palette. Then type in and select `Python: Select Interpreter`command from the Command Palette. Select Python with “base” from the offered options. \n", "Now you can import and use NumPy in your scripts in VS Code!\n", "\n", "Note: we will use only a few basic packages throughout this book, which we'll install in our \"base\" environment. To learn more about creating Python environments, see the dropdown box on *Environments* below.\n", "```\n", "\n", "````{admonition} Environments\n", ":class: note, dropdown\n", "A Python environment is a setup in which a Python script is executed. \n", "For instance, if you use functions from NumPy package in your code, (a specific version of) NumPy has to be present in the environment.\n", "Because packages can depend on each other, and because details of functions can change between different versions of a package, your code may stop working (\"break\") after an update. \n", "It is, therefore, good practice to work within environments that contain specific versions of packages needed in your code.\n", "\n", "```{figure} ../images/chapter1/environments.png\n", "---\n", "height: 250px\n", "name: environment\n", "---\n", "Illustration of two separate Python environments, each containing a different version of Python and different packages.\n", "```\n", "\n", "In practice, if you want to make an environment, you can do that from your terminal in the following way:\n", "\n", "```\n", "# Best practice, use an environment rather than install in the base env \n", "# These two lines create and activates an environment called my-environment\n", "conda create -n my-environment\n", "conda activate my-environment\n", "\n", "# Installing a package (numpy) inside the activated my-environment\n", "conda install numpy\n", "```\n", "\n", "To then use/activate this custom environment (my-environment) in VS Code, you need to select it as your Python interpreter by opening the Command Palette, type `Python: Select Interpreter`, and hit `Enter`. A list of available interpreters will appear, so you have to select the one that corresponds to this environment.\n", "\n", "````" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Packages contain functions\n", "\n", "But what does a package actually look like, what's inside it?\n", "A package contains **functions**, with each function performing a specific **task**.\n", "A function typically also has a set of **arguments**, which allow you to pass information into the function to tailor what exactly it will do. \n", "Depending on a function, the arguments can be:\n", "- **mandatory** - you **have to** specify this for a function to work\n", "- **optional** - you **can** set this parameter yourself; if you don't, there is a **default** value that the function uses\n", "\n", "This is actually also true for Python's built-in functions, such as `print()` and `input()`, which we've seen earlier.\n", "\n", "\n", "#### Reading documentation\n", "\n", "When you encounter a new function, you will wonder \"how do I use it exactly?\" This is where **documentation** becomes immensely useful, and it's crucial to know **how to read it**.\n", "\n", "You can access documentation online or directly in Python (such as in code cells here in the book and in your VS Code).\n", "The online Python [documentation](https://docs.python.org/3/library/index.html) contains information about built-in functions and more. \n", "There are also dedicated package documentations such as the one for NumPy, which has information about NumPy's functions.\n", "\n", "Let's look at an example of a NumPy function called `numpy.zeros()`. \n", "To learn which arguments this function takes, you can either type `?numpy.zeros` to ask Python here in the book or in your VS Code, or check this function's [online documentation](https://numpy.org/doc/stable/reference/generated/numpy.zeros.html).\n", "If you do that, you will see\n", "\n", "```\n", "numpy.zeros(shape, dtype=float, order='C', *, like=None)\n", "```\n", "\n", "followed by explanations about each argument, as well as information on whether it's mandatory or optional (in `np.zeros()`, only the first one is mandatory). Depending on the argument, you will also be able to read what is its default value. At the end, there are a few concrete examples of usage of this function.\n", "\n", "Let's try using `numpy.zeros()` ourselves:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "remove-output" ] }, "outputs": [], "source": [ "# First we have to import numpy\n", "import numpy\n", "\n", "# Creating an array of 10 zeros\n", "numpy.zeros(10)\n", "\n", "# To see help for numpy.zeros, type ?numpy.zeros" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "````{admonition} Asterisk * in function descriptions\n", ":class: tip\n", "\n", "The asterisk `*` in a Python function description indicates that all the **following parameters are keyword-only**. This means they must be specified using their names when calling the function.\n", "\n", "For example, with our function `numpy.zeros(shape, dtype=float, order='C', *, like=None)`, if we want to use the last argument, we have to write:\n", "\n", "```\n", "numpy.zeros(5, int, 'C', like=None)\n", "```\n", "\n", "rather than just:\n", "\n", "```\n", "numpy.zeros(5, int, 'C', None)\n", "```\n", "\n", "Therefore, for the parameter `like` which is listed after the asterisk, we have to explicitly write `like=`.\n", "\n", "````" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In general, when you wish to use a function from a specific library, such as `numpy.zeros()`, you need to define a library by using (in this case) `numpy.` in its name.\n", "To save you from some typing, you can also tell Python how you want to refer to functions from a specific package. For NumPy, it's very typical to use this import line:\n", "\n", "```\n", "import numpy as np\n", "```\n", "\n", "which then allows you to invoke its functions using `np.`, e.g., `np.zeros()` rather than the longer `numpy.zeros()`.\n", "Try recreating the code from above, but using `np` for NumPy." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "remove-output" ] }, "outputs": [], "source": [ "# Your code here - create an array of 10 zeros with np.zeros" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Importing a single function\n", "\n", "When we import NumPy with `import numpy as np`, we get access to all its functions by adding `np.` in front of the function name.\n", "To see which functions are in the library, type `dir(numpy)`, which will generate a list of available functions.\n", "The list of functions can be quite long and exhaustive.\n", "\n", "If you **need only a single function from a library**, there is a second commonly used way to import only that single function, e.g.:\n", "\n", "```\n", "from numpy import zeros\n", "```\n", "\n", "When you do this, the function `zeros()` will be available directly, without any prefix:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "remove-output" ] }, "outputs": [], "source": [ "from numpy import zeros\n", "a = zeros(10)\n", "print(a)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you look around on the internet, you will also find people that do the following:\n", "\n", "```\n", "from numpy import *\n", "```\n", "\n", "(Remember: `*` is a [wildcard](../chapter2/bash-commands.ipynb) and here replaces a function with *any* name.)\n", "\n", "This will import all the functions from NumPy directly into the namespace with no `np.` or `numpy.` prefix. You might think: what a great idea, this will save me loads of typing! Instead of typing `np.array()` I could just type `array()`, and so on for tens of NumPy functions. \n", "\n", "While it's true that it will save typing, it also comes with a high risk: sometimes **different packages have functions that have the same name**, but do different things. A concrete example is the function `sqrt()`, which is available in both `math` and `numpy` libraries. Unfortunately, `math.sqrt()` will give an error when using NumPy arrays. \n", "\n", "If you import both of these libraries with the command above (using `*`), you will overwrite these functions by the second import, and if you're not careful, you will forget which one you are using. This could cause your code to \"break\". It will also \"crowd\" your namespace: you suddenly have hundreds or even thousands of functions, instead of just a library. \n", "\n", "For these reasons, it is generally **advised not to use `import *`**, and it is considered a **poor coding practice** in Python." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## NumPy package\n", "\n", "As a nanobiologist, you will regularly work with scientific data or build computational simulations and models. \n", "A lot of the data will come in the form of arrays (think of vectors as 1D and matrices as 2D arrays), on which you will want to perform mathematical operations.\n", "In fact, the number of **problems in modern biological science** that uses array representations is huge: from DNA datasets to neuronal networks to biomolecular networks, data and models are conveniently and powerfully represented as arrays and linear transformations (i.e., multiplying by matrices). Python's NumPy library was built precisely to aid in this kind of programming tasks.\n", "\n", "For a deeper dive into the significance of NumPy in the world of scientific programming and for a *great visual summary* on some of the fundamental NumPy array concepts, see [this publication](https://doi.org/10.1038/s41586-020-2649-2). You can also visit [NumPy's website](https://numpy.org/doc/stable/user/whatisnumpy.html) if you're curious to learn more.\n", "\n", "In the next sections, we will familiarize ourselves with NumPy and some of its functions.\n", "\n" ] } ], "metadata": { "jupytext": { "formats": "ipynb,md" }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.4" } }, "nbformat": 4, "nbformat_minor": 4 }