{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Files with pandas\n", "\n", "```{admonition} Interactive page\n", ":class: warning, dropdown\n", "This is an interactive book page. Press launch button at the top right side.\n", "```\n", "\n", "In the previous section, we used NumPy to work with a file containing numerical values. When our data contains **different data types**, which will often be the case, we resort to pandas package. \n", "\n", "What does such data look like? Let's use [Palmer Archipelago (Antarctica) penguin data](https://www.kaggle.com/datasets/parulpandey/palmer-archipelago-antarctica-penguin-data?resource=download) as an example dataset.\n", "\n", "```{figure} ../images/chapter9/penguin-data.png\n", "---\n", "height: 70px\n", "name: penguin-data\n", "---\n", "The first few rows of Palmer Archipelago (Antarctica) penguin data. Note that the column values are not aligned with column names in this preview.\n", "```\n", "\n", "The penguin dataset contains text (`studyName`, `Island`, etc.) and numeric data (`Sample Number`, `Body mass (g)`, etc.), as well as some missing values (you can spot them in CSV by consecutive commas `,,,,` like in the second to last row in the figure above).\n", "\n", "To use the pandas library for importing such a data file, we first need to import pandas (as any Python library). It is commonly imported in this way:\n", "\n", "```\n", "import pandas as pd\n", "```\n", "\n", "allowing us to call functions from pandas using the prefix `pd.`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Loading a file\n", "\n", "To load data into Python using pandas, you can use its `pd.read_*` function, where wildcard `*` stands for file type. E.g., you can use `pd.read_csv()` for CSV files, `pd.read_excel()` for Excel's xlsx files, etc.\n", "\n", "```{figure} ../images/chapter9/pandas-readwrite.svg\n", "---\n", "height: 200px\n", "name: pandas-readwrite\n", "---\n", "Reading and writing tabular data with pandas (image from [pandas website](https://pandas.pydata.org/docs/getting_started/intro_tutorials/02_read_write.html), © 2024, pandas, image cannot be reused).\n", "```\n", "\n", "If passed to a variable, these functions will store data from your file into a pandas `DataFrame` type. As a default, printing it displays the first and last five rows. Let's try it with our CSV file:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "remove-output" ] }, "outputs": [], "source": [ "import pandas as pd\n", "\n", "penguins = pd.read_csv(\"penguins.csv\")\n", "print(penguins)\n", "print(type(penguins))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we had an Excel file with these data stored in a sheet called \"penguin_analysis\", we could use `pd.read_excel()` for import:\n", "```\n", "penguins = pd.read_excel(\"penguins.xlsx\", sheet_name=\"penguin_analysis\")\n", "```\n", "\n", "Very conveniently, `head()` and `tail()` methods from pandas allow you to print the selected number of rows at the beginning or end of your DataFrame, respectively. Note that the display of the rows also looks quite neat." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "remove-output" ] }, "outputs": [], "source": [ "penguins.head(2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we're curious about the data types in the different columns of our `penguins` DataFrame, we can use the `dtypes` atttribute. In our DataFrame, we have integers (`int64`), floats (`float64`) and strings (`object`):" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "remove-output" ] }, "outputs": [], "source": [ "penguins.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Did something look strange to you in the `head` and `dtypes` code lines above?**\n", "\n", "Note that we didn't use `pd.head(penguins)` or `pd.dtypes(penguins)`, but instead `penguins.head()` and `penguins.dtypes`! Notice also that the difference between a method `head` and an attribute `dtypes` in the usage of **round brackets**. The details of the reason behind this go beyond the level of this book (we give you a glimpse into the reasons in the info box below, in case you're curious). For now, it's only important to remember that pandas works this way.\n", "\n", "```{admonition} Functional programming vs. object-oriented programming\n", ":class: note\n", "There are two programming paradigms: functional programming ([FP](https://en.wikipedia.org/wiki/Functional_programming)) and object-oriented programming ([OOP](https://en.wikipedia.org/wiki/Object-oriented_programming)). FP is based on **functions** that take inputs and produce outputs. OOP is based on **objects**, which can contain both data (**attributes**) and behaviour (**methods**) that describe characteristics and actions. The details of FP vs. OOP are beyond this book, and we don't delve into OOP.\n", "\n", "The reason for `penguins.head()` is that **pandas is designed using OOP principles**, where `head()` is a **method** (not a function), so we call `penguins.head()` a \"method call\". Similarly, `dtypes` is an **attribute** of a DataFrame, and it doesn't require the usage of `()`.\n", "\n", "pandas also has some **functional interfaces** where you use `pd.` (e.g., `pd.read_csv()` above), though it's less commonly used and not as extensive as in NumPy.\n", "\n", "Note that **NumPy** is also designed with OOP principles. E.g., for a NumPy array `a = np.array([1, 2, 3])`, you can get its mean either with `np.mean(a)` or with `a.mean()`.\n", "```\n" ] }, { "cell_type": "markdown", "metadata": { "nbgrader": { "grade": false, "grade_id": "cell-0793eb4c8f85801b", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "## Saving a file\n", "\n", "As we've seen in {numref}`pandas-readwrite`, we can save pandas DataFrames into files using `to_*` where wildcard `*` again stands for different file types, e.g., `to_csv()` or `to_excel()`. Explore: what does `index=False` in the code below do?\n", "\n", "```\n", "penguins.to_excel(\"penguins_data.xlsx\", sheet_name=\"penguins_analysis\", index=False)\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Manipulating data\n", "\n", "Being able to manipulate data is **essential** in science. Some estimates say that **up to 80% of work in data analysis** (e.g., in bioinformatics) is data manipulation, i.e., cleaning and preparing the data so that it's in a form needed for analysis and plotting.\n", "\n", "Here we offer a glimpse into the possibilities of data manipulation with pandas. To learn more, visit [pandas tutorials](https://pandas.pydata.org/docs/getting_started/intro_tutorials/index.html), or simply use Google, ChatGPT, or check the cheat sheet when you come across a data manipulation action that you don't know how to perform." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Subset\n", "\n", "#### Columns\n", "\n", "```{figure} ../images/chapter9/subset-column.svg\n", "---\n", "height: 200px\n", "name: subset-column\n", "---\n", "Subsetting columns from a pandas DataFrame (image from [pandas website](https://pandas.pydata.org/docs/getting_started/intro_tutorials/03_subset_data.html), © 2024, pandas, image cannot be reused).\n", "```\n", "\n", "To select a single column, use square brackets `[]` with the column name of the column of interest. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "remove-output" ] }, "outputs": [], "source": [ "# Select column \"Species\"\n", "penguin_region = penguins[\"Body Mass (g)\"]\n", "print(penguin_region.head(3))\n", "\n", "# Print output variable type\n", "print(type(penguin_region))\n", "\n", "# Print output variable shape\n", "print(penguin_region.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, a **column of a pandas DataFrame is a pandas Series** with `shape` (344,), i.e., 344 rows and no columns. That tells you that Series is one-dimensional. From the usage `penguin_region.shape` you can also see that `shape` is an attribute, so no brackets are used.\n", "\n", "If you're interested in extracting several columns, you can do that in one go like this: `dataframe[\"column1\", \"column2\"]`. Note that in this case your output is 2D, and therefore a pandas DataFrame." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Rows\n", "\n", "```{figure} ../images/chapter9/subset-rows.svg\n", "---\n", "height: 200px\n", "name: subset-rows\n", "---\n", "Subsetting rows from a pandas DataFrame (image from [pandas website](https://pandas.pydata.org/docs/getting_started/intro_tutorials/03_subset_data.html), © 2024, pandas, image cannot be reused).\n", "```\n", "\n", "You can also extract rows from a pandas DataFrame **based on a condition**. You write the condition between `[]`. Let's say you're interested in penguins with body mass above 5 kg:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "remove-output" ] }, "outputs": [], "source": [ "above_5kg = penguins[penguins[\"Body Mass (g)\"] > 5000]\n", "above_5kg.head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```{admonition} Selection based on condition\n", ":class: note, dropdown\n", "The output of the conditional expression (in this case `>`) is a pandas Series of Boolean values (`True` or `False`) with the same number of rows as the original DataFrame. Such a Series of Boolean values can be used to filter the DataFrame by putting it in between the selection brackets `[]`. Only rows for which the value is `True` will be selected.\n", "```\n", "\n", "Let's check how many rows we actually end up with compared to the initial DataFrame:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "remove-output" ] }, "outputs": [], "source": [ "print(\"Initial shape =\", penguins.shape)\n", "print(\"Above 5 kg =\", above_5kg.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also **combine conditions** using special operators: `&` meaning `and` and `|` meaning `or`. For example, to filter the DataFrame rows of male penguins that weigh below 4 kg, we can use the following:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "remove-output" ] }, "outputs": [], "source": [ "male_below_4kg = penguins[(penguins[\"Body Mass (g)\"] < 4000) & (penguins[\"Sex\"] == \"MALE\")]\n", "\n", "print(\"Male below 4 kg =\", male_below_4kg.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Rows and columns\n", "\n", "```{figure} ../images/chapter9/subset_columns_rows.svg\n", "---\n", "height: 200px\n", "name: subset-rows-column\n", "---\n", "Subsetting rows and column from a pandas DataFrame (image from [pandas website](https://pandas.pydata.org/docs/getting_started/intro_tutorials/03_subset_data.html), © 2024, pandas, image cannot be reused).\n", "```\n", "\n", "To select both specific rows and columns in one go, `[]` are not sufficient. For this, we need to use operators in front of `[]`:\n", "- `loc` for selecting based on column names \n", "- `iloc` for selecting based on position (row/column position or range) " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "remove-output" ] }, "outputs": [], "source": [ "species_above_5kg = penguins.loc[penguins[\"Body Mass (g)\"] > 5000, \"Species\"]\n", "print(species_above_5kg.shape)\n", "\n", "subset = penguins.iloc[9:25, 2:5]\n", "print(subset.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create columns\n", "\n", "Often in science we want to create new columns in our data for analysis or plotting purposes. For example, we may want to change the unit of data in a given column, or log-transform our data. \n", "\n", "Let's add a new column `Body Mass (kg)` to our DataFrame by dividing `Body Mass (g)` by 1000:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "remove-output" ] }, "outputs": [], "source": [ "penguins[\"Body Mass (kg)\"] = penguins[\"Body Mass (g)\"] / 1000\n", "penguins.head(2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that the operations is done \"element-wise\", i.e., on all values in the `Body Mass (g)` column at once. Therefore, **you don't need a loop to iterate over rows**.\n", "\n", "Using analogous syntax, you can also make new columns based on the existing ones. Say we have a column with mass and a column with volume, we can create a column with density like this:\n", "```\n", "dataframe[\"density\"] = dataframe[\"mass\"] / dataframe[\"volume\"]\n", "```\n", "\n", "Other mathematical operators (`+`, `-`, `*`,…) and logical operators (`<`, `>`, `==`, …) also work element-wise." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Combine data from multiple tables\n", "\n", "Very often we need to combine data from several tables. Let's explore two scenarios for combining tables.\n", "\n", "#### Concatenating\n", "\n", "```{figure} ../images/chapter9/concat-row.svg\n", "---\n", "height: 200px\n", "name: concat\n", "---\n", "Concatenating two pandas DataFrames (image from [pandas website](https://pandas.pydata.org/docs/getting_started/intro_tutorials/08_combine_dataframes.html), © 2024, pandas, image cannot be reused).\n", "```\n", "\n", "Say you have performed two experiments and stored your data in two tables that contain the columns of the same names. To \"paste\" the rows of one table below the rows of the other, we perform **concatenation**.\n", "\n", "Explore: what does `axis=0` mean in `pd.concat()`?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "remove-output" ] }, "outputs": [], "source": [ "# Load data frame of measurements for iris setosa flowers\n", "setosa = pd.read_csv(\"Iris-setosa.csv\", delimiter=\";\")\n", "print(setosa.head(2))\n", "\n", "# Load data frame of measurements for iris virgnica flowers\n", "virginica = pd.read_csv(\"Iris-virginica.csv\", delimiter=\";\")\n", "print(virginica.head(2)) " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "remove-output" ] }, "outputs": [], "source": [ "# Concatenate two tables of iris flower measurements\n", "iris = pd.concat([setosa, virginica], axis=0)\n", "\n", "print(setosa.shape)\n", "print(virginica.shape)\n", "print(iris.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Join tables\n", "\n", "```{figure} ../images/chapter9/merge.svg\n", "---\n", "height: 200px\n", "name: concat\n", "---\n", "Merging two pandas DataFrames based on a key (image from [pandas website](https://pandas.pydata.org/docs/getting_started/intro_tutorials/08_combine_dataframes.html), © 2024, pandas, image cannot be reused).\n", "```\n", "\n", "Let's say that you have a large table from a proteomics experiment, containing measurements of all proteins in a yeast cell. Imagine that you want to learn more about these proteins by combining this data with information from databases (such as protein length, their function, etc.). This would require you to combine (join, merge) two DataFrames **using a common identifier (key)**.\n", "\n", "Let's look again at the iris dataset as an example. This time, two data files contain information on the same plants, but have different measurements:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "remove-output" ] }, "outputs": [], "source": [ "# Load data frame of sepal measurements \n", "sepal = pd.read_csv(\"Iris-Sepal.csv\", delimiter=\";\")\n", "print(sepal.head(5))\n", "print(sepal.tail(5))\n", "\n", "# Load data frame of petal measurements \n", "petal = pd.read_csv(\"Iris-Petal.csv\", delimiter=\";\")\n", "print(petal.head(5)) \n", "print(petal.tail(5)) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What could be a useful identifier (key) in these two DataFrames?\n", "\n", "If we were to use the iris `Species` column, that wouldn't be unique, e.g., there a multiple rows with \"Iris-setosa\". What does appear to be a unique identified is `Id`. Let's use that column for merging the two tables:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "remove-output" ] }, "outputs": [], "source": [ "all_measurements = pd.merge(sepal, petal, how=\"left\", on=\"Id\")\n", "all_measurements.head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that we could do the same thing even if the ID column was named differently in the two DataFrames - in that case, we would need to use `left_on='key_column_name_first_df', right_on='key_column_name_second_df'`.\n", "\n", "Notice also the `how` argument. What does that mean, and what are the alternatives to \"left\"?\n", "Explore this in pandas documentation or with `?`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "thebe-remove-input-init" ] }, "outputs": [], "source": [ "import micropip\n", "await micropip.install(\"jupyterquiz\")\n", "from jupyterquiz import display_quiz\n", "import json\n", "\n", "with open(\"questions7.json\", \"r\") as file:\n", " questions=json.load(file)\n", " \n", "display_quiz(questions, border_radius=0)" ] } ], "metadata": { "celltoolbar": "Create Assignment", "jupytext": { "formats": "ipynb,md" }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.4" }, "toc": { "base_numbering": "5", "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": { "height": "calc(100% - 180px)", "left": "10px", "top": "150px", "width": "249.797px" }, "toc_section_display": true, "toc_window_display": true } }, "nbformat": 4, "nbformat_minor": 4 }