Skip to main content

Command Palette

Search for a command to run...

Beginner’s Guide to Data Analysis with Pandas and Matplotlib

Your first practical guide to data analysis and visualization

Published
3 min read
Beginner’s Guide to Data Analysis with Pandas and Matplotlib

Data is everywhere, from personal expenses to global climate records. However, data itself has little value until it is analyzed and visualized to uncover insights.

In this guide, we will walk through a beginner-friendly data analysis project using Python, Pandas, and Matplotlib. We will work with the well-known Iris dataset, which contains measurements of different Iris flower species.

By the end, you will understand how to:

  • Load and clean a dataset

  • Explore and analyze data with Pandas

  • Create visualizations with Matplotlib and Seaborn

  • Document results clearly


Prerequisites

Before you start, ensure you have:

  • A basic understanding of Python (variables, functions, lists)

  • A Python environment (Anaconda, Jupyter Notebook, or Google Colab)

  • Installed the following libraries:

pip install pandas matplotlib seaborn scikit-learn

If you prefer not to install anything locally, you can use Google Colab which runs entirely in the browser.


Step 1: Setting Up Your Notebook

  1. Open Google Colab.

  2. Create a new notebook.

  3. Rename it Data_Analysis_Assignment.

  4. Copy and paste the code snippets provided in the following sections.


Step 2: Loading the Dataset

We will use the Iris dataset. It is available in scikit-learn, but we will also handle cases where a CSV file is missing.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris

# Try to load dataset
try:
    df = pd.read_csv("iris.csv")
    print("Loaded iris.csv from local file.")
except FileNotFoundError:
    print("iris.csv not found. Loading from sklearn instead...")
    iris = load_iris(as_frame=True)
    df = iris.frame

# Preview dataset
df.head()

Expected result: A table showing the first five rows of the dataset.


Step 3: Exploring the Dataset

Check the structure, data types, and missing values.

# Info about dataset
df.info()

# Check missing values
print(df.isnull().sum())

Expected result: A summary of the dataset structure and confirmation that there are no missing values.


Step 4: Basic Data Analysis

Generate statistics and group data by species.

# Descriptive statistics
df.describe()

# Group by species and calculate mean
grouped = df.groupby("target").mean()
print(grouped)

Observation:

  • Setosa has the smallest petal and sepal sizes

  • Virginica has the largest

  • Versicolor lies in between


Step 5: Data Visualization

We will now create different plots using Matplotlib and Seaborn.

Line Chart – Sepal Length Trend

plt.plot(df["sepal length (cm)"])
plt.title("Sepal Length Trend")
plt.xlabel("Sample")
plt.ylabel("Sepal Length (cm)")
plt.show()

Fig. 1.0 Line chart

Bar Chart – Average Petal Length by Species

sns.barplot(x="target", y="petal length (cm)", data=df, ci=None)
plt.title("Average Petal Length by Species")
plt.xlabel("Species")
plt.ylabel("Petal Length (cm)")
plt.show()

Fig. 1.1 Bar chart

Histogram – Sepal Width Distribution

plt.hist(df["sepal width (cm)"], bins=20, color="skyblue", edgecolor="black")
plt.title("Distribution of Sepal Width")
plt.xlabel("Sepal Width (cm)")
plt.ylabel("Frequency")
plt.show()

Fig. 1.2 Histogram

Scatter Plot – Sepal vs Petal Length

sns.scatterplot(
    x="sepal length (cm)", 
    y="petal length (cm)", 
    hue="target", 
    data=df
)
plt.title("Sepal Length vs Petal Length")
plt.show()

Fig. 1.1 Scatter Plot


Each visualization will appear below the corresponding cell.


Step 6: Error Handling

When loading datasets, files may be missing or misformatted. We use a try-except block to handle this gracefully.

try:
    df = pd.read_csv("iris.csv")
except FileNotFoundError:
    iris = load_iris(as_frame=True)
    df = iris.frame

This ensures that the notebook continues running even if the CSV file is unavailable.


Step 7: Observations and Findings

From the analysis and visualizations:

  • Species are clearly separable by petal and sepal measurements

  • Sepal width follows a normal-like distribution

  • Scatter plots show distinct clustering by species


Step 8: Conclusion

In this project, we:

  • Learned how to load and clean datasets with Pandas

  • Performed descriptive statistics

  • Created multiple visualizations with Matplotlib and Seaborn

  • Implemented basic error handling

The Iris dataset clearly demonstrates how data analysis workflows can uncover meaningful patterns. This serves as a strong starting point for learning data analysis with Python.