pandas Basics: Data Analysis - Python Mastery

pandas: Data Analysis in Python

pandas is the foundational library for data analysis in Python. It provides DataFrame (table) and Series (column) data structures with powerful operations for cleaning, transforming, and analyzing data.

# pip install pandas
import pandas as pd
import numpy as np

# Create a DataFrame
df = pd.DataFrame({
    "name": ["Alice", "Bob", "Carol", "Dave"],
    "age": [30, 25, 35, 28],
    "salary": [75000, 60000, 90000, 70000],
    "department": ["Eng", "Sales", "Eng", "HR"]
})

print(df)
#     name  age  salary department
# 0  Alice   30   75000        Eng
# 1    Bob   25   60000      Sales
# 2  Carol   35   90000        Eng
# 3   Dave   28   70000         HR

# Basic info
df.shape        # (4, 4) — rows, columns
df.dtypes       # Column data types
df.info()       # Summary including null counts
df.describe()   # Statistics: mean, std, min, max, quartiles

Selecting Data

# Select column
df["name"]               # Series
df[["name", "salary"]]   # DataFrame with multiple columns

# Select rows
df.iloc[0]               # First row by position
df.iloc[1:3]             # Rows 1 and 2 by position
df.loc[0]                # Row by label/index
df.loc[0, "name"]        # Specific cell

# Boolean filtering
df[df["age"] > 28]                     # Rows where age > 28
df[df["department"] == "Eng"]          # Engineering department
df[(df["age"] > 25) & (df["salary"] > 70000)]  # Multiple conditions
df[df["name"].isin(["Alice", "Carol"])]  # Name in list

Transformations

# Add columns
df["annual_bonus"] = df["salary"] * 0.1
df["senior"] = df["age"] > 30
df["name_upper"] = df["name"].str.upper()

# Apply a function
df["grade"] = df["salary"].apply(lambda s: "A" if s > 80000 else "B")

# Aggregations
df["salary"].mean()    # 73750.0
df["salary"].sum()     # 295000
df["salary"].max()     # 90000
df["age"].std()        # Standard deviation

# Group by
dept_stats = df.groupby("department")["salary"].agg(["mean", "max", "count"])
print(dept_stats)
#             mean     max  count
# department
# Eng        82500   90000      2
# HR         70000   70000      1
# Sales      60000   60000      1

# Sort
df.sort_values("salary", ascending=False)
df.sort_values(["department", "salary"])

Data Cleaning

# Handling missing values
df = pd.DataFrame({
    "name": ["Alice", "Bob", None, "Dave"],
    "score": [85.0, None, 92.0, 78.0]
})

df.isnull()              # Boolean mask of nulls
df.isnull().sum()        # Count nulls per column
df.dropna()              # Drop rows with any null
df.fillna(0)             # Fill nulls with 0
df["score"].fillna(df["score"].mean())   # Fill with column mean

# String operations on columns
df["name"] = df["name"].str.strip()
df["name"] = df["name"].str.lower()
df["email"].str.contains("@")

# Type conversion
df["age"] = df["age"].astype(int)
df["date"] = pd.to_datetime(df["date_str"])
df["date"].dt.year   # Extract year

Reading and Writing Files

# CSV
df = pd.read_csv("data.csv")
df = pd.read_csv("data.csv", index_col=0, parse_dates=["date"])
df.to_csv("output.csv", index=False)

# Excel
df = pd.read_excel("data.xlsx", sheet_name="Sheet1")
df.to_excel("output.xlsx", index=False)

# JSON
df = pd.read_json("data.json")
df.to_json("output.json", orient="records")

# From dict or list
records = [{"name": "Alice", "age": 30}, {"name": "Bob", "age": 25}]
df = pd.DataFrame(records)

Key Takeaways

DataFrame is a table: rows and columns, like a spreadsheet or SQL table
Boolean indexing is powerful: df[df["col"] > value]
groupby + agg: SQL GROUP BY in Python
Always check df.info() and df.describe(): understand your data first
fillna vs dropna: handle missing data explicitly

Practice Exercises

Load a CSV of your choosing, explore it with info() and describe(), find and handle any missing values.
Given a sales DataFrame with columns (date, product, quantity, price), calculate monthly revenue per product.
Filter a DataFrame to rows where a numeric column is more than 2 standard deviations from the mean (outlier detection).
Merge two DataFrames on a common key (like SQL JOIN) and compute a derived column from both.

← The requests Library Working with CSV and JSON Data →