pandas: Data Analysis in Python
pandas is the foundational library for data analysis in Python. It provides DataFrame (table) and Series (column) data structures with powerful operations for cleaning, transforming, and analyzing data.
# pip install pandas
import pandas as pd
import numpy as np
# Create a DataFrame
df = pd.DataFrame({
"name": ["Alice", "Bob", "Carol", "Dave"],
"age": [30, 25, 35, 28],
"salary": [75000, 60000, 90000, 70000],
"department": ["Eng", "Sales", "Eng", "HR"]
})
print(df)
# name age salary department
# 0 Alice 30 75000 Eng
# 1 Bob 25 60000 Sales
# 2 Carol 35 90000 Eng
# 3 Dave 28 70000 HR
# Basic info
df.shape # (4, 4) โ rows, columns
df.dtypes # Column data types
df.info() # Summary including null counts
df.describe() # Statistics: mean, std, min, max, quartilesSelecting Data
# Select column
df["name"] # Series
df[["name", "salary"]] # DataFrame with multiple columns
# Select rows
df.iloc[0] # First row by position
df.iloc[1:3] # Rows 1 and 2 by position
df.loc[0] # Row by label/index
df.loc[0, "name"] # Specific cell
# Boolean filtering
df[df["age"] > 28] # Rows where age > 28
df[df["department"] == "Eng"] # Engineering department
df[(df["age"] > 25) & (df["salary"] > 70000)] # Multiple conditions
df[df["name"].isin(["Alice", "Carol"])] # Name in listTransformations
# Add columns
df["annual_bonus"] = df["salary"] * 0.1
df["senior"] = df["age"] > 30
df["name_upper"] = df["name"].str.upper()
# Apply a function
df["grade"] = df["salary"].apply(lambda s: "A" if s > 80000 else "B")
# Aggregations
df["salary"].mean() # 73750.0
df["salary"].sum() # 295000
df["salary"].max() # 90000
df["age"].std() # Standard deviation
# Group by
dept_stats = df.groupby("department")["salary"].agg(["mean", "max", "count"])
print(dept_stats)
# mean max count
# department
# Eng 82500 90000 2
# HR 70000 70000 1
# Sales 60000 60000 1
# Sort
df.sort_values("salary", ascending=False)
df.sort_values(["department", "salary"])Data Cleaning
# Handling missing values
df = pd.DataFrame({
"name": ["Alice", "Bob", None, "Dave"],
"score": [85.0, None, 92.0, 78.0]
})
df.isnull() # Boolean mask of nulls
df.isnull().sum() # Count nulls per column
df.dropna() # Drop rows with any null
df.fillna(0) # Fill nulls with 0
df["score"].fillna(df["score"].mean()) # Fill with column mean
# String operations on columns
df["name"] = df["name"].str.strip()
df["name"] = df["name"].str.lower()
df["email"].str.contains("@")
# Type conversion
df["age"] = df["age"].astype(int)
df["date"] = pd.to_datetime(df["date_str"])
df["date"].dt.year # Extract yearReading and Writing Files
# CSV
df = pd.read_csv("data.csv")
df = pd.read_csv("data.csv", index_col=0, parse_dates=["date"])
df.to_csv("output.csv", index=False)
# Excel
df = pd.read_excel("data.xlsx", sheet_name="Sheet1")
df.to_excel("output.xlsx", index=False)
# JSON
df = pd.read_json("data.json")
df.to_json("output.json", orient="records")
# From dict or list
records = [{"name": "Alice", "age": 30}, {"name": "Bob", "age": 25}]
df = pd.DataFrame(records)Key Takeaways
- DataFrame is a table: rows and columns, like a spreadsheet or SQL table
- Boolean indexing is powerful:
df[df["col"] > value] - groupby + agg: SQL GROUP BY in Python
- Always check df.info() and df.describe(): understand your data first
- fillna vs dropna: handle missing data explicitly
Practice Exercises
- Load a CSV of your choosing, explore it with
info()anddescribe(), find and handle any missing values. - Given a sales DataFrame with columns (date, product, quantity, price), calculate monthly revenue per product.
- Filter a DataFrame to rows where a numeric column is more than 2 standard deviations from the mean (outlier detection).
- Merge two DataFrames on a common key (like SQL JOIN) and compute a derived column from both.