Module 9 ยท Lesson 26

pandas Basics: Data Analysis

๐Ÿ Pythonโฑ 16 min read๐Ÿ“– Data Science

pandas: Data Analysis in Python

pandas is the foundational library for data analysis in Python. It provides DataFrame (table) and Series (column) data structures with powerful operations for cleaning, transforming, and analyzing data.

# pip install pandas import pandas as pd import numpy as np # Create a DataFrame df = pd.DataFrame({ "name": ["Alice", "Bob", "Carol", "Dave"], "age": [30, 25, 35, 28], "salary": [75000, 60000, 90000, 70000], "department": ["Eng", "Sales", "Eng", "HR"] }) print(df) # name age salary department # 0 Alice 30 75000 Eng # 1 Bob 25 60000 Sales # 2 Carol 35 90000 Eng # 3 Dave 28 70000 HR # Basic info df.shape # (4, 4) โ€” rows, columns df.dtypes # Column data types df.info() # Summary including null counts df.describe() # Statistics: mean, std, min, max, quartiles

Selecting Data

# Select column df["name"] # Series df[["name", "salary"]] # DataFrame with multiple columns # Select rows df.iloc[0] # First row by position df.iloc[1:3] # Rows 1 and 2 by position df.loc[0] # Row by label/index df.loc[0, "name"] # Specific cell # Boolean filtering df[df["age"] > 28] # Rows where age > 28 df[df["department"] == "Eng"] # Engineering department df[(df["age"] > 25) & (df["salary"] > 70000)] # Multiple conditions df[df["name"].isin(["Alice", "Carol"])] # Name in list

Transformations

# Add columns df["annual_bonus"] = df["salary"] * 0.1 df["senior"] = df["age"] > 30 df["name_upper"] = df["name"].str.upper() # Apply a function df["grade"] = df["salary"].apply(lambda s: "A" if s > 80000 else "B") # Aggregations df["salary"].mean() # 73750.0 df["salary"].sum() # 295000 df["salary"].max() # 90000 df["age"].std() # Standard deviation # Group by dept_stats = df.groupby("department")["salary"].agg(["mean", "max", "count"]) print(dept_stats) # mean max count # department # Eng 82500 90000 2 # HR 70000 70000 1 # Sales 60000 60000 1 # Sort df.sort_values("salary", ascending=False) df.sort_values(["department", "salary"])

Data Cleaning

# Handling missing values df = pd.DataFrame({ "name": ["Alice", "Bob", None, "Dave"], "score": [85.0, None, 92.0, 78.0] }) df.isnull() # Boolean mask of nulls df.isnull().sum() # Count nulls per column df.dropna() # Drop rows with any null df.fillna(0) # Fill nulls with 0 df["score"].fillna(df["score"].mean()) # Fill with column mean # String operations on columns df["name"] = df["name"].str.strip() df["name"] = df["name"].str.lower() df["email"].str.contains("@") # Type conversion df["age"] = df["age"].astype(int) df["date"] = pd.to_datetime(df["date_str"]) df["date"].dt.year # Extract year

Reading and Writing Files

# CSV df = pd.read_csv("data.csv") df = pd.read_csv("data.csv", index_col=0, parse_dates=["date"]) df.to_csv("output.csv", index=False) # Excel df = pd.read_excel("data.xlsx", sheet_name="Sheet1") df.to_excel("output.xlsx", index=False) # JSON df = pd.read_json("data.json") df.to_json("output.json", orient="records") # From dict or list records = [{"name": "Alice", "age": 30}, {"name": "Bob", "age": 25}] df = pd.DataFrame(records)

Key Takeaways

Practice Exercises

  1. Load a CSV of your choosing, explore it with info() and describe(), find and handle any missing values.
  2. Given a sales DataFrame with columns (date, product, quantity, price), calculate monthly revenue per product.
  3. Filter a DataFrame to rows where a numeric column is more than 2 standard deviations from the mean (outlier detection).
  4. Merge two DataFrames on a common key (like SQL JOIN) and compute a derived column from both.
โ† The requests Library