Python Data Cleaning Tutorial

How to Clean and Validate CSV Files
with Python

Learn how to build a practical Python workflow that reads messy CSV files, normalizes column names, removes empty and duplicate rows, validates required fields, counts missing values, and exports clean data.

Why CSV files need cleaning

CSV files are common in business workflows. They are exported from CRMs, spreadsheets, e-commerce platforms, marketing tools, internal systems, and reporting software.

The problem is that exported CSV files are often messy. Column names may contain extra spaces, rows may be duplicated, required fields may be missing, and text values may be inconsistent. If this data is used without cleaning, reports, dashboards, and imports can break.

Common CSV problems

  • Extra spaces in column names
  • Extra spaces inside text values
  • Fully empty rows
  • Duplicate rows
  • Missing values
  • Required columns missing from the file
  • Inconsistent column formatting

A good CSV cleaner should not only clean the file. It should also tell you what changed and what problems still exist.

What we will build

Step 01

Read CSV

Load the messy CSV file into a Pandas DataFrame.

Step 02

Clean Data

Normalize column names, trim text values, and remove empty rows.

Step 03

Validate

Check required columns and count missing values.

Step 04

Export

Save the cleaned CSV file and return a validation summary.

Example messy CSV

This example includes extra spaces, duplicate records, missing values, and an empty row.

messy_customers.csv
 Customer ID , Name , Email , Country , Signup Date
1, John Smith , john@example.com , USA , 2026-01-05
2, Sara Miller , sara@example.com , Canada , 2026-01-07
4, Maria Garcia , maria@example.com , Spain , 2026-01-14
4, Maria Garcia , maria@example.com , Spain , 2026-01-14
5, David Chen ,  , China , 2026-01-20
6, Emma Wilson , emma@example.com ,  , 2026-01-22
,,,,

Python CSV cleaner example

The following code shows the core cleaning and validation workflow.

cleaner.py
from pathlib import Path
from typing import Iterable

import pandas as pd


def normalize_column_name(column_name: str) -> str:
    return (
        str(column_name)
        .strip()
        .lower()
        .replace(" ", "_")
        .replace("-", "_")
    )


def clean_and_validate_csv(
    input_file: str | Path,
    output_file: str | Path,
    required_columns: Iterable[str] | None = None,
) -> dict:
    input_path = Path(input_file)
    output_path = Path(output_file)

    if not input_path.exists():
        raise FileNotFoundError(f"Input file not found: {input_path}")

    output_path.parent.mkdir(parents=True, exist_ok=True)

    raw_df = pd.read_csv(input_path)
    original_rows = len(raw_df)

    df = raw_df.copy()
    df.columns = [normalize_column_name(column) for column in df.columns]

    for column in df.select_dtypes(include=["object"]).columns:
        df[column] = df[column].astype(str).str.strip()
        df[column] = df[column].replace({"nan": pd.NA, "": pd.NA})

    df = df.dropna(how="all")

    duplicate_count = int(df.duplicated().sum())
    df = df.drop_duplicates()

    if required_columns:
        normalized_required = {
            normalize_column_name(column)
            for column in required_columns
        }

        missing_columns = sorted(normalized_required - set(df.columns))

        if missing_columns:
            raise ValueError(
                "Missing required columns: " + ", ".join(missing_columns)
            )

    missing_values = {
        column: int(df[column].isna().sum())
        for column in df.columns
        if int(df[column].isna().sum()) > 0
    }

    df.to_csv(output_path, index=False)

    return {
        "input_file": str(input_path),
        "output_file": str(output_path),
        "original_rows": original_rows,
        "duplicate_rows_removed": duplicate_count,
        "final_rows": len(df),
        "columns": list(df.columns),
        "missing_values": missing_values,
    }

What the validation summary tells you

After cleaning the CSV file, the script returns a summary dictionary. This helps you understand what happened during the cleaning process.

  • original_rows: how many rows existed before cleaning
  • duplicate_rows_removed: how many duplicate rows were removed
  • final_rows: how many rows remain in the cleaned file
  • columns: the normalized column names
  • missing_values: columns that still contain missing data

Where this is useful

  • Customer data cleanup
  • Sales lead validation
  • CRM import preparation
  • E-commerce order export cleanup
  • Marketing contact list validation
  • Reporting and dashboard preparation

Portfolio project

I created a GitHub repository for this workflow. The project includes a reusable cleaner, sample messy data, a demo script, documentation, and basic tests.

Conclusion

Python is a practical tool for cleaning and validating CSV files. A small script can remove repeated manual cleanup work, reduce errors, and prepare business data for reporting, dashboards, imports, and automation workflows.