How to Clean and Validate CSV Files with Python

Why CSV files need cleaning

CSV files are common in business workflows. They are exported from CRMs, spreadsheets, e-commerce platforms, marketing tools, internal systems, and reporting software.

The problem is that exported CSV files are often messy. Column names may contain extra spaces, rows may be duplicated, required fields may be missing, and text values may be inconsistent. If this data is used without cleaning, reports, dashboards, and imports can break.

Common CSV problems

Extra spaces in column names
Extra spaces inside text values
Fully empty rows
Duplicate rows
Missing values
Required columns missing from the file
Inconsistent column formatting

A good CSV cleaner should not only clean the file. It should also tell you what changed and what problems still exist.

What we will build

Step 01

Read CSV

Load the messy CSV file into a Pandas DataFrame.

Step 02

Clean Data

Normalize column names, trim text values, and remove empty rows.

Step 03

Validate

Check required columns and count missing values.

Step 04

Export

Save the cleaned CSV file and return a validation summary.

Example messy CSV

This example includes extra spaces, duplicate records, missing values, and an empty row.

messy_customers.csv

 Customer ID , Name , Email , Country , Signup Date
1, John Smith , john@example.com , USA , 2026-01-05
2, Sara Miller , sara@example.com , Canada , 2026-01-07
4, Maria Garcia , maria@example.com , Spain , 2026-01-14
4, Maria Garcia , maria@example.com , Spain , 2026-01-14
5, David Chen ,  , China , 2026-01-20
6, Emma Wilson , emma@example.com ,  , 2026-01-22
,,,,

Python CSV cleaner example

The following code shows the core cleaning and validation workflow.

cleaner.py

from pathlib import Path
from typing import Iterable

import pandas as pd


def normalize_column_name(column_name: str) -> str:
    return (
        str(column_name)
        .strip()
        .lower()
        .replace(" ", "_")
        .replace("-", "_")
    )


def clean_and_validate_csv(
    input_file: str | Path,
    output_file: str | Path,
    required_columns: Iterable[str] | None = None,
) -> dict:
    input_path = Path(input_file)
    output_path = Path(output_file)

    if not input_path.exists():
        raise FileNotFoundError(f"Input file not found: {input_path}")

    output_path.parent.mkdir(parents=True, exist_ok=True)

    raw_df = pd.read_csv(input_path)
    original_rows = len(raw_df)

    df = raw_df.copy()
    df.columns = [normalize_column_name(column) for column in df.columns]

    for column in df.select_dtypes(include=["object"]).columns:
        df[column] = df[column].astype(str).str.strip()
        df[column] = df[column].replace({"nan": pd.NA, "": pd.NA})

    df = df.dropna(how="all")

    duplicate_count = int(df.duplicated().sum())
    df = df.drop_duplicates()

    if required_columns:
        normalized_required = {
            normalize_column_name(column)
            for column in required_columns
        }

        missing_columns = sorted(normalized_required - set(df.columns))

        if missing_columns:
            raise ValueError(
                "Missing required columns: " + ", ".join(missing_columns)
            )

    missing_values = {
        column: int(df[column].isna().sum())
        for column in df.columns
        if int(df[column].isna().sum()) > 0
    }

    df.to_csv(output_path, index=False)

    return {
        "input_file": str(input_path),
        "output_file": str(output_path),
        "original_rows": original_rows,
        "duplicate_rows_removed": duplicate_count,
        "final_rows": len(df),
        "columns": list(df.columns),
        "missing_values": missing_values,
    }

What the validation summary tells you

After cleaning the CSV file, the script returns a summary dictionary. This helps you understand what happened during the cleaning process.

original_rows: how many rows existed before cleaning
duplicate_rows_removed: how many duplicate rows were removed
final_rows: how many rows remain in the cleaned file
columns: the normalized column names
missing_values: columns that still contain missing data

Where this is useful

Customer data cleanup
Sales lead validation
CRM import preparation
E-commerce order export cleanup
Marketing contact list validation
Reporting and dashboard preparation

Portfolio project

I created a GitHub repository for this workflow. The project includes a reusable cleaner, sample messy data, a demo script, documentation, and basic tests.

Conclusion

Python is a practical tool for cleaning and validating CSV files. A small script can remove repeated manual cleanup work, reduce errors, and prepare business data for reporting, dashboards, imports, and automation workflows.