Why CSV files need cleaning
CSV files are common in business workflows. They are exported from CRMs, spreadsheets, e-commerce platforms, marketing tools, internal systems, and reporting software.
The problem is that exported CSV files are often messy. Column names may contain extra spaces, rows may be duplicated, required fields may be missing, and text values may be inconsistent. If this data is used without cleaning, reports, dashboards, and imports can break.
Common CSV problems
- Extra spaces in column names
- Extra spaces inside text values
- Fully empty rows
- Duplicate rows
- Missing values
- Required columns missing from the file
- Inconsistent column formatting
A good CSV cleaner should not only clean the file. It should also tell you what changed and what problems still exist.
What we will build
Read CSV
Load the messy CSV file into a Pandas DataFrame.
Clean Data
Normalize column names, trim text values, and remove empty rows.
Validate
Check required columns and count missing values.
Export
Save the cleaned CSV file and return a validation summary.
Example messy CSV
This example includes extra spaces, duplicate records, missing values, and an empty row.
Customer ID , Name , Email , Country , Signup Date
1, John Smith , john@example.com , USA , 2026-01-05
2, Sara Miller , sara@example.com , Canada , 2026-01-07
4, Maria Garcia , maria@example.com , Spain , 2026-01-14
4, Maria Garcia , maria@example.com , Spain , 2026-01-14
5, David Chen , , China , 2026-01-20
6, Emma Wilson , emma@example.com , , 2026-01-22
,,,,
Python CSV cleaner example
The following code shows the core cleaning and validation workflow.
from pathlib import Path
from typing import Iterable
import pandas as pd
def normalize_column_name(column_name: str) -> str:
return (
str(column_name)
.strip()
.lower()
.replace(" ", "_")
.replace("-", "_")
)
def clean_and_validate_csv(
input_file: str | Path,
output_file: str | Path,
required_columns: Iterable[str] | None = None,
) -> dict:
input_path = Path(input_file)
output_path = Path(output_file)
if not input_path.exists():
raise FileNotFoundError(f"Input file not found: {input_path}")
output_path.parent.mkdir(parents=True, exist_ok=True)
raw_df = pd.read_csv(input_path)
original_rows = len(raw_df)
df = raw_df.copy()
df.columns = [normalize_column_name(column) for column in df.columns]
for column in df.select_dtypes(include=["object"]).columns:
df[column] = df[column].astype(str).str.strip()
df[column] = df[column].replace({"nan": pd.NA, "": pd.NA})
df = df.dropna(how="all")
duplicate_count = int(df.duplicated().sum())
df = df.drop_duplicates()
if required_columns:
normalized_required = {
normalize_column_name(column)
for column in required_columns
}
missing_columns = sorted(normalized_required - set(df.columns))
if missing_columns:
raise ValueError(
"Missing required columns: " + ", ".join(missing_columns)
)
missing_values = {
column: int(df[column].isna().sum())
for column in df.columns
if int(df[column].isna().sum()) > 0
}
df.to_csv(output_path, index=False)
return {
"input_file": str(input_path),
"output_file": str(output_path),
"original_rows": original_rows,
"duplicate_rows_removed": duplicate_count,
"final_rows": len(df),
"columns": list(df.columns),
"missing_values": missing_values,
}
What the validation summary tells you
After cleaning the CSV file, the script returns a summary dictionary. This helps you understand what happened during the cleaning process.
- original_rows: how many rows existed before cleaning
- duplicate_rows_removed: how many duplicate rows were removed
- final_rows: how many rows remain in the cleaned file
- columns: the normalized column names
- missing_values: columns that still contain missing data
Where this is useful
- Customer data cleanup
- Sales lead validation
- CRM import preparation
- E-commerce order export cleanup
- Marketing contact list validation
- Reporting and dashboard preparation
Portfolio project
I created a GitHub repository for this workflow. The project includes a reusable cleaner, sample messy data, a demo script, documentation, and basic tests.
Conclusion
Python is a practical tool for cleaning and validating CSV files. A small script can remove repeated manual cleanup work, reduce errors, and prepare business data for reporting, dashboards, imports, and automation workflows.