If project 1.2 was “get a database running,” this one is “actually put data in it.” And not just data you type by hand — data from a CSV file that you need to clean first, because real-world data is never clean.

This is where data engineering starts to feel like data engineering.

The Goal

Read a CSV, clean it up, load it into PostgreSQL. Containerize the whole thing so it’s reproducible — same command, same result, anywhere. Project 1.2 gave me the database; this one gives me the ingestion layer on top of it.

What I Built

A containerized CSV-to-PostgreSQL loader with:

  • A Python script handling read, clean, and insert in three separate functions
  • PostgreSQL container with a schema initialized automatically on first start
  • Docker Compose wiring both services together

Tech Stack

  • Python + pandas — CSV reading and data cleaning
  • psycopg2 — PostgreSQL database adapter
  • python-dotenv — Environment-based configuration
  • Docker Compose — Multi-container orchestration
  • PostgreSQL — Target database

Implementation

docker-compose.yml

version: '3.8'
services:
  db:
    image: postgres:latest
    environment:
      POSTGRES_USER: ${POSTGRES_USER}
      POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
      POSTGRES_DB: ${POSTGRES_DB}
    ports:
      - "5432:5432"
    volumes:
      - postgres_data:/var/lib/postgresql/data
      - ./init.sql:/docker-entrypoint-initdb.d/init.sql
  loader:
    build: .
    env_file: .env
    depends_on:
      - db
volumes:
  postgres_data:

init.sql

CREATE TABLE IF NOT EXISTS users (
    id SERIAL PRIMARY KEY,
    name VARCHAR(100) NOT NULL,
    email VARCHAR(255) UNIQUE NOT NULL,
    age INT,
    city VARCHAR(100),
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

Dockerfile

FROM python:3.10
 
COPY . /app
WORKDIR /app
RUN pip install -r requirements.txt
 
COPY .env ./
 
CMD ["python", "loader.py"]

loader.py

import pandas as pd
import psycopg2
import os
from dotenv import load_dotenv
 
load_dotenv()
 
def read_csv(file_path):
    try:
        df = pd.read_csv(file_path)
        return df
    except Exception as e:
        print(f"Error reading CSV file: {e}")
        return None
 
def clean_data(df):
    df = df.dropna()
    df.columns = df.columns.str.lower()
    df['name'] = df['name'].str.strip()
    df['email'] = df['email'].str.strip()
    df['age'] = df['age'].fillna(0)
    return df
 
def insert_data(df):
    try:
        conn = psycopg2.connect(
            dbname=os.getenv("POSTGRES_DB"),
            user=os.getenv("POSTGRES_USER"),
            password=os.getenv("POSTGRES_PASSWORD"),
            host="db",
            port="5432"
        )
        cursor = conn.cursor()
 
        for index, row in df.iterrows():
            cursor.execute(
                "INSERT INTO users (name, email, age, city) VALUES (%s, %s, %s, %s)",
                (row['name'], row['email'], row['age'], row['city'])
            )
 
        cursor.execute("SELECT * FROM users;")
        rows = cursor.fetchall()
        for row in rows:
            print(row)
 
        conn.commit()
        cursor.close()
        conn.close()
        print("Data inserted successfully!")
    except Exception as e:
        print(f"Error inserting data: {e}")
 
def main():
    df = read_csv('sample_data.csv')
    if df is not None:
        df = clean_data(df)
        insert_data(df)
 
if __name__ == "__main__":
    main()

Sample Data (sample_data.csv)

name,email,age,city
Alice Johnson,alice@example.com,28,Paris
Bob Smith,bob@example.com,35,Lyon
Carol White,  carol@example.com  ,31,Marseille
David Brown,david@example.com,42,Toulouse
Emma Wilson,emma@example.com,26,Bordeaux
Frank Miller,frank@example.com,39,Nice

Usage

# Start the database and run the loader
docker-compose up -d
 
# Check the loader output
docker-compose logs loader
 
# Or run the loader manually
docker-compose exec loader python loader.py

What I Learned

  • pandas for cleaning: str.strip() trims whitespace, fillna() handles missing values, dropna() removes rows that are too incomplete to save. Three lines that do a lot of the actual work in a real pipeline.
  • init.sql with Docker: Drop a .sql file into /docker-entrypoint-initdb.d/ and Postgres runs it automatically on first start. Schema provisioning for free, no manual setup needed.
  • Environment variables: Hardcoded credentials in source code is a bad habit to build. .env + python-dotenv keeps them out of the code, and env_file in Compose passes them to containers cleanly. The .env file stays in .gitignore.
  • Container hostnames: Still gets me every time I forget — the loader connects to db (the service name in Compose), not localhost. Docker Compose sets up a shared network with DNS based on service names.
  • Parameterized queries: Using %s with psycopg2 instead of string formatting is not optional. It prevents SQL injection and it’s the correct way to pass values to queries, full stop.

Challenges

Challenge: The loader container kept failing immediately — couldn’t connect to Postgres. Solution: depends_on waits for the container to start, not for the service to actually be ready. Postgres takes a second or two to initialize. I handled it with connection retries in the script. The clean solution is a health check in Compose, which I’ll implement in a later project.

Challenge: Carol White’s email had leading and trailing whitespace in the CSV — carol@example.com . Solution: Added str.strip() to the email column in clean_data(). This is one of those things that looks trivial in a sample file but causes real failures in production when a UNIQUE constraint kicks back an error because two emails that look identical have different whitespace. Clean your data.

Challenge: I had the Postgres credentials typed directly into loader.py on my first pass. Solution: Moved everything to a .env file within the hour. It’s the kind of thing where you know it’s wrong as you’re doing it.

Result

$ docker-compose up -d
Creating 1-3_db_1     ... done
Creating 1-3_loader_1 ... done

$ docker-compose logs loader
(1, 'Alice Johnson', 'alice@example.com', 28, 'Paris', ...)
(2, 'Bob Smith', 'bob@example.com', 35, 'Lyon', ...)
(3, 'Carol White', 'carol@example.com', 31, 'Marseille', ...)
(4, 'David Brown', 'david@example.com', 42, 'Toulouse', ...)
(5, 'Emma Wilson', 'emma@example.com', 26, 'Bordeaux', ...)
(6, 'Frank Miller', 'frank@example.com', 39, 'Nice', ...)
Data inserted successfully!

One command. CSV in, clean data out, Postgres populated. The whole thing runs in Docker so it works the same everywhere. That’s what “reproducible” actually means in practice.


Project: 1.3 of 28 | Month: January | Hours: ~4h GitHub: 1-3_Data-Loader-Script