Docker Fundamentals (Images, Containers, Volumes)

Docker is how you package and run applications consistently across any machine. In data engineering, Docker ensures your pipeline works on your laptop, in CI/CD, and on production servers—no “it works on my machine” excuses.

Core Concepts

Images vs Containers

Image: A blueprint (like a template)

Defines what goes into the application
Built from a Dockerfile
Immutable (read-only)
Stored locally or in registries (Docker Hub, ECR)

Container: An actual running instance

Built from an image
Isolated, lightweight process
Can be started, stopped, deleted
Data inside dies when container stops (unless you use volumes)

Building Images: Dockerfile

Basic Dockerfile

# Start from a base image
FROM python:3.11-slim
 
# Set working directory
WORKDIR /app
 
# Copy requirements
COPY requirements.txt .
 
# Install dependencies
RUN pip install -r requirements.txt
 
# Copy application code
COPY . .
 
# Expose port (documentation only)
EXPOSE 8000
 
# Default command
CMD ["python", "main.py"]

Execution order:

FROM — Choose base image
RUN — Execute commands at build time
COPY — Copy files from host to container
EXPOSE — Document which port the app uses
CMD — Default command when container starts

Building an Image

# Build image with name and tag
docker build -t my-app:1.0 .
 
# Verify it was created
docker images
 
# Output:
# REPOSITORY   TAG    IMAGE ID      CREATED
# my-app       1.0    abc123def456  5 seconds ago

Multi-Stage Build (Reduce Size)

# Stage 1: Build
FROM python:3.11 as builder
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
 
# Stage 2: Runtime (smaller image)
FROM python:3.11-slim
WORKDIR /app
COPY --from=builder /usr/local/lib/python3.11/site-packages /usr/local/lib/python3.11/site-packages
COPY . .
CMD ["python", "main.py"]

Why: First stage includes build tools (large). Second stage is lean, fast, production-ready.

Running Containers

Basic Container

# Run a container from image
docker run -d --name my-app-1 my-app:1.0
 
# Flags:
# -d: detached (run in background)
# --name: container name (for reference)

Interactive Container

# Run interactively (bash shell)
docker run -it my-app:1.0 /bin/bash
 
# Flags:
# -i: interactive
# -t: terminal
# /bin/bash: shell to use

Port Mapping

# Map container port 8000 to host port 8000
docker run -p 8000:8000 my-app:1.0
 
# Syntax: -p HOST_PORT:CONTAINER_PORT
 
# Check what's running
docker ps
 
# Output:
# CONTAINER ID  IMAGE        PORTS                  NAMES
# abc123        my-app:1.0   0.0.0.0:8000->8000/tcp my-app-1

Environment Variables

# Pass environment variables to container
docker run -e DATABASE_URL="postgres://user:pass@db:5432/mydb" \
           -e LOG_LEVEL="debug" \
           my-app:1.0
 
# Or from a file
docker run --env-file .env my-app:1.0

Running Commands in Container

# Execute command in running container
docker exec -it my-app-1 /bin/bash
 
# Run one-off command
docker exec my-app-1 python script.py

Volumes: Persist Data

Problem: When a container stops, all data inside is lost.
Solution: Mount volumes to persist data outside the container lifecycle.

Bind Mount (Host Directory)

# Mount host directory into container
docker run -v /path/on/host:/path/in/container my-app:1.0
 
# Example: Share config file
docker run -v ~/.ssh:/root/.ssh my-app:1.0
 
# Read-only mount
docker run -v /config:/app/config:ro my-app:1.0

Named Volume (Docker-managed)

# Create named volume
docker volume create my-data
 
# Mount named volume
docker run -v my-data:/app/data my-app:1.0
 
# List volumes
docker volume ls
 
# Inspect volume
docker volume inspect my-data
 
# Delete volume
docker volume rm my-data

Volume in Dockerfile

FROM postgres:15
 
# Declare volume (helps document persistence)
VOLUME /var/lib/postgresql/data
 
# Any files written to /var/lib/postgresql/data will persist

Networking: Connect Containers

Bridge Network (Default)

# Containers on same bridge can communicate by name
docker network create my-network
 
# Run containers on network
docker run -d --name db --network my-network postgres:15
docker run -d --name app --network my-network my-app:1.0
 
# From app container, connect to db using hostname 'db'
# Connection string: postgresql://db:5432/mydb

Host Network (Advanced)

# Container shares host's network interface
docker run --network host my-app:1.0
 
# Use when you need maximum performance
# Warning: Security implications

Container Lifecycle

Common Commands

# Check running containers
docker ps
 
# Check all containers (including stopped)
docker ps -a
 
# View logs
docker logs my-app-1
docker logs -f my-app-1  # Follow logs (tail -f)
 
# Stop container
docker stop my-app-1
 
# Start container
docker start my-app-1
 
# Remove container (must be stopped)
docker rm my-app-1
 
# Remove image
docker rmi my-app:1.0

Container Health Check

FROM my-app:1.0
 
# Periodically check if app is healthy
HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
  CMD curl -f http://localhost:8000/health || exit 1

Real-World Data Engineering Example

ETL Pipeline in Docker

FROM python:3.11-slim
 
WORKDIR /app
 
# Install system dependencies
RUN apt-get update && apt-get install -y \
    postgresql-client \
    && rm -rf /var/lib/apt/lists/*
 
# Copy pipeline code
COPY requirements.txt .
RUN pip install -r requirements.txt
 
COPY pipeline/ ./pipeline/
 
# Health check
HEALTHCHECK CMD python -c "import sys; sys.exit(0)" || exit 1
 
# Run pipeline
CMD ["python", "-m", "pipeline.etl"]

# Run with environment config
docker run \
  -e SOURCE_DB="postgresql://source:5432/raw" \
  -e TARGET_DB="postgresql://warehouse:5432/prod" \
  -e LOG_LEVEL="info" \
  -v /data/logs:/app/logs \
  my-etl-pipeline:1.0

Best Practices

Practice	Why	Example
Use specific base image versions	Reproducibility	`FROM python:3.11.1` not `FROM python:latest`
Install only what you need	Smaller images	Don’t install `curl` if you don’t use it
Use multi-stage builds	Reduce final image size	Build in one stage, copy artifacts to slim stage
Layer caching	Faster builds	Put stable commands (RUN pip) before changing code
Use `.dockerignore`	Exclude unnecessary files	Like `.gitignore` for Docker
Run as non-root user	Security	`USER appuser` in Dockerfile

Tips & Gotchas

Container IDs are long. Use --name to give them readable names.

docker run --name my-pipeline my-app:1.0  # Better than container ID

Ports must be unique. Two containers can’t use the same host port.

# ❌ Error: Port 8000 already in use
docker run -p 8000:8000 app1:1.0
docker run -p 8000:8000 app2:1.0
 
# ✅ Use different ports
docker run -p 8000:8000 app1:1.0
docker run -p 8001:8000 app2:1.0

Data in containers is temporary. Use volumes!

# ❌ Data lost when container stops
docker run my-app:1.0
docker stop container_id
# Data gone!
 
# ✅ Data persists
docker run -v my-data:/app/data my-app:1.0

Layers are cached. Put changing code at the end of Dockerfile.

# ❌ Rebuilds pip install every time code changes
FROM python:3.11
COPY . .
RUN pip install -r requirements.txt
 
# ✅ Caches pip install, only rebuilds your code
FROM python:3.11
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .

Docker-Compose — Run multi-container applications
Apache-Airflow — Airflow runs in Docker
05-PostgreSQL-for-Data-Engineering — Database in Docker
TOOLS-Learning-Roadmap — Your complete tools learning path

Key Takeaway:
Docker = Image (blueprint) → Container (running instance). Use Dockerfile to build images, docker run to start containers, and volumes to persist data. Master Docker and you can deploy anywhere.

Explorer

Ben Warai Otoko

Docker Fundamentals (Images, Containers, Volumes)

Table of Contents

Core Concepts

Images vs Containers

Building Images: Dockerfile

Basic Dockerfile

Building an Image

Multi-Stage Build (Reduce Size)

Running Containers

Basic Container

Interactive Container

Port Mapping

Environment Variables

Running Commands in Container

Volumes: Persist Data

Bind Mount (Host Directory)

Named Volume (Docker-managed)

Volume in Dockerfile

Networking: Connect Containers

Bridge Network (Default)

Host Network (Advanced)

Container Lifecycle

Common Commands

Container Health Check

Real-World Data Engineering Example

ETL Pipeline in Docker

Best Practices

Tips & Gotchas

Graph View

Backlinks

Recent Notes

Second Brain

Projects

Blog Posts

Learning Logs

Explorer

Docker Fundamentals (Images, Containers, Volumes)

Table of Contents

Core Concepts

Images vs Containers

Building Images: Dockerfile

Basic Dockerfile

Building an Image

Multi-Stage Build (Reduce Size)

Running Containers

Basic Container

Interactive Container

Port Mapping

Environment Variables

Running Commands in Container

Volumes: Persist Data

Bind Mount (Host Directory)

Named Volume (Docker-managed)

Volume in Dockerfile

Networking: Connect Containers

Bridge Network (Default)

Host Network (Advanced)

Container Lifecycle

Common Commands

Container Health Check

Real-World Data Engineering Example

ETL Pipeline in Docker

Best Practices

Tips & Gotchas

Related

Graph View

Backlinks

Recent Notes

Second Brain

Projects

Blog Posts

Learning Logs