Designing DAGs for Complex Tasks
Directed Acyclic Graphs are the backbone of modern workflow engines, build systems, and AI pipelines. Learn how to model, structure, and optimize them.
What is a Directed Acyclic Graph?
A DAG is a graph where edges have a direction (A → B) and no path leads back to the same node — no cycles. This constraint makes DAGs ideal for modeling task dependencies: you can always determine a valid execution order.
Key Properties
- Every node has a unique topological sort position
- Source nodes have zero in-degree (no predecessors)
- Sink nodes have zero out-degree (no successors)
- Longest path = critical path for scheduling
- Subgraphs inherit the acyclic guarantee
Common Use Cases
- CI/CD pipelines (GitHub Actions, Jenkins)
- Data workflows (Apache Airflow, Prefect)
- Build systems (Bazel, Make, Gradle)
- AI agent task planning
- Package dependency resolution
- Spreadsheet formula evaluation
Enumerate all tasks
List every discrete unit of work. Each task should be atomic — a single responsibility that produces a clear, testable output. Avoid bundling unrelated work into one node.
Identify dependencies
For each task, ask: what must finish before this can begin? Draw a directed edge from the prerequisite to the dependent. Be strict — only include true data or execution dependencies.
Detect and break cycles
Run a depth-first search or Kahn’s algorithm. If a cycle exists, it signals a design flaw — split tasks, introduce versioning, or reconsider the dependency relationship entirely.
Find the critical path
Use forward and backward passes (CPM) to find the longest dependency chain. This path determines the minimum execution time and identifies tasks with zero float — your bottlenecks.
Maximize parallelism
Nodes with no shared ancestors can execute simultaneously. Identify wide levels in the topological sort and schedule them as parallel workers, thread pools, or distributed agents.
Building a DAG in Python
A simple adjacency-list representation with topological sort using Kahn’s algorithm.
from collections import defaultdict, deque
class DAG:
def __init__(self):
self.graph = defaultdict(list) # node → [successors]
self.in_deg = defaultdict(int) # node → in-degree
self.nodes = set()
def add_edge(self, u, v):
# directed edge u ──► v
self.graph[u].append(v)
self.in_deg[v] += 1
self.nodes |= {u, v}
def topological_sort(self):
# Kahn's algorithm
queue = deque(n for n in self.nodes if self.in_deg[n] == 0)
order, seen = [], 0
while queue:
node = queue.popleft()
order.append(node); seen += 1
for succ in self.graph[node]:
self.in_deg[succ] -= 1
if self.in_deg[succ] == 0:
queue.append(succ)
if seen != len(self.nodes):
raise ValueError("Cycle detected — not a valid DAG")
return order
# ── Usage ──────────────────────────────
dag = DAG()
for edge in [("start","A"), ("start","B"),
("A","C"), ("B","C"), ("B","D"),
("C","end"), ("D","end")]:
dag.add_edge(*edge)
print(dag.topological_sort())
# → ['start', 'A', 'B', 'C', 'D', 'end']Design Principles
Keep nodes fine-grained
Smaller tasks enable better parallelism, easier retries on failure, and clearer observability. If a node takes more than ~15 minutes, consider splitting it.
Prefer immutable data contracts
Each edge should carry a well-defined schema. Version your data contracts so upstream changes don’t silently break downstream consumers.
Model retries as sub-graphs
Don’t rely on implicit retry loops. Instead, model fallback logic explicitly as conditional branches within the DAG, making retry behavior visible and auditable.
Instrument the critical path
Add latency telemetry to every node. The critical path changes as data volumes grow — what’s fine today may become your bottleneck at 10× scale.
Common Patterns
- Fan-out: one node feeds many parallel workers
- Fan-in: reduce/join results before continuing
- Diamond: fan-out then fan-in via same path
- Chain: sequential pipeline with no branching
- Forest: multiple independent root trees
Common Pitfalls
- Hidden temporal dependencies not modeled as edges
- Overly wide fan-outs overwhelming downstream systems
- Global shared state bypassing the graph’s ordering
- Neglecting idempotency in task implementations
- Confusing logical and physical DAG layers

