Disjointset data structure
In computer science, a disjointset data structure (also called a union–find data structure or merge–find set) is a data structure that tracks a set of elements partitioned into a number of disjoint (nonoverlapping) subsets. It provides nearconstanttime operations (bounded by the inverse Ackermann function) to add new sets, to merge existing sets, and to determine whether elements are in the same set. In addition to many other uses (see the Applications section), disjointsets play a key role in Kruskal's algorithm for finding the minimum spanning tree of a graph.
Disjointset/Unionfind Forest  

Type  multiway tree  
Invented  1964  
Invented by  Bernard A. Galler and Michael J. Fischer  
Time complexity in big O notation  

History
Disjointset forests were first described by Bernard A. Galler and Michael J. Fischer in 1964.[2] In 1973, their time complexity was bounded to , the iterated logarithm of , by Hopcroft and Ullman.[3] (A proof is available here.) In 1975, Robert Tarjan was the first to prove the (inverse Ackermann function) upper bound on the algorithm's time complexity,[4] and, in 1979, showed that this was the lower bound for a restricted case.[5] In 1989, Fredman and Saks showed that (amortized) words must be accessed by any disjointset data structure per operation,[6], thereby proving the optimality of the data structure.
In 1991, Galil and Italiano published a survey of data structures for disjointsets.[7]
In 1994, Richard J. Anderson and Heather Woll described a parallelized version of Union–Find that never needs to block.[8]
In 2007, Sylvain Conchon and JeanChristophe Filliâtre developed a persistent version of the disjointset forest data structure, allowing previous versions of the structure to be efficiently retained, and formalized its correctness using the proof assistant Coq.[9] However, the implementation is only asymptotic if used ephemerally or if the same version of the structure is repeatedly used with limited backtracking.
Representation
A disjointset forest consists of a number of elements each of which stores an id, a parent pointer, and, in efficient algorithms, either a size or a "rank" value.
The parent pointers of elements are arranged to form one or more trees, each representing a set. If an element's parent pointer points to no other element, then the element is the root of a tree and is the representative member of its set. A set may consist of only a single element. However, if the element has a parent, the element is part of whatever set is identified by following the chain of parents upwards until a representative element (one without a parent) is reached at the root of the tree.
Forests can be represented compactly in memory as arrays in which parents are indicated by their array index.
Operations
MakeSet
The MakeSet operation makes a new set by creating a new element with a unique id, a rank of 0, and a parent pointer to itself. The parent pointer to itself indicates that the element is the representative member of its own set.
The MakeSet operation has time complexity, so initializing n sets has time complexity.
Pseudocode:
function MakeSet(x) is if x is not already present then add x to the disjointset tree x.parent := x x.rank := 0 x.size := 1
Find
Find(x) follows the chain of parent pointers from x up the tree until it reaches a root element, whose parent is itself. This root element is the representative member of the set to which x belongs, and may be x itself.
Path compression
Path compression flattens the structure of the tree by making every node point to the root whenever Find is used on it. This is valid, since each element visited on the way to a root is part of the same set. The resulting flatter tree speeds up future operations not only on these elements, but also on those referencing them.
Tarjan and Van Leeuwen also developed onepass Find algorithms that are more efficient in practice while retaining the same worstcase complexity: path splitting and path halving.[4]
Path halving
Path halving makes every other node on the path point to its grandparent.
Path splitting
Path splitting makes every node on the path point to its grandparent.
Pseudocode
Path compression  Path halving  Path splitting 

function Find(x)
if x.parent ≠ x
x.parent := Find(x.parent)
return x.parent 
function Find(x) while x.parent ≠ x x.parent := x.parent.parent x := x.parent return x 
function Find(x)
while x.parent ≠ x
x, x.parent := x.parent, x.parent.parent
return x 
Path compression can be implemented using iteration by first finding the root then updating the parents:
function Find(x) is root := x while root.parent ≠ root root := root.parent while x.parent ≠ root parent := x.parent x.parent := root x := parent return root
Path splitting can be represented without multiple assignment (where the right hand side is evaluated first):
function Find(x) while x.parent ≠ x next := x.parent x.parent := next.parent x := next return x
or
function Find(x) while x.parent ≠ x prev := x x := x.parent prev.parent := x.parent return x
Union
Union(x,y) uses Find to determine the roots of the trees x and y belong to. If the roots are distinct, the trees are combined by attaching the root of one to the root of the other. If this is done naively, such as by always making x a child of y, the height of the trees can grow as . To prevent this union by rank or union by size is used.
by rank
Union by rank always attaches the shorter tree to the root of the taller tree. Thus, the resulting tree is no taller than the originals unless they were of equal height, in which case the resulting tree is taller by one node.
To implement union by rank, each element is associated with a rank. Initially a set has one element and a rank of zero. If two sets are unioned and have the same rank, the resulting set's rank is one larger; otherwise, if two sets are unioned and have different ranks, the resulting set's rank is the larger of the two. Ranks are used instead of height or depth because path compression will change the trees' heights over time.
by size
Union by size always attaches the tree with fewer elements to the root of the tree having more elements.
Pseudocode
Union by rank  Union by size 

function Union(x, y) is xRoot := Find(x) yRoot := Find(y) // x and y are already in the same set if xRoot = yRoot then return // x and y are not in same set, so we merge them if xRoot.rank < yRoot.rank then xRoot, yRoot := yRoot, xRoot // swap xRoot and yRoot // merge yRoot into xRoot yRoot.parent := xRoot if xRoot.rank = yRoot.rank then xRoot.rank := xRoot.rank + 1 
function Union(x, y) is xRoot := Find(x) yRoot := Find(y) // x and y are already in the same set if xRoot = yRoot then return // x and y are not in same set, so we merge them if xRoot.size < yRoot.size then xRoot, yRoot := yRoot, xRoot // swap xRoot and yRoot // merge yRoot into xRoot yRoot.parent := xRoot xRoot.size := xRoot.size + yRoot.size 
Time complexity
Without path compression (or a variant), union by rank, or union by size, the height of trees can grow unchecked as , implying that Find and Union operations will take time.
Using path compression alone gives a worstcase running time of ,[10] for a sequence of n MakeSet operations (and hence at most Union operations) and f Find operations.
Using union by rank alone gives a runningtime of (tight bound) for m operations of any sort of which n are MakeSet operations.[10]
Using both path compression, splitting, or halving and union by rank or size ensures that the amortized time per operation is only ,[4][5] which is optimal,[6] where is the inverse Ackermann function. This function has a value for any value of n that can be written in this physical universe, so the disjointset operations take place in essentially constant time.
Applications
Disjointset data structures model the partitioning of a set, for example to keep track of the connected components of an undirected graph. This model can then be used to determine whether two vertices belong to the same component, or whether adding an edge between them would result in a cycle. The Union–Find algorithm is used in highperformance implementations of unification.[11]
This data structure is used by the Boost Graph Library to implement its Incremental Connected Components functionality. It is also a key component in implementing Kruskal's algorithm to find the minimum spanning tree of a graph.
Note that the implementation as disjointset forests doesn't allow the deletion of edges, even without path compression or the rank heuristic.
Sharir and Agarwal report connections between the worstcase behavior of disjointsets and the length of Davenport–Schinzel sequences, a combinatorial structure from computational geometry.[12]
See also
 Partition refinement, a different data structure for maintaining disjoint sets, with updates that split sets apart rather than merging them together
 Dynamic connectivity
References
 Tarjan, Robert Endre (1975). "Efficiency of a Good But Not Linear Set Union Algorithm". Journal of the ACM. 22 (2): 215–225. doi:10.1145/321879.321884.
 Galler, Bernard A.; Fischer, Michael J. (May 1964). "An improved equivalence algorithm". Communications of the ACM. 7 (5): 301–303. doi:10.1145/364099.364331.. The paper originating disjointset forests.
 Hopcroft, J. E.; Ullman, J. D. (1973). "Set Merging Algorithms". SIAM Journal on Computing. 2 (4): 294–303. doi:10.1137/0202024.
 Tarjan, Robert E.; van Leeuwen, Jan (1984). "Worstcase analysis of set union algorithms". Journal of the ACM. 31 (2): 245–281. doi:10.1145/62.2160.
 Tarjan, Robert Endre (1979). "A class of algorithms which require nonlinear time to maintain disjoint sets". Journal of Computer and System Sciences. 18 (2): 110–127. doi:10.1016/00220000(79)900424.
 Fredman, M.; Saks, M. (May 1989). "The cell probe complexity of dynamic data structures". Proceedings of the TwentyFirst Annual ACM Symposium on Theory of Computing: 345–354.
Theorem 5: Any CPROBE(log n) implementation of the set union problem requires Ω(m α(m, n)) time to execute m Find's and n−1 Union's, beginning with n singleton sets.
 Galil, Z.; Italiano, G. (1991). "Data structures and algorithms for disjoint set union problems". ACM Computing Surveys. 23 (3): 319–344. doi:10.1145/116873.116878.
 Anderson, Richard J.; Woll, Heather (1994). Waitfree Parallel Algorithms for the UnionFind Problem. 23rd ACM Symposium on Theory of Computing. pp. 370–380.
 Conchon, Sylvain; Filliâtre, JeanChristophe (October 2007). "A Persistent UnionFind Data Structure". ACM SIGPLAN Workshop on ML. Freiburg, Germany.
 Cormen, Thomas H.; Leiserson, Charles E.; Rivest, Ronald L.; Stein, Clifford (2009). "Chapter 21: Data structures for Disjoint Sets". Introduction to Algorithms (Third ed.). MIT Press. pp. 571–572. ISBN 9780262033848.
 Knight, Kevin (1989). "Unification: A multidisciplinary survey" (PDF). ACM Computing Surveys. 21: 93–124. doi:10.1145/62029.62030.
 Sharir, M.; Agarwal, P. (1995). DavenportSchinzel sequences and their geometric applications. Cambridge University Press.
External links
 C++ implementation, part of the Boost C++ libraries
 A Java implementation with an application to color image segmentation, Statistical Region Merging (SRM), IEEE Trans. Pattern Anal. Mach. Intell. 26(11): 1452–1458 (2004)
 Java applet: A Graphical Union–Find Implementation, by Rory L. P. McGuire
 A Matlab Implementation which is part of the Tracker Component Library
 Python implementation
 Visual explanation and C# code