I have seen this question and its variants during a couple of interviews, so it is worth exploring.
The simplest form of the question is:
Given an array of integers, determine whether there two distinct elements in it that sum to an integer \(C\).
The brute-force approach: Look at every distinct pair of elements \(a_i\), \(a_j\), and check whether \(a_i + a_j = C\). There are \((n-1) + (n-2) \ldots + (1) = \frac{1}{2}n(n-1)\) distinct tuples, giving us a complexity of \(O(n^2)\).
The code to solve this could look something like this:
We can try and do better. What if we sorted the list? That sometimes helps, and it’s \(O(n\operatorname{log}(n))\) complexity. We now have a sorted list and are trying to find two distinct elements \(a_i, a_j\), such that \(a_i + a_j = C\). We also know that that, since the array is sorted, \(i > j \leftrightarrow a_i \geq a_j\). Then, if we have \(a_i + a_j > C, i > j)\), we know that we must lower one of the terms. Since \(a_j > a_i\), we can try \(a_{j-1}\), since \(a_{j_1} \leq a_j\), and then just maybe \(a_i + a_{j-1} = C\). Symmetrically, if we have \(a_i + a_j < C, i > j\), then we can try \(a_{i+1} + a_j\). This gives us an algorithmic idea. Start with two pointers - one to \(a_0\) and one to \(a_{n-1}\), the last element. If the sum of the two elements is \(C\), stop. Otherwise, if it’s smaller than C, move right the left pointer, and if it’s greater than \(C\), move the right pointer to the left.
In code, it might look like this:
Since sorting is \(O(n\operatorname{log}(n))\), and the second pass through the array is linear, the total running time of this algorithm is \(O(n\operatorname{log}(n))\), which is improvement over the last algorithm.
But maybe we can do even better than that. What do we actually need to determine? We need to determine whether there exist \(i\), and \(j\), such that \(a_i + a_j = C\), or alternatively, \(a_i = C - a_j\). Hence, if we are scanning the array linearly, and are looking at \(a_j\), if the array has in it the number \(C - a_j = a_i\), we have found our solution. This leads to the following idea: Go through the array \(a\), and add every member to a set, so that we can query array membership in \(O(1)\) time. Then, iterate through the array again, and for each element \(a_j\), use the precomputed set to check whether \(C-a_j\) is in the array. This clearly gives us an \(O(n)\) algorithm. A straightforward implementation might look like this (Java):
Let’s check that it works with input a = [5,8,29,1,-4,3], C = 25
. This should return true as \(29-4 = 25\).
Seems to be working. Let’s try \(C=10\). This should return false since no tuple sums to \(10\) in this example.
The problem is that by putting everything into a set, we have lost the ability to tell how many instances of a number there were in the original array. The code above allows us to use the \(5\) twice. Similarly, if we gave the code a = [5], C=10
, it would still return true. Here is one way to fix it: Create a hashmap that maps a number to the index it appeared at in the original array. Then, scan the array again, and for every \(a_j\), check if \(C-a_j\) is in the hashmap. If it is, check if it appears somewhere besides index \(j\). If it does, we found our tuple. An implementation might look something like this:
Assuming good implementations of hashing functions, this algorithm should be linear, \(O(n)\). This covers the case for \(k=2\). I’ll now look at cases where \(k > 2\).
0.1 Case for \(k > 2\)
Let’s look at \(k = 3\). We now need to find three numbers that sum to \(C\) : \(a_x + a_y + a_z = C\). Hence, \(a_x = C - a_y - a_z\). The obvious brute-force way is to look at every pair of three elements, giving us an \(O(n^3)\) algorithm. We can use a similar technique as last time to improve this. Store all the array elements in a set, and for every two distinct elements \(a_y, a_z\), check if \(C - a_y - a_z\) is in the array. This gives an \(O(n^2)\) algorithm.
For \(k = 4\), compute a set \(S\) of sums of all 2-element subsets of the original array. Then, for every two elements \(a_e, a_f\) in the original array, check if \(C - a_e - a_f\) is in \(S\). If it is, we have found a solution, again in \(O(n^2)\) time.
In general, when k is even, we have the following algorithm:
- Compute a set \(S\) that stores all the sums of all the \(k/2\)-element tuples of the original array
- Look at every \(k/2\) tuple of elements in the original array, and compute its sum \(s\). If \(C - s \in S\), we found a solution
This is \(O(n^{\frac{k}{2}})\)
When \(k\) is odd, we have the following algorithm:
- Compute a set \(S\) that stores all the sums of all the \((k-1)/2\)-element tuples of the original array.
- Look at every \((k+1)/2\)-element tuple in the original array, and compute its sum \(s\). If \(C - s\), we found a solution.
This is \(O(n^{\frac{k+1}{2}})\)
Notice the above algorithms allow the same element to be used more than once. Fixing this is an implementation detail; we have to keep track as to which indices a sum came from, and ensure that there are no duplicates. This will increase the runtime of our algorithm.