Google Interview Question
Software EngineersCountry: UK
Interview Type: In-Person
Can you please clarify about the sorting?
From what I understand, by splitting the sorting between machines, are you sorting the arrays partially? I find it confusing when you're using map reduce and heap sort.
Please explain a little more about that.
Thanks.
1. I dont think we need to sort both arrays. (==nlog(n))
For the first one, hash every value and point it into the array. (this is log(n) as you run on the entire array
then, run on the second array, ad for each element , look if it exists on the hash, if it does, pull the element and operate F on it.
[There is an assumption that the values are fixed and it is not a stream of data]
2. balancing is relatively easy,hash the first array [or split the array, hash every part on different machines and combine the hashmaps --> which is a basic MapReduce).
then you can splot the second array and operate each machine on different part.
3. Per the first answer above, find a different master (Like done in Hadoop)
Can you please clarify about the sorting?
From what I understand, by splitting the sorting between machines, are you sorting the arrays partially? I find it confusing when you're using map reduce and heap sort.
Please explain a little more about that.
Thanks.
I refer to Foo as A, Bar as B.
1. Assuming every element in A has exactly 1 corresponding element in B, sort both A and B then apply F(A[i], B[i]) for i = 0 to length of A or B.
We can do this with just machine 1 and 2. However we really should also utilize machine 3. One simple algorithm is to partition (like in quicksort) the arrays A and B into 24 parts (this is a power of 2 evenly divided by 3). Then put 8 parts of each array onto each machine. Each machine can then sort each sub-partition and run function F on corresponding values from A and B like mentioned above.
2. I assume by scale up, this means you end up with more spare machines. Same idea as above, we want to place the same number of elements onto each machine via partitioning the arrays such that each machine gets the same number of elements to process, then letting each machine handle their load at their own pace.The nice thing about the recursive partitioning is that it doesn't have to be done by a single machine. Once you partition an array once, the 2 halves now don't care about each other and can be treated independently on different machines. This means, once machine 1 breaks A into A[0...i] and A[i+1, A.length-1], A can offload A[i+1, A.length-1] to be processed by another machine.
3. Need to somehow distribute the control. In the extreme case, every machine could know what every other machine is working on and report process to each other. Ideally, you would have a balance of masters and slaves where only the master needs to coordinate slaves and slaves need to communicate with the master, but still know about other slaves in the same cluster. If the master is non-responsive, a new master needs to be created from another slave in the cluster, that slave needs to figure out what work was done and what work remains, and continue from there.
Part 1: The arrays need to be sorted first and then the function can be applied. This can be accomplished by splitting the sorting between the machines, then merging the results and then exchanging the parts of the arrays for application of the function (a form of a map-reduce algorithm). Heap sort would be particularly useful as it produces consecutive (and sorted) elements during sorting which could already be used for application of the function.
- tested.candidate March 21, 2015Part 2: Either run a small subset first to get an idea if it is linear distribution and then divide statically according to that or try to adapt during processing depending on the load of the machines.
Part 3: Use master election algorithms (gossip algorithms, etc.).