Google Interview Question
Software EngineersCountry: United States
Interview Type: In-Person
It's actually a sliding window function of streaming processing system like Flink, Storm, etc.
To store the data, use a fifo list to store every element with timestamp. and each time a new element comes to the tail of the list, evict the expired elements using timestamp and update the sum.
use map reduce like function to shard messages to multiple machines, and sort them perhaps or keep a min-heap of top K.
That's a systems design question, not coding, I assume.
- Chris December 09, 2017- I define the thing I want the rolling statistics as fact (e.g. user id, hashtag, ...)
- Then I would split the time window into slots such a way that the timewindow / slots provides a reasonable update rate for the statistics while it keeps the number of slots small enough for a feasible solution.
- Now I have two things: a queue which represents a time slot. Each queue items holds a hash table which counts for the seen facts the occurrences (frequency)
- Further I keep a hashtable that will count the occurrences total, it will have the fact as key and a refrence to a tree-item as value. The tree-item has the frequency as key and the fact as value.
- If an item arrives, I will update the head of the queue or create a new queue item if a new slot is touched by the timestamp of the arriving item. I will as well increment the frequency of that item in my hash-table->tree-item structure
- I will check as well the oldest items in the queue and if they are older than the new item arrived, I will subtract the key-value pairs from my main statistic and throw away that queue item away
Now I can always query my tree for the top-k keys and receive the the top-k items.
This design has a few assumptions and weak points:
- it is optimized for a high query rate on top most k items.
- it is slightly wrong, because I add items as they arrive to the hash-table -> tree-item structure, but remove them with a lower granularity of time per slot
- I only update the statistics if items arrive, assuming, items arrive always and if not, the statistics relatives don't change (which is true, but maybe top k should be an empty list if no items arrived longer than the moving window)
For high availability and fault tolerance of single servers:
- a number of 'frequency aggregatos' that receive the events and assign them to time slots (either synchronized time or fine grained ... the whole time discussion could fill a couple of minutes here). Here I could place a load balancer in front or if I control the clients assign different clients a different aggregator and fallback if primary for that client fails, etc..
- multiple 'redundant statistics runners' that will receive data from aggregator and sum up to final statistics as pointed out in prior section. The aggregator will translate thousand or hundereds of thouasands of requests to just one per time slot, which will be a list of key-value pairs it will forward to the main server. It might already cut away the long tail in order to minimize traffic... but that's another discussion because it's not that easy, one might need to consider re-elevating long-tail's that peak up or if very small time slots are considered, if a fact occurs exactly once in a time slot, it might need re-elevation as well otherwise statistics might be wrong.
Now, I can query any one of the redundant statistics runner. How ever, I need to account for statistics runner that are temporarily overloaded or didn't receive all updates from the aggregators, because they will return a different and wrong statistics.