Consider there is a streaming

Google Interview Question for Software Engineers

0

of 0 votes

2
Answers
Consider there is a streaming service, which outputs Log object to your service. The Log object has fields like {timeStamp, userId, hashtag(used in the tweet), @userAddressUsedInTweet} etc. Imagine this streaming service has very high QPS. Design your service in such a way that it can output top K userId's within a configurable time window(example: last 1 hour, last 24 hour etc). This service should be extendable to get any top K category (Example: TopK userId's, TopK hashtags etc). What would you use to design such a service. Top K is defined by its frequency, example: 1,2,3,4,1,2,1,2,1,1,5 are the userIDs then the top 2 users are userId 1 and 2.

@EdgeCase:
1. Take into consideration how to store data in that window to get the topK user's.
2. The service should be highly available and should return the results quickly
3. Design and implement this service
- lks December 08, 2017 in United States | Report Duplicate | Flag | PURGE
Google Software Engineer

Email me when people comment.

An error occurred in subscribing you.

Country: United States
Interview Type: In-Person

Email me when people comment.

An error occurred in subscribing you.

Comment hidden because of low score. Click to expand.

of 0 vote

That's a systems design question, not coding, I assume.

- I define the thing I want the rolling statistics as fact (e.g. user id, hashtag, ...)
- Then I would split the time window into slots such a way that the timewindow / slots provides a reasonable update rate for the statistics while it keeps the number of slots small enough for a feasible solution.
- Now I have two things: a queue which represents a time slot. Each queue items holds a hash table which counts for the seen facts the occurrences (frequency)
- Further I keep a hashtable that will count the occurrences total, it will have the fact as key and a refrence to a tree-item as value. The tree-item has the frequency as key and the fact as value.
- If an item arrives, I will update the head of the queue or create a new queue item if a new slot is touched by the timestamp of the arriving item. I will as well increment the frequency of that item in my hash-table->tree-item structure
- I will check as well the oldest items in the queue and if they are older than the new item arrived, I will subtract the key-value pairs from my main statistic and throw away that queue item away

Now I can always query my tree for the top-k keys and receive the the top-k items.

This design has a few assumptions and weak points:
- it is optimized for a high query rate on top most k items.
- it is slightly wrong, because I add items as they arrive to the hash-table -> tree-item structure, but remove them with a lower granularity of time per slot
- I only update the statistics if items arrive, assuming, items arrive always and if not, the statistics relatives don't change (which is true, but maybe top k should be an empty list if no items arrived longer than the moving window)

For high availability and fault tolerance of single servers:
- a number of 'frequency aggregatos' that receive the events and assign them to time slots (either synchronized time or fine grained ... the whole time discussion could fill a couple of minutes here). Here I could place a load balancer in front or if I control the clients assign different clients a different aggregator and fallback if primary for that client fails, etc..
- multiple 'redundant statistics runners' that will receive data from aggregator and sum up to final statistics as pointed out in prior section. The aggregator will translate thousand or hundereds of thouasands of requests to just one per time slot, which will be a list of key-value pairs it will forward to the main server. It might already cut away the long tail in order to minimize traffic... but that's another discussion because it's not that easy, one might need to consider re-elevating long-tail's that peak up or if very small time slots are considered, if a fact occurs exactly once in a time slot, it might need re-elevation as well otherwise statistics might be wrong.

Now, I can query any one of the redundant statistics runner. How ever, I need to account for statistics runner that are temporarily overloaded or didn't receive all updates from the aggregators, because they will return a different and wrong statistics.

- Chris December 09, 2017 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

It's actually a sliding window function of streaming processing system like Flink, Storm, etc.
To store the data, use a fifo list to store every element with timestamp. and each time a new element comes to the tail of the list, evict the expired elements using timestamp and update the sum.
use map reduce like function to shard messages to multiple machines, and sort them perhaps or keep a min-heap of top K.

- zyfo2 December 09, 2017 | Flag Reply

CareerCup

Google Interview Question for Software Engineers

Books

Videos

Resume Review

Mock Interviews