Amazon Interview Question
Software Development ManagersCountry: United States
Interview Type: In-Person
I have this idea of solving the problem;
First, I am assuming that clients are requesting to be updated (as in NTP) and server is not forcing anyone.
Secondly, I am assuming that when syncing is completed, server and client gets synced with a something like 99.99999% accuracy. So, not everyone needs to be synced at the same time (which was impossible).
Thirdly, I am assuming that this sync operation takes at most 6 seconds and at least 3 seconds per client and I am also assuming that server serves clients asynchronously which means, in the worst case it would take up to 6 seconds for server to complete all 10 clients.
Since we can't trust the internal times of servers, we can't ask them to start their syncing on different times. So, we have to assume that they all might request to be synced at the same time. But, since it is literally impossible, their requests will reach the server in some order and server will only be able to serve first 10 clients. Other 90 clients will get refused. At this point, clients need to retry the server some time after this retry period can be selected randomly between 3 seconds to 6 seconds to prevent collisions as much as possible. Every time clients face with a collision they can keep repeating this after getting synced. In the worst case, it would take 60 seconds for last client to get synced.
Can we improve this?
I suppose, lets modify our architecture a little bit and only let machines from 1 to 10 to sync with central server only. And divide remaining clients in 9 machine groups to these 10 machines. Clients will try to sync with these 10 machines and if they are not synced yet, they will get rejected. Lets assume machine 3 is synced but not the remaining yet, So, machines from 19 to 27 can now start syncing. In this scenario, it will only take up to 12 seconds to complete syncing all machines.
Assuming the central "time server" can be customized, I'd extend the "time server" to hold all the servers that require to be synced (observer). Each server registers itself on the time server (listener), say when the server starts it tries to register itself on the time server, the time server decides the time shift when the next renew request should come from the listener server. "Decides" means it should spread all registered servers across some time period in order to eliminate more than 10 request at the same point of time. Then all servers receive requests from the time server (even with the renewed next timestamp of the helthcheck to the time server). Instead of the time server receives requests, it sends requests to other servers. The time-server can also de-register a server if the request from the server fails. Next time the broken server should register again.
If I can assume that the "time server" can be customizible, I'd extent it to be an observer. When a server (listener) wants to sync a time, when it loads, it tries to register itself on the time server. Time server responds with the time and the next timestamp of the syncronization (optional). So the time server holds the list of the servers that want to sync the time and periodically sends time sync requests to the list of servers (can be done in parallel). If the time server doesn't receive response from a server (unreachable server) it removes the server from the list. And the next time the broken server starts, it will register itself again on the time server.
Way I:
- confused_coder August 21, 2016assuming each server takes the same time to sync, you could divide the cluster into groups of 10 and the central time server can sync them in groups.
pros:
-easy to maintain grouping
-grouping map can be stored on central time server
cons:
-if central server fails, may need to do resyncing
-if one or more of the servers in a group fail to sync or take too long then the other servers will be kept waiting
Way II:
let the central server have a list of sync'd, syncing and to be sync'd servers
the central server can start with a random 10 servers and change their state from to be
sync'd to syncing and update the lists.
When a server finished syncing, its state is changed to sync'd and added to the sync'd list. Next, a new server can be sync'd if space is available.
pros:
-servers don't constantly wait unnecessarily
- a priority can be assigned to certain servers if dependencies exist
- can retry on sync errors
cons:
-if central server fails, need to redo syncing. To combat this each server can store their state, so when the central server comes back online, it can poll the cluster to see what needs to be done and recreate its lists