Amazon Interview Question
SDE-2sCountry: India
bloom filter seems to be feasible solution. Read file in chunk sequentially and keep compare with bloom filter. It will need 2 scan of file.
1. make bloom filter of every words that you encounter and put into bloom filter b1, which ever word is find positive in bloom filter remove that and put in another bloom filter b2.
2. During 2nd scan just remove the word that give positive for bloom filter b2.
Thats how remaining word will have frequency 1 and will be in order.
Since RAM is very less than the size of file, we can't load it into memory at a time and hence we need chunk logic:
1. Let's divide file into 30-40 chunks such that at least 2 chunks can be loaded in memory at a time.
2. Load i'th chunk into memory. This chunk will contain words which will be compared with next remaining chunks, one at a time.
(a) Load next chunks one by one into memory and remove all words which also occurred in i'th chunk. After every chunk is processed like this, write it back into file at its designated position.
(b) Remove duplicate words into the i'th chunk itself and then write it back into the file as well at its designated position.
3. Repeat step 3 for all chunks until we reach EOF (end of file).
Note for step 2: Start position of next chunk to be compared with others will be the next position after end of previous updated (written into file after processing in 2(b)) chunk.
Since we are constrained to 100MB, we can do a disk sort of the file chunk. Once it's sorted all we need is to load the chunked sorted file and by comparing the words next to it we can determine if its unique or not. Once we have the unique words we can revisit the chunked files to determine the order.
- Beginner December 16, 2017