Facebook Interview Question
Software Engineer / DevelopersCountry: United States
This is the solution in my mind as well but the solution to this problem is a lot of code (Creating multiple files, file IO code, merge sort for individual files and then an N way merge). Does facebook expect a complete solution end to end?
if you have a Space of O(n) for creating a max heap, why you do not use the hash map to prevent sorting which has O(nlogn) runtime complexity?
look at the solution which is provided above and let me know your opinion
You are right. Max heap is not a solution to me either but external sort is. In external sort, typically, sorted data is saved back into the HDD and then you perform an Nway merge on the files.
Linux command :
cat filename.txt | tr -cs "[:alnum:]" "\n"| tr "[:lower:]" "[:upper:]" | awk '{h[$1]++}END{for (i in h){print h[i]" "i}}'|sort -nr | cat -n | head -n 10
your heap can go out of memory. you need to write out the results. Create a max heap of size 10 and start reading the results computed in the previous steps and insert them into the heap. finally, the heap will contain the 10 most frequent strings.
Well, there could be 2 different definition of word.
1. A word in English dictionary, the biggest I have seen so far has 400k words.
2. A sequence of character separated by separator like space, full stop, etc. if we can say that they all from English dictionary or another languages dictionary. Or even if we can be sure that all sequence of words in our terabyte of file belongs to word in any language. Then total set of words is limited can easily loaded into hash table in memory.
Instead of using hashtables, we can use a different data strcuture to indicate the frequency of the occurances. I ve written a basic program for any number of words based upon input.
The structure has count of occurences of a word ending with that node. Based upon the next letter we create an array of 26 letters (assuming all the letters are small , doesnt matter if mixed as well) from the first node. And the process resumes till last letter.
#include <iostream>
struct node;
struct node
{
int count;
node* child;
};
int main()
{
node * fnode = new node[26];
std::string str;
int numStr;
std::cin >> numStr;
for (int i = 0; i < numStr ; i++)
{
std::cin >> str;
node * n = fnode + (str[0] - 97);
for (int i = 1; i < str.length() ; i++)
{
if (n->child == NULL)
{
n->child = new node[26];
}
n = n->child + (str[i] - 97);
}
n->count++;
std::cout << n->count ;
}
}
create a min-heap of 10 nodes. Node structure would look like:
srtuct node{
int frequency;
char *word;
node * left;
node *right;
}
The min heap is created based on the frequency of the word. Also maintain a concurrent hashmap backing this. With every insert on the hashmap check if the frequency of the inserted element is greater than the min frequency currently on the min-heap then delete the min-heap element-> place it into the hashmap because later it could re-enter the min-heap. Insert the element from the hashmap back into the min-heap and heapify. The insertion into the min-heap has to be an atomic operation. I think this would have good throughput.
Alternatively map-reduce could also be the way to go..
Since there are terabyte of strings. A Modified trie based solution and Min Heaps too will be a optimal one.
In Trie, The node consists of not "IsEnd" boolean variable but also the Count.
when a new string is inserted and its count is increased. If the count is greater than the top element of the min heap and the heap size is equal to 10, remove the top element and add the element.
When all the strings have been read, pop the elements and store the elements in a array.
print the array in the reverse order which gives the top 10 frequent elements
At the End
You can if you buffer line by line and have a map of key and counter. Can't think of a solution if you have one terabyte of string has with all unique words.
What if the string is the list of all words in a dictionary? All words are unique and it takes much more than one terabyte in a hashmap given its overhead.
One mitigation to having one terabyte worth of unique words is to use a trie. The words (or keys to the hash table) will be compressed to the degree of common prefixes between them.
if the string contains all unique English words (worst case), it means about 150.000 words and if the length of each word is in average 10 bytes and 4 bytes for keeping the number of acurrance, in total 14 bytes for each word, the hash table will be 2GB which is not unreseanable and can be handle by desktop PC or a alptop.
@Zasa1010
First of all, a data structure that hogs 2GB of memory IS unreasonable in any scaling application. 2GB for one data structure is ridiculous. Secondly oxford has about 300k words. So requirement is 4GB. Thirdly, in real life applications like Facebook, words are not limited to just English. There are other things like usernames which can themselves go into the millions.
So in short, this storing everything in memory wont work. Even if its just english dictionary, I am sure the interviewer will not be happy with such a program.
Is it just me or is all the math in this thread incorrect? 300,000 words, at an average of 10 bytes each and 4 bytes to hold their count is 300000 * 14 = 4,200,000 bytes = 4.2MB, not 2-4GB as previously stated. Or am I missing something?
exp is right, It's 4.2 MB not 4.2GB. This solution is fine if the interviewer says assuming english. I would be interested in how to solve the problem if it's not english... like UTF32 strings even.
I think whenever an interviewer specifically mentions "terabyte of strings", then it does not imply an in-memory solution because "terabyte" has no significance..it should be out-of-memory solutions such as disk-based or distributed approaches such as map-reduce..but the point is that these solutions are not possible to code in a real-time interview environment.
Nasser is right. Consider 2 parallel buckets. One comes up with top ten where last one is Y, 91 times and 11th is X 90 times. The other bucket comes up last as Z, 91 times, and 11th as X 90 times, and no Y at all. X actually appeared 180 times and Y only 91 times, but if you merge only top 10, the answer will not have X in it.
Could tracking more than top 10 in each bucket and merging more than 10 from each would guarantee the correct top 10? I don't know.
1. Do External Sort (merge sort in file).
- Martin August 27, 20122. Collapse duplicates to "num, word", e.g. "1025, the".
3. While doing step 2, keep a max heap of word frequencies.
This way there is no extra storage needed, but the complexity is O(N*log(N)).
If extra storage is permitted, you can build a hash file, where line number corresponds to hash index. You'll probably end up with using twice the original storage though. Also, random access on disk will probably be less efficient than the sorting in External Sort, since lines will be read sequentially and written back sequentially.