Determine the 10 most frequent

Facebook Interview Question for Software Engineer / Developers

0

of 0 votes

30
Answers
Determine the 10 most frequent words given a terabyte of strings.
- geekofthegeeks August 27, 2012 in United States | Report Duplicate | Flag | PURGE
Facebook Software Engineer / Developer Algorithm

Email me when people comment.

An error occurred in subscribing you.

Country: United States

Email me when people comment.

An error occurred in subscribing you.

Comment hidden because of low score. Click to expand.

of 7 vote

1. Do External Sort (merge sort in file).
2. Collapse duplicates to "num, word", e.g. "1025, the".
3. While doing step 2, keep a max heap of word frequencies.

This way there is no extra storage needed, but the complexity is O(N*log(N)).

If extra storage is permitted, you can build a hash file, where line number corresponds to hash index. You'll probably end up with using twice the original storage though. Also, random access on disk will probably be less efficient than the sorting in External Sort, since lines will be read sequentially and written back sequentially.

- Martin August 27, 2012 | Flag Reply

Comment hidden because of low score. Click to expand.

of 1 vote

This is the solution in my mind as well but the solution to this problem is a lot of code (Creating multiple files, file IO code, merge sort for individual files and then an N way merge). Does facebook expect a complete solution end to end?

- axecapone August 28, 2012 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

if you have a Space of O(n) for creating a max heap, why you do not use the hash map to prevent sorting which has O(nlogn) runtime complexity?

look at the solution which is provided above and let me know your opinion

- zasa1010 August 28, 2012 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

You are right. Max heap is not a solution to me either but external sort is. In external sort, typically, sorted data is saved back into the HDD and then you perform an Nway merge on the files.

- axecapone August 28, 2012 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

Linux command :

cat filename.txt | tr -cs "[:alnum:]" "\n"| tr "[:lower:]" "[:upper:]" | awk '{h[$1]++}END{for (i in h){print h[i]" "i}}'|sort -nr | cat -n | head -n 10

- Psycho September 20, 2012 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

your heap can go out of memory. you need to write out the results. Create a max heap of size 10 and start reading the results computed in the previous steps and insert them into the heap. finally, the heap will contain the 10 most frequent strings.

- ram October 02, 2012 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

Min-heap instead of max-heap; you wan't to pop the smallest key. ;) And it doesn't require linear space; it wants to maintain the top 10 frequencies.

- Safi December 10, 2014 | Flag

Comment hidden because of low score. Click to expand.

of 1 vote

Use Hash-Map and Min Heap

- Psycho August 29, 2012 | Flag Reply

Comment hidden because of low score. Click to expand.

of 1 vote

Well, there could be 2 different definition of word.
1. A word in English dictionary, the biggest I have seen so far has 400k words.
2. A sequence of character separated by separator like space, full stop, etc. if we can say that they all from English dictionary or another languages dictionary. Or even if we can be sure that all sequence of words in our terabyte of file belongs to word in any language. Then total set of words is limited can easily loaded into hash table in memory.

- Anonymous October 18, 2012 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Instead of using hashtables, we can use a different data strcuture to indicate the frequency of the occurances. I ve written a basic program for any number of words based upon input.

The structure has count of occurences of a word ending with that node. Based upon the next letter we create an array of 26 letters (assuming all the letters are small , doesnt matter if mixed as well) from the first node. And the process resumes till last letter.

#include <iostream>

struct node;

struct node
{
	int count;
	node* child;
};

int main()
{
	node * fnode = new node[26];
	std::string str;
	int numStr;
	std::cin >> numStr;
	for (int i = 0; i < numStr ; i++)
	{
		
	
	std::cin >> str;
	node * n = fnode + (str[0] - 97);
	for (int i = 1; i < str.length() ; i++)
	{
		if (n->child == NULL)
		{
			n->child  = new node[26];
		}
		n = n->child + (str[i] - 97);
		 
	}
	n->count++;
	std::cout << n->count ;
	}
}

- Rahul Balakavi August 28, 2012 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

what if we are allowed to use a distributed setup something like mapreduce and we use the hashmap solution on each computer and then finally merge the each solution .
i think this can be solved by map reduce approach

- pritesh August 28, 2012 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

create a min-heap of 10 nodes. Node structure would look like:

srtuct node{
int frequency;
char *word;
node * left;
node *right;
}

The min heap is created based on the frequency of the word. Also maintain a concurrent hashmap backing this. With every insert on the hashmap check if the frequency of the inserted element is greater than the min frequency currently on the min-heap then delete the min-heap element-> place it into the hashmap because later it could re-enter the min-heap. Insert the element from the hashmap back into the min-heap and heapify. The insertion into the min-heap has to be an atomic operation. I think this would have good throughput.

Alternatively map-reduce could also be the way to go..

- saurabh August 28, 2012 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Since there are terabyte of strings. A Modified trie based solution and Min Heaps too will be a optimal one.
In Trie, The node consists of not "IsEnd" boolean variable but also the Count.
when a new string is inserted and its count is increased. If the count is greater than the top element of the min heap and the heap size is equal to 10, remove the top element and add the element.

When all the strings have been read, pop the elements and store the elements in a array.
print the array in the reverse order which gives the top 10 frequent elements

At the End

- Dinesh April 01, 2014 | Flag Reply

Comment hidden because of low score. Click to expand.

-3

of 15 vote

Create a hash table, the key is word, and value is a counter. Trace all words, if is not existed in hash, add it with value 1, otherwise, increment the value.

In the same time using a list of pair key value, can hold the most ten frequent repeated words, and if a word after increment has a value of more than the minimum in this list, replace with that and remove the minimum in list.

Notice the total English words! The Second Edition of the 20-volume Oxford English Dictionary contains full entries for 171,476 words in current use, and 47,156 obsolete words. To this may be added around 9,500 derivative words included as subentries.

anyways, this is a simple code to handle this case:

Dictionary<string, int> lookup = new Dictionary<string, int>();
string word = getNextWord();
int maxFrequent = 0;
Dictionary<string, int> topTen = new Dictionary<string,int>();


while (!String.IsNullOrEmpty(word))
{
    if (lookup.ContainsKey(word))
        lookup[word]++;
    else
        lookup.Add(word, 1);
    if (lookup[word] > maxFrequent)
    {
        if (topTen.ContainsKey(word))
            topTen[word] = lookup[word];
        else
        {
            topTen.Add(word, lookup[word]);
            if (topTen.Count() > 10)
            {
                string lessFrequent = findMinTopTen();
                topTen.Remove(lessFrequent);
            }
        }
        maxFrequent = findMinTopTen();
    }
   word = getNextWord();
}

At the end, topTen contains the most ten frequent words items.
findMinTopTen() return the word with less frequent repeat in a list of 10 items. This algorithm is linear and run in O(N) which N is total words in the text and needs O(m) space which m is the number of items in the word set used in the text

- developer.2020 August 27, 2012 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

The string size is one terabyte!! You cant store it in memory with a hash table.

- axecapone August 27, 2012 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

You can if you buffer line by line and have a map of key and counter. Can't think of a solution if you have one terabyte of string has with all unique words.

- Dat August 27, 2012 | Flag

Comment hidden because of low score. Click to expand.

-5

of 7 votes

of course I can store it. Because I do not store the terabyte string! I just need one place for each word in this long string.

The number of words in a long text are not unlimited!

The space needed for this algorithm is O(n) which n is the total number of words that is used in the text. notice if a word is used 100,000 times in the text we need just one place in the memory to keeping track of that word

- developer.2020 August 27, 2012 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

what happen if all the words are unique? then you cannot put everything into memory.

- hchan August 27, 2012 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

What if the string is the list of all words in a dictionary? All words are unique and it takes much more than one terabyte in a hashmap given its overhead.

- axecapone August 27, 2012 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

One mitigation to having one terabyte worth of unique words is to use a trie. The words (or keys to the hash table) will be compressed to the degree of common prefixes between them.

- Anonymous August 28, 2012 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

if the string contains all unique English words (worst case), it means about 150.000 words and if the length of each word is in average 10 bytes and 4 bytes for keeping the number of acurrance, in total 14 bytes for each word, the hash table will be 2GB which is not unreseanable and can be handle by desktop PC or a alptop.

- zasa1010 August 28, 2012 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

@Zasa1010

First of all, a data structure that hogs 2GB of memory IS unreasonable in any scaling application. 2GB for one data structure is ridiculous. Secondly oxford has about 300k words. So requirement is 4GB. Thirdly, in real life applications like Facebook, words are not limited to just English. There are other things like usernames which can themselves go into the millions.

So in short, this storing everything in memory wont work. Even if its just english dictionary, I am sure the interviewer will not be happy with such a program.

- axecapone August 28, 2012 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

Is it just me or is all the math in this thread incorrect? 300,000 words, at an average of 10 bytes each and 4 bytes to hold their count is 300000 * 14 = 4,200,000 bytes = 4.2MB, not 2-4GB as previously stated. Or am I missing something?

- exp August 28, 2012 | Flag

Comment hidden because of low score. Click to expand.

of 1 vote

exp is right, It's 4.2 MB not 4.2GB. This solution is fine if the interviewer says assuming english. I would be interested in how to solve the problem if it's not english... like UTF32 strings even.

- spock05 August 30, 2012 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

I think whenever an interviewer specifically mentions "terabyte of strings", then it does not imply an in-memory solution because "terabyte" has no significance..it should be out-of-memory solutions such as disk-based or distributed approaches such as map-reduce..but the point is that these solutions are not possible to code in a real-time interview environment.

- Anonymous September 07, 2012 | Flag

Comment hidden because of low score. Click to expand.

of 1 vote

yeah but the point is to realize that the total collection of *unique* words is not too big for memory. So run through the file stream keeping counts of words. simple.

- spock05 September 07, 2012 | Flag

Comment hidden because of low score. Click to expand.

-3

of 3 vote

Partition the data into subsets(i.e. buckets of reasonable size in RAM), and do parallel processing on each node(bucket). Each node will return the 10 most-frequent strings and their counts, as a sorted array where the index is the rank. This is essentially mergesort, done in parallel.

- Yev August 28, 2012 | Flag Reply

Comment hidden because of low score. Click to expand.

-4

of 8 votes

it does not work! because it is possible a word has repeated in all buckets in for example rank 11, and you do not report it as a frequent word! but in total might be the most frequent word.
Notice in each bucket different word sets might be dominant in terms of frequency.

- developer.2020 August 28, 2012 | Flag

Comment hidden because of low score. Click to expand.

-1

of 1 vote

But if you do the final merge by adding ranks of same words, you will get the answer

- Man August 30, 2012 | Flag

Comment hidden because of low score. Click to expand.

of 1 vote

Nasser is right. Consider 2 parallel buckets. One comes up with top ten where last one is Y, 91 times and 11th is X 90 times. The other bucket comes up last as Z, 91 times, and 11th as X 90 times, and no Y at all. X actually appeared 180 times and Y only 91 times, but if you merge only top 10, the answer will not have X in it.

Could tracking more than top 10 in each bucket and merging more than 10 from each would guarantee the correct top 10? I don't know.

- dc360 August 31, 2012 | Flag

CareerCup

Facebook Interview Question for Software Engineer / Developers

Books

Videos

Resume Review

Mock Interviews