Google Interview Question
SDE-2sCountry: United States
Taken in count that a username is mostly in ASCII (4 byte per char) with a max length of seven (standard), this is for a billion:
4 * 7 * 1000000000 = 26.077 GB
In a 32 bit architecture system with more than 4 GB, e.g.: 32 and a OS Gnu/Linux based using kernel pae, split in chunks of 2 GB (13 portions of the file / external sort) to processed in parallel by 13 (not forked) process, MapReduce is one approach.
Trie-tree might be one option to "compress" these files, e.g.: that the tree root is a->b->c->...->z (using just from a, z) and the so on (recursively)..
Based on that the username can use any of the 24 chars:
As a 24-char-tree, 0, 24^1, 24*24..., total number of nodes is nodes = (k^(h+1))/(k-1), with h = 6 (using 7 chars) and k = 24, space = bytes2GB(nodes * 4) = 3.46 GB as max, also can't be used in a 32 bit architecture (max allocable is around 3.2)... not entirely true, due is just using 7 chars it will never reach the 3.4GB. The complexity to search a username is based on the tree height.
Another approach can be by using bit manipulation and bit vectors (an int) to identify the char to be used, e.g.: 0000 0000 0000 0000 0001 0101, a,c,e are used, this will improve the max memory used up to 3 bytes per node, but is using a integer will use 4 bytes anyway, and store them in an array int[10^9] using 3.7 GB, having in mind to navigate the bits over the array as a tree.
IMHO
Two approaches
1. Build Hash with smaller file and check for match in other Time O(n) Space O(n)
2. Sort both files (nlogn) and now find match with 2 fingered approach O(n) total complexity O(nlogn) + O(n) = O(nlogn)
I think the point of mentioning "billion usernames" is that they won't fit in memory. So the question is more about exploring other options
What I would do, since clearly I cannot load both files in memory (probably not even each alone). I would still sort them (External merge sort, check wikipedia for that). Then I will load as much from each file as I can put into memory, for example 512kb from each file, and perform a merge sort where I only keep the last element and the list where it came from, if the came from different lists, I add it to my result set. When a chunk is over, I load the next chunk.
How about using Bloom Filters. It can help in finding whether the element is present or not.
Though, it might lead to false positives(showing element is present though it is not). In that case, go for other approaches like hash functions on smaller files.
You question about hashing that you just posted after this question was posted suggestion you already figured out what to do here.
This is a sort of open ended question. It is crucial to ask more details from the interviewer.
- Are we using a single core or a cluster of cpus?
- How much memory is available? Do the files fit in memory?
- Can we create auxiliary files?
Assuming 1 cpu core, a possible approach could be partition the usernames into buckets. Each bucket contains a portion of the valid usernames.
Assume for simplicity that the alphabet used in the usernames are only letters A-Z. One possible partition would be to split all usernames starting with A to one file, B to a another file, etc. Then, each file will have a fraction of the billion usernames and we can intersect the usernames from the log files.
The above approach is very simple and won't partition the usernames into the buckets uniformly. We should discuss it further with the interviewer. We could try to check the usual distribution of the usernames over the alphabet and partition (start with A is probably more likely than starting with Q). Of course we don't need to split just by the starting character, but the first 2, 3, etc chars.
Note: don't assume the usernames use the latin charset. Ask first if other chars like Chinese or Japanese are used (unicode in general).
Sorting looks to me over kill, rather i would create index of string hashcode and save info like {hashcode, stringFileLocationArray} in a avl tree. O(n)
Once this structure is in place then scan second file pick each user name lookup hash code in index (nlog n) if hash code matches then compare with actual string in file.
you can use chche of strings read from file some kind of (LRU etc) to speed up matching. You can also do some IO optimization by adding all matched (hash code) in a queue and order items in queue by order of file location.
Use trie (dictionary) is better.
let the length of user name is O(k).
First, load the first file, and build a dictionary. O(kn) time and O(kn) space.
Second, load the second file, for each user, search whether it exists in the dictionary. If it exists, output it. O(kn) time.
In general case, k is a small constant. The algorithm can achieve O(n) time and use O(n) space.
how about start hashing both the files simulataneosly, take one username from one file, hashmap it (bucket hashing),, then take one user name from second file do the same. Do this till for all 2 billions usernames. Once done, Space O(2n) = O(n).
Pick any user name and you can check with O(1) if it has a duplicate. (worst case O(m) m being duplication constant or depends on the hashing function.
do this for all 2N records, which would be O(2n) = O(n)
I believe that the only way to solve that efficiently is to use famous B-Tree data structure. Building the tree will take O(N*Log(N)) but it shall be done only once. You can keep only some parts of the tree called pages in memory and the rest of them you can store on the file system so no problem with huge number of usernames and you wil still have O(Log N) to check if username already exists in the tree.
How about this -
- Anshu December 05, 2013USe a trie-like data structure storing all valid user names. Populate it using one pass over the first list, keep marking the words present.
Next, pass over the second list and keep looking up the trie for it.
Pick the ones already present in trie and report them.