Given two log files, each with

Google Interview Question for SDE-2s

3

of 3 votes

20
Answers
Given two log files, each with a billion usernames (each username appended to the log file), find the usernames existing in both documents?
- Madan December 04, 2013 in United States | Report Duplicate | Flag | PURGE
Google SDE-2

Email me when people comment.

An error occurred in subscribing you.

Country: United States

More Questions from This Interview

Email me when people comment.

An error occurred in subscribing you.

Comment hidden because of low score. Click to expand.

of 3 vote

How about this -
USe a trie-like data structure storing all valid user names. Populate it using one pass over the first list, keep marking the words present.
Next, pass over the second list and keep looking up the trie for it.
Pick the ones already present in trie and report them.

- Anshu December 05, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 1 vote

Taken in count that a username is mostly in ASCII (4 byte per char) with a max length of seven (standard), this is for a billion:

4 * 7 * 1000000000 = 26.077 GB

In a 32 bit architecture system with more than 4 GB, e.g.: 32 and a OS Gnu/Linux based using kernel pae, split in chunks of 2 GB (13 portions of the file / external sort) to processed in parallel by 13 (not forked) process, MapReduce is one approach.

Trie-tree might be one option to "compress" these files, e.g.: that the tree root is a->b->c->...->z (using just from a, z) and the so on (recursively)..

Based on that the username can use any of the 24 chars:
As a 24-char-tree, 0, 24^1, 24*24..., total number of nodes is nodes = (k^(h+1))/(k-1), with h = 6 (using 7 chars) and k = 24, space = bytes2GB(nodes * 4) = 3.46 GB as max, also can't be used in a 32 bit architecture (max allocable is around 3.2)... not entirely true, due is just using 7 chars it will never reach the 3.4GB. The complexity to search a username is based on the tree height.

Another approach can be by using bit manipulation and bit vectors (an int) to identify the char to be used, e.g.: 0000 0000 0000 0000 0001 0101, a,c,e are used, this will improve the max memory used up to 3 bytes per node, but is using a integer will use 4 bytes anyway, and store them in an array int[10^9] using 3.7 GB, having in mind to navigate the bits over the array as a tree.

IMHO

- alexis.tejeda December 05, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 1 vote

the drawback of tries is the memory. it would take too much memory to store 2 billion usernames in memory as Alexis mentioned as well.

- Miguel Oliveira December 05, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 2 vote

Two approaches
1. Build Hash with smaller file and check for match in other Time O(n) Space O(n)
2. Sort both files (nlogn) and now find match with 2 fingered approach O(n) total complexity O(nlogn) + O(n) = O(nlogn)

- loveCoding December 05, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 2 votes

I think the point of mentioning "billion usernames" is that they won't fit in memory. So the question is more about exploring other options

- Miguel Oliveira December 05, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

However sorting files is something that has been done since the time of tapes.

- sambatyon December 05, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

True @sambatyon, but I don't think sorting huge files works exactly in O(n log n) does it? I never did it (huge files) in practice.

- Miguel Oliveira December 05, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 2 vote

What I would do, since clearly I cannot load both files in memory (probably not even each alone). I would still sort them (External merge sort, check wikipedia for that). Then I will load as much from each file as I can put into memory, for example 512kb from each file, and perform a merge sort where I only keep the last element and the list where it came from, if the came from different lists, I add it to my result set. When a chunk is over, I load the next chunk.

- sambatyon December 05, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 1 vote

How about using Bloom Filters. It can help in finding whether the element is present or not.
Though, it might lead to false positives(showing element is present though it is not). In that case, go for other approaches like hash functions on smaller files.

- Dinesh January 12, 2014 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

I like this idea.

- tian April 05, 2014 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

You question about hashing that you just posted after this question was posted suggestion you already figured out what to do here.

- Anonymous December 05, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

Hash items from smaller file, then check each item in bigger file if it exists in hash table, if so add it to a set data structure to hold information about the duplicates.

If both file same size, doesn't matter which you hash and which you check against hash.

- Anonymous December 05, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

This is a sort of open ended question. It is crucial to ask more details from the interviewer.

- Are we using a single core or a cluster of cpus?
- How much memory is available? Do the files fit in memory?
- Can we create auxiliary files?

Assuming 1 cpu core, a possible approach could be partition the usernames into buckets. Each bucket contains a portion of the valid usernames.

Assume for simplicity that the alphabet used in the usernames are only letters A-Z. One possible partition would be to split all usernames starting with A to one file, B to a another file, etc. Then, each file will have a fraction of the billion usernames and we can intersect the usernames from the log files.

The above approach is very simple and won't partition the usernames into the buckets uniformly. We should discuss it further with the interviewer. We could try to check the usual distribution of the usernames over the alphabet and partition (start with A is probably more likely than starting with Q). Of course we don't need to split just by the starting character, but the first 2, 3, etc chars.

Note: don't assume the usernames use the latin charset. Ask first if other chars like Chinese or Japanese are used (unicode in general).

- Miguel Oliveira December 05, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Sorting looks to me over kill, rather i would create index of string hashcode and save info like {hashcode, stringFileLocationArray} in a avl tree. O(n)

Once this structure is in place then scan second file pick each user name lookup hash code in index (nlog n) if hash code matches then compare with actual string in file.

you can use chche of strings read from file some kind of (LRU etc) to speed up matching. You can also do some IO optimization by adding all matched (hash code) in a queue and order items in queue by order of file location.

- Anonymous December 06, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

I think a possible approach would be sorting both log files which will take O(nlogn) and, after that, read both files by chunks simply comparing the usernames and skipping when the username in one list is lower than the other one. Some kind of merge search...

- Dídac Pérez December 08, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

First and foremost thing is to reduce the search Space. Every Name has vowels so basically use that property and generate words and reduce search space in both files and search only in that to determine the occurrence.

We can then hash them or sort them which ever way needed.

- Ravi December 10, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Use trie (dictionary) is better.

let the length of user name is O(k).

First, load the first file, and build a dictionary. O(kn) time and O(kn) space.

Second, load the second file, for each user, search whether it exists in the dictionary. If it exists, output it. O(kn) time.

In general case, k is a small constant. The algorithm can achieve O(n) time and use O(n) space.

- Anonymous December 16, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

how about start hashing both the files simulataneosly, take one username from one file, hashmap it (bucket hashing),, then take one user name from second file do the same. Do this till for all 2 billions usernames. Once done, Space O(2n) = O(n).
Pick any user name and you can check with O(1) if it has a duplicate. (worst case O(m) m being duplication constant or depends on the hashing function.
do this for all 2N records, which would be O(2n) = O(n)

- cank December 18, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

I believe that the only way to solve that efficiently is to use famous B-Tree data structure. Building the tree will take O(N*Log(N)) but it shall be done only once. You can keep only some parts of the tree called pages in memory and the rest of them you can store on the file system so no problem with huge number of usernames and you wil still have O(Log N) to check if username already exists in the tree.

- Guy January 11, 2014 | Flag Reply

Comment hidden because of low score. Click to expand.

-1

of 1 vote

To avoid space complexity,
1) simply compare first string from File1 with all strings in File2.
2) Repeat the same for all strings in File1.

Time complexity: n^2 Space complexity: 0

- CPPGuy February 13, 2014 | Flag Reply

CareerCup

Google Interview Question for SDE-2s

Books

Videos

Resume Review

Mock Interviews