Google Interview Question
Java DevelopersCountry: United States
Opposed to exact content, one might want to have a comparison that is robust against typos, layout dependent differences (e.g. table layout vs. column layout), intended introduction of minor differences, etc.
To minimize false positives, the approaches previously described work fine. If you want to have a balance between false positives and false negatives, the following approach is be better suited:
1) strip html tags off, take the text, strip filling words (words with no content, like "and", "it", "a", ...) from these words, do word stemming (reading = read, etc...) of remaining words and then count the occurrence of each word, so you get two multi-dimensional vectors (k dimensions, for each word stem a dimension)
2) feed the two vectors into a comparison function (like cosine-similarity) and now define a threshold when do you consider the page same content and when not. Clearly, if you have a database to test against (where you know the result you want to achieve) you have an advantage.
I would iterate the whole text and eliminate the tags (i.e. <div> <ul> and so on). After that I would remove the white spaces and compare the text.
- sosuliviu December 02, 2017For a very long string I have no idea. Above I am iterating the text only on time.