Facebook Interview Question
Country: United States
I would use a hash in which the keys are the pages and the values the list of users that visited them. Next the code in Python that illustrates it showing how to solve the queries using the proposed data structure
from collections import defaultdict
def build_query_structure(data):
d = defaultdict(list)
for x, y in data:
d[y].append(x)
return d
query_data = [('u1', 'p4'), ('u3', 'p2'), ('u7', 'p9')]
query = build_query_structure(query_data)
# Query 1: Pages visited exactly 100 times
print filter(lambda x: len(query[x]) == 100, query.iterkeys())
# Query 2: Pages visited by only one user exactly 15 times
print filter(lambda x: len(query[x]) == 15 and len(set(query[x])) == 1,\
query.iterkeys())
# Query 3: Page visited by u3 more than 20 times
print filter(lambda x: query[x].count('u3') >= 20, query.iterkeys())
Struct PageUserStat {
pages: array of pointers to Page objects, sorted by visitorCount
users: array of pointers to User objects, sorted by userId
}
Page {
pageId: int
visitorCount: int
visitCount: int
}
User {
userId: int
pageCounts: array of pointers to Page objects, sorted by visitCount
}
This provides:
Query 1/2: Teta/Omeg(logP) O(P), worst case happens when a large number of pages satisfies the condition.
Query 3: Teta/Omeg(logU+logP) O(P), worst case happens when a large number of users satisfies the condition.
U is the number of distinct users.
P is the number of distinct pages.
I see the bulk of the work for this problem to be done just by building trees of page hits and user hits (user visits). We need a tree for all page hits, page hits for a specific user, and user hits (user visits) for a given page.
With super large data sets, additional data structures should be added to build all the users and pages and hits ... so I've added dictionaries (hash tables) to allow for the fast creation of the tree data.
using System;
// To execute C#, please define "static void Main" on a class
// named Solution.
class Solution
{
class PageHit
{
public Page Page; // the page
public int HitCount; // hit count
public PageHit Left; // tree of hit counts <= this.HitCount
public PageHit Right; // tree of hit counts > this.HitCount
}
class UserHit
{
public User User; // the user
public int HitCount; // hit count
public PageHit Left; // tree of hit counts <= this.HitCount
public PageHit Right; // tree of hit counts > this.HitCount
}
class Page
{
// tree of user hits (user visits) balanced/sorted by hit count
private UserHit m_userHitTree;
// fast lookup of user hits by user
private Dictionary<User, UserHit> m_userHitsByUser;
public void AddUserHit( User user )
{
// get UserHit from m_userHitsByUser, create if not found
// increment UserHit.HitCount
// sort (or resort) into m_userHitTree
}
public User FindUserByVisitCount( int visitCount )
{
// search m_userHitTree O(log(n))
}
}
class User
{
// user identifier
public string UserId;
// tree of page hits balanced/sorted by page hit
private PageHit m_pageHitTree;
// fast lookup of page hits by page
private Dictinary<Page, PageHit> m_pageHitByPage;
public void AddPageHit( Page page )
{
// get PageHit from m_pageHitsByPage, create if not found
// increment PageHit.Hitcount
// sort (or resort) PageHit into m_pageHitTree
}
public Page FindPageByHitCount( int hitCount )
{
// search m_pageHitTree O(log(n))
}
}
class PageManager
{
// tree of page hits balanced/sorted by pagehit
private PageHit m_pageHitTree;
// fast lookup of page hit by page
private Dictionary<Page, PageHit> m_pageHitsByPage;
// fast lookup of page by url
private Dictionary<string, Page> m_pagesByUrl;
public Page GetPage( string url, bool createIfNotFound = false )
{
// get existing page from m_pagesByUrl
// possibly create if not found
// return page
}
public void AddPageHit( Page page )
{
// get PageHit from m_pageHitsByPage, create if not found
// increment PageHit.HitCount
// sort (or resort) PageHit in m_pageHitTree
}
public Page FindPageByHitCount( int hitCount, Page searchStart = null )
{
// search binary tree for hit count starting at searchStart
// or from m_pageHitTree root node
// O(log(n)) from root or O(1) if given searchStart
}
}
class UserManager
{
private Dictionary<string, User> m_usersById;
public User GetUser( string userId, bool createIfNotFound = false )
{
// get existing user in m_usersById
// possibly create if not found
// return user
}
}
static void Main(string[] args)
{
// data structs
UserManager userManager = new UserManager();
PageManager pageManager = new PageManager();
// load data from logs
foreach (string userId and string pageUrl in log)
{
User user = userManager.GetUser( userId, true );
Page page = pageManager.GetPage( pageUrl, true );
pageManager.AddPageHit( page );
page.AddUserHit( user );
user.AddPageHit( page );
}
// which page was visited by 100 users in a day
Page pageWith100UserVisits = pageManager.FindPageByHitCount( 100 );
// which page was visited by only one user exactly 15 times in a day
Page pageWith1UserVisiting15Times = null;
Page prevPage = null;
while (true)
{
Page page = pageManager.FindPageByHitCount( 1, prevPage != null ? prevPage.Left : null );
if (page == null)
{
break;
}
if (page.FindUserByVisitCount( 15 ))
{
pageWith1UserVisiting15Times = page;
break;
}
prevPage = page;
}
// which page ws visited by u3 more than 20 times in a day
User user3 = userManager.GetOrAddUser( "u3" );
if (user3 != null)
{
Page pageU3VisitedMoreThan20Times = user3.FindPageByHitCount( 20 );
}
}
}
I think, there should be a directional graph having users and pages as vertex. Edge goes from user to page. Each edge will have weightage which will show the number of visits by user to that page. at each visit, increase the weight by 1.
So the answer for the three questions:-
1. Which page was visited by exactly 100 users in a day? get the list of all vertex with 100 incoming edges.
2. Which page was visited by only one user exactly 15 times in a day? vertices sharing the edge with the weightage 15 and is vertex type "Page" not "User".
3. Which page was visited by u3 more than 20 times a day? All the outgoing edges from the vertex u3 having weigthage more then 20.
/*
Too much convolution in all the answers
Actual, things can be done much faster
*/
def UserActivity : {
user_id : "user-id" ,
actions_at : [/* list of timestamps */ ]
}
def PageActivityForDay : {
page_id : "page-id" ,
day : "YYY-MM-dd" ,
user_activity : { /* map of User Activities keyed by user-id */ }
}
// which pages has exactly 100 users in a day?
yo = select( list_of_page_activities_per_day ) where { size( $.user_activity ) == 100 }
// Which page was visited by only one user exactly 15 times in a day?
meh = select( list_of_page_activities_per_day ) where {
exists ( $.user_activity ) where { size( $.actions_at ) == 15 }
}
// Which page was visited by u3 more than 20 times in a day?
foo = select( list_of_page_activities_per_day ) where {
"u3" @ $.user_activity && size( $.user_activity["u3"].actions_at ) > 20
}
edited (mixed up terms)
- Chris August 13, 2017I would try to approach the question in a structured way, where do these queries come from, what do we expect to ask tomorrow, so we don't have to change this structure all the time. Anyway, for the exact question:
page(s) visited by 100 users:
- would store for each number of unique users per day the pages
- as a b-tree with #of unique users as key and page id and number of total visits as value (I would loose which user it was with this aggregation), but i could have fast statistics for a page ranking based on user visits.
- to ask which was visited 100 times, I had a O(lg(m)+k) query. m is the different page frequencies per day, k the number of pages with exactly 100 users per day. Instead of b-tree I could use a HT with the frequency as index, so it would be O(k), how ever, I'd expect a query > 100, > 1000 or >100 < 200 etc...
pages visited by 1 user 15 times:
- same as above, I would get all pages visited by 1 unique user and ask which one had only 15 visits in total. if that's not good enough, because there are many pages with 1 unique user a day, I would use a different index on the b-tree of first query.
that index would consist of two integers, the #unique users and the #of total page visits (ordered primarily on unique users, secondary on total page visits) So I could ask both questions on the same index with logarithmic upper bounds (depends what I want exactly, e.g. one result or multiple results)
page visited by u3 more than 20 times (may be more than one...first, all > 20?, any if exists):
- I would assume the vector describing #visits / page for a single user and day is sparse.
- hash-table with user id as key and a vector as value
- the vector contains the tuple page-id, #visits
- to find I would have O(1) to get to the vector and a linear scan on the vector, If I sort the vector, I had O(lg(n)) to find the min(>20), O(1) to find any or none, or O(n) to find all > 20