Question

Facebook has cooked into their search some features that are unique -- possibly some are patented, even? The features I speak of are driven by three distinct requirements:

  1. The fact that their database is gigantic, and they can't just JOIN their way over to the data they need as they need it, as you can, typically in a single-homed business app with less than a million records.
  2. The expectations of their users are shaped by other search experiences, namely Google, so that long-tail search queries are done by appending keywords to the person's name being searched for, such as "Orlando, Florida" or "Rotary Club" (or some other identifying value like an employer name).
  3. The data architecture appears to be shallow, based on the window we have on it looking in from the application (of course it's not shallow). What I'm saying is that beyond the so-called "Basic Information" in a user profile, such as gender, and current city, so much of what makes a profile unique is not rigidly assigned to logical columns.

So, then, complexity exists in the needs associated with the size of the dataset, BUT with it a need to deliver to the user relevant results, to a user community that's not savvy in search, but has had their expectations and training provided by The Google.

Given all of that (a refinement of my question):

a.) What search features are necessary for FaceBook that we should take note of and deploy in our own search apps/engines? By necessary, I mean driven by either the massive size of the data set, or driven by the expectations of the users, and the need for the site to organically grow and increase its relationships among its data -- I mean, users.

b.) What search features are innovative and worthy of attention by data and/or search architects?

Some are obvious, such as using synonyms for first names -- fuzzy matching a query for "Bill" with a "William" record. You can do this in Solr with a list of synonyms. I'd call this a basic feature that is necessary, not innovative of course.

Others, which are innovative deserve our attention. The first example of innovation that I can call attention to is that their search relevancy is custom per user. If I type "John Smith" I get a different set of results than another searcher would (theoretically better matches for me, people in my network, friends of friends, etc.). Before you say that's not innovative because you can type just "Pizza" in Google and they'll give you relevant results by appending your locale to the query, follow along, please. My hope is that answers and discussions, really, to this question would frame in some of the technical requirements as well as provide ideas to include as features in search.

For instance...

  • Would you guess they run a regular batch process to denormalize the data? (i.e. a batch job to make a link table of in-place first degree of separation, second degree, etc.)
  • From such a batch or denormalization, does it then limit the number of hits? This is evidenced by returning only the logically nearest "John Smith" matches. However, searches of non-common names [such as my own first and lastname] seem not affected by a limit on results and the search will look around the world, completely outside of those "few degrees" of separation.
  • Are they increasing the relevance scoring by age, giving more relevancy to matches that are near the same age group as the searcher? (comment: it seems they should, it could be at least a minor speed bump to intergenerational communications/meetings that should not happen -- euphemistic, I know)

Technically, on the back end, is it best to do a denormalization process at the database level and THEN index the "documents?" (clarification: for those unitiated to enterprise search a "document" is MOL similar in concept to a database record... MOL)

OR, is there no database denormalization. In place of that, the process of writing the search index includes writing into each "document" the related information and the people who are "in-network" or just a few degrees apart?

CERTAINLY it's necessary to pre-process such info. Without having done this exact thing in practice, myself, it seems to me that it's advantageous to denormalize in batches at the database level, reason being that the search server is good at finding info super fast, but the database server is better at getting the matching data (assuming it extends out to related columns which are not in the search index).

Consequently, expanding upon the concept of search relevancy being dependent upon the user-searcher, notice that it is also derivative of the recent browsing activity of the user. For example, a search for "John Smith Orlando" might never produce the "right" John Smith, but after visiting the correct John Smith's FaceBook page (suppose you got his URL in an email), even without adding John Smith as a friend, a subsequent search on John Smith will, this time, actually return that result the very next time. [I wonder how long before that ages out, or if it ages out at all?]

I used Facebook as an example here because they're huge. Their size forces a well-thought architecture -- such as what stays in it's normal form, and what cannot because you just can't JOIN a 100 million record table repeatedly (rejoining the same person table from another "fork" off of a link table or a derived table can produce the "friends of friends" effect).

The practice of relevancy tuning is really almost an art. Data sets, business rules, and users' expectations are unique enough that a multipurpose scoring template, or even a best practices is nearly impossible to create.

That being said, by looking to the big sites who are pulling off search well enough, there is a technique to emulate, isn't there?

What are those techniques in place at FaceBook? And given their size, they can't just fetch what the user needs when they need it via ORM (not a slam to ORM champions) -- this requires well-planned normalization, SQL-level indexing, DE-normalization, and search server indexing.

Can anyone suggest what are some of the techniques in place there? For that matter, any large site with a similar search (and a large data set) will also provide good, on-topic suggestions.

Was it helpful?

Solution

For the database, Facebook utilizes MySQL because of its speed and reliability. MySQL is used primarily as a key-value store as data is randomly distributed amongst a large set of logical instances. These logical instances are spread out across physical nodes and load balancing is done at the physical node level. As far as customizations are concerned, Facebook has developed a custom partitioning scheme in which a global ID is assigned to all data. They also have a custom archiving scheme that is based on how frequent and recent data is on a per-user basis. Most data is distributed randomly.

For some parts like inbox it uses a NoSQL databases that is "eventually consistent" and when you query a cluster of them you get "the best answer at that time" and not necessarily what is correct.

From parts of your question it appears you're trying to take practices that work for social media and apply them more widely. Eventually Consistent won't work in accounting or trading or medical or research. If it's Auntie Fannie's latest picture of her cat, no one cares if the FB page doesn't show the most recent one, ALL THE TIME. You're willing to sacrifice that accuracy for such banality.

Turning every 3rd normal form business app into key value pairs because FB does it, isn't a train I'm willing to board.

OTHER TIPS

The question is kind of vague and we can only speculate as to what Facebook does.

But we can discuss instead how a typical Solr-powered search works, which is a more concrete topic. Yes, you have to denormalize data (here are some good tips on Solr schema design) when loading data into the Solr index. This ETL process can be done with the Data Import Handler, or a custom ETL process. Data sources can be anything, not just relational databases. How you design your schema depends largely on what kind of searches you'll be performing.

Full denormalization (Solr really has a flat schema) means no joins so it's pretty scalable (see Solr shards and replication).

Your other concern was relevancy in search results. Here, Solr is very tunable (see the Relevancy Cookbook, FAQ). Yes, it's almost an art as you say, since every application has a different concept of relevancy, so it needs to be tuned differently. And yet the default relevancy is usually acceptable for an out-of-the-box Solr instance (kudos to Solr and Lucene devs for that).

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top