How does Google Search work?
Get 300 checks per monthabsolutely FREE!
No credit card needed. No strings attached.
Hi Matt, could you please explain how Google's ranking and website evaluation process works starting with the crawling and analysis of a site, crawling timelines, frequencies, priorities, indexing and filtering processes within the databases etc.
How Google's ranking and website evaluation process works starting with the crawling and analysis of a site, crawling time lines, frequencies, priorities, indexing and filtering processes within the databases?
00:00
MATT CUTTS: Hi, everybody.
We got a really interesting and
very expansive question
from RobertvH in Munich.
RobertvH wants to know--
Hi Matt, could you please
explain how Google's ranking
and website evaluation process
works starting with the
crawling and analysis of a site,
crawling time lines,
frequencies, priorities,
indexing and filtering
processes within the databases,
et cetera?
OK.
So that's basically
just like, tell me
everything about Google.
Right?
That's a really expansive
question.
It covers a lot of
different ground.
And in fact, I have given
orientation lectures to
engineers when they come in.
And I can talk for an hour
about all those different
topics, and even talk for an
hour about a very small subset
of those topics.
So let me talk for a while and
see how much of a feel I can
give you for how the Google
infrastructure works, how it
all fits together, how our
crawling and indexing and
serving pipeline works.
Let's dive right in.
So there's three things that you
really want to do well if
you want to be the world's
best search engine.
You want to crawl the web
comprehensively and deeply.
You want to index those pages.
And then you want to rank or
serve those pages and return
the most relevant ones first.
Crawling is actually
more difficult
than you might think.
Whenever Google started,
whenever I joined back in
2000, we didn't manage to crawl
the web for something
like three or four months.
And we had to have a war room.
But a good way to think about
the mental model is we
basically take page rank as
the primary determinant.
The more page rank you have that is, the more people who link to you and the more reputable those people are, the more likely it is we're going to discover your page relatively early in the crawl.
01:28
And the more page rank you
have-- that is, the more
people who link to you and the
more reputable those people
are-- the more likely it is
we're going to discover your
page relatively early
in the crawl.
In fact, you could imagine
crawling in strict page rank
order, and you'd get the CNNs of
the world and The New York
Times of the world and really
very high page rank sites.
And if you think about how
things used to be, we used to
crawl for 30 days.
So we'd crawl for
several weeks.
And then we would index
for about a week.
And then we would push
that data out.
And that would take
about a week.
And so that was what the
Google dance was.
Sometimes you'd hit one data
center that had old data.
And sometimes you'd hit a data
center that had new data.
Now there's various
interesting tricks
that you can do.
For example, after you've
crawled for 30 days, you can
imagine recrawling the high page
rank guys so you can see
if there's anything new or
important that's hit on the
CNN home page.
But for the most part, this
is not fantastic.
Right?
Because if you're trying to
crawl the web and it takes you
30 days, you're going
to be out-of-date.
So eventually, in 2003, I
believe, we switched as part
of an update called Update Fritz
to crawling a fairly
interesting significant chunk
of the web every day.
And so if you imagine breaking
the web into a certain number
of segments, you could imagine
crawling that part of the web
and refreshing it every night.
And so at any given point, your
main base index would
only be so out of date.
Because then you'd loop back
around and you'd refresh that.
And that works very,
very well.
Instead of waiting for
everything to finish, you're
incrementally updating
your index.
And we've gotten even
better over time.
So at this point, we can
get very, very fresh.
Any time we see updates, we can usually find them very quickly.
03:14
Any time we see updates,
we can usually
find them very quickly.
And in the old days, you would
have not just a main or a base
index, but you could have what
were called supplemental
results, or the supplemental
index.
And that was something that we
wouldn't crawl and refresh
quite as often.
But it was a lot
more documents.
And so you could almost imagine
having really fresh
content, a layer of our main
index, and then more documents
that are not refreshed quite
as often, but there's a lot
more of them.
So that's just a little bit
about the crawl and how to
crawl comprehensively.
What you do then is you
pass things around.
And you basically say, OK, I
have crawled a large fraction
of the web.
And within that web you have,
for example, one document.
Indexing is basically taking things in word order.
03:58
And indexing is basically taking
things in word order.
Well, let's just work
through an example.
Suppose you say Katy Perry.
In a document, Katy Perry
appears right
next to each other.
But what you want in an index
is which documents does the
word Katy appear in, and which
documents does the word
Perry appear in?
So you might say Katy appears in
documents 1, and 2, and 89,
and 555, and 789.
And Perry might appear in
documents number 2, and 8, and
73, and 555, and 1,000.
And so the whole process of
doing the index is reversing,
so that instead of having the
documents in word order, you
have the words, and they have
it in document order.
So it's, OK, these are all
the documents that a
word appears in.
Now when someone comes to Google
and they type in Katy
Perry, you want to say, OK,
what documents might match
Katy Perry?
Well, document one has Katy,
but it doesn't have Perry.
So it's out.
Document number two has both
Katy and Perry, so that's a
possibility.
Document eight has Perry
but not Katy.
89 and 73 are out because they
don't have the right
combination of words.
555 has both Katy and Perry.
And then these two
are also out.
And so when someone comes to
Google and they type in
Chicken Little, Britney Spears,
Matt Cutts, Katy
Perry, whatever it is, we find
the documents that we believe
have those words, either on
the page or maybe in back
links, in anchor text pointing
to that document.
Once you've done what's called
document selection, you try to
figure out, how should
you rank those?
And that's really tricky.
We use page rank as well as over
200 other factors in our
rankings to try to say, OK,
maybe this document is really
authoritative.
It has a lot of reputation
because it has
a lot of page rank.
But it only has the
word Perry once.
And it just happens to have the
word Katy somewhere else
on the page.
Whereas here is a document that
has the word Katy and
Perry right next to each other,
so there's proximity.
And it's got a lot
of reputation.
It's got a lot of links
pointing to it.
So we try to balance that off.
You want to find reputable
documents that are also about
what the user typed in.
And that's kind of the secret
sauce, trying to figure out a
way to combine those 200
different ranking signals in
order to find the most
relevant document.
So at any given time, hundreds
of millions of times a day,
someone comes to Google.
We try to find the closest
data center to them.
They type in something
like Katy Perry.
We send that query out to
hundreds of different machines
all at once, which look through
their little tiny
fraction of the web that
we've indexed.
And we find, OK, these are
the documents that
we think best match.
All those machines return
their matches.
And we say, OK, what's the
creme de la creme?
What's the needle
in the haystack?
What's the best page that
matches this query across our
entire index?
And then we take that page and
we try to show it with a
useful snippet.
So you show the key words in the
context of the document.
And you get it all back in
under half a second.
That's probably about as long as we can go on without straining YouTube. But that just gives you a little bit of a feel about how the crawling system works.
07:06
So that's probably about as long
as we can go on without
straining YouTube.
But that just gives you a little
bit of a feel about how
the crawling system works, how
we index documents, how things
get returned in under half a
second through that massive
parallelization.
I hope that helps.
And if you want to know more,
there's a whole bunch of
articles and academic papers
about Google, and page rank,
and how Google works.
But you can also apply to--
there's jobs@google.com, I
think, or google.com/jobs, if
you're interested in learning
a lot more about how search
engines work.
OK.
Thanks very much.
Related Topics
Get 300 checks per month absolutely FREE!
No credit card needed. No strings attached. 👍