Google's US appl. 20050071741
has been the subject of some discussion.
The first claim of the '741 application recites:
1. A method for scoring a document, comprising:
identifying a document;
obtaining one or more types of history data associated with the document; and generating a score for the document based on the one or more types of history data.
Doesn't citation analysis meet these claim elements?
Claim 15 of the '741 reads:
The method of claim 1, wherein the one or more types of history data includes information relating to how often the document is selected when the document is included in a set of search results; and wherein the generating a score includes: determining an extent to which the document is selected over time when the document is included in a set of search results, and scoring the document based, at least in part, on the extent to which the document is selected over time when the document is included in the set of search results.
Claim 22 gets into links:
The method of claim 1, wherein the one or more types of history data includes information relating to behavior of links over time; and wherein the generating a score includes: determining behavior of links associated with the document, and scoring the document based, at least in part, on the behavior of links associated with the document.
The spec notes: In the description to follow, documents may be described as having links to other documents and/or links from other documents. For example, when a document includes a link to another document, the link may be referred to as a "forward link." [LBE note: citing] When a document includes a link from another document, the link may be referred to as a "back link." [LBE note: cited] When the term "link" is used, it may refer to either a back link or a forward link.
Claim 52 (the second independent claim) is directed to a "system":
52. A system for scoring a document, comprising: means for identifying a document; means for obtaining a plurality of types of history data associated with the document; and means for generating a score for the document based, at least in part, on the plurality of types of history data.
The patent application has a "conclusion" section which recites:
 Systems and methods consistent with the principles of the invention may use history data to score documents and form high quality search results.
 The foregoing description of preferred embodiments of the present invention provides illustration and description, but is not intended to be exhaustive or to limit the invention to the precise form disclosed. Modifications and variations are possible in light of the above teachings or may be acquired from practice of the invention. For example, while a series of acts has been described with regard to FIG. 4, the order of the acts may be modified in other implementations consistent with the principles of the invention. Also, non-dependent acts may be performed in parallel.
 Further, it has generally been described that server 120 performs most, if not all, of the acts described with regard to the processing of FIG. 4. In another implementation consistent with the principles of the invention, one or more, or all, of the acts may be performed by another entity, such as another server 130 and/or 140 or client 110.
 It will also be apparent to one of ordinary skill in the art that aspects of the invention, as described above, may be implemented in many different forms of software, firmware, and hardware in the implementations illustrated in the figures. The actual software code or specialized control hardware used to implement aspects consistent with the principles of the invention is not limiting of the present invention. Thus, the operation and behavior of the aspects were described without reference to the specific software code--it being understood that one of ordinary skill in the art would be able to design software and control hardware to implement the aspects based on the description herein.
Jill Whalen wrote:
I wasn’t surprised about the stuff in the patent that corresponded with Google’s aging delay and its “sandbox” as I had already seen a lot of discussion on this. For those who aren’t familiar with the aging delay and the sandbox, you’ll want to note that there is a lot of disagreement over what causes a site to be thrown in the sandbox.
However, based on my own observations and the experiences of some trusted SEO friends, it’s my belief that the sandbox is basically a purgatory database where Google places certain URLs based on a variety of predetermined criteria. (Much of this is spelled out in the first part of the patent application.)
The aging delay, on the other hand, is actually a subset of the sandbox. In other words, the aging delay is just *one* reason why a URL might get placed in the sandbox.
**UPDATE on 10 July 06**
There was later discussion by Darren Yates of the '741 application; however, keep in mind that, just because Google filed a patent application, does NOT establish that Google is practicing what is in the application.
It's common knowledge that Google relies heavily on inbound relevant links to rank a site. Now they explain exactly how it works.
As well as the number, quality and anchor text factors of a link. Google seems to also consider historical factors. Apparently the Google 'sandbox' or aging delay begins count down the minute links to a new site are discovered.
Google records the discovery of a link, link changes over time, the speed at which a site gains links and the link life span. With this in mind fast link acquisition may be a strong indicator of potential search engine Spam.
Gone are the days of pages and pages full of links. You must grow your links slowly to stay below the radar and be careful who you exchange links with. That means no more buying hundreds of links at once or other underhand tactics.
PR is now very valuable.
Your link anchor text should vary but remain consistent with your site content. No more using your main keywords on every link exchange you gain. That's 'anchor Spam'. Instead vary them around your top five to ten keywords.
Link exchanges are still very important but you must work and utilize them ethically. If you don't and you get caught the recovery from a ban can be months in coming and your host and IP may also be recorded.
Softly softly seems to be the message. The fact is fewer but better quality links will benefit you more anyway and they will be much more likely to be long-term which is also good.
• Site click through rates (CTR)
CTR may now be monitored through cache, temporary files, bookmarks and favorites via the Google toolbar or desktop tools. Many have suspected for some time that sites are reward for good CTR with a raise in ranking. Similar to how Adwords works.
CTR is monitored to see if fresh or stale contentis preferred for a search result.
CTR is also analyzed for increases or decreases relating to trends or seasons.
• Web page rankings are recorded and monitored for changes.
• The traffic to a web page is recorded and monitored over time.
• Sites can be ranked seasonally. A ski site may rank higher in the winter than in the summer.
Google can monitor and rank pages by recording CTR changes by season.
• Bookmarks and favorites could be monitored for changes, deletions or additions.
• User behaviour in general could be monitored.
As Google is capable of tracking traffic to your site you should closely monitor the small amount of copy returned in search results. Ideally you want to integrate a call to action in there to increase your listings CTR.
Clicks away from your site back to the search results are also monitored. Make your site as sticky as possible to keep visitors there longer. As mentioned above it may also help if you could get your visitors to bookmark you.
• The frequency and amount of page updates is monitored and recorded as is the number of pages.
Mass updates of hundreds of files will see you pop up on the radar.On the other hand few or small updates to your site could see your rankings slide. Unless your CTR is good. A stale page that receives good traffic may hold it's [sic: its] own and not require an update. So don't update for the sake of it.
A further indicator that Google is really cracking down on Spam is made clear in the following extract from the Patent [application]. Mention is made of changing the focus of multiple pages at once.
Here's the quote -
“A significant change over time in the set of topics associated with a document may indicate that the document has changed owners and previous document indicators, such as score, anchor text, etc., are no longer reliable.
Similarly, a spike in the number of topics could indicate Spam. For example, if a particular document is associated with a set of one or more topics over what may be considered a ’stable’ period of time and then a (sudden) spike occurs in the number of topics associated with the document, this may be an indication that the document has been taken over as a ‘doorway’ document.
Another indication may include the sudden disappearance of the original topics associated with the document. If one or more of these situations are detected, then [Google] may reduce the relative score of such documents and/or the links, anchor text, or other data associated the document.”
There's still more to look out for:-
• Changes in on page keyword density is monitored and recorded as are changes to anchor text.
• The domain name owner address is considered, most likely to help in a local search result.
• The technical and admin contact details are checked for consistency. These are often falsified for Spam domains.
• Your hosts IP address. If you are on a shared server it's possible somebody else on that server is using dirty tactics or Spaming. If so your site will suffer since you share the same IP.
The impression I get here is that Google have learned from the Spam 'attack' they suffered in early 2004 and they are determined to eradicate it from their listing results.