Saturday, June 25, 2005

Google's US appl. 20050071741

Google's US application 20050071741 (Information retrieval based on historical data)
has been the subject of some discussion.

The first claim of the '741 application recites:

1. A method for scoring a document, comprising:
identifying a document;
obtaining one or more types of history data associated with the document; and generating a score for the document based on the one or more types of history data.

Doesn't citation analysis meet these claim elements?

Claim 15 of the '741 reads:

The method of claim 1, wherein the one or more types of history data includes information relating to how often the document is selected when the document is included in a set of search results; and wherein the generating a score includes: determining an extent to which the document is selected over time when the document is included in a set of search results, and scoring the document based, at least in part, on the extent to which the document is selected over time when the document is included in the set of search results.

Claim 22 gets into links:

The method of claim 1, wherein the one or more types of history data includes information relating to behavior of links over time; and wherein the generating a score includes: determining behavior of links associated with the document, and scoring the document based, at least in part, on the behavior of links associated with the document.

The spec notes: In the description to follow, documents may be described as having links to other documents and/or links from other documents. For example, when a document includes a link to another document, the link may be referred to as a "forward link." [LBE note: citing] When a document includes a link from another document, the link may be referred to as a "back link." [LBE note: cited] When the term "link" is used, it may refer to either a back link or a forward link.

Claim 52 (the second independent claim) is directed to a "system":

52. A system for scoring a document, comprising: means for identifying a document; means for obtaining a plurality of types of history data associated with the document; and means for generating a score for the document based, at least in part, on the plurality of types of history data.

The patent application has a "conclusion" section which recites:


[0134] Systems and methods consistent with the principles of the invention may use history data to score documents and form high quality search results.

[0135] The foregoing description of preferred embodiments of the present invention provides illustration and description, but is not intended to be exhaustive or to limit the invention to the precise form disclosed. Modifications and variations are possible in light of the above teachings or may be acquired from practice of the invention. For example, while a series of acts has been described with regard to FIG. 4, the order of the acts may be modified in other implementations consistent with the principles of the invention. Also, non-dependent acts may be performed in parallel.

[0136] Further, it has generally been described that server 120 performs most, if not all, of the acts described with regard to the processing of FIG. 4. In another implementation consistent with the principles of the invention, one or more, or all, of the acts may be performed by another entity, such as another server 130 and/or 140 or client 110.

[0137] It will also be apparent to one of ordinary skill in the art that aspects of the invention, as described above, may be implemented in many different forms of software, firmware, and hardware in the implementations illustrated in the figures. The actual software code or specialized control hardware used to implement aspects consistent with the principles of the invention is not limiting of the present invention. Thus, the operation and behavior of the aspects were described without reference to the specific software code--it being understood that one of ordinary skill in the art would be able to design software and control hardware to implement the aspects based on the description herein.

Jill Whalen wrote:

I wasn’t surprised about the stuff in the patent that corresponded with Google’s aging delay and its “sandbox” as I had already seen a lot of discussion on this. For those who aren’t familiar with the aging delay and the sandbox, you’ll want to note that there is a lot of disagreement over what causes a site to be thrown in the sandbox.

However, based on my own observations and the experiences of some trusted SEO friends, it’s my belief that the sandbox is basically a purgatory database where Google places certain URLs based on a variety of predetermined criteria. (Much of this is spelled out in the first part of the patent application.)

The aging delay, on the other hand, is actually a subset of the sandbox. In other words, the aging delay is just *one* reason why a URL might get placed in the sandbox.

**UPDATE on 10 July 06**

There was later discussion by Darren Yates of the '741 application; however, keep in mind that, just because Google filed a patent application, does NOT establish that Google is practicing what is in the application.

It's common knowledge that Google relies heavily on inbound relevant links to rank a site. Now they explain exactly how it works.

As well as the number, quality and anchor text factors of a link. Google seems to also consider historical factors. Apparently the Google 'sandbox' or aging delay begins count down the minute links to a new site are discovered.

Google records the discovery of a link, link changes over time, the speed at which a site gains links and the link life span. With this in mind fast link acquisition may be a strong indicator of potential search engine Spam.

Gone are the days of pages and pages full of links. You must grow your links slowly to stay below the radar and be careful who you exchange links with. That means no more buying hundreds of links at once or other underhand tactics.

PR is now very valuable.

Your link anchor text should vary but remain consistent with your site content. No more using your main keywords on every link exchange you gain. That's 'anchor Spam'. Instead vary them around your top five to ten keywords.

Link exchanges are still very important but you must work and utilize them ethically. If you don't and you get caught the recovery from a ban can be months in coming and your host and IP may also be recorded.

Softly softly seems to be the message. The fact is fewer but better quality links will benefit you more anyway and they will be much more likely to be long-term which is also good.

• Site click through rates (CTR)

CTR may now be monitored through cache, temporary files, bookmarks and favorites via the Google toolbar or desktop tools. Many have suspected for some time that sites are reward for good CTR with a raise in ranking. Similar to how Adwords works.

CTR is monitored to see if fresh or stale contentis preferred for a search result.

CTR is also analyzed for increases or decreases relating to trends or seasons.

• Web page rankings are recorded and monitored for changes.

• The traffic to a web page is recorded and monitored over time.

• Sites can be ranked seasonally. A ski site may rank higher in the winter than in the summer.

Google can monitor and rank pages by recording CTR changes by season.

• Bookmarks and favorites could be monitored for changes, deletions or additions.

• User behaviour in general could be monitored.

As Google is capable of tracking traffic to your site you should closely monitor the small amount of copy returned in search results. Ideally you want to integrate a call to action in there to increase your listings CTR.

Clicks away from your site back to the search results are also monitored. Make your site as sticky as possible to keep visitors there longer. As mentioned above it may also help if you could get your visitors to bookmark you.

• The frequency and amount of page updates is monitored and recorded as is the number of pages.

Mass updates of hundreds of files will see you pop up on the radar.On the other hand few or small updates to your site could see your rankings slide. Unless your CTR is good. A stale page that receives good traffic may hold it's [sic: its] own and not require an update. So don't update for the sake of it.

A further indicator that Google is really cracking down on Spam is made clear in the following extract from the Patent [application]. Mention is made of changing the focus of multiple pages at once.

Here's the quote -

“A significant change over time in the set of topics associated with a document may indicate that the document has changed owners and previous document indicators, such as score, anchor text, etc., are no longer reliable.

Similarly, a spike in the number of topics could indicate Spam. For example, if a particular document is associated with a set of one or more topics over what may be considered a ’stable’ period of time and then a (sudden) spike occurs in the number of topics associated with the document, this may be an indication that the document has been taken over as a ‘doorway’ document.

Another indication may include the sudden disappearance of the original topics associated with the document. If one or more of these situations are detected, then [Google] may reduce the relative score of such documents and/or the links, anchor text, or other data associated the document.”

There's still more to look out for:-

• Changes in on page keyword density is monitored and recorded as are changes to anchor text.

• The domain name owner address is considered, most likely to help in a local search result.

• The technical and admin contact details are checked for consistency. These are often falsified for Spam domains.

• Your hosts IP address. If you are on a shared server it's possible somebody else on that server is using dirty tactics or Spaming. If so your site will suffer since you share the same IP.

The impression I get here is that Google have learned from the Spam 'attack' they suffered in early 2004 and they are determined to eradicate it from their listing results.


Blogger Lawrence B. Ebert said...

The Law of Googling

Greg Lastowka, Rutgers School of Law, Camden, has published "Google's Law." Here is the abstract.
Google has become, for the majority of Americans, the index of choice for online information. Through dynamically generated results pages keyed to a near-infinite variety of search terms, Google steers our thoughts and our learning online. It tells us what words mean, what things look like, where to buy things, and who and what is most important to us. Google's control over results constitutes an awesome ability to set the course of human knowledge.

As this paper will explain, fortunes are won and lost based on Google's results pages, including the fortunes of Google itself. Because Google's results are so significant to e-commerce activities today, they have already been the subject of substantial litigation. Today's courtroom disputes over Google's results are based primarily, though not exclusively, in claims about the requirements of trademark law. This paper will argue that the most powerful trademark doctrines shaping these cases, initial interest confusion and trademark use, are not up to the task they have been given, but that trademark law must continue to stay engaged with Google's results.

The current application of initial interest confusion to search results represents a hyper-extension of trademark law past the point of its traditional basis in preventing consumer confusion. Courts should reject initial interest confusion doctrine due to its tendency to grant trademark owners rights over search results that could easily operate against the greater public interest. On the other hand, the recent innovation of trademark use doctrine improperly relieves trademark law of any role in the supervision of the shape of Google's search results. The absence of any state involvement in the shape Google's results will effectively cede the structure of our primary online index to Google's law. Google may enjoy substantial public goodwill, but what is best for Google will not always be what is best for society.

Part I of this article describes the history of Google and its business model. Google is not the only search engine today, but it is the leading search engine in terms of United States market share. Additionally, Google is playing the most important role today in search engine litigation. It is a unique search engine in many respects. During its evolution, Google followed a very different path than many of its competitors. Today its competitors are largely imitating its model, yet are unable to dethrone its centrality in search. Understanding how Google rose to prominence is essential to understanding its motives and how it might act in the future.

Part II of this article sets forth the contemporary law pertaining to search results. It begins with a short discussion of recent (failed) attempts to regulate Google's results through laws other than trademark. It then describes current theories of trademark law. It concludes by summarizing how trademark law has been applied to search engines, starting with early meta tag cases and concluding with Google's current attempts to insulate itself from liability under an expanded doctrine of trademark use.

Part III criticizes the current application of trademark law to search engines. It argues that the judicial innovations of both initial interest confusion and trademark use are inconsistent with the traditional purpose of trademark law and the new realities of the e-commerce marketplace. It concludes that a simple focus on the likelihood of confusion standard, which some courts have already supported, is overdue. It concludes by explaining why, despite the fact that trademark law today will likely permit Google's current practices, Google's bid for the carte blanche freedom permitted by trademark use doctrine should be rejected by courts. In its relatively new role as a protector of the social value of indices, trademark law must retain the ability to curb potential abuses of the commercial power enjoyed by Google.

