Friday, November 28, 2008

Robots.txt at the USPTO

Patently-O wrote:

The BPAI opinions are stored on the site Unfortunately that site is not crawled by web indexes such as Google or Yahoo. The problem is that the site includes a "robots.txt" file that excludes such activity. See Robots are also blocked from crawling the USPTO patent database ( and the PAIR site ( Also blocked from crawling is this site: - what is that?

A comment under the patenthawk post on Suppression, about Hal Wegner asserting that the USPTO is suppressing information on the outcome of petitions to the USPTO-->

robots.txt only blocks if it is respected.
google respects it.
there is nothing to stop some guy from the Ukraine from scraping and indexing the site.

ClickZ notes:

Traditional robots.txt exclusion protocols were pretty easy to understand, but they never really kept up with the times. I'm thankful that in early 2008, Microsoft, Yahoo, and Google joined up to expand their robots exclusion protocol to honor a series of new characters and directives that the traditional robots.txt protocol doesn't address. This move represented a vast expansion of the power Webmasters have to control bot access to their content, not only with what shows up in SERPs but also how it appears.

The perfect complement to this expanded robots protocol is Google's robots.txt analysis tool in Webmaster Tools. This tool is a sandbox that enables you to plug in different URL exclusion patterns, pit them against specific URLs from your site, and see exactly how Google's bot (and in theory, any major search engine crawler) will react to the code.

The IHT noted: Instead, they [search engines] use a 15-year-old program called robots.txt. To ensure that their articles turn up in searches, publishers also have to keep using robots.txt, which gives them little control over what happens to their material after it has been released on the Internet.

Rubin said adoption of the new protocol could encourage publishers to make additional information available in digital form. Some newspaper publishers, for instance, have been reluctant to open their archives online.

Complicating the battles between search engines and copyright owners is a disagreement over how best to profit from the rise of the Internet. Some newspaper publishers, for instance, try to make it as easy as possible for search engines to find their articles in an effort to attract more Web traffic and to sell more online advertising.

Critics of the Automated Content Access Protocol have compared it to the "digital rights management" systems imposed on some online music services, saying such restrictions inhibit the development of Internet business models.

Of scraping, from cogniview:

Scraping - Scraping (note, not “scrapping”), is taking content that was not published in a feed but by reading the HTML source. Google scrapes in fact, and really they should provide an opt-in rather than an opt-out via robots.txt. We let Google get away with it because they benefit us with traffic, your scraper+adsense site will not get the same treatment.

ALSO, note USPTO Application published application 20060174106 System and method for obtaining a digital certificate for an endpoint which discusses public key infrastructure (PKI) .


Post a Comment

<< Home