I was trying to trace the source of the quote “Any sufficiently advanced financial instrument is indistinguishable from fraud.”. If you do a quoted google search on a custom date range, an interesting problem can be seen.
The results contain pages originally published in 2005 but re-indexed recently. While re-indexing, the current tweets of the author were visible to the crawler and got indexed along with the original article. This makes it seem like the quoted text was mentioned first in 2005 where as originally it’s only a recent meme.
One way to avoid this might focus on identifying dynamic widgets like twitter/news/weather feeds and eliminating them from the index. The HTTP Header (pasted below) lists the last-updated date which probably means that google is either getting the date from the first time it indexed the post or from the URL itself. Whatever the case is, it’s an interesting problem to distinguish between the ‘original’ content and other dynamically added elements on a page.
HTTP/1.1 200 OK``Server: nginx Date: Tue, 23 Nov 2010 17:50:20 GMT Content-Type: text/html; charset=UTF-8 Transfer-Encoding: chunked Connection: close Vary: Cookie, Accept-Encoding X-hacker: If you're reading this, you should visit automattic.com/jobs and apply to join the fun, mention this header. X-Pingback:
P.S. An interesting way to advertise. You have read this header anyway so you might want to apply for the job :)
P.P.S. On second thought, it’s not much of a ‘challenge’ per se. It’s just an interesting problem.