musings on self and search

So as you may be aware, I developed SelfSolved in order to avoid cluttering up this blog with “I fixed this!” posts. I’m trying to get into a habit of writing at a higher conceptual level, instead of this repository being just yet another technical problems blog. It turns out that without writing about random things that I fixed, I have little else to say. When I wrote up problems for this blog, I had monthly posts. Now I’ve gone 3 months without writing anything.

It’s odd, as Ph.D. students (and former startup founders) can hold forth via very long and extensive blog posts, especially on technical topics they care about (and sometimes on topics that they don’t). It may speak to my lack of aptitude for Ph.D. work; I hope not.

When I implemented SelfSolved, I created a sitemap for search engine crawlers as a matter of course, but did not do so immediately. Lots of content were present on the site before I completed the sitemap feature. It turns out that search engines like the GOOG seem to ignore certain content in the sitemap — so much for indexing all the world’s knowledge. I thought the crawler would get around to it eventually, but it doesn’t appear to be the case after nearly a year.

It’s interesting, because there doesn’t to be any rules. For example, “turning caching off completely in Pylons” has a very short problem statement and solution. Other posts, however, are fairly normal: “SVN directory is viewable through Apache” is by all accounts a normal sized SelfSolved problem, with references and a full solution writeup. As of this post, neither are indexed.

As soon as I wrote this blog post on this WordPress installation, the post became available in GOOG’s main index. If I search on the phrases above, I get an entry back immediately. But for a site like SelfSolved, these indexes aren’t available for days or weeks, even though SelfSolved publishes an Atom feed of its content as well, and pings the same GOOG notification URIs. This does suggest that if you build your site on a well-used CMS, rather than custom-designed software, it will get you into indexes in almost real-time vs in days. If you run a for-profit content-driven site (which SelfSolved is not), that could mean a fair bit of money.

It’s no wonder that SEO consultancies are flourishing. There are certain hidden rules with crawling that go beyond the simple “create good content” and “get a sitemap” guidelines that the GOOG wants you to believe.