XML Sitemap and Robots.txt Considerations for WordPress Users

Photo by Cristian LabarcaIf you have a WordPress blog on your own web site (not through WordPress.com) you should configure both a robots.txt file and an XML sitemap in order to provide indexing information to search robots.

If you don’t know anything about configuring robots.txt files, I encourage you to learn more about them by visiting http://www.robotstxt.org/robotstxt.html. If you don’t know anything about configuring XML sitemaps, I encourage you to learn more about them by visiting http://www.sitemaps.org/protocol.php.

Below is an outline of special considerations that WordPress users need to be aware of when setting up these files for their blogs.

Creating Your XML Sitemap

XML sitemaps help search robots index your web site, especially if your site is new and has few inbound links. If you have a standard web site, creating this XML file is a fairly easy task. If you have a blog, though, you’ll be continually adding pages, which means you’ll need to alter your sitemap every time you post something new. The easiest way to avoid this annoying task is to use a WordPress plugin. My recommendation is Arne Brachhold’s Google XML Sitemaps Generator. This plugin will create your XML sitemap for you, and then update it every time you add content to your blog.

Uploading Your Robots.txt File

Instead of using WordPress to manage their entire site, many people use it to add a blog section to an existing site. In this case, the blog sits in its own directory (e.g., www.mywebsite.com/blog/). Even if your blog sits in its own directory, you still need to place your robots.txt file at the root level of the site (e.g., www.mywebsite.com/robots.txt) because robots.txt files are written to be site-specific and not directory-specific. In other words, if you don’t put your robots.txt file at the root level, search robots won’t find it, and the file will be useless.

Listing Your Sitemap in Your Robots.txt File

Major search engines now support listing a URL for your XML sitemap right in your robots.txt file. Listing your sitemap makes it easier for search robots to crawl your site, which means that your pages will be indexed more quickly. While it’s true that some smaller search engines don’t yet recognize this sitemap listing, there’s no disadvantage to adding it to your robots.txt file. Smaller search engines won’t penalize you for it.

To add your XML sitemap to your robots.txt file, follow these simple steps:

  1. Create your XML sitemap and save it, at the root level of your site, as “sitemap.xml”.
  2. At the bottom of your robots.txt file, write “Sitemap: http://www.mywebsite.com/sitemap.xml”, making sure to replace “www.mywebsite.com” with your own domain name.

Note: if you use a WordPress plugin to create your sitemap, it may save the sitemap as “sitemap.xml.gz”. Don’t be alarmed, as this is a perfectly valid sitemap file. The “.gz” file extension simply mean that your sitemap has been compressed with gzip, which is great for large sites, as it saves bandwidth.

Don’t Necessarily Disallow Directories Just Because You Don’t Want Them Indexed

Many web designers, developers, and bloggers who are new to SEO (Search Engine Optimization) grow quickly afraid of having duplicate content on their web sites. For those of you who don’t know about the dangers of duplicate content, I encourage reading up on the subject at Google.com.

In the case of a WordPress blog, it’s quite easy to have duplicate content on your site. For instance, if you go to the following pages on this web site, you’ll see that the content is almost identical:

http://www.opensourceartistry.com/category/film-video-television/
http://www.opensourceartistry.com/category/film-video-television/independent-film-video/
http://www.opensourceartistry.com/tag/film/

If I allowed search robots to index all of these pages, I could be flagged for duplicate content, and my search rankings could suffer. The easiest way avoid this is to Disallow my Category, Tag, Author, Archive, Search, and RSS directories in my robots.txt file, but this would be a serious mistake.

Every time you cut off an avenue for a search robot to enter your site, you lose opportunities for your site to be indexed and for your site to rise in search rankings. The best way to solve the problem of duplicate content on a WordPress blog, then, is to tell search engines not to index your Category, Tag, Author, Archive, Search, and RSS directories, but to crawl them, anyway, so the robots can still get to the content you do want indexed.

“How do you tell search engines to do this?” you might ask. Well, another WordPress plugin can help. I recommend installing the Robots Meta Plugin by Joost de Valk. Once you’ve installed the plugin, access it through your control panel and check off all the directories you don’t want to be indexed. Below is a list of directories I recommend not having indexed. Just keep in mind that you might choose a different configuration, depending on how you have your blog set up and which other plugins you have installed.

  • All RSS Feeds
  • The Site’s Search Results Page
  • The Login and Register Pages
  • All Admin Pages
  • Subpages of the Homepage
  • Author Archives
  • Data-based Archives
  • Category Archives
  • Tag Archives

Note: if you want to make sure your Robots Meta configuration is working properly, go to one of the pages you’ve chosen not to be indexed, right-click on the page, and view the page’s source. If the Robots Meta plugin is working correctly, you should see the following code toward the top of the page:

<meta name=”robots” content=”noindex,follow” />

If you have any questions regarding this post, please don’t hesitate to register and leave a comment.  Happy blogging!

Tags: Blogging, Blogs, Plugins, Robots.txt, SEO, Sitemaps, WordPress, XML

Leave a Reply