Thursday, May 27, 2010

XmlSitemap Solution for SharePoint 2007

I've just finished an initial version of a XmlSitemap generator solution for SharePoint 2007. XmlSitemaps are meant for search engines like Google to help them index your website better. For many companies it's very important to SEO.

The following XmlSitemap solutions already existed for SharePoint 2007:
http://blog.mastykarz.nl/imtech-xml-sitemap-free-sharepoint-feature/
http://www.thesug.org/Blogs/lsuslinky/archive/2009/04/17/SharePoint_SiteMap_Generator__Version_2.aspx.aspx
http://www.kwizcom.com/ProductPage.asp?ProductID=737&ProductSubNodeID=738

I played mix & match to create my own solution.

I started out with generating a XmlDocument and uploading it as a file (sitemap.xml) to the rootweb of a sitecollection. Unfortunatly i had to drop this easy solution as i realized that the urls in the XmlSitemap protocol have to be absolute urls. This would pose a problem when:
* ContentDeployment is enabled, because the generated file would be deployed to a different farm which most likely has a different DNS.
* Alternate Access Mappings are used, because only one url would be present in the sitemap file. It would be technically possible to store all AAM urls in one file, search engines should select only the urls which match the sitemap-url. But this is not very neat as internal DNS entries would also be seen.

I ended up with a solution consisting of a Job which runs periodically (once a day, at night). I also had to implement two HttpHandlers for serving the sitemap.xml files.

The Job creates a XmlSitemap-index file and (if needed) multiple sitemap.xml files. The files get stored as Persisted Object under the WebApplication as a set. For each Site Collection and AAM a set of sitemap files will be generated and stored.

The HttpHandlers look up which set of sitemap files should be used and write out directly the stored xml.

This is an example of a generated XmlSitemap Index file:
<?xml version="1.0" encoding="utf-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
 <sitemap>
  <loc>http://hbrmosdev01:41238/sitemap.0.xml</loc>
  <lastmod>2010-05-27</lastmod>
 </sitemap>
 <sitemap>
  <loc>http://hbrmosdev01:41238/sitemap.1.xml</loc>
  <lastmod>2010-05-27</lastmod>
 </sitemap>
 <sitemap>
  <loc>http://hbrmosdev01:41238/sitemap.2.xml</loc>
  <lastmod>2010-05-27</lastmod>
 </sitemap>
 <sitemap>
  <loc>http://hbrmosdev01:41238/sitemap.3.xml</loc>
  <lastmod>2010-05-27</lastmod>
 </sitemap>
</sitemapindex>

As you might notice multiple seperate sitemap.xml files are being referenced. A requirement of the XmlSitemap protocol is that the files may not get too large (max.50.000pages or 10mb).

This is a snippet of sitemap.0.xml:
<?xml version="1.0" encoding="utf-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
 <url>
  <loc>http://hbrmosdev01:41238/Pages/default.aspx</loc>
  <lastmod>2010-04-21</lastmod>
  <changefreq>monthly</changefreq>
  <priority>0.5</priority>
 </url>
 <url>
  <loc>http://hbrmosdev01:41238/de/Pages/default.aspx</loc>
  <lastmod>2010-03-10</lastmod>
  <priority>1.0</priority>
 </url>
...
</urlset>

The priority is by default set to "0.5". The priority of welcome-pages is set to "1.0". The changefrquency is calculated based upon the listitem versions. The average interval of modificationdates is calculated and by an algorithm the changefreq is set to: daily, weekly, monthly or yearly. I've decided not to implement: always and never.

For my needs it was important that no processing is needed upon requesting the sitemap files, as it would not scale very well to very large websites.

This solution should be compatible with multiple WFE. Only one machine will generate the sitemap files and store it in the configuration database.

Code is available here

The sitemap can be fed to Search engines. But it also can be referenced from within the Robots.txt file.

1 comment:

  1. Hi Sander,

    very interesting article.
    I am quite new in Sharepoint. Your sitemap is exactly what I am looking for. But I do not know, where to paste which code? As I saw there are a few codes available.

    BR

    ReplyDelete