Recent Changes - Search:

Cookbook

PmWiki

pmwiki.org

GoogleSitemaps

Summary: How to submit a complete list of web pages to google
Version:
Prerequisites:
Status:
Maintainer:
Categories: RSS, Integration, Robots

Search engines and especially google are major source of visitors for many if not most websites.

Optimal indexing of a webpage by means of the search engine spider (for example googlebot) is a key issue in achieving good search engine results.

A spider visits a web page, the page is indexed and the spider crawls on following the links on the page. PmWiki ensures a proper linkage between the different wiki-pages, and enables easy generation of a sitemap by means of the (:pagelist:) directive. Still, since a spider indexes a website step by step it can take a while before a site is fully indexed, and it will take a while before added or changed pages are re-spidered.

Recently google introduced a new method to have a website indexed: Google sitemaps, as usual as a beta program.

Google sitemaps allows a Webmaster to submit a complete list of web pages to google. Several content management systems provide a method to use Google Sitemaps. I think it's time for PmWiki as well

Using RSS

One method to provide a (partial) index to google sitemaps is to use the rss feed provided by pmwiki based on for example Main.AllRecentChanges:

http://yoursite.com/pmwiki.php?n=Main.AllRecentChanges&action=rss
NOTE: With recent changes to PMWiki you should now use: http://yoursite.com/pmwiki.php?n=Site.AllRecentChanges&action=rss

  • the rss module must be enabled ( include_once("scripts/rss.php")

Do not use the syntax like ..../pmwiki.php/Main/AllRecentChanges?action=rss . Why? from Google:

The location of a Sitemap file determines the set of URLs that can be included in that Sitemap. A Sitemap file located at http://yoursite.com/catalog/sitemap.xml can include any URLs starting with http://yoursite.com/catalog/ but cannot include URLs starting with http://yoursite.com/images/.

Thus the syntax above would not add .../pmwiki.php/Cookbook/... to the index

Set parameters for a more complete list

It might be useful to tweak the rss a little, by default the feed only displays the last 20 changes:

  if ( $action=="sitemap" ) {
    $RssMaxItems=50000;                           # maximum items to display
    $RssSourceSize=0;                        # max size to build desc from
    $RssDescSize=0;                          # max desc size
    $action="rss";
  }
  include_once("scripts/rss.php");

Set .htaccess to overcome directory layout restrictions

Google is quite strict about the directory layout and the sitemap url must be in the top directory of your website. However redirects are accepted. So a little teak in the .htaccess can overcome that restriction:

 Redirect /sitemap.rss http://gnuada.sourceforge.net/index.php/Site/AllRecentChanges?action=sitemap(approve links)

Now use a syntax like:

http://yoursite.com/pmwiki.php?n=Main.AllRecentChanges&action=sitemap

Submit this link to google sitemaps using the ping-link or the web form. (see the google pages for details)

Using XML-Sitemap

Google provides a special XML scheme for this purpose.

Benefit of using the XML-Sitemap scheme are the tags:

how important is this page ( relative to the other pages on the site)
how often is the page updates

The changefreq could be derived from the values of the page history. I’m not sure yet how to get a priority of a page. Probably using some patternarray

Any thoughts are welcome BrBrBr

A Basic script

Changelog

1.7support EnablePageListProtectTested with pmwiki 2.1beta14-15
 Added Site to exclude pattern

Comments

  • Other Site Maps exist, like the extension module for FireFox, which offer the opportunity to have a Navigation bar… Should not your module be called GoogleSiteMap.php, with action=googlesitemap?
  • encoding is UTF-8, while pmwiki.php uses ISO-8859-1. For the "é", I have %e9 which is, I think, ISO and not UTF. Probably should you transcode the characters.
  • The time modification is not reported yet. Have you tested with the full time format (for me, with +02:00)?
  • For the priority, maybe we could increase it for the home page, and reduce it for the Recent Changes ones.
  • Note that the priority and the changefrequency are not mandatory. If the priority is always the same, I suggest not to write it in the file.
  • Could the priority be set by a page text variable? That is, pages without the variable would have a default priority, but page authors could mark pages higher or lower by setting the PTV.
    Ben Stallings December 11, 2007, at 03:02 PM
  • I'm using clean URLs recipe and have encountered a problem. ?action=sitemap works, as it is supposed to (pages are displayed as http://wiki.spounison.org/Main/Homepage(approve links)); But in the sitemap.xml.gz URLs of pages are displayed as ?n=*** (e.g. http://wiki.spounison.org/pmwiki.php?n=Main.Homepage(approve links)). That's the problem of mine...
    ArSoron March 05, 2008, at 03:14 AM

Older Comments

  • Can the script be made to exclude password protected groups?
solved in version 1.7
  • For the frequency, I think you should write at least "hourly" for Recent Changes (Group or Main).
actually pages like recentchanges are not included in the sitemap. Since the sitemap alreade includes change-times having the recentchanges in the sitemap is not neccessary

1] It's not clear how to generate the .gz sitemap. I have set $SitemapDelay=0, made a wiki edit, and still I don't see the file. The XML is shown in browser correctly. I temporarily set the pmwiki directory to ALL write, with no sucess. (ref http://www.mr2wiki.com/?action=sitemap). DaveG
2]same issues as DaveG here, ?action=sitemap returns a working xml, but I'm struggeling to find out how to generate the .xml.gz file.. Gilrim


Here's my hack: adding a script on a linux or OS X system as a (daily? hourly?) cronjob. Say I make a bash script called "makesitemap" for each wiki on my system and put it in the webroot for the site.

 #! /bin/bash
 curl -o sitemap.xml http://www.myurl.org/index.php?action=sitemap
 rm sitemap.xml.gz
 gzip sitemap.xml
 chmod 644 sitemap.xml.gz

I had to remove the old sitemap or the gzip command asks for overwrite verification

Now I just need a cronjob to run it. Most advanced cPanel type webhosts give you a user crontab. No, this won't work for everyone, but people worried about Google sitemaps are already getting a bit advanced :) XES


Okay ...I can't run bash on my server so I figured there had to be away of doing the same thing above with PHP ...so a gleamed the net and came-up with the following by hacking other peoples code ...cause I am not a programmer by any means... ARNOLD

 <?php
$url = "http://http://www.myurl.org/index.php?action=sitemap";
$file = "sitemap.xml";

$ch = curl_init ($url);
$fp = fopen ($file, "w") or
die("Unable to open $file for writing.\n");

curl_setopt ($ch, CURLOPT_FILE, $fp);
curl_setopt ($ch, CURLOPT_FAILONERROR, true);
curl_setopt ($ch, CURLOPT_FOLLOWLOCATION, 1);

if (!curl_exec ($ch)) {
print("Unable to fetch $url.\n");
}

curl_close ($ch);
fclose ($fp);

function compress($srcName, $dstName)
{
$fp = fopen($srcName, "r");
$data = fread ($fp, filesize($srcName));
fclose($fp);

$zp = gzopen($dstName, "w9");
gzwrite($zp, $data);
gzclose($zp);
}

// Compress a file
compress("sitemap.xml", "sitemap.xml.gz");

?>

Edit - History - Print - Recent Changes - Search
Page last modified on March 05, 2008, at 03:34 AM