Plugins to fetch information from websites

These plugins are used to fetch information concerning items in default collections from websites. It will fill the fields with all the information it could get. This page describes how to create such a plugin. Please keep in mind the Coding conventions when writing such a plugin.

Preparation

| Top |

The easiest way to begin a new plugin is to copy an existing one. They are found in lib/gcstar/GCPlugins/GCxxx, where xxx is the kind of collection the plugins concerns. As an example, plugins to fetch information for movies are in lib/gcstar/GCPlugins/GCfilms.

You could also use a template provided with GCstar sources. It’s GCSiteTemplate.pm in templates directory.

You should rename your file with something explicit. But the 2 first letters should always be GC and the suffix has to be .pm.

The first line contains something like that:

package GCPlugins::GCxxx::GCyyy;

Change yyy so it will matches your file name (without its suffix). A few lines below, there is this text:

package GCPlugins::GCxxx::GCPluginyyy;

Don’t change GCPlugin in the last part, but replace yyy with the same value as previously.

Interface

| Top |

Here is the list of methods your plugin should implement. It’s done in an object-oriented way, meaning that the first parameter of this method will always be a reference to an object. This object is an instance of your package that will have to do the work. The same instance could be used during a user session, but there is no guarantee about that. So your package should be ready in any case. That means you are supposed to clear internal values that should be resetted between 2 fetches, and to avoid storing values between 2 fetches.

new

Parameters
Package name.
Returns
A blessed reference to the created object (it should be a hash reference). You need to use the constructor for base class, GCPlugins::GCxxx::GCxxxPluginsBase.
Description
This is the constructor of the plugin object. You may intialize here any internal values you would like. You are also supposed to initialize a field (hasField) containing a reference to a hash where keys are name of fields a search will return. These fields could be found in .gcm file describing collection between results tags (in collection/options/fields). The associated value is 1 when the plugin returns a value for this field during a seach. It should be 0 if it doesn’t.

getName

Parameters
None.
Returns
A string containing the plugin name.
Description
The name it returns is the one that will be displayed in the application. So it should be explicit enough and also unique to avoid confusion.

getAuthor

Parameters
None.
Returns
A string containing the author’s name.
Description
You could return here your real name or nickname. It will be displayed in the application when the user select a plugin.

getLang

Parameters
None.
Returns
A string containing the website language.
Description
This is useful for users to know what plugin they could use. This value is also used internally by GCstar to automatically select plugins using same languages as the user one. So this has to be a 2 letters language code. If the language is already supported by GCstar, make sure you use the same code as the one used for the translation.

getCharset

Parameters
None.
Returns
A string containing the website charset.
Description
You may find this information in the header of the pages the website contains. If you don’t implement this method, default value will be ISO-8859-1

getSearchCharset

Parameters
None.
Returns
A string containing the encoding to use for the search term
Description
Character set for encoding the search term, this can sometimes be different to the page encoding, but we default to the same as the page set. “utf8” is usually a safe bet to use when you need searches using international characters to work.

getSearchUrl

Parameters
Text to search.
Returns
URL of the page containing search results. Optionnaly the post parameters.
Description
This method should build the full URL of the page containing the results of the search for the user’s query. The value it gets has already been prepared to be directly used in a URL without any conversion (i.e. all special characters have been escaped).
If the website use GET method for the search form, everything will be contained in the URL. But some websites may use POST method. Then the parameters should be returned also with the URL. They should be contained in a reference to an array that contains keys and values. Here is an example of a getSearchUrl for a website that uses POST:

sub getSearchUrl
{
    my ($self, $word) = @_;
    return ('http://www.example.com/search.php', ['query' => $word, 'type' => 'movies']);
}

In this example, the website will get 2 parameters, query and type with corresponding values.

getItemUrl

Parameters
The URL of an item or none.
Returns
Full URL of the page describing an item.
Description
This method could be called in 2 different contexts. During searches, the results could contain only a relative URL to the page containing the description. Then this method will have to prepend the website address so it will returns a full URL. But this will also called when using drag and drop to a URL from a web browser to GCstar. The application will call this method with no parameter to try to match the URL that has been dropped with the plugin one. Then it will be able to know which plugin should be used to parse the page.

preProcess

Parameters
Full content of the page.
Returns
Modified content of the page.
Description
Before parsing a page (see next section), you could want to do some changes in the content (such as removing unused parts or fixing some tag problems). This could be done in this method. You may also test $self->{parsingList} as described later.

getNumberPasses

Parameters
None.
Returns
A integer representing the number of times the plugin needs to search
Description
Most plugins only need to search once, so this defaults to one pass. For plugins like in collections like the tv episodes collection, the plugin needs to search twice - once for series name, once for episode number. The current pass number is stored $self→{pass}. See the GCTvdb.pm file for a working example.

isPreferred

Parameters
None.
Returns
True if the plugin should be the default gcstar uses. Defaults to false
Description
Determines whether plugin should be the default plugins gcstar uses. Plugins with this attribute set will appear first in plugin list, and will be highlighted with a star icon.

Parsing the pages

| Top |

The plugins are some event-based HTML parsers. That means they will go through an HTML page and some functions will be called when some events occured.

When a tag (such as <p> or <a href=...>) starts, the method called is start. When there is some textual content, the method called is text. When a tag ends, the method called is end. Refer to documentation about HTML::Parser for more information as it is the base package of your package (providing you didn’t remove the use base clause during preparation). We are supposing here you got the reference to the current object in a variable $self.

Inside these methods, there are 2 main blocks depending on the value of $self->{parsingList}. If this is a true value, that means we are parsing a results page (the list of items that match a query). If this is a false value, we are parsing the information for a given item.

parsing the search results

When parsing search results, you have to fill an array named $self->{itemsList}. Each item of this array is a reference to a hash. Each key of a hash is the name of the field (the same that the ones in $self->{hasField} initialized in new method). The values are obviously the ones that have been extracted from the parsed page.

handling of unique results

Some websites don’t return a search list page but the item description page if there is exactly one search result. The plugin needs to detect this and has to tell gcstar that the page isn’t a results page but to treat it as the item description page instead. The following code from GCImdb.pm is an example on how to do this. For IMDB, this is achieved by checking the page heading to see if it doesn’t include “Title Search”.

if (($self->{inside}->{h1}) 
  && ($origtext !~ m/IMDb\\s*Title\\s*Search/i))
    {
        $self->{parsingEnded} = 1;
        $self->{itemIdx} = 0;
        $self->{itemsList}[0]->{url} = $self->{loadedUrl};
    }

parsing the item description

When parsing item description, the values have to be stored in $self->{curInfo}->{fieldName}, where fieldName is the same name as the one in the .gcm file.

Drag'n'Drop and refresh support

| Top |

(from a forum post by zombiepig) For drag and drop support to correctly there’s basically two parts that need to be fulfilled within the getItemUrl function.

Here’s an example, from the boardgamegeek plugin:

    sub getItemUrl
    {
        my ($self, $url) = @_;
        
        if (!$url)
        {
            # If we're not passed a url, return a hint so that gcstar knows what type
            # of addresses this plugin handles
            $url = "http://www.boardgamegeek.com";
        }
        elsif (index($url,"xmlapi") < 0)
        {
            # Url isn't for the bgg api, so we need to find the game id
            # and return a url corresponding to the api page for this game       
            $url =~ |/([0-9]+)[/]*|;
            my $id = $1;
            $url = "http://www.boardgamegeek.com/xmlapi/boardgame/".$id;
        }
        return $url;
    }

So there’s two parts to it. First, is that if getItemUrl is called without a url, the plugin needs to return a sample url showing the domain the plugin handles (in this case www.boardgamegeek.com). This is so when a url is drag and dropped, gcstar can correctly determine which plugin is able to handle that url. The second part is only applicable sometimes, mostly for plugins that use an api, so that the page with the details to parse is different to the page the user will drag-and-drop. For the example above, the bgg plugin checks to see if the url isn’t for the xmlapi. If so, it extracts the game id from the url, and then returns a url for the actual page to parse. If those two conditions are met, drag and dropping should work fine. Mostly, for scraped pages, there’s nothing really required here. Eg, for imdb, the function is only:

    sub getItemUrl
    {
        my ($self, $url) = @_;
        
        return $url if $url =~ /^http:/;
        return "http://www.imdb.com".$url;
    }

This is the main criteria for the “update” button to work (I’m going to change that string to “refresh” I think). It uses the same routines as the drag and drop code to change the stored item url to the url that the plugin needs to parse. So if drag-and-drop works, then refresh should work as well.

Inform webmasters

| Top |

While GCstar only does the same operations a web browser would do, it is nicer to inform the websmaster what you are doing. Just look for the contact information on the website you are writing a plugin to, and send them a mail to inform them about this. You may send them a link to the page with Information for webmasters.

As what GCstar is could be unclear, you probably will have to insist on the fact that GCstar is only for personal use. Also, the users will always know from where they are fetching the information. The goal of this application is in no way to hide what website is used as they are doing a useful and great work.

 
en/websites_plugins.txt · Last modified: 28/08/2010 14:39 by Tian



Should you have a problem using GCstar, you can open a bug report or request some support on GCstar forums.