Hi! Please consider following me on twitter: @hanekomu.

Plagger hack: Engrish.com Recent Discoveries custom feed

I've been playing with Plagger, the Pluggable RSS/Atom Aggregator, to create a custom feed for Engrish.com's "recent discoveries". The HTML on that page is pretty convoluted, consisting of at least six or seven levels of nested tables, rows and cells, but a combination of Web::Scraper and XPath made it easy to get at the required information.

Here is the Plagger workflow:

global:
  log:
    level: error
plugins:
  - module: Subscription::Config
    config:
      feed:
        - script:///home/gr/plagger/feeds/engrish_com-recent_discoveries.pl
  - module: CustomFeed::Script
  - module: Publish::Feed
    config:
      dir: /path/to/htdocs/feeds
      filename: engrish_com-recent_discoveries.rss
      format: RSS

And here is the script (engrish_com-recent_discoveries.pl) referenced by the workflow:

#!/usr/bin/env perl

use strict;
use warnings;
use Web::Scraper;
use URI;
use YAML;
use DateTime;
use DateTime::Format::W3CDTF;

my $uri = URI->new('http://engrish.com/recent.php');

my $s = scraper {
    process '//a[contains(@href,"recent_detail.php")]',
        'entries[]' => {
            link  => '@href',
            title => scraper {
                process 'img', title => '@alt';
                result 'title';
            },
        }
};

my $feed = $s->scrape($uri);
splice @{ $feed->{entries} || [] }, 10;

$feed->{title} = 'Engrish.com Recent Discoveries';
$feed->{link} = $uri->as_string;

for my $entry (@{ $feed->{entries} }) {

    # extract the date from the 'link' key; it looks something like
    #
    # recent_detail.php?imagename=foobar.jpg&category=Clothing&date=2007-09-12

    if ($entry->{link} =~ /date=(\d{4})-(\d{2})-(\d{2})/) {
        my $date = DateTime->new(
            year  => $1,
            month => $2,
            day   => $3,
        );
        $entry->{date} = DateTime::Format::W3CDTF->format_datetime($date);
    }

    if ($entry->{link} =~ /imagename=(.*?)&/) {
        $entry->{body} = qq!<img src="http://engrish.com/image/engrish/$1" />!;
    }

    $entry->{link} = 'http://engrish.com/' . $entry->{link};
}

print Dump $feed;

At the moment, I run this workflow once per hour, and you can subscribe to the feed. I subscribe to it using Bloglines. Combined with the Plagger workflow that takes Bloglines feeds and publishes them to Gmail, this means that I can now see new Engrish discoveries from within the Gmail reader.

Write a comment | Bookmark and Share

posted at: 08:44 | path: /dev | permalink | 0 comments | 0 trackbacks

Comments are closed for this story.

Trackbacks are closed for this story.