Hi! Please consider following me on twitter: @hanekomu.
2007年09月14日
Plagger hack: Engrish.com Recent Discoveries custom feed
I've been playing with Plagger, the Pluggable RSS/Atom Aggregator, to create a custom feed for Engrish.com's "recent discoveries". The HTML on that page is pretty convoluted, consisting of at least six or seven levels of nested tables, rows and cells, but a combination of Web::Scraper and XPath made it easy to get at the required information.
Here is the Plagger workflow:
global:
log:
level: error
plugins:
- module: Subscription::Config
config:
feed:
- script:///home/gr/plagger/feeds/engrish_com-recent_discoveries.pl
- module: CustomFeed::Script
- module: Publish::Feed
config:
dir: /path/to/htdocs/feeds
filename: engrish_com-recent_discoveries.rss
format: RSS
And here is the script (engrish_com-recent_discoveries.pl)
referenced by the workflow:
#!/usr/bin/env perl use strict; use warnings; use Web::Scraper; use URI; use YAML; use DateTime; use DateTime::Format::W3CDTF; my $uri = URI->new('http://engrish.com/recent.php'); my $s = scraper { process '//a[contains(@href,"recent_detail.php")]', 'entries[]' => { link => '@href', title => scraper { process 'img', title => '@alt'; result 'title'; }, } }; my $feed = $s->scrape($uri); splice @{ $feed->{entries} || [] }, 10; $feed->{title} = 'Engrish.com Recent Discoveries'; $feed->{link} = $uri->as_string; for my $entry (@{ $feed->{entries} }) { # extract the date from the 'link' key; it looks something like # # recent_detail.php?imagename=foobar.jpg&category=Clothing&date=2007-09-12 if ($entry->{link} =~ /date=(\d{4})-(\d{2})-(\d{2})/) { my $date = DateTime->new( year => $1, month => $2, day => $3, ); $entry->{date} = DateTime::Format::W3CDTF->format_datetime($date); } if ($entry->{link} =~ /imagename=(.*?)&/) { $entry->{body} = qq!<img src="http://engrish.com/image/engrish/$1" />!; } $entry->{link} = 'http://engrish.com/' . $entry->{link}; } print Dump $feed;
At the moment, I run this workflow once per hour, and you can subscribe to the feed. I subscribe to it using Bloglines. Combined with the Plagger workflow that takes Bloglines feeds and publishes them to Gmail, this means that I can now see new Engrish discoveries from within the Gmail reader.
posted at: 08:44 | path: /dev | permalink | 0 comments | 0 trackbacks
Comments are closed for this story.
Trackbacks are closed for this story.