Look at the Latest Fark Headlines
Originally 17 Sep 2004. Overhauled on 3 January 2007
The Problem
I want to look at the Fark headlines without opening a browser. Why? I dunno, maybe I just want to see what's new since the last time I looked, without being distracted by the SbB girls. Mind you, they are quite lovely. It's just that I often forget what I was doing when that flashing gif starts loading.
Now, I could just turn off images and go to the site, and that would work fine. Actually, it would work quite well. No need for this article, then. I'm off for some coffee ...
What? I have to write something in here about Ruby? ... okay.
ahem.
I want to look at the Fark headlines without opening a
browser. Why? Well, as it so happens, I am logged into a
machine via ssh, and using lynx or links to
load the page will result in a lot of extra clutter from
text versions of ads which obscure the headlines. I'm
just interested in seeing what interesting, scary, or
amusing things have been posted on Fark since the last
time I checked.
Finding a Solution
Well, we could always just dump the Fark page to the console:
require 'open-uri'
headlines = open('http://www.fark.com/').read()
puts headlines
$ ruby fark.rb
...
<div class="copyright">
<a target="_top" href="http://www.fark.com/nomirror/"></a>Copyright © 1999-2007 Fark.com, LLC<br>
</div>
<div class="footnote">
Last updated: Wed Jan 3 18:58:11 2007<br>
Terms of Service: Text comments, AudioEdit submissions, and Photoshopped images
posted on Fark by registered users may not be reposted or broadcast without the
express written permission or license from Fark.com, and must attribute Fark.com
as the source. Fark.com is the legal owner of all copyrights in the content on
this site. <a target="_top" href="http://www.fark.com/farq/legal.shtml">(Legal/privacy policy)</a>
<br>
</div>
<div class="finalfootnote">
<noscript><img src="http://go.fark.com/cgi/fark/ll.pl?l=H2SfqTIxK1_rLUlKWEBYKGFr4c4cU7RogwDr3cbVJkllZMVNKaPk5j.." width="4" height="1" alt=""></noscript>
</div>
</body>
</html>
But running this doesn't quite get the result I was looking for. I just want the headlines, and I want them without the HTML, thank you very much.
Fark and a lot of other news sites make RSS feeds available. These are special XML files containing mainly - you guessed it, the headlines, without the HTML, you're very welcome.
require 'open-uri' url = 'http://www.fark.com/fark.rss' headlines = open(url).read() puts headlines
$ ruby fark.rb ... <title>[Obvious] Nick Saban, one week ago: &quot;I'm not going to be the Alabama coach.&quot; Alabama: &quot;We'll give you $35 million.&quot; Saban: &quot;Oh, I see what you did there&quot;</title> <description><![CDATA[ESPN]]></description> <link>http://forums.fark.com/cgi/fark/comments.pl?IDLink=2514190</link> </item> <item> <title>[Interesting] Nepalese authorities baffled by four dozen missing rhinos, begin production of gigantic milk cartons</title> <description><![CDATA[Yahoo]]></description> <link>http://forums.fark.com/cgi/fark/comments.pl?IDLink=2514051</link> </item> <item> <title>[Interesting] Ontario helps cause the breakdown of the family by callously ensuring that boy has 50 percent more nagging than all the other kids</title> <description><![CDATA[Toronto Star]]></description> <link>http://forums.fark.com/cgi/fark/comments.pl?IDLink=2513949</link> </item> <item> <title>[Unlikely] Super bug set to destroy the world EVERYBODY PANIC</title> <description><![CDATA[Toronto Star]]></description> <link>http://forums.fark.com/cgi/fark/comments.pl?IDLink=2513884</link> </item> </channel> </rss>
Now I get the RSS file dumped to the console. That's a little better, I guess. At least the story headlines are little easier to find. To get the behavior I want, though, we're going to need to chop out the bits we don't care about and get straight to the headlines. This task is straightforward in Ruby, thanks to the [RSS](http://ruby-doc.org/stdlib/libdoc/rss/rdoc/index.html) library. The RSS library has recently been made an official part of the standard libs, which makes a lot of this exercise much easier.
require 'open-uri' require 'rss' rss_url = 'http://www.fark.com/fark.rss' document = open(rss_url).read() rss = RSS::Parser.parse(document) rss.items.each do |item| puts item.title end
$ ruby fark.rb [Obvious] Wal-Mart finds sinister new way to make employees' lives hell: Chan\ ging from fixed shifts to scheduling them based on how many customers are in \ the store at any given time [Cool] Auto-parts store manager pulls a TJ Hooker, calls 911 from atop speedi\ ng car [NewsFlash] Today's school shooting story brought to you by Tacoma, Washington [Strange] Greased, naked guy slips out of prison by sliding through prison ba\ rs. No word on whether he was deaf [PSA] Tax time is right around the corner. Make sure to include bribes, sale \ of illegal drugs, kickbacks and stolen property on your Schedule C as income [PSA] Drew will be talking Fark with Chip Franklin on WBAL-AM 1090 Baltimore \ MD at 11am [Obvious] Nick Saban, one week ago: "I'm not going to be the Alabama coa\ ch." Alabama: "We'll give you $35 million." Saban: "Oh, I\ see what you did there" [Interesting] Nepalese authorities baffled by four dozen missing rhinos, begi\ n production of gigantic milk cartons [Interesting] Ontario helps cause the breakdown of the family by callously en\ suring that boy has 50 percent more nagging than all the other kids [Unlikely] Super bug set to destroy the world EVERYBODY PANIC
This is even better still, but that's an awful lot of headlines. How about just the most recent ones? How about the last 10?
require 'open-uri' require 'rss' rss_url = 'http://www.fark.com/fark.rss' limit = 10 document = open(rss_url).read() rss = RSS::Parser.parse(document) rss.items.each_with_index do |item, index| break if index >= limit puts item.title end
$ ruby fark.rb [Obvious] ExxonMobil has been borrowing a page from big tobacco's playbook by\ funding front groups that question Global Warming [Strange] Meteor-like object crashes into New Jersey home. Residents nervous,\ act as if it's a growing trend (w/video) [Amusing] NYC taxis covered with fake fur to look like cows. No changes neede\ d to the interior due to existing authentic "downwind of the barn" \ cab stench [Cool] Spiders on Drugs [Amusing] "U.S. Mines Still Not Safe Enough, Experts Say." Apparent\ ly they keep exploding when you step on them [Amusing] Tara Reid counting down to 2007 and completely blowing it [Stupid] Iranian police force launches women's fashion line, which allow wome\ n to show obscene amounts of ankle and wrist [Unlikely] Not content to let the girls have all the fun with the F-bomb drop\ ping Bratz doll, Tek Nek sells F-bomb dropping toy police belt [Obvious] With Democrats now running Congress, Bush suddenly remembers he's s\ upposed to be a fiscal conservative [Silly] "Unidentified Goat Found." In related news--well, there is \ no related news. In fact, there appears to be no news at all
Now we've got it down to the freshest 10, but each item is still filling up a lot of space. One way to cut down the length of each line is to split each headline into multiple lines. Let's start by cutting the category and title into two separate lines:
require 'open-uri'
require 'rss'
rss_url = 'http://www.fark.com/fark.rss'
limit = 10
title_pattern = Regexp.new %r{\[(.+?)\]\s(.+)$}
document = open(rss_url).read()
rss = RSS::Parser.parse(document)
rss.items.each_with_index do |item, index|
break if index >= limit
title_match = title_pattern.match(item.title)
if title_match then
puts title_match[1].upcase, title_match[2]
end
end
$ ruby fark.rb OBVIOUS U.S. military on Saddam execution:"Would have done it differently."\ Probably would have been many, many more pictures, less clothing, electrodes INTERESTING Archaeologists find ancient 2000 year old latrine in Qumran. Ancient grafitti\ on door says "For a good time, call Mary Magdalene." OBVIOUS ExxonMobil has been borrowing a page from big tobacco's playbook by funding f\ ront groups that question Global Warming STRANGE Meteor-like object crashes into New Jersey home. Residents nervous, act as if\ it's a growing trend (w/video) AMUSING NYC taxis covered with fake fur to look like cows. No changes needed to the i\ nterior due to existing authentic "downwind of the barn" cab stench COOL Spiders on Drugs (repeat from the video tab, but worth it) AMUSING "U.S. Mines Still Not Safe Enough, Experts Say." Apparently they ke\ ep exploding when you step on them AMUSING Tara Reid counting down to 2007 and completely blowing it STUPID Iranian police force launches women's fashion line, which allow women to show\ obscene amounts of ankle and wrist UNLIKELY Not content to let the girls have all the fun with the F-bomb dropping Bratz \ doll, Tek Nek sells F-bomb dropping toy police belt
Hmm, I see some HTML
entities in there. &quot;, stuff
like that. Let's fix that problem before we move on
to anything else.
require 'open-uri'
require 'rss'
require 'cgi'
rss_url = 'http://www.fark.com/fark.rss'
limit = 10
title_pattern = Regexp.new %r{\[(.+?)\]\s(.+)$}
document = open(rss_url).read()
rss = RSS::Parser.parse(document)
rss.items.each_with_index do |item, index|
break if index >= limit
title_match = title_pattern.match(item.title)
if title_match then
category = CGI::unescapeHTML(title_match[1]).upcase
title = CGI::unescapeHTML(title_match[2])
puts category, title
end
end
$ ruby fark.rb OBVIOUS U.S. military on Saddam execution:"Would have done it differently." Probably \ would have been many, many more pictures, less clothing, electrodes INTERESTING Archaeologists find ancient 2000 year old latrine in Qumran. Ancient grafitti\ on door says "For a good time, call Mary Magdalene." OBVIOUS ExxonMobil has been borrowing a page from big tobacco's playbook by funding f\ ront groups that question Global Warming STRANGE Meteor-like object crashes into New Jersey home. Residents nervous, act as if\ it's a growing trend (w/video) AMUSING NYC taxis covered with fake fur to look like cows. No changes needed to the i\ nterior due to existing authentic "downwind of the barn" cab stench COOL Spiders on Drugs (repeat from the video tab, but worth it) AMUSING "U.S. Mines Still Not Safe Enough, Experts Say." Apparently they keep explodi\ ng when you step on them AMUSING Tara Reid counting down to 2007 and completely blowing it STUPID Iranian police force launches women's fashion line, which allow women to show\ obscene amounts of ankle and wrist UNLIKELY Not content to let the girls have all the fun with the F-bomb dropping Bratz \ doll, Tek Nek sells F-bomb dropping toy police belt
Wherever possible, I'm using standard library tools to get my work done. I'm too lazy to remember escaping every possible HTML entity, and I would rather spend a few minutes searching through the Standard Lib docs to find what I need. It's a good habit, and you might want to try it yourself.
Maybe I only care about particular types of headline. Say, I want to be interested, but not amused.
require 'open-uri'
require 'rss'
require 'cgi'
rss_url = 'http://www.fark.com/fark.rss'
limit = 10
title_pattern = Regexp.new %r{\[(.+?)\]\s(.+)$}
preferred_category = 'INTERESTING'
document = open(rss_url).read()
rss = RSS::Parser.parse(document)
rss.items.each_with_index do |item, index|
break if index >= limit
title_match = title_pattern.match(item.title)
if title_match then
category = CGI::unescapeHTML(title_match[1]).upcase
if category == preferred_category then
title = CGI::unescapeHTML(title_match[2])
puts title
end
end
end
$ ruby fark.rb Archaeologists find ancient 2000 year old latrine in Qumran. Ancient grafitti\ on door says "For a good time, call Mary Magdalene."
That's pretty nifty, except that it only looks for
Interesting items out of
the last 10 headlines, rather than looking for the last
10 Interesting
headlines.
require 'open-uri'
require 'rss'
require 'cgi'
rss_url = 'http://www.fark.com/fark.rss'
limit = 10
title_pattern = Regexp.new %r{\[(.+?)\]\s(.+)$}
preferred_category = 'Interesting'
document = open(rss_url).read()
rss = RSS::Parser.parse(document)
index = 0
rss.items.each do |item|
title_match = title_pattern.match(item.title)
if title_match then
category = CGI::unescapeHTML(title_match[1])
if category.upcase == preferred_category.upcase then
title = CGI::unescapeHTML(title_match[2])
puts title
index += 1
break if index >= limit
end
end
end
$ ruby fark.rb Archaeologists find ancient 2000 year old latrine in Qumran. Ancient grafitti\ on door says "For a good time, call Mary Magdalene." Documents show Iran is supporting Sunni and Shia terrorists in Iraq, apparent\ ly favoring a neighbor in chaos over a Shia client state Drop a quarter in the slot or the little man gets it
Well, we can only look at today's headlines. I guess we can't be sure of ten interesting things happening every day. Still, at least I know I'm getting all of the interesting headlines that are available, up to my limit.
Next problem: the only way I can fetch different headline types is to manually dig in to the source code and change the category.
require 'open-uri'
require 'rss'
require 'cgi'
require 'optparse'
rss_url = 'http://www.fark.com/fark.rss'
limit = 10
title_pattern = Regexp.new %r{\[(.+?)\]\s(.+)$}
preferred_category = nil
# Get the preferred category, if any, from the command line.
opts = OptionParser.new do |opts|
opts.banner = "#{$0} [options]"
opts.separator ""
opts.separator "Specific Options"
opts.on("-c", "--category CATEGORY",
"Only grab headlines in specified category") do |cat|
preferred_category = cat
end
opts.on_tail("-h", "--help", "Show this usage display") do
puts opts
exit
end
end
opts.parse!(ARGV)
document = open(rss_url).read()
rss = RSS::Parser.parse(document)
index = 0
rss.items.each do |item|
title_match = title_pattern.match(item.title)
if title_match then
category = CGI::unescapeHTML(title_match[1])
if preferred_category.nil? or
category.upcase == preferred_category.upcase then
title = CGI::unescapeHTML(title_match[2])
puts title
index += 1
break if index >= limit
end
end
end
$ ruby fark.rb -c amusing Newspaper fails to consult the Urban dictionary. Vocabularity ensues NYC taxis covered with fake fur to look like cows. No changes needed to the i\ nterior due to existing authentic "downwind of the barn" cab stench "U.S. Mines Still Not Safe Enough, Experts Say." Apparently they keep explodi\ ng when you step on them Tara Reid counting down to 2007 and completely blowing it Rare 1913 Liberty Head nickel, estimated to be worth $5 million, fetches abso\ lutely nothing at auction. In other news, that fancy word for coin collectors\ is numismatists If you left your Porsche keys, Elton John's sunglasses and six feet of snakes\ kin in your hotel room, Travelodge would like to have a word with you
That works. Pretty nicely, I might add.
OptParse
is a great library for handling command-line
arguments.
This program now does everything that I set out to do, and then some. I might choose to do a some refactoring to "bulletproof" the code, or wrap it up in some OO niceness to make it pretty. The truth is that this application is exactly what it needs to be for now, and I think that I shouldn't overwork something that I may never come back to. Maybe later I'll come back to it when I think of new features or find new bugs, and then I can overwork it to my heart's content.
I hope you enjoyed working along with me as much as I enjoyed sitting here and typing random nonsense to myself.
What Else?
I may be done with this exercise for now, but here are a few ideas about features that can be added to make it a little cooler. Go ahead and try them out!
- Add word wrap to make the output a little more readable.
- Add a parameter to change the number of headlines grabbed.
- Modify so that this script will work with other newsfeeds.
- Modify so that the functionality of this script can be embedded in other Ruby programs.
Revision History
| 3 January 2007 | Major rewrite to incorporate RSS library and changes at Fark |
| 19 September 2004 | Changed the network library used from 'net/http' to 'open-uri' in the refactoring stage. This is from a suggestion that was made by Gavin Sinclair, Frederick Ros, and others. It's a good suggestion, and I'm going to ignore a good suggestion! |
| 17 September 2004 | Initial version released. |
