newsfiler - simple RSS to file with fulltext support

avx · 2011-10-20 21:03:43

situation: imho, all RSS-readers suck
thoughts: "everything is a file", why not news?
motivation: handle news like normal textfiles with regards to linking, sharing, searching, archiving, ... them

(my) solution: newsfiler

newsfiler is a simple script, which iterates over a list of feed urls and transforms it's items to textfiles.

Short description of functionality:

a) read in a list of feed urls
b) extract news item direct links
c) pass these to readability port, thus receiving the fulltext - since most feeds still don't provide them on their own
d) do some formatting
e) write out the contents to: $HOME/news/$feedtitle/[YYYY-mm-dd] @ [HH:mm] with content

Feed Title: Item Title [DATE TIME]
Original Link

Full Content

available featurs:

receive fulltext for any feed
fix feeds not giving a date on their items

todo:

replace links with 'linktext [X]' and having a list of '[X] reallink' at the end
somehow implement parallel threads, it's currently not really fast
some kind of configuration per feed, ie timestamps to only fetch a feed in certain intervals
~~pass the site hosting the feed as referrer~~* done *
maybe a mode to write every item of a feed to the same file, so it could easily be followed by ie `tail -f`
maybe a mode to keep html, so it can easily be formatted and looked at in the browser
code clean-up, maybe try to get rid of some deps

written in: Python
deps: feedparser, BeautifulSoup, readability-lxml

download: http://phorcix.org/code/newsfiler

contact: here or see email-comment in the source

Ideas/patches welcome, flame too

Enjoy, or not?!
avx

Edit 1:

quick fix/addition: added dupe-check based on the article's title and date~~(where available)~~ (needs proper fix for feeds without dates).
some cleanup, more "python-style"-coding

Edit 2:

fixed stupid error not saving anything
added support for custom user-agent and referrer

Last edited by avx (2011-10-22 23:18:16)

Arch Linux

#1 2011-10-20 21:03:43

newsfiler - simple RSS to file with fulltext support

Board footer