You are not logged in.

#1 2010-10-30 09:23:35

Dieter@be
Forum Fellow
From: Belgium
Registered: 2006-11-05
Posts: 2,001
Website

Looking for cli tool that can convert ordinary html into rss

Does a tool like this exist?
What I'm thinking of:

cat $html | tool_i_need --match-post='<p class="newspost">' > output.rss

does something like this exist? Note: I am not interested in "online services" that do it for me.

The main use case: a lot of blogs have comments which i want to follow, but no rss feeds for the comments, so i want to create my own feed.

Last edited by Dieter@be (2010-10-30 09:24:13)


< Daenyth> and he works prolifically
4 8 15 16 23 42

Offline

#2 2010-10-30 10:54:29

keenerd
Package Maintainer (PM)
Registered: 2007-02-22
Posts: 647
Website

Re: Looking for cli tool that can convert ordinary html into rss

Yes.  It is called "Python".  Or any other programming language.  Basic parser transformation.  HTML parser -> filter out the interesting stuff -> generate the RSS subset of XML.

Beautiful Soup is probably the easist to use HTML parser.  (It does have downsides, but worry about them later.)  soup.findAll('p', {'class':'newpost'})  gets you half way there.

Offline

#3 2010-10-30 11:50:48

Dieter@be
Forum Fellow
From: Belgium
Registered: 2006-11-05
Posts: 2,001
Website

Re: Looking for cli tool that can convert ordinary html into rss

thanks, I'll keep that in mind.  I prefer to save some time though, so if something like this already exists (i.e.: example code)


< Daenyth> and he works prolifically
4 8 15 16 23 42

Offline

#4 2010-10-30 15:28:44

oupsemma
Member
Registered: 2010-01-01
Posts: 70

Offline

#5 2010-10-30 15:38:30

Dieter@be
Forum Fellow
From: Belgium
Registered: 2006-11-05
Posts: 2,001
Website

Re: Looking for cli tool that can convert ordinary html into rss

I'm looking for a generic tool, not something that only supports some specific sites.


< Daenyth> and he works prolifically
4 8 15 16 23 42

Offline

#6 2010-10-30 15:56:11

oupsemma
Member
Registered: 2010-01-01
Posts: 70

Re: Looking for cli tool that can convert ordinary html into rss

Offline

#7 2010-10-30 17:04:59

diegonc
Member
Registered: 2008-12-13
Posts: 42

Re: Looking for cli tool that can convert ordinary html into rss

For xhtml (blogs are using it, aren't they? big_smile) there is xsltproc. You just need a stylesheet like this:

<?xml version="1.0" encoding="UTF-8" ?>
<xsl:stylesheet version="1.0"
                xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

        <xsl:param
                name="channel-title"
                select="/head/title" />
        <xsl:param
                name="channel-description"
                select="/head/title" />
        <xsl:param
                name="channel-url" />
        <xsl:param
                name="page-url" />

        <xsl:template match="/">
                <rss version="2.0">
                        <channel>
                                <title><xsl:value-of select="$channel-title"/></title>
                                <link><xsl:value-of select="$channel-url"/></link>
                                <description><xsl:value-of select="$channel-description"/></description>
                                <xsl:apply-templates/>
                        </channel>
                </rss>
        </xsl:template>

        <xsl:template match="//p[@class == 'newspost']">
                <item>
                        <title><xsl:value-of select=".//id('title-id')"/></title>
                        <link><xsl:value-of select="$page-url"/>#<!-- find anchor--></link>
                        <description></description>
                </item>
        </xsl:template>
</xsl:stylesheet>

The params values may be set from the command-line for those that can't be guessed from the source file.
While xsltproc is a generic tool, stylesheets are mostly site specific sad

Dieter@be: do you have a link for your sample?, I'd like to test this crap smile

Last edited by diegonc (2010-10-30 17:05:32)

Offline

Board footer

Powered by FluxBB