You are not logged in.

#1 2020-08-29 15:03:19

graysky
Wiki Maintainer
From: :wq
Registered: 2008-12-01
Posts: 10,595
Website

Convert html to rss

I want to see html sources without RSS in newsboat.  I am finding "free" web-based services that offer this but would prefer something I can run myself to do the conversion.
Googling for a shell script or perl script to do this is surprisingly difficult.  Is anyone else doing it and care to share a link to a script to achieve it?

Initial target is: https://mirrors.edge.kernel.org/pub/lin … le-review/


CPU-optimized Linux-ck packages @ Repo-ck  • AUR packagesZsh and other configs

Offline

#2 2020-08-29 15:27:24

progandy
Member
Registered: 2012-05-17
Posts: 5,184

Re: Convert html to rss

I found this project: https://html2rss.github.io/

Sources:
https://github.com/gildesmarais/html2rss
https://github.com/gildesmarais/html2rss-configs
https://github.com/gildesmarais/html2rss-web

Otherwise you should be able to use xidel to extract data and generate the rss. http://www.videlibri.de/xidel.html

Last edited by progandy (2020-08-29 15:30:16)


| alias CUTF='LANG=en_XX.UTF-8@POSIX ' |

Offline

#3 2020-08-29 16:16:37

graysky
Wiki Maintainer
From: :wq
Registered: 2008-12-01
Posts: 10,595
Website

Re: Convert html to rss

Thanks I stumbled upon the ruby project as well.  I starting hacking a shell script together as well.  Will checkout xidel.


CPU-optimized Linux-ck packages @ Repo-ck  • AUR packagesZsh and other configs

Offline

#4 2020-08-29 18:15:46

progandy
Member
Registered: 2012-05-17
Posts: 5,184

Re: Convert html to rss

Here's an example for xidel with xquery3. The most annoying part is creating a valid date string.

xidel https://mirrors.edge.kernel.org/pub/linux/kernel/v5.x/stable-review/ --extract-kind=xquery3 --extract-file=kernelrss.xq --output-format xml >test.rss

kernelrss.xq:

declare function formatTime($t) {
	let $months:=("Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec")
	let $days:=("Mon","Tue","Wed","Thu","Fri","Sat","Sun")
	let $DT:=tokenize($t, " ")
	let $D:=tokenize($DT[1], "-")
	let $m:=format-integer(index-of($months, $D[2]), "00")

	let $date:=xs:dateTime($D[3] || "-" || $m || "-" || $D[1] || "T" || $DT[2] || ":00")
	let $dow:=((int((xs:date($date) - xs:date('1970-01-05')) div xs:dayTimeDuration('P1D')) mod 7 + 7) mod 7) +1
	return format-dateTime($date, $days[$dow]||", [D,2] "||$months[int($m)]||" [Y,4] [H,2]:[M,2]:[s,2] [Z] UTC")
};

<rss version="2.0">

  <channel>
    <title>Linux kernel v5.x stable review</title>
    <link>https://mirrors.edge.kernel.org/pub/linux/kernel/v5.x/stable-review/</link>
    <description>Signed releases for linux kernel v5.x stable review</description>
    <language>en-us</language>
    <copyright>kernel.org</copyright>
    <pubDate>{formatTime(normalize-space(//a[@href="sha256sums.asc"]/following-sibling::node()[1]))}</pubDate>

	{for $a in reverse(//a[starts-with(@href, "patch") and ends-with(@href,"z")])
    return <item>
      <title>{$a/text()}</title>
      <description>Release of {$a/text()}</description>
      <link>https://mirrors.edge.kernel.org/pub/linux/kernel/v5.x/stable-review/</link>
      <author>kernel.org</author>
      <guid>{$a/text()}</guid>
      <pubDate>{formatTime(normalize-space($a/following-sibling::node()[1]))}</pubDate>
    </item>
	}

  </channel>

</rss>

Last edited by progandy (2020-08-29 18:22:32)


| alias CUTF='LANG=en_XX.UTF-8@POSIX ' |

Offline

#5 2020-08-29 18:22:18

graysky
Wiki Maintainer
From: :wq
Registered: 2008-12-01
Posts: 10,595
Website

Re: Convert html to rss

@progandy - Thanks for sharing that code.  I checked into xidel but it seems a bit complex so I wrote a bash script that seems to work but I cannot get newsboat to read it.  See: https://bbs.archlinux.org/viewtopic.php?id=258620

Last edited by graysky (2020-08-29 18:23:12)


CPU-optimized Linux-ck packages @ Repo-ck  • AUR packagesZsh and other configs

Offline

#6 2020-08-29 18:36:20

firecat53
Member
From: Lake Stevens, WA, USA
Registered: 2007-05-14
Posts: 1,542
Website

Re: Convert html to rss

The python feedgen module might do what you want and is fairly easy to use if you already have the html. Hopefully it would work using a local file path for the RSS link instead of a web URL.

Offline

Board footer

Powered by FluxBB