You are not logged in.

#1 2021-04-13 19:06:55

Registered: 2020-01-18
Posts: 100

How would you have parsed this Web page?

A while ago I uploaded to Neocities a list of games. Recently I discovered this template system called Mustache, so I decided to put all the data onto a JSON and let the C implementation of it generate the Web page for me. It produced a visually identical version with this:

$ cat list-of-vehicle-building-games.mustache
<!DOCTYPE html>
<html lang="en">
		<meta charset="UTF-8">
		<link rel="icon" type="image/svg+xml" href="./logo.svg">
		<title>{{title}} - aqueduct1</title>
				{{#resources}}[<a href="{{url}}">{{name}}{{#index}}<sup>{{.}}</sup>{{/index}}</a>{{#index}}<sup>{{/index}}{{#archive}} (archive){{/archive}}{{#mention}} (mention){{/mention}}{{#index}}</sup>{{/index}}{{#alts}}
				<sup><a href="{{url}}">{{index}}</a>{{#archive}} (archive){{/archive}}{{#mention}} (mention){{/mention}}</sup>{{/alts}}]

But to get the JSON... I suffered. This cursed Bash script using html-xml-utils, grep and sed sput a broken JSON that saved me a lot time, but also required a lot of manual corrections until jq stopped complaining:

$ cat extract-games-to-json

shopt -s extglob


IFS=$'\n' resources=($(cat list-of-vehicle-building-games.html | hxremove sup | hxselect -s "\n" -c ul li a | sort | uniq))
IFS=$'\136'   games=($(cat list-of-vehicle-building-games.html | hxselect -c -s $'\136' li ))

echo -en "{\n"
echo -en "\t\"title\" : \"$(cat list-of-vehicle-building-games.html | hxselect -c h1)\",\n"
echo -en '\t"games" : ['

for game_i in "${!games[@]}"; do
        IFS=$'\n' game=($(echo "${games[game_i]}"))
        unset "game[${#game[@]} - 1]"

        notes="${game[${#game[@]} - 1]##*([[:space:]])}"
        unset "game[0]"

        json_games[game_i]="\n\t\t{\n\t\t\t\"name\" : \"$name\""

        if [[ "$notes" =~ ^\(.*\)$ ]]; then
                json_games[game_i]+=",\n\t\t\t\"notes\" : \"$(echo "$notes" | sed 's/"/\\"/g')\""
                unset "game[${#game[@]} - 1]"

        if [[ ${#game[@]} -gt 0 ]]; then
                json_games[game_i]+=",\n\t\t\t\"resources\" : ["


                for resource in "${game[@]}"; do
                        if [[ "$resource" =~ [.*] ]]; then
                                resources+=" {\"url\" : $(echo "$resource" | grep -o '".*"'), \"name\" : \"$(echo "$resource" | grep -o '>.*<')\"}"

                json_games[game_i]+=$(IFS=, echo -n "${resources[*]}")


(IFS=,; echo -en "${json_games[*]}")

echo -en '\n\t]\n}'

At this point I contemplated Python... I think I'll give it a shot. Sample of the hairiest parts of the JSON:

$ jq '.games[] | select(.name == "Fraxy")' list-of-vehicle-building-games.json
  "name": "Fraxy",
  "resources": [
      "url": "",
      "name": "site",
      "archive": true,
      "index": 1,
      "alts": [
          "index": 2,
          "url": "",
          "archive": false
      "url": "",
      "name": "wiki",
      "index": 1,
      "alts": [
          "index": 2,
          "url": ""
          "index": 3,
          "url": "",
          "archive": true
          "index": 4,
          "url": "",
          "archive": true
          "index": 5,
          "url": "",
          "archive": true
          "index": 6,
          "url": ""
          "index": 7,
          "url": ""
      "url": "",
      "name": "forum",
      "index": 1,
      "alts": [
          "index": 2,
          "url": "",
          "archive": true
          "index": 3,
          "url": "",
          "archive": true
          "index": 4,
          "url": "",
          "archive": true
      "url": "",
      "name": "YouTube"
      "url": "",
      "name": "TV Tropes"
$ jq '.games[] | select(.name == "Block Tech Sandbox")' list-of-vehicle-building-games.json
  "name": "Block Tech Sandbox",
  "notes": "(aka Block Tech: Epic Sandbox, or Block Tech: Epic Sandbox Craft Simulator Online)",
  "resources": [
      "url": "",
      "name": "Play Store",
      "index": "free",
      "alts": [
          "index": "gold",
          "url": ""
      "url": "",
      "name": "App Store"
      "url": "",
      "name": "Crazy Games"
      "url": "",
      "name": "Silver Games"

Maybe I could've learned some of those Python libraries. Maybe I missed some program that could've been more useful. Maybe I need to learn more Bash. What do you think?

Behemoth, wake up!


#2 2021-04-13 19:41:14

Registered: 2012-09-02
Posts: 886

Re: How would you have parsed this Web page?

I would have entered this data in a spreadsheet, exported it as CSV or TSV and then created the HTML with Python. You are making it unnecessarily difficult by storing data in JSON that is basically just a big table.

And I don't think bash is ever the correct answer to any programming problem. smile


#3 2021-04-14 02:50:07

From: US/Eastern
Registered: 2020-06-21
Posts: 696

Re: How would you have parsed this Web page?

By muscle memory, I like python for managing json, since the concepts of it translate nicely to python, at least for my personal experience.

My reposSome snippets

Heisenberg might have been here.


Board footer

Powered by FluxBB