You are not logged in.

#1 2021-04-13 19:06:55

porcelain1
Member
Registered: 2020-01-18
Posts: 97

How would you have parsed this Web page?

A while ago I uploaded to Neocities a list of games. Recently I discovered this template system called Mustache, so I decided to put all the data onto a JSON and let the C implementation of it generate the Web page for me. It produced a visually identical version with this:

$ cat list-of-vehicle-building-games.mustache
<!DOCTYPE html>
<html lang="en">
	<head>
		<meta charset="UTF-8">
		<link rel="icon" type="image/svg+xml" href="./logo.svg">
		<title>{{title}} - aqueduct1</title>
	</head>
	<body>
		<h1>{{title}}</h1>
		<ul>{{#games}}
			<li>
				{{name}}
				{{#resources}}[<a href="{{url}}">{{name}}{{#index}}<sup>{{.}}</sup>{{/index}}</a>{{#index}}<sup>{{/index}}{{#archive}} (archive){{/archive}}{{#mention}} (mention){{/mention}}{{#index}}</sup>{{/index}}{{#alts}}
				<sup><a href="{{url}}">{{index}}</a>{{#archive}} (archive){{/archive}}{{#mention}} (mention){{/mention}}</sup>{{/alts}}]
				{{/resources}}
				{{{notes}}}
			</li>
		{{/games}}</ul>
		[...]
	</body>
</html>

But to get the JSON... I suffered. This cursed Bash script using html-xml-utils, grep and sed sput a broken JSON that saved me a lot time, but also required a lot of manual corrections until jq stopped complaining:

$ cat extract-games-to-json
#!/bin/bash

shopt -s extglob

json_games=()

IFS=$'\n' resources=($(cat list-of-vehicle-building-games.html | hxremove sup | hxselect -s "\n" -c ul li a | sort | uniq))
IFS=$'\136'   games=($(cat list-of-vehicle-building-games.html | hxselect -c -s $'\136' li ))

echo -en "{\n"
echo -en "\t\"title\" : \"$(cat list-of-vehicle-building-games.html | hxselect -c h1)\",\n"
echo -en '\t"games" : ['

for game_i in "${!games[@]}"; do
        IFS=$'\n' game=($(echo "${games[game_i]}"))
        unset "game[${#game[@]} - 1]"

        name="${game[0]##*([[:space:]])}"
        notes="${game[${#game[@]} - 1]##*([[:space:]])}"
        unset "game[0]"

        json_games[game_i]="\n\t\t{\n\t\t\t\"name\" : \"$name\""

        if [[ "$notes" =~ ^\(.*\)$ ]]; then
                json_games[game_i]+=",\n\t\t\t\"notes\" : \"$(echo "$notes" | sed 's/"/\\"/g')\""
                unset "game[${#game[@]} - 1]"
        fi

        if [[ ${#game[@]} -gt 0 ]]; then
                json_games[game_i]+=",\n\t\t\t\"resources\" : ["

                resources=()

                for resource in "${game[@]}"; do
                        if [[ "$resource" =~ [.*] ]]; then
                                resources+=" {\"url\" : $(echo "$resource" | grep -o '".*"'), \"name\" : \"$(echo "$resource" | grep -o '>.*<')\"}"
                        fi
                done

                json_games[game_i]+=$(IFS=, echo -n "${resources[*]}")
                json_games[game_i]+="\n\t\t\t]"
        fi

        json_games[game_i]+="\n\t\t}"
done

(IFS=,; echo -en "${json_games[*]}")

echo -en '\n\t]\n}'

At this point I contemplated Python... I think I'll give it a shot. Sample of the hairiest parts of the JSON:

$ jq '.games[] | select(.name == "Fraxy")' list-of-vehicle-building-games.json
{
  "name": "Fraxy",
  "resources": [
    {
      "url": "https://web.archive.org/web/20200129130946/http://monz.sp.land.to/wp/fraxy/",
      "name": "site",
      "archive": true,
      "index": 1,
      "alts": [
        {
          "index": 2,
          "url": "http://fraxyhq.net/",
          "archive": false
        }
      ]
    },
    {
      "url": "https://shmup.fandom.com/wiki/Fraxy",
      "name": "wiki",
      "index": 1,
      "alts": [
        {
          "index": 2,
          "url": "https://tig.fandom.com/wiki/Fraxy"
        },
        {
          "index": 3,
          "url": "https://web.archive.org/web/20100310233900/http://fraxy.kafuka.org/wiki/Main_Page",
          "archive": true
        },
        {
          "index": 4,
          "url": "https://web.archive.org/web/20141218150554/http://wiki.fraxy.net/index.php?title=Main_Page",
          "archive": true
        },
        {
          "index": 5,
          "url": "https://web.archive.org/web/20090317233751/http://fraxycompendium.pbwiki.com/",
          "archive": true
        },
        {
          "index": 6,
          "url": "http://fraxyacademy.pbworks.com/w/page/8284158/FrontPage"
        },
        {
          "index": 7,
          "url": "http://fraxy.pbworks.com/w/page/8284047/The%20Bosses"
        }
      ]
    },
    {
      "url": "http://fraxyhq.net/forums/index.php",
      "name": "forum",
      "index": 1,
      "alts": [
        {
          "index": 2,
          "url": "https://web.archive.org/web/20180911174817/http://acmlm.kafuka.org/board/forum.php?id=51",
          "archive": true
        },
        {
          "index": 3,
          "url": "https://web.archive.org/web/20120313164954/http://fraxy.forumi.biz/",
          "archive": true
        },
        {
          "index": 4,
          "url": "https://web.archive.org/web/20121107142027/http://fraxyhq.com:80/forums/index.php",
          "archive": true
        }
      ]
    },
    {
      "url": "https://www.youtube.com/results?search_query=%22fraxy%22",
      "name": "YouTube"
    },
    {
      "url": "https://tvtropes.org/pmwiki/pmwiki.php/VideoGame/Fraxy",
      "name": "TV Tropes"
    }
  ]
}
$ jq '.games[] | select(.name == "Block Tech Sandbox")' list-of-vehicle-building-games.json
{
  "name": "Block Tech Sandbox",
  "notes": "(aka Block Tech: Epic Sandbox, or Block Tech: Epic Sandbox Craft Simulator Online)",
  "resources": [
    {
      "url": "https://play.google.com/store/apps/details?id=com.NGG.BlockTech",
      "name": "Play Store",
      "index": "free",
      "alts": [
        {
          "index": "gold",
          "url": "https://play.google.com/store/apps/details?id=com.NGG.BlockTechGold"
        }
      ]
    },
    {
      "url": "https://apps.apple.com/app/block-tech-sandbox-online/id1465592382",
      "name": "App Store"
    },
    {
      "url": "https://www.crazygames.com/game/block-tech-epic-sandbox",
      "name": "Crazy Games"
    },
    {
      "url": "https://www.silvergames.com/block-tech-epic-sandbox",
      "name": "Silver Games"
    }
  ]
}

Maybe I could've learned some of those Python libraries. Maybe I missed some program that could've been more useful. Maybe I need to learn more Bash. What do you think?


Behemoth, wake up!

Offline

#2 2021-04-13 19:41:14

Morn
Member
Registered: 2012-09-02
Posts: 886

Re: How would you have parsed this Web page?

I would have entered this data in a spreadsheet, exported it as CSV or TSV and then created the HTML with Python. You are making it unnecessarily difficult by storing data in JSON that is basically just a big table.

And I don't think bash is ever the correct answer to any programming problem. smile

Offline

#3 2021-04-14 02:50:07

GaKu999
Member
From: US/Eastern
Registered: 2020-06-21
Posts: 696

Re: How would you have parsed this Web page?

By muscle memory, I like python for managing json, since the concepts of it translate nicely to python, at least for my personal experience.


My reposSome snippets

Heisenberg might have been here.

Offline

Board footer

Powered by FluxBB