You are not logged in.

#1 2017-01-03 20:31:04

saveman71
Member
Registered: 2017-01-03
Posts: 9

Downloading heavy files in PKGBUILD

Hello,

I'm currently in the process of creating a PKGBUILD for libpostal, a C library for normalizing addresses around the world, trained on OpenStreetMap's data (I'm not affiliated with them, I just want to see a package for Arch).

I currently have a working PKGBUILD:

# Maintainer: Antoine Bolvy <antoine.bolvy@gmail.com>

_pkgname=libpostal
pkgname=$_pkgname"-git"
pkgrel=1
pkgver=r1731.d61e90a3
epoch=
pkgdesc="A C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data. (git version)"
arch=("any")
url="https://github.com/openvenues/$_pkgname"
license=('MIT')
groups=()
depends=()
makedepends=('git' 'make' 'automake' 'autoconf' 'libtool' 'curl' 'pkg-config' 'snappy')
checkdepends=()
optdepends=()
provides=("$_pkgname=$pkgver" "$pkgname=$pkgver")
conflicts=("$_pkgname")
replaces=()
backup=()
options=()
install=
changelog=
source=("git+git://github.com/openvenues/$_pkgname.git")
noextract=()
md5sums=('SKIP')
validpgpkeys=()

pkgver() {
  cd "$_pkgname"
  ( set -o pipefail
    git describe --long --tags 2>/dev/null | sed 's/\([^-]*-g\)/r\1/;s/-/./g' ||
    printf "r%s.%s" "$(git rev-list --count HEAD)" "$(git rev-parse --short HEAD)"
  )
}

prepare() {
    mkdir -p data
}

build() {
    cd "$_pkgname"
    ./bootstrap.sh
    ./configure --datadir="${srcdir}/data"
    make
}

package() {
    make -C "$_pkgname" DESTDIR=${pkgdir} install
}

My concern is that the compilation of this package relies on the download of multiple files (from S3, more than 2G at the moment) for training purposes and is currently "embedded" in the compilation in itself. Here are the build instructions from libpostal"s README:

git clone https://github.com/openvenues/libpostal
cd libpostal
./bootstrap.sh
./configure --datadir=[...some dir with a few GB of space...]
make
sudo make install

# On Linux it's probably a good idea to run
sudo ldconfig

As you can see, it relies on the existence of a data directory where it will check if the files are already here (and then download nothing, or only updates), or download the data with the utility libpostal_data. From what I can gather, it is possible to launch libpostal_data beforehand, and then during the installation the execution will result in a no-op.

My questions are:

  • Is the PKGBUILD good as-is? As in, the download occurs in the build() function.

  • Can we warn the user (that may have some restricted internet connexion, data caps, whatever) than a few GB are going to be downloaded? How?

  • Can the download be done beforehand, and outside $srcdir? This because in the case of a package upgrade, some AUR helpers cache the build directory (e.g. pacaur), so it might be useful to cache the downloaded data as well, as makepkg will clear $srcdir before each run.

Any other feedback is always appreciated!

EDIT: Also, do I need to execute ldconfig? If yes where, when?

Last edited by saveman71 (2017-01-03 20:40:36)

Offline

#2 2017-01-03 20:52:22

eschwartz
Fellow
Registered: 2014-08-08
Posts: 4,097

Re: Downloading heavy files in PKGBUILD

You do not need to repeat "git+git://" in the source array.
I would use ${pkgname%-git} instead of $_pkgname.
Fix your provides, it doesn't need to provide itself.
About half your makedepends are already in the base-devel group, which is expected to be installed on all Arch systems that will be running makepg. Do not add depends/makedepends for packages in base and base-devel.
In package() you need to quote "${pkgdir}" (for the same reason you quoted "${srcdir}" earlier).

You can warn the user using an echo statement wink but if the data is available at a known URL for download, simply add it to the source array and have makepkg copy it or extract it to where it is needed.

ldconfig is already taken care of by pacman, just as it is for all installed packages containing shared libraries.


Managing AUR repos The Right Way -- aurpublish (now a standalone tool)

Offline

#3 2017-01-03 21:05:10

saveman71
Member
Registered: 2017-01-03
Posts: 9

Re: Downloading heavy files in PKGBUILD

Thank you for your message.

Here is the updated PKGBUILD:

# Maintainer: Antoine Bolvy <antoine.bolvy@gmail.com>

pkgname="libpostal-git"
pkgrel=1
pkgver=v0.3.2.r19.gd61e90a3
epoch=
pkgdesc="A C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data. (git version)"
arch=("any")
url="https://github.com/openvenues/${pkgname%-git}"
license=('MIT')
groups=()
depends=()
makedepends=('git' 'curl' 'snappy')
checkdepends=()
optdepends=()
provides=("${pkgname%-git}")
conflicts=("${pkgname%-git}")
replaces=()
backup=()
options=()
install=
changelog=
source=("git://github.com/openvenues/${pkgname%-git}.git")
noextract=()
md5sums=('SKIP')
validpgpkeys=()

pkgver() {
  cd "${pkgname%-git}"
  ( set -o pipefail
    git describe --long --tags 2>/dev/null | sed 's/\([^-]*-g\)/r\1/;s/-/./g' ||
    printf "r%s.%s" "$(git rev-list --count HEAD)" "$(git rev-parse --short HEAD)"
  )
}

prepare() {
    mkdir -p data
}

build() {
    cd "${pkgname%-git}"
    ./bootstrap.sh
    ./configure --datadir="${srcdir}/data"
    make
}

package() {
    make -C "${pkgname%-git}" DESTDIR=${pkgdir} install
}

Also, as I mentioned the data is downloaded from multiple urls, by the "libpostal_data" util script. It handles the downloading, extraction, etc of the data, so I cannot really mimic its behaviour in the PKGBUILD.

Another issue is that during the execution the library actually needs the training data (I haven't figured this out before because the compilation seems to "hard code" the data folder, and in my local compilation the data was still available, but it was deleted on the makepkg build directory.
One solution I can see is to install the data to some directory like "/opt/libpostal_data", am I in the right direction there?
The issue I can see is once again, there is no caching happening between the old install and the new package. Another issue is the package will be 2.2Gigs in size hmm

I'm starting to see why there was no AUR package for this library :')

EDIT: https://github.com/openvenues/libpostal/issues/120 interesting read, it is apparently possible to decouple the download step from the build. It's still required to execute libpostal though. Because libpostal installs the download tool, maybe it's possible to simply display a message saying to the user "Alright, now you have to download the date to THAT directory with libpostal_data"

Last edited by saveman71 (2017-01-03 21:25:00)

Offline

#4 2017-01-03 22:20:44

eschwartz
Fellow
Registered: 2014-08-08
Posts: 4,097

Re: Downloading heavy files in PKGBUILD

Wait, so --datadir is I assume the autoconf-standard datadir, and during compilation it... doesn't know how to look in the pre-installation build-time location for the data? Does this stuff even need to be used during compilation, or does their inefficient build system simply not know how to handle the data?

What should happen: data is not needed at build time. A separate package (libpostal-data) can hold the data and this package can depend on it.
If the build process (optionally?) downloads the data, it does so into a temporary location during the build, then moves the data together with everything else according to the --datadir and DESTDIR.

It sounds like your best option is to NOT download the data files, tell it the data files will be available at e.g. /usr/share/libpostal/ and then separately download the data and package it for /usr/share/libpostal, possibly in a second package.
(Is the software useful at all without the data files? If not, you probably shouldn't separate the packages.)


Managing AUR repos The Right Way -- aurpublish (now a standalone tool)

Offline

#5 2017-01-03 22:25:26

Trilby
Inspector Parrot
Registered: 2011-11-29
Posts: 29,523
Website

Re: Downloading heavy files in PKGBUILD

Wow - this looks like it will be of great use on a project I've been limping by with querying google maps for.  This sounds like a great approach for address format checking.

On the topic of building, I've yet to look into this specific package much (I will be looking into it this week), but it's not uncommon to see SOMEPKG and SOMEPKG-data.  I don't know if having a separate data-package which would be a dependency of the main package would help, but it's an option.

EDIT: I'm just tinkering with building this ... it's churning away now.  But I don't get the impression that the data is needed for the compilation/installation.  You just need to specify a path to where the data will be stored when the installed binary is run (the path is passed as a preprocessor definition).  So, most likely, you'd just want to use a path like /usr/share/libpostal/ or something like that.  Perhaps changing permissions on that dir and/or adding a libpostal group so that the installed program can write to that directory as a regular user would be good.

EDIT 2: the compile locked up - 1 CPU pegged to 100% for 5 minutes with no other activity and the compiler output was frozen on a single gcc line.  That doesn't look good.  But in any case I had to cancel out for now, I'll try again later.


"UNIX is simple and coherent..." - Dennis Ritchie, "GNU's Not UNIX" -  Richard Stallman

Offline

#6 2017-01-03 22:43:27

saveman71
Member
Registered: 2017-01-03
Posts: 9

Re: Downloading heavy files in PKGBUILD

@Eschwartz Honestly I'm not really sure. I don't know anything about autoconf, I'm more of a cmake guy wink I'm going to try to compile once without the data, and report back.
As to the separate package, it looks like a good idea. The main package would still provide the library and the download tool (maybe other things too), but the download tool is still useful in itself, I guess.

Also, as much as I'd like to have this as an AUR package, the more I play with it, the more it doesn't really seem like it's worth the work (for my own use). The library looks like it's still in very active development (which is good), but there isn't a proper documentation in regard to the build system / how to install, etc.

@Trilby, yeah it happened to me too, restarting the build fixed it I think, it might help, but I'm not sure I remember correctly, it might have been an issue with the download too, as I was poking around.

Keep in mind this is only the second time I'm trying to create an AUR package, and the first time was nowhere as complicated!

EDIT: Alright I'm stuck too. I tried setting the data-dir, not setting it, clearing it, no effect. Can't build anymore. So weird.

Last edited by saveman71 (2017-01-03 22:52:04)

Offline

#7 2017-01-03 23:54:42

eschwartz
Fellow
Registered: 2014-08-08
Posts: 4,097

Re: Downloading heavy files in PKGBUILD

This wouldn't be the only software which is difficult to build properly. wink If you can get it working properly there is no reason not to submit it to the AUR.
Perhaps via your exploration you can encourage the devs to polish their build system. smile

I forgot to mention earlier, you still haven't fixed the unquoted ${pkgdir}.
Since you don't seem to be planning on using all those empty variables and arrays, you should delete them.
Your updated PKGBUILD has a pkgver with an actual tagged version in it, no longer *just* the revision count. So you can remove the fallback handling in pkgver() which will now never get used (git describe will always return a useful result).

...

I think Trilby may be right, and this software may be happier if there is a writable directory to hold the data in that doesn't rely on continually-repackaged updates. I guess it depends on how often the data gets updates.


Managing AUR repos The Right Way -- aurpublish (now a standalone tool)

Offline

#8 2017-01-04 01:52:15

saveman71
Member
Registered: 2017-01-03
Posts: 9

Re: Downloading heavy files in PKGBUILD

Okay, I think I got a solution, it might not be the prettiest thing you've ever seen, but eh, it works.

PKGBUILD:

# Maintainer: Antoine Bolvy <antoine.bolvy@gmail.com>

pkgname="libpostal-git"
pkgrel=1
pkgver=v0.3.2.r19.gd61e90a3
pkgdesc="A C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data. (git version)"
arch=("any")
url="https://github.com/openvenues/${pkgname%-git}"
license=('MIT')
depends=('curl' 'snappy')
makedepends=('git')
provides=("${pkgname%-git}")
conflicts=("${pkgname%-git}")
source=("git://github.com/openvenues/${pkgname%-git}.git")
md5sums=('SKIP')
install="${pkgname}.install"

pkgver() {
    cd "${pkgname%-git}"
    git describe --long --tags 2>/dev/null | sed 's/\([^-]*-g\)/r\1/;s/-/./g'
}

build() {
    cd "${pkgname%-git}"
    ./bootstrap.sh
    ./configure --disable-data-download
    make
}

package() {
    ls -r
    make -C "${pkgname%-git}" DESTDIR="${pkgdir}" install
}

libpostal-git.install

post_install() {
    echo ":: You need to download libpostal's trained data to "
    echo "   /usr/local/share/libpostal prior using this library."
    echo "   This can be done with the libpostal_data command, e.g.:"
    echo "   libpostal_data download all /usr/local/share/libpostal"
}

post_upgrade() {
    echo ":: You might want to update libpostal's trained data."
    echo "   This can be done with the libpostal_data command, e.g.:"
    echo "   libpostal_data download all /usr/local/share/libpostal"
}

post_remove() {
    echo ":: You might still have libpostal's trained data located in"
    echo "   /usr/local/share/libpostal (usually manually downloaded)"
    echo "   You should be able to safely remove this directory."
}

@Eschwartz: Hopefully it has no more style issues
@Trilby: Okay so for the build that was stuck it turns out after a while it does succeed. I don't have any clue why. I may submit an issue to the repository on Github though, I'm sure this is not a correct behavior.

Personally, I like this solution because it leaves the user some choice on where and how to store the data. For example, I was a little short on space on my root partition, so I just created a symlink to the location I had downloaded the files before (on another partition) when I was building manually.

EDIT: Also I moved 'curl' and 'snappy' to depends dependencies because they are required to run libpostal_data
EDIT 2: Actually I have another problem, the library is unusable because it is installed in "/usr/local/lib" which is not in LD_LIBRARY_PATH by default, so programs won't work without "export LD_LIBRARY_PATH=/usr/local/lib" beforehand. Is it only my problem (i.e. it's the user's responsibility) or am I missing something here?

Last edited by saveman71 (2017-01-04 03:40:29)

Offline

#9 2017-01-04 14:22:57

eschwartz
Fellow
Registered: 2014-08-08
Posts: 4,097

Re: Downloading heavy files in PKGBUILD

/usr/local is for things installed outside of the package manager. There should be an option in ./configure to set the prefix to /usr
But AFAIK ldconfig should automatically look there hmm

Last edited by eschwartz (2017-01-04 14:25:25)


Managing AUR repos The Right Way -- aurpublish (now a standalone tool)

Offline

#10 2017-01-11 00:26:07

saveman71
Member
Registered: 2017-01-03
Posts: 9

Re: Downloading heavy files in PKGBUILD

Hey!

For info, the package is live: https://aur.archlinux.org/packages/libpostal-git

Cheers

Offline

Board footer

Powered by FluxBB