You are not logged in.
I have nltk-data 3.0a3-1 installed, although admittedly I haven't used it much. When I went to upgrade today, I saw that the size difference between the old package and the new one (3.2.1-1) is enormous. The old package is already big, at an installed size of 1.7GB (to be expected, it's a bunch of corpora of natural language text) but the new package is a whopping 7.8GB installed size.
I ignored the upgrade for now, but I had to wonder... is such a huge increase in size expected, or might it be a bug? I don't see anyone talking about it, but I would think any time a package grows in size by over 6GB it would at least warrant a mention. Does the new version install the data uncompressed instead of leaving it compressed? Maybe there's just 4x as much data now?
Anyway, it's not a big problem, but it was surprising to me that nobody else seems to have taken notice. Seems Alexander Rødseth is the package maintainer, so maybe I should get in contact with him, but I'm not really sure how. Maybe this post will do the trick?
Last edited by mDuo13 (2016-06-21 21:41:54)
Offline
I don't know about nltk-data, but the diff between versions 3.0a3 y 3.2.1 in the PKGBUILD [1] is just a version bump, so it seems that the package it's just big. You're welcome to trim it down of course...
[1] https://git.archlinux.org/svntogit/comm … caaabe2fcb
Edit: Maybe the "NoExtract" option of pacman.conf might help...
Last edited by a821 (2016-06-02 12:48:40)
Offline
The latest version of nltk includes support for the PanLex Lite database, which is huge:
$ ls -lh /usr/share/nltk_data/corpora/panlex_lite/db.sqlite
-rw-r--r-- 1 root root 5.1G May 13 2016 23:00 /usr/share/nltk_data/corpora/panlex_lite/db.sqlite
Offline