zeptodb: tiny, fast command-line DBM tools

jakobcreutzfeldt · 2013-06-07 21:32:19

So, this started out as a coding exercise a while ago, then it was to become part of a bigger project, and then that bigger project never went anywhere. So, the code's just been sitting around gathering bitdust. I finally decided to clean it up and make it available on its own in case someone finds it useful.

zeptodb is a small collection of relatively tiny command-line tools for interacting with DBM databases. For the uninitiated, DBM databases are flat (non-relational) a databases; in other words, they are persistent key-value hash tables. Typically they are created via a library for C, Python, Perl, etc. These tools fill in a gap by providing useful command-line tools. Some DBM libraries come with really basic binaries for manipulating the databases, but they are not designed to be very flexible.

These tools may be helpful in scripts and pipelines, for example, when persistant data storage is needed but when a full database would be overkill. They may also be useful if, for whatever reason, one would like to manipulate, via the command-line or scripts, DBM databases created by other programs.

That's it...Comments welcome.

For more information, see the included documentation (man & info).

http://zeptodb.invergo.net
https://aur.archlinux.org/packages/zeptodb/

Last edited by jakobcreutzfeldt (2013-10-24 16:27:23)

jakobcreutzfeldt · 2013-06-10 22:18:27

I updated to version 1.1 today, which shouldn't do much for most people; it just conditionally includes some gnulib code for portability on systems that don't use the GNU C Library. For everyone else, nothing changes, not even the binary size (ok, slightly longer configure script; oh well).

Also, for fun, a comparison between searching for genetic locations in a database of 5 records versus one of more than 60k:

$ time echo "RGS9" | zdbf gene_locations.db
17:63133549-63223821:1

real    0m0.002s
user    0m0.000s
sys     0m0.000s

$ zdbf --all all_genes.db | wc -l
62192

$ time echo "ENSG00000141194" | zdbf all_genes.db 
17:56232494-56233517:1

real    0m0.002s
user    0m0.000s
sys     0m0.000s

(different keys were used in the two databases but that doesn't matter).

So, if you want O(1) disk-bound data retrieval in your scripts, zeptodb might be handy.

Last edited by jakobcreutzfeldt (2013-06-10 22:19:59)

jakobcreutzfeldt · 2013-06-11 09:11:47

One more fun test: a database with data for all the single nucleotide polymorphisms (SNPs) on chromosome 22 in Europeans.

$ du -h chr22_w11_annotations.csv
159M    chr22_w11_annotations.csv

$ wc -l chr22_w11_annotations.csv 
816755 chr22_w11_annotations.csv

$ time grep chr22_EUR_1_11_rs8138356_22 chr22_w11_annotations.csv | sed 's/chr22_EUR_1_11_rs8138356_2
2|//'
51234159 51243297 14387 758 29.1538 26 42 2 2.8733 7 46.026 3.2308 812.4603 0.2578 21.6154 94.8954 NA
 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA no_anno

real    0m0.103s
user    0m0.070s
sys     0m0.033s

$ time sed -n '/chr22_EUR_1_11_rs8138356_22/{s/chr22_EUR_1_11_rs8138356_22|\(.*\)/\1/;p}' chr22_w11_a
nnotations.csv
51234159 51243297 14387 758 29.1538 26 42 2 2.8733 7 46.026 3.2308 812.4603 0.2578 21.6154 94.8954 NA
 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA no_anno

real    0m0.584s
user    0m0.523s
sys     0m0.057s

$ du -h snps_db 
260M    snps_db

$ time echo chr22_EUR_1_11_rs8138356_22 | zdbf snps_db 
51234159 51243297 14387 758 29.1538 26 42 2 2.8733 7 46.026 3.2308 812.4603 0.2578 21.6154 94.8954 NA
 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA no_anno

real    0m0.004s
user    0m0.000s
sys     0m0.000s

Last edited by jakobcreutzfeldt (2013-06-11 09:22:39)

Spyhawk · 2013-06-11 09:28:52

Looks... impressive.

jakobcreutzfeldt · 2013-06-11 09:32:03

Thanks! Admittedly, all the hard work was done by the GDBM devs.

Last edited by jakobcreutzfeldt (2013-06-11 14:17:17)

jakobcreutzfeldt · 2013-07-03 22:03:15

I've just released zeptodb 2.0:

Kyoto Cabinet back-end is now available. At compile time, you can now opt to use the Kyoto Cabinet DBM library instead of the GDBM library. Note that any databases created with the GDBM library will *not* be accessible with the Kyoto Cabinet library.
Database creation now has a dedicated program. Previously, zeptodb databases were created automatically upon storing the first value in them. Now, there is the zdbc tool dedicated to creating new databases. This tool allows you to control some parameters of the database, namely the number of hash buckets and the size of the memory-mapped region that it uses. As a result, zdbs will no longer automatically create new databases.
New --verbose mode. All tools now have a --verbose option, which causes them to display more information while they work. This is particularly handy when you are working with millions of entries and you want to keep an eye on progress.
General code refactoring. To cleanly handle the two DBM back-ends, the code was significantly refactored, making it easier to read and edit.

jakobcreutzfeldt · 2013-07-04 21:20:49

Minor version bump to 2.0.1 fixing a couple minor bugs only revealed when running on a system based on Linux 2.6.18 (but bugs nevertheless).

jakobcreutzfeldt · 2013-11-17 16:36:50

I've just released version 2.0.2. You can read about the release here. Long story short: I've cleaned up all the memory handling so there shouldn't be any leaks, I've fixed building on Mac OS X, and I've dipped my toes into POSIX standards-compliant packaging (see the above link for more info), though that doesn't matter around here since you'll probably install it via the AUR.

I've also added a tutorial to the documentation, since people weren't 100% sure about how to use the programs.

edit: small update to 2.0.2b, fixing some documentation problems that I missed

Last edited by jakobcreutzfeldt (2013-11-17 17:01:15)

Arch Linux

#1 2013-06-07 21:32:19

zeptodb: tiny, fast command-line DBM tools

#2 2013-06-10 22:18:27

Re: zeptodb: tiny, fast command-line DBM tools

#3 2013-06-11 09:11:47

Re: zeptodb: tiny, fast command-line DBM tools

#4 2013-06-11 09:28:52

Re: zeptodb: tiny, fast command-line DBM tools

#5 2013-06-11 09:32:03

Re: zeptodb: tiny, fast command-line DBM tools

#6 2013-07-03 22:03:15

Re: zeptodb: tiny, fast command-line DBM tools

#7 2013-07-04 21:20:49

Re: zeptodb: tiny, fast command-line DBM tools

#8 2013-11-17 16:36:50

Re: zeptodb: tiny, fast command-line DBM tools

Board footer