You are not logged in.

#1 2013-08-14 18:46:20

ajrl
Member
Registered: 2013-05-18
Posts: 22

Indexing Email

I'm looking for a way to store 302,367 emails (and counting) that doesn't require keeping 302,367+ separate mbox files and that I can easily add on to as I store additional email locally. I'm mostly concerned with preserving text content, but it would be nice to be able to save HTML and attachments as well. Does anyone have experience indexing a massive amount of email? If so, what application(s) did you use to index, search, etc.?

Last edited by ajrl (2013-08-15 04:42:32)


¡A la máquina!

Offline

#2 2013-08-14 18:57:30

WonderWoofy
Member
From: Los Gatos, CA
Registered: 2012-05-19
Posts: 8,414

Re: Indexing Email

Wait... I thought that the idea behind the mbox format was to have a single file that is concatenated when new mail is added.  So you wouldn't have 300,000+ files, but rather one gargantuan file that holds all your mail.  Are you thinking of mail stored in maildir format?  This is where there is an actual directory that then descends into various "folders" and separate pieces of mail.

I prefer the maildir format, but this is only a personal preference.  So I use mutt with offlineimap to read and store my mail, repsectively.  But in order to search through my mail, I use notmuch-mutt to index everything.  It is fast, effective, and pretty good in its various search terms.  But this would require that you have those >302,367 files in order to store things.

Offline

#3 2013-08-14 19:28:38

ajrl
Member
Registered: 2013-05-18
Posts: 22

Re: Indexing Email

You're right. They're definitely separate files, but they don't have file extensions. My file browser just calls them "mailbox file{s}," which isn't necessarily the same as mbox files now that I think about it. In any case, I exported these all into KMail, so they're in whatever format KMail uses by default.

I'm not totally averse to maildir format for most of my emails, but a lot of these 300K emails are actually text messages that I'd like to put into just a few text files or something. I definitely don't need to be able to index and search through that many texts on demand (or ever). For the rest of my emails, I think I'm going to go with your suggestion and start using Mutt and Notmuch. I had been leaning that way for a while, but I've always been kind of intimidated by Mutt . . .


¡A la máquina!

Offline

#4 2013-08-14 19:32:13

Trilby
Inspector Parrot
Registered: 2011-11-29
Posts: 29,442
Website

Re: Indexing Email

Mutt, offlineimap, and notmuch are each completely separate tools (and I too love all three as an aside).  While they do work together quite well, offlineimap and notmuch will also integrate very well with nearly any other mail program.


"UNIX is simple and coherent..." - Dennis Ritchie, "GNU's Not UNIX" -  Richard Stallman

Offline

#5 2013-08-14 19:48:54

ajrl
Member
Registered: 2013-05-18
Posts: 22

Re: Indexing Email

What's the difference between the packages notmuch and notmuch-mutt? The latter has 23 dependencies, but the former only has three. Is notmuch-mutt essential for integrating with Mutt?


¡A la máquina!

Offline

#6 2013-08-14 19:51:52

Trilby
Inspector Parrot
Registered: 2011-11-29
Posts: 29,442
Website

Re: Indexing Email

No, defintely not essential.  I don't have notmuch-mutt.

I don't know anything more about it than that though.  I *think* it is a collection of configurations, basically, that make it work smoother "out of the box" - I prefer to set things up my own way.


"UNIX is simple and coherent..." - Dennis Ritchie, "GNU's Not UNIX" -  Richard Stallman

Offline

#7 2013-08-14 21:13:58

WonderWoofy
Member
From: Los Gatos, CA
Registered: 2012-05-19
Posts: 8,414

Re: Indexing Email

notmuch-mutt is just the indexing functionality of notmuch, which can be used with mutt.  notmuch is apparently a MTA on its own, but since its indexer is so great, people used to install the whole package only to use that one feature with mutt.  Of course, in order to make this work, I think that you used to have to patch the notmuch source.  So eventually the upstream notmuch people realized it would be better to simply merge and maintain this fcuntionality themselves.

At least I think that is how the story goes.

In any case, notmuch is not essential whatsoever for mutt/offlineimap functionlaity.  Mutt has its own search feature, but if you are searching through thousands upon thousands of emails, you will get better performance with notmuch-mutt.  I don't think I necessarily *need* it, but I like it, so I use it.

There is also the option of using isync/mbsync instead of offlineimap.  It is written in C I think, so it is supposedly better performance wise.  It also features much more fine grained control over what gets synced in which direction.  I have just always used offlineimap, and have that set up and working, so I never saw the need to change.

Offline

#8 2013-08-14 21:18:41

Trilby
Inspector Parrot
Registered: 2011-11-29
Posts: 29,442
Website

Re: Indexing Email

## from bashrc
mgrep() {
	notmuch new > /dev/null
	mkdir /tmp/mgrep/{,cur,new,tmp}
	cp $(notmuch search --output=files "$@") /tmp/mgrep/cur
	mutt -f /tmp/mgrep/
	rm -r /tmp/mgrep/
}

That's how I put them together.  I don't need to search from within mutt, as I don't typically have mutt open unless I am reading or replying to an email.  When I want to search for something in an email, I just use `mgrep <search terms>` and I get all the matching mail, but in my familiar mutt interface.


"UNIX is simple and coherent..." - Dennis Ritchie, "GNU's Not UNIX" -  Richard Stallman

Offline

#9 2013-08-14 21:22:03

ewaller
Administrator
From: Pasadena, CA
Registered: 2009-07-13
Posts: 19,739

Re: Indexing Email

ajrl wrote:

I'm looking for a way to store 302,367 emails (and counting)

As an aside, might I ask:   Why?


Nothing is too wonderful to be true, if it be consistent with the laws of nature -- Michael Faraday
Sometimes it is the people no one can imagine anything of who do the things no one can imagine. -- Alan Turing
---
How to Ask Questions the Smart Way

Offline

#10 2013-08-14 21:22:17

WonderWoofy
Member
From: Los Gatos, CA
Registered: 2012-05-19
Posts: 8,414

Re: Indexing Email

Trilby wrote:
## from bashrc
mgrep() {
	notmuch new > /dev/null
	mkdir /tmp/mgrep/{,cur,new,tmp}
	cp $(notmuch search --output=files "$@") /tmp/mgrep/cur
	mutt -f /tmp/mgrep/
	rm -r /tmp/mgrep/
}

That's how I put them together.  I don't need to search from within mutt, as I don't typically have mutt open unless I am reading or replying to an email.  When I want to search for something in an email, I just use `mgrep <search terms>` and I get all the matching mail, but in my familiar mutt interface.

Nice!  I like this.  I always keep mutt open, but I might give this a whirl.

Offline

#11 2013-08-15 02:41:10

ajrl
Member
Registered: 2013-05-18
Posts: 22

Re: Indexing Email

Okay, thank you both for all the info.

This is slightly off-topic, but I'm having a problem with accessing my Gmail account through Mutt. My Google password is normally 100 characters (the maximum number allowed by Google), but I get a "login failed" message from Mutt no matter if I store the password in the config file or enter it manually. I tried using a 12-character password, and it worked fine. I tried a 64-character password with only letters and numbers, but it didn't work. I'm definitely not planning on using a 12-character password for my Google account. Any idea what the problem could be?

ewaller wrote:

As an aside, might I ask:   Why?

I honestly have no idea how I accumulated this many messages nor why I have the desire to keep them . . .


¡A la máquina!

Offline

#12 2013-08-15 03:35:10

WonderWoofy
Member
From: Los Gatos, CA
Registered: 2012-05-19
Posts: 8,414

Re: Indexing Email

ajrl wrote:
ewaller wrote:

As an aside, might I ask:   Why?

I honestly have no idea how I accumulated this many messages nor why I have the desire to keep them . . .

Not too long ago, I was trying to do something with my mail... I think I was trying to use isync/mbsync in place of offlineimap.  I accidentally configured something totally wrong and wiped out all my mail in one account.  It was honestly a pretty great thing actually.  I panicked at first, but after a while I realized that I don't really ever access old mail anyway.  So I was just keeping it for... nothing.

Offline

#13 2013-08-15 04:13:55

firecat53
Member
From: Lake Stevens, WA, USA
Registered: 2007-05-14
Posts: 1,542
Website

Re: Indexing Email

1. For indexing I prefer mairix (along with mutt/offlineimap, of course) over notmuch because of the smaller database size. Although jasonwryan had a nice post here about notmuch that almost got me to switch smile

2. I found archivemail and wrote a little script to automatically add any mail over 180 days old to a gzipped archive. This keeps my active Gmail maildir folders to a manageable size. If I need to search the archive, it just gets unzipped to an mbox file that can be opened by mutt or indexed by mairix. I can provide more details on this if anyone is interested.

Scott

Offline

#14 2013-08-15 04:18:04

Xyne
Administrator/PM
Registered: 2008-08-03
Posts: 6,963
Website

Re: Indexing Email

ewaller wrote:
ajrl wrote:

I'm looking for a way to store 302,367 emails (and counting)

As an aside, might I ask:   Why?

We appreciate your concern, citizen, but there is nothing to see here. Everything is under control. Please move along.

*nods to 2 ungentlemen with tasers*

Off you go now.


My Arch Linux StuffForum EtiquetteCommunity Ethos - Arch is not for everyone

Offline

#15 2013-08-15 06:32:20

ball
Member
From: Germany
Registered: 2011-12-23
Posts: 164

Re: Indexing Email

I use mu as mail indexer, for searching mail and completing email adresses. I integrates well with mutt. It's in the AUR. For configuration I used the corresponding paragraphs in this guide: http://dev.gentoo.org/~tomka/mail.html

Offline

#16 2013-08-15 07:41:54

vacant
Member
From: downstairs
Registered: 2004-11-05
Posts: 816

Re: Indexing Email

This is an interesting thread - as much a philosophical discussion as technical.

I'm currently using Thunderbird so have quite large mbox-style files which can be in sub-directories. It sounds like the OP might prefer one file per email folder.  Each year I move stuff into a large Archive hierarchy (family, shopping, tech etc).

I would suggest moving that Archive to a removable hard drive that could be attached as needed - as a mount point in your email hierarchy.

Here's how I use the method with my entire local mail system. Instead of running thunderbird from an icon click, I plug in my device (USB flash drive) and run a script:

cat mountmail

#!/bin/bash
sudo cryptsetup luksOpen /dev/disk/by-uuid/72179a6b-8964-4745-8f1b-46678267e893 mail
sudo mount -o user /dev/mapper/mail /home/paul/.thunderbird/someprofile.default/Mail/Local\ Folders
thunderbird
sudo umount /dev/mapper/mail
sudo cryptsetup luksClose mail

I'm using this set up because I don't want to keep my private stuff on the cloud. I'm using a Chromebook for browsing but switch to Crouton for private email (plugging in the USB drive). When I plug in the USB to use email on my laptop/server, "mountmail" includes an rsync after thunderbird to back it up. I want any mail sent from any device, using any email identity (and any smtp server) stored in my "local/sent" folder.

Offline

Board footer

Powered by FluxBB