I finally finished importing all the old email I could find (back to 1996 or so).
I used Eudora on my old Macs, starting whenever I first got Internet access. Eudora was a great client for its day. It used something that looks a lot like mbox format, only it used \r instead of \n for line breaks. Surprisingly, most of the emails had Message-ID headers, which made checking for duplicates rather easy…
…which is more than I can say for Outlook. I don't know why I ever thought Outlook was a decent mail client, but I did. I was stupid enough to use it for about 2 years early in college. Each month, I backed up my PST file and burned it to CD, just in case Outlook crapped its pants. To the credit of Microsoft, I don't think Outlook ever actually soiled itself, but I was probably lucky.
To convert my PST files into a usable format, I installed libpst. I took the mbox files from readpst, copied the contents to an import Maildir, and ran my script to check for duplicates. Any messages which the script didn't already know about, I filed appropriately.
This was working great until I hit a large chunk of mail which didn't have Message-ID headers. I could have guessed based on the combination of Subject and Date, but I resorted to checking each message individually. Sigh.
readpst seemed to have the most trouble with messages in my sent folder. I guess Outlook didn't add any useful information to the message until it sent it to the SMTP server. Basic stuff like To, From, and Date was present, but in nonstandard formats of course. Much of the To data included the person's name (e.g. BobbyNewmark - note the lack of spaces) but no email address. Presumably I had these people in my address book and Outlook waited until the last possible second to use the person's email address. Bonus.
I'll probably write a tiny script to convert these messages into a more useful format, but that will require guessing email addresses for people I haven't emailed for almost 3 years, in some cases.
[12:40:45 dwc@dulcinea ~]$ find Mail -type f | wc -l
80723
Why is this technology an anathema to me?
Posted by dwc in Internet at 12:35 PM