Monday, October 31, 2016

Searching e-mail headers is easy

(This is a very specific post about events deemed newsworthy at the end of October 2016, namely, the announcement that the FBI has reopened an investigation because emails found on a computer might be relevant to a previous case. For those that don't know me, I'm a software engineer who's worked at Google and Bing and written some dozens of papers about language and search, so while I don't know all the details of this particular laptop, I'm not just ranting without expertise.)
My professional opinion: the notion that the FBI may take days to figure out if some of 650K emails went through a particular server is nonsense. This is a one-line grep-the-headers command and may take as much as a few minutes on a standard laptop. Doing a more thorough preliminary analysis to see if a few individuals are mentioned by names or synonyms is a bit more work - maybe a couple of hours for a beginner programmer. More detailed topical analysis, now it's getting more fun - about a day's work for an NLP specialist using tools readily available for over a decade.
If you've read any "tech news" articles telling you that FBI agents might spend weeks looking for something vaguely AI-blabber sounding like "metadata", please realize that you're being deceived. The notion that this task takes days is utter nonsense. American readers and voters are being played and manipulated by claiming something is hard that is in fact easy.
From a language-processing point of view, the claim that James Comey had to notify Congress that they'd found a "big box of something" but it would take until after the election to have any idea of whether that big box contained anything relevant is incompetent nonsense.
If instead they're saying that out of those 650K emails, they have to read every one by hand because maybe, just maybe, someone-said-something-to-someone that could count as referring to something that was once at some grade of classification, then that's quite a different proposition. It's not a standard that is applied to any politician other than Hillary Clinton, and the notion that until you've done this you have "no idea what the box of stuff contains" is just not true.
At the very least, we should have a Director of the FBI who has beyond a mid-1980's sense of how hard it might be to search through fewer than 1 million email headers.


My this seems dated now. So as you probably know, a week or later just before the election the FBI reported that after all they didn't have anything new. Hillary Clinton went from a decisive lead in the polls to a dip in the polls, won the popular vote, but lost the presidential election. Given how close the elections were in the swing states that mattered, I suspect that history will tell the story of an election that was marred by scandal, and was decided in the end by the exact timing of when scandals were reported, real or imagined.