Finding Hidden Haiku

hidden-haiku-basic-logo

Somewhere in this extract of an article by John Paul Rathbone lurks a haiku. Not placed there deliberately, it, and others like it, have been sitting in plain sight, unrecognised for what they are.

Lastly, Mr Temer needs to build a Congressional majority. Although Mr Temer is a wily negotiator, this is perhaps his hardest task — especially as the Petrobras corruption scandal has fractured Congress into myriad whirlpools of seething factions, not only between parties but within them too. Mr Temer has already had to shelve earlier plans to reduce the number of ministers because such appointments are a traditional way of dealing out pork and thus building coalitions.

… and here it is:

  • has fractured Congress
    into myriad whirlpools
    of seething factions

There are plenty more such haiku: identified by computer algorithm, selected by a human, and brought to light in ft.com/hidden-haiku.

What is a haiku?

As Wikipedia describes, “A haiku in English is a very short poem in the English language, following to a greater or lesser extent the form and style of the Japanese haiku”, sometimes having “a three-line format with 17 syllables arranged in a 5–7–5 pattern”.

For the purposes of this project, we have concentrated more on finding this 5-7-5 syllable structure, and less on “a focus on some aspect of nature or the seasons” and the other, more subtle, criteria. That being said, some of the haiku we have found have come pleasingly close to being powerful pieces of prose in their own right.

Recognising haiku

Inspired by the lovely New York Times Haiku project, we conducted a series of mini-experiments, looking at our content in new ways, exploring a variety of other avenues for the manipulation of search results and the automated identification and manipulation of ‘accidental’ poetry*. Along the way, perhaps inevitably, the tool we built was easily tweaked to become a haiku detector, and it seemed wrong not to point it at the FT articles to see what we could find. A conversation with Jacob Harris (@harrisj), formerly of the NYT, concluded that we’ve ended up in a similar haiku place.

The code for our haiku detector is available but comes with plenty of caveats, chief among which is that it was My First Golang Program (™). The tool itself is live but, at the time of writing, the part involving haiku is restricted to FT Staff accounts.

On startup, the haiku detector reads in the 134K defined words from the enormous and enormously useful phoneme dataset, CMU Pronouncing Dictionary, and some extra items added as part of this project (more details below).

An example defined word is: HAIKU: HH AY1 K UW0. There are two numbered phonemes, AY1 and UW0, so the word has two syllables, with the primary emphasis on the phoneme marked with a 1. Longer words can have a syllable with a secondary emphasis, e.g. DEFENESTRATION: D IY0 F EH2 N EH0 S T R EY1 SH AH0 N.

A word’s phonemes are then mapped to a string of emphasis points. In the case of the example word HAIKU, this would be “10”, indicating this is a two syllable word with emphasis on the first syllable.

The extra items added to the dictionary include

  • straightforward definitions, such as BITCOIN and EUROSCEPTIC, which were missing from the original
  • alternate (mis)spellings, such as CRITICISED → CRITICIZED

and extra functionality

  • regexes for word boundaries, such as /\w+[‘’][sStdmM]/
  • regexes for transforming awkward text, such as apostrophes, /(\w+)’([sStTdDmM])/$1’$2/
  • marking certain words as being unsuited to terminating a phrase in poetry, such as AND and BUT

The user specifies the structure of the haiku, “….. ……. …..”, representing contiguous blocks of 5+7+5 syllables, with no particular required emphasis, but an individual word cannot be split across two blocks of syllables. This structure, aka meter, is turned into a regular expression to be used to match against concatenations of strings of emphasis points.

The user specifies some articles, such as those currently on the FT.com homepage (usually 15), or the most recently published articles, or Lucy Kellaway’s latest thoughts. The content of each article is pulled in via our Search and Content APIs, and split into words. Each word is looked up in the dictionary to obtain its emphasis string, which defaults to “?” if there is no matching definition and ensures this word will not be a candidate to match the meter regex. The emphasis strings for all the words in the article are concatenated into a space-separated string and the meter regex is applied to look for matching sequences of emphasis points. The matches are unpacked to give the original article text.

The haiku detector lists all the specified articles, and for each one lists the fragments of text which match the haiku meter and do not end with unsuitable words, presented in such a way as to make it easy for the user to visually scan through large numbers of them.

The ‘unsuitable’ haikus are listed at the end, along with all the unrecognised words, such as (at the time of writing) BREWDOG and WHISTLEBLOWING.

Numbers, dates, percentages, etc, could be converted into text, and then be candidates to match within a haiku, but that has been left as an exercise for (maybe, but probably not) later.

Choosing haiku

There’s no subtle way of saying it other than, you have to wade through an awful lot of nonsense to get to the good ones. No concrete stats yet, but a haiku hit rate of 1 in a 100 seems about right, i.e. 1 “Hm, maybe” to 100 “No”s.

Returning to John Paul Rathbone’s article, here are all the matching, suitable haiku:

week is a supposed
procedural glitch announced
on Monday that could
the lower house will
decide any differently
than it did last month
faces four daunting
challenges although even
these are not unique
rule Mauricio
Macri Argentina’s new
president faces
they or candidates
of a similar stature
can inject a dose
private Brazilian
companies can currently
pay to their partners
Congress including
the head of the lower house
Eduardo Cunha
scandal has fractured
Congress into myriad
whirlpools of seething
has fractured Congress
into myriad whirlpools
of seething factions
into myriad
whirlpools of seething factions
not only between
whirlpools of seething
factions not only between
parties but within
of seething factions
not only between parties
but within them too
here is that in both
countries most citizens care
little for party
reasonably well
run and are willing to give
fresh leaders a chance
run and are willing
to give fresh leaders a chance
to do that at least

Not shown here, the (perhaps double the number of) matching ‘unsuitable’ haiku.

The selection of haiku from this list of candidates is highly subjective, and done by eye, at speed. No doubt there are still some nice haiku which remain hidden even after this scan.

Categorising haiku

It is early days, but over the course of 3 months we have accumulated approximately 300 haiku which pass the “Hm, maybe” test. There are enough to attempt to categorise them into different types, ranging from the mundane to the really rather profound. Here are some of the (overlapping) categories:

  • Cropping changing meaning
  • Imagery
    • the two main parties
      taking turns to rip themselves
      apart in public

  • Reportage
    • toilet was destroyed
      in a controlled explosion
      by army experts

Publishing haiku

We are taking some small, early steps to establish if there is interest among our readers (existing or prospective) in having these haiku brought to their attention, with a weekly collection being published as an article: ft.com/hidden-haiku. These will be tweeted and posted onto Facebook.

Perhaps we will provide an option for our readers to get all their news in haiku form. You heard it here first.

* Awful poetry

Albeit not used in this haiku project, having the details of the syllable emphasis within each word plus the final syllable (e.g. K UW0, sounds like “coo”) gives the raw material from which to auto-generate cringingly awful, metered, rhyming poetry. The user can specify other meters, such as iambic pentameter, “0101010101”, and resulting matches are sorted by final syllable.

The reader is spared some examples here, and may have to wait for a followup post.