xenofem's diceware wordlist

2020-07-17 // 1500 words

tldr: I made a dice-based passphrase wordlist where each word has a unique 3-letter prefix, so you only need to type the first 3 letters of each word in your passphrase. You can also get it as a print-and-staple zine, or a python script to generate passphrases.

How to use it

  1. Roll four 6-sided dice.
  2. Look up that combination of four numbers in the wordlist – for instance, if you rolled 5, 2, 4, 1, you’d take word 5241: picture.
  3. Repeat this until you have enough words for your use-case1. Each word gives slightly over 10 bits of entropy.
  4. Take all your words and make up some sort of mental image or story as a mnemonic for them. For instance, if you rolled 5241 picture, 1445 banana, 2613 erudite, 1435 bacterium, 5351 pumice, maybe you could imagine a picture of a banana speaking eruditely about the bacteria in pumice stones? idk how your brain works, come up with something that’s memorable in the right way for you2.
  5. String together just the first three letters of each word to get the actual password you type in. In the example above, your password would be picbanerubacpum. It’s just 15 lowercase letters, nice and easy to type out, but it has 51.7 bits of entropy. If a website demands numbers or special characters or whatever, just throw in a 0 or a . at the end or something.

Background

In 2016, the EFF released a new set of wordlists for generating high-entropy random passphrases using dice, attempting to improve on some of the flaws of the classic Diceware wordlist. I was particularly interested in their list of words with unique 3-letter prefixes; one of my biggest obstacles to using Diceware more in practice is that typing 6 whole words into a password prompt is a pain, and there’s too many opportunities for random typos even if you remember the actual words correctly. The EFF speculate about a future password prompt with an autocompletion feature based on their list, filling in the whole word after you type in the first three letters; I’m perfectly happy here and now just memorizing a phrase and typing the first three letters of each word as my password. This is a super cool project, and I’m grateful to the EFF for putting it together. However, looking at the list and reading through their process for generating it, there are a number of things I’m unhappy about:

Confusability

The EFF’s word list with unique 3-letter prefixes contains several pairs of words that, in my opinion, are too close to synonyms to be usable as distinct elements of a passphrase mnemonic. For example, it includes both “backpack” and “knapsack”, and both “idiocy” and “imbecile”. I’d like a word list where words are chosen to be easily conceptually distinguishable.

What gets included…

Now that we mention it, I’m not thrilled about “idiocy” and “imbecile” being on the list in the first place. The creators of this list attempted to remove offensive words through a combination of manual review and published word filter lists, but not every reviewer or list curator will have the same standards for what constitutes offensive language. I’m also not fond of “policeman” or “jailhouse”, among other words.

… and what gets left out

In addition to leaving in some not-great stuff, the curated filter lists used by the EFF also unnecessarily exclude some words. One of the offensive word lists they cite, by Luis von Ahn (cw: lots of slurs), includes the words “lesbian” and “gay” as words to filter out (“heterosexual” is also filtered out, I guess), as well as many other innocuous words. There are contexts where it makes some amount of sense to filter strictly like this, but for making a static word list that will already be subjected to manual review (rather than, say, a Twitter chatbot that learns from random users who interact with it), I’d say this filter is kinda unhelpful. I want my passwords to have less cops and more gays :p

Data ethics

A lot of other data went into the EFF’s word lists to reduce the intensive labor of manually picking out suitable words, mostly from Ghent University’s Center for Reading Research. This includes data on how commonly-known various words are, as well as data on the concreteness of various words; the EFF chose to target more concrete words for easier memorization. Unfortunately, Ghent University’s word concreteness data was collected using Amazon’s Mechanical Turk platform. Vast amounts of this kind of disposable underpaid human labor are sadly ubiquitous in the background of countless research projects in computer language processing and image processing, and I wanted to try constructing my word list without relying on data from exploitative sources.

My attempt to do better

With these issues in mind, I put together a wordlist I like better, taking inspiration from the EFF’s list while trying to improve on its flaws. Below, I’ll talk about my process for generating this list, and share resources for anyone interested in building on my work here.

Conceptual distance

The first issue I highlighted with the EFF’s list was words that are too conceptually similar. To avoid this problem, I used ConceptNet Numberbatch, a dataset of word embeddings for use in machine learning projects. As best as I can tell, ConceptNet primarily uses data from Wikipedia and voluntary surveys, rather than exploitative sources like MTurk, so I feel more comfortable using this dataset. Each word is associated with a 300-dimensional vector, and I use the distances between these vectors as a measure of how conceptually distinct various words are. As I added new words to my wordlist, I was able to see what other words might be too similar, and remove words that were too close to existing words. This approach had its pluses and minuses, as close conceptual distances didn’t always correlate perfectly with confusability. For instance, the closest pair of words in my finished list according to Numberbatch are “piano” and “violin”, with a distance of 0.64. Sure, they’re both musical instruments, but I’m not too worried about someone losing track of which is which. For the most part, I tried to maintain a spacing of around 0.9 or higher between word vectors, with occasional exceptions for cases like this.

Manual review

While I used the ConceptNet Numberbatch embeddings as a guide, and used them to produce suggestions for words I might want to add that were sufficiently distant from the words I already had, all of the actual decision making about what words to include or exclude was done manually by me. Yes, this took forever, but I couldn’t really find other data sources that would be useful to simplify my search. For a while I tried using Ghent University’s word prevalence data as a source of words that were commonly known, but their corpus just wasn’t big enough and left out a lot of usable words.

Non-goals

The EFF did a few things in creating their word list that I wasn’t interested in reproducing or wasn’t able to reproduce. For instance, their word list with distinct 3-letter prefixes has a minimum edit distance of 3 between all of their words. This would potentially be useful for typo correction… if I were typing out entire words. Since my intended usage of this list is to memorize a passphrase as a mnemonic but only actually type the first 3 letters of each word, typo resistance really doesn’t seem useful to me. For the same reason, I also didn’t care as much about avoiding words that have confusing or ambiguous spellings, as long as common misspellings or alternate spellings didn’t affect the first 3 letters.

Code

If you’re interested in building on this work, you can find my extremely janky code at https://git.xeno.science/xenofem/diceware

In conclusion

I’ll end this article the same way the EFF ended theirs: Hopefully I’ve made something useful, but there’s plenty of room for more research and experimentation in this area, and I hope people keep exploring!


  1. xkcd #936 suggests 44 bits of entropy (so, 5 words from this list) as a good level for a password for a web service; that recommendation is over a decade old now, but the rate at which you can spam a web server with password attempts doesn’t have quite as much to do with CPU speeds, so maybe that’s still a reasonable baseline? I’ll generally use at least 6 words, more if it’s an encryption password for something sensitive, but I don’t really have much solid data here either.↩︎

  2. Since your password will only use the first three letters of each word, you can also substitute in entirely different words that start with the same letters and stick in your brain better. Maybe a picnic banquet for erudite bachelor pumas?↩︎

/projects
#infosec
#passwords
#zines