Breaking the VIC cipher

Cryptography is the art of hiding a message in plain sight. Although I've always thought it cool (and really useful), I never dabbled in it til I recently read Cryptonomicon by Neal Stephenson, an exciting book about hidden treasures, WWII codebreaking, a hacker outwitting the NSA...all very exciting topics and woven together in a compelling way. It reminds me a bit of Mitchell's Cloud Atlas in terms of structure, but I think much better executed. Recommended read.

Anyway, getting back to the topic at hand, I thought it might be fun to play around with a simple encryption algorithm, and a friend brought up the VIC cipher, a pen and paper cipher once used by spies. The VIC cipher involves turning the letters in your message into a set of numbers by setting up a 3 x 10 table. Every letter gets a number by first identifying the row number and concatenating it to the column number. I suppose the four extra characters can correspond to whatever you would like, but it seems that two are typically left empty from the first row and the remaining two are for punctuation. The key is then an 8 letter word and the locations of the 4 extra characters (although these are not too hard to find if the key word is known).

VIC cipher checkerboard

0 1 2 3 4 5 6 7 8 9
c o m p u t e r
1 a b d f g h i j / k
2 . l n q s v w x y z

I wrote up some code to encrypt a message that takes a key word and the locations of the extra characters. This code only takes in the alphabet and consider the "/" and "." to just be fillers. The above checkerboard is from the call

code = vic_cipher.encoder(['-c','computer','-p','12','-m',text,'-s','18,20'])

The text is a few paragraphs from Jarome Iginla's Wikipedia page. (Disclaimer: I am not endorsing this player. I don't know anything about hockey, but he happened to be featured on the front page of Wikipedia when I went searching for some text.)

iginlamadehisnhldebutinthestanleycupplayoffsashewassignedtoacontractandflowntocalgaryimmediatelyafterhisjuniorseasonendedinkamloopsheappearedintwogamesfortheflamesintheirseriesagainstthechicagoblackhawksindoingsohebecamethefirstyearoldtoplayfortheflamessincedanquinnininhisfirstnhlgameiginlaassistedonatheorenfleurygoaltorecordhisfirstpointhescoredhisfirstgoalinhissecondgameheremainedwiththeflamesandplayedhisfirstnhlseasoninheearnedaspotonthatyearsnhlallrookieteamandfinishedastherunneruptobryanberardinvotingforthecaldermemorialtrophyasrookieoftheyearafterleadingallfirstyearplayersinscoringwithpointsbyhisthirdseasoniginlaledtheflamesingoalswithhissuccesscomplicatednegotiationsforanewcontractasheandtheflamesstruggledtoagreeonanewdealfollowingtheseasonhopingtohelpresolvethecontractimpasseiginlaagreedtoattendtrainingcampwithoutacontractandpurchasedhisowninsuranceastheteamwouldnothavebeenresponsiblefinanciallyifhesufferedaninjuryheremainedwithoutacontractatthestartoftheseasonandmissedthefirstthreegamesasaholdoutbeforesigningathreeyeardealworthusmillionplusbonuseshefinishedtheyearwithcareerhighsingoalsandpointshethentoppedbothmarksinbyrecordinggoalsandpointsafterparticipatingincanadasolympicsummercampbeforetheseasoniginlaagainsetnewpersonalhighsinwhenheregisteredgoalsandpointsthisseasonelevatediginlatosuperstarstatusheearnedtheartrossandmauricerichardtrophiesasthenhlsleadingpointandgoalscorerrespectivelyhewasalsoawardedthelesterbpearsonawardastheleaguesmostvaluableplayerasvotedbyhispeersandwasanomineeforboththehartmemorialtrophyandthekingclancymemorialtrophytheharttrophyvotingprovedtobecontroversialiginlatiedcanadiensgoaltenderjosthodoreinvotingpointsbutreceivedfewerfirstplacevotesthanthodorehoweveronevoterrumouredtobefromquebecthodoreandthecanadienshomeprovinceinexplicablyleftiginlaoffhisballotasaresultofthecontroversythatfollowedtheprofessionalhockeywritersassociationchangedtherulesonhowitsmembersvotedfortheawardtopreventarecurrencetherewerefearsiginlawouldagainholdoutafterhiscontractexpiredfollowingtheseasontheywereunfoundedhoweverashesignedatwoyearmilliondealbeforetheseasonandwaslookedontoagainleadtheflamesoffensivelyiginlafellbacktopointsinasinjuriesincludingalingeringfingerdislocationfollowingafightdiminishedhisplayhisgoalswerestillenoughtoleadtheflamesforthefourthtimeinfiveseasonsdespitehisoffensivecontributionstheflamesmissedtheplayoffs
    

The resulting encrypted message is


    

If you want to decode, you would call

   decode = vic_cipher.decode(['-c','computer','-p','12','-m',code,'-s','18,20'])

This looks encrypted to me. I wouldn't be able to read this off, but if you can maybe you should be working for the NSA. But I'm sure this is easily breakable! Let's make this slightly easier for ourselves and imagine that we know we're working with the VIC cipher, so if we can identify some of the key components, maybe that will be sufficient.

I think the obvious place to start is to look at the frequency distribution of the numbers. We know that there are two prefix numbers for the last two rows of the VIC checkerboard, so those should be pretty easy to identify. Indeed, just from counting frequencies, 1 and 2 are clearly important. Let's take these as our prefixes, and now we can identify separate numbers in the code. Our code broken up into numbers is then "16 13 16 22 21 10 4 10..."

Frequency of digits in raw code that hasn't been separated into separate code numbers.

Thus, we look at the frequency of the different numbers...

Frequencies of numbers in the code.

What might be useful here is to know how frequently different letters in the alphabet appear. Let's go to Project Gutenberg and take Joyce's Ulysses. I'm not quite sure how representative of Wikipedia English Ulysses is, but who said we knew what the source of this text was? I don't start right at the beginning since I noticed a line of Latin, so the text I scraped are many lines including and after these:

Solemnly he came forward and mounted the round gunrest. He faced about
and blessed gravely thrice the tower, the surrounding land and the
awaking mountains. Then, catching sight of Stephen Dedalus, he bent
towards him and made rapid crosses in the air, gurgling in his throat
and shaking his head. Stephen Dedalus, displeased and sleepy, leaned
his arms on the top of the staircase and looked coldly at the shaking
gurgling face that blessed him, equine in its length, and at the light
untonsured hair, grained and hued like pale oak.

Buck Mulligan peeped an instant under the mirror and then covered the
bowl smartly.

--Back to barracks! he said sternly.

He added in a preacher's tone:

--For this, O dearly beloved, is the genuine Christine: body and soul
and blood and ouns. Slow music, please. Shut your eyes, gents. One
moment. A little trouble about those white corpuscles. Silence, all.
    

The corresponding letter frequencies are below:

Frequency of letters in Joyce's Ulysses taken to be representative of the English language.

Obviously, "e" is the most common letter with something like a percentage difference between the frequencies of t, a, o, i, and n. It makes sense to match up "8" with "e", and we can be pretty confident about that replacement. Since our text isn't that long, it's not clear that we would be able to distinguish a difference of about a percent between frequencies to match up the following letters with next most frequent code numbers. But I might as well demonstrate that this doesn't work but simply lining up the code numbers with the corresponding letter by frequency. The resulting text is mostly gibberish.

amasltctderansrldebfoasorenotslepgfwwltpiuuntnreytnnamsedoitgisohtgotsduliysoigtlmthpaccedatoelptuoehranjfsaihnetnisesdedasvtcliiwnretwwethedasoyimtcenuihoreultcenasoreahnehaentmtasnooregragtmibltgvrtyvnasdiasmnirebegtceoreuahnopethildoiwltpuihoreultcennasgedtsxfassasasranuahnosrlmtceamaslttnnanoedistoreihesulefhpmitloihegihdranuahnowiasorengihedranuahnomitlasrannegisdmtcerehectasedyaororeultcentsdwltpedranuahnosrlnetnisasreethsedtnwioisortopethnsrltllhiivaeoetctsduasanredtnorehfssehfwoibhptsbehthdaskioasmuihoregtldehcecihatlohiwrptnhiivaeiuorepethtuoehletdasmtlluahnopethwltpehnasngihasmyaorwiasonbpranorahdnetnisamasltledoreultcenasmitlnyaorrannfggenngicwlagtoedsemioatoaisnuihtseygisohtgotnretsdoreultcennohfmmledoitmheeistseydetluilliyasmorenetnisriwasmoirelwhenilkeoregisohtgoacwtnneamaslttmheedoitooesdohtasasmgtcwyaorifotgisohtgotsdwfhgrtnedraniysasnfhtsgetnoreoetcyifldsiortkebeeshenwisnableuastsgatllpaurenfuuehedtsasjfhprehectasedyaorifotgisohtgotoorenothoiuorenetnistsdcannedoreuahnoorheemtcentntrildifobeuihenamsasmtorheepethdetlyihorfncallaiswlfnbisfnenreuasanredorepethyaorgtheehramrnasmitlntsdwiasonreoresoiwwedbiorcthvnasbphegihdasmmitlntsdwiasontuoehwthoagawtoasmasgtstdtnilpcwagnfccehgtcwbeuiheorenetnisamaslttmtasneoseywehnistlramrnasyresrehemanoehedmitlntsdwiasonorannetniselektoedamasltoinfwehnothnotofnreethsedorethohinntsdctfhagehagrthdohiwraentnoresrlnletdasmwiasotsdmitlngihehhenwegoakelpreytntlnitythdedorelenoehbwethnistythdtnoreletmfencinoktlftblewltpehtnkioedbpranweehntsdytntsicaseeuihbiororerthocecihatlohiwrptsdorevasmgltsgpcecihatlohiwrporerthoohiwrpkioasmwhikedoibegisohikehnatlamasltoaedgtstdaesnmitloesdehjinoridiheaskioasmwiasonbfohegeakedueyehuahnowltgekioenortsoridiheriyekehisekioehhfcifhedoibeuhicxfebegoridihetsdoregtstdaesnricewhikasgeaseqwlagtblpleuoamasltiuuranbtlliotnthenfloiuoregisohikehnportouilliyedorewhiuennaistlrigvepyhaoehntnnigatoaisgrtsmedorehflenisriyaoncecbehnkioeduihoretythdoiwhekesothegfhhesgeoreheyeheuethnamasltyifldtmtasrildifotuoehrangisohtgoeqwaheduilliyasmorenetnisorepyehefsuifsdedriyekehtnrenamsedtoyipethcallaisdetlbeuiheorenetnistsdytnliivedisoitmtasletdoreultceniuuesnakelpamasltuellbtgvoiwiasonastnasjfhaenasglfdasmtlasmehasmuasmehdanligtoaisuilliyasmtuamrodacasanredranwltpranmitlnyehenoallesifmroiletdoreultcenuihoreuifhoroaceasuakenetnisndenwaoeraniuuesnakegisohabfoaisnoreultcencannedorewltpiuun
    

There are some sensible combinations inside here, but nothing regular that pops out immediately from skimming it with the eye. I think we can afford to conservative as of yet and claim this small victory of identifying the e's. I haven't tried this, but it might also be a good idea to match up the least frequent letters. The position of those relative to other letters are quite informative and probably easier to identify whether you have them mislabeled or not. Frequent letters like vowels, however, tend to be promiscuous and harder to position.

Talking about positions, why don't we look at the simplest relation between letters, pairs of letters.

Frequency of pairs of letters in alphabetical order in Ulysses.

The most frequent pair combinations are "he", "th", "in", "er", "an", "es", "re", "st", "nd", "ed". Some of the least frequent pairs (of the ones that appear at all) are "zw", "qr", "mx", "jg", "qy", "zq", "zj", "cx", "kz", "jn". Remember that some of these less frequent combinations are not happening in words but between them. Some of the weird combinations are also coming from names like "Jno. Henry Menton". (Maybe of interest: Stephen and Bialek (2010) have paper on the "Statistical mechanics of letters in words.")

This is a lot of information, but we will have even less resolution about the relative frequencies of pairs of letters than for single letters. Our text is nearly 2400 characters in length, so if the pairs were independent, we would expect error bars around the pairs that appear ~2% of the time to be √((0.02 x 0.98)/(2400 x .02)) ~ 0.02. Maybe we can do better. One thing I tried for a little bit is to look only at pairs of words preceding "e"; after all, those should be easier to identify. The most frequent pair combinations preceding "e" are "th", "sh", "st", "on", "om", "at", "ar", "av", "nc", and "nd". Some of the least frequent pairs (of the ones that appear at all) are "ii", "ij", "lz", "jb", "pj", "jd", "je", "ve", "jg", and "jh". The last one "jh" happens because there are characters with initials "J. H."

One thing to try again is to align the most frequent pair of letters preceding "e" with the most frequent pair of numbers. The most frequent coding pair preceding "e" is (7, 15) whereas the most frequent letters from English are "th", and so lets make the guess 7→t and 15→h. If you're really good, this might be enough to figure out the message. But I'm going to do one more thing that let me figure out the rest by looking at it. Instead of the complicated strategy of looking at all pairs, why don't we just look at pairs of repeated letters; after all, there are only 26 of them.

Repeated pairs of letter in Ulysses.

Since there are only 26, we should be able to plug and play. If plugging in a guess in leads to a combination of letters that seems really improbable, we've probably made a mistake. Since the most common pair is "ll", I plugged "l" in for "24" and a quick reading turned up the sequence "thelt", nothing that corresponds to any words I know (sure, maybe "lt" is short for "lieutenant" but it's not the most likely thing to turn up). The next most frequent pair "t" is taken so I try "s" and find no contradictions. After some probing, 21→l seems to work. That quickly led me to "i", "w", "n", "d", etc. The dominoes kept cascading. You can see the sequence I explored in the code below

    nextMessage = replace_letters(message.split(),['24','21','16','26','22','12','5','3','10','28','0','6','9'],
                                  ['s','l','i','w','n','d','p','o','a','y','c','p','r'])
    

The partially decrypted message is

i14inla4adehisnhlde11utinthestanleycupplayo1313sashewassi14nedtoacontractand13lowntocal14aryi44ediatelya13terhis17uniorseasonendedin19a4loopsheappearedintwo14a4es13orthe13la4esintheirseriesa14ainstthechica14o11lac19haw19sindoin14sohe11eca4ethe13irstyearoldtoplay13orthe13la4essincedan23uinnininhis13irstnhl14a4ei14inlaassistedonatheoren13leury14oaltorecordhis13irstpointhescoredhis13irst14oalinhissecond14a4ehere4ainedwiththe13la4esandplayedhis13irstnhlseasoninheearnedaspotonthatyearsnhlallroo19ietea4and13inishedastherunnerupto11ryan11erardin25otin1413orthecalder4e4orialtrophyasroo19ieo13theyeara13terleadin14all13irstyearplayersinscorin14withpoints11yhisthirdseasoni14inlaledthe13la4esin14oalswithhissuccessco4plicatedne14otiations13oranewcontractasheandthe13la4esstru1414ledtoa14reeonanewdeal13ollowin14theseasonhopin14tohelpresol25ethecontracti4passei14inlaa14reedtoattendtrainin14ca4pwithoutacontractandpurchasedhisowninsuranceasthetea4wouldnotha25e11eenresponsi11le13inanciallyi13hesu1313eredanin17uryhere4ainedwithoutacontractatthestarto13theseasonand4issedthe13irstthree14a4esasaholdout11e13oresi14nin14athreeyeardealworthus4illionplus11onuseshe13inishedtheyearwithcareerhi14hsin14oalsandpointshethentopped11oth4ar19sin11yrecordin1414oalsandpointsa13terparticipatin14incanadasoly4picsu44erca4p11e13oretheseasoni14inlaa14ainsetnewpersonalhi14hsinwhenhere14istered14oalsandpointsthisseasonele25atedi14inlatosuperstarstatusheearnedtheartrossand4auricerichardtrophiesasthenhlsleadin14pointand14oalscorerrespecti25elyhewasalsoawardedthelester11pearsonawardasthelea14ues4ost25alua11leplayeras25oted11yhispeersandwasano4inee13or11oththehart4e4orialtrophyandthe19in14clancy4e4orialtrophytheharttrophy25otin14pro25edto11econtro25ersiali14inlatiedcanadiens14oaltender17osthodorein25otin14points11utrecei25ed13ewer13irstplace25otesthanthodorehowe25erone25oterru4ouredto11e13ro423ue11ecthodoreandthecanadiensho4epro25inceine27plica11lyle13ti14inlao1313his11allotasaresulto13thecontro25ersythat13ollowedthepro13essionalhoc19eywritersassociationchan14edtherulesonhowits4e411ers25oted13ortheawardtopre25entarecurrencetherewere13earsi14inlawoulda14ainholdouta13terhiscontracte27pired13ollowin14theseasontheywereun13oundedhowe25erashesi14nedatwoyear4illiondeal11e13oretheseasonandwasloo19edontoa14ainleadthe13la4eso1313ensi25elyi14inla13ell11ac19topointsinasin17uriesincludin14alin14erin1413in14erdislocation13ollowin14a13i14htdi4inishedhisplayhis14oalswerestillenou14htoleadthe13la4es13orthe13ourthti4ein13i25eseasonsdespitehiso1313ensi25econtri11utionsthe13la4es4issedtheplayo1313s
    

That was quick--once we had the proper crib a strategy that is well-known to any codebreaker. I'm not sure what the etymology of this word "crib" is but it reminds me of getting something "rocking" gently til the small pushes add up and cause the cipher to rock violently and blow apart. The easy part was identifying the most frequent letters, and once that was done, we could easily break the rest of the code.

A clear lesson that one needs to hide the obvious statistical signals to fend off codebreakers. One can easily make the VIC cipher much tougher by adding a few other steps in before or after the encryption as is detailed on the Wikipedia page.

One day, I will properly set up a GitHub profile, but in the meantime you can use this to play on your own (although it's really that much trouble to code one up yourself).

Back to top