Posted April 24, 2014
One of my current projects at the C4 group is to probe the dynamics of conflict behavior in pigtailed macaques with the hope that signatures in these data might hint at some of the important principles underlying conflict in social systems. We have found some interesting results already, but have been searching around to see whether these findings do indeed generalize. One data set that I thought might be particularly relevant to compare is the large compilation of events from the Afghanistan and Iraq wars, which have been released by Wikileaks. I thought it might be fun to discuss some of the details and difficulties of working with a data set like this, and mostly I'll be writing about the Iraq data.
I had thought that with all the widespread reporting on these data, it would be easily found online, but it turned out to be a little more difficult than anticipated. First, there seems to be no reputable website that hosts downloads of these data. Obviously, Wikileaks has this data, but not in any easily accessible format. One could crawl their website, but this seemed to be much more effort than just getting the database in the raw format. I would think that news agencies and other groups like human rights agencies have these data somewhere, but they seem to avoid posting it online. In the end, I found the Afghanistan War Diary (AWD) on archive.org. The Iraq War Logs (IWL) were there and also hosted on a file server, but those formats turned out to be very cumberous (a separate HTML page for each event, of which there are nearly 700,000). Instead, I found a CSV file on BitTorrent which was both easier to work with and more complete.
Obviously, this poses some difficult questions about authenticity, but I'm not sure what checks besides comparing the two different data sets and checking a sample of these events with news reports one could do. Another limitation is that the IWL only report SIGACTS, which are "significant actions" and determined by some arbitrary criterion. In the end, what I did was to compare a small sample of events with the Wikileaks data by hand.
I think the most interesting part of looking at these data is that you get a first hand account of the action on the ground. Most of us have heard about the war through the news, which rarely accounts for the thoughts and observations of the soldiers. Although these accounts are by no means entries as from a real diary, they tell you much more about what a day in this war is like. Here's an example of an entry from the data set.
WHEN:%%% WHERE: Ninewah Province, East Mosul, Rte %%% WHAT: %%% Grenade Attack - Confirmed (//-%%%) - Effective HOW: At 1325hrs, //-%%% was attacked with a grenade at %%% while conducting movement back to FOB %%% resulting in %%% x CF WIA and %%% x MRAP with minor damages. The TC received minor burns to the right side of his face and the %%% ringing in his ears. //-%%% immediately proceeded to the CSH following the attack to evaluate the injuries. Update: At 1345hrs, //-%%% reported that a LN jumped out in front of a parked bus, threw a %%% grenade at the first vehicle in the convoy, and escaped back behind the bus and into a crowd. Update: At 1545hrs CDR//-%%% reported that the TC received %%% degree minor burns to the right side of his face as well as a concussion. The %%% a concussion. Both soldiers RTD. S2 Assessment: This attack marks the %%% attack in the past %%% days with a total of %%% grenades thrown. A recent report indicates that a shipment of %%% grenades was changing hands in Mosul. RTE %%% continues to be a key engagement area for grenade attacks. This may be due to the ease of exit provided by the route. %%% elements are able to flee the scene of the incident and travel by car straight into old town Mosul where it is most difficult to track someone. EOD Assessment: %%% x %%% BDA: %%% x CF WIA Minor Wounds and %%% x MRAP Minor Damage
I find this riveting to read a direct account of this action. Knowing that the filter between the actual event and me is minimal makes this seem much more compelling. Here's another account about a suicide bomber.
HOW: %%% manager, confirmed by -/%%% IA BN XO, reports that a suicide bomber detonated a suicide vest in %%%, killing 3x local nationals and himself. The suicide bomber was targeting (%%%) while he was meeting with %%% and his son %%% at their truck.
Immediately, we can see that the entries have been redacted as signaled by a "%%%". This is not consistent either because it is sometimes preceded by a "//-", and indeed this inconsistency is only one of many examples that make this data much harder to process. Examples of shorthand include "x" for number of times, "CF" (counter fire), "WIA" (wounded in action), and the more obscure "TCP" (traffic control point). One can find many of these clarified on the Wikileaks website.
As I mentioned earlier, there are nearly 700,000 events in the IWL that I have access to, and so there's no way one person could read through all these entries and really understand everything that is happening. To do this, I rely on some old-fashioned (or is it new-fashioned?) data analytics.
Nearly every event in these data has been labeled with an event category. Most of the events fall under a very few categories as I show below. We can see that the most reported SIGACT event relates to IEDs. The second most frequent event has to do with "fire" from weapons. Does this mean that these types of events are the most frequent SIGACT events, or does it mean that these categories are just so broad that they serve as a catch-all?
Frequency of appearance of each category in data
It's not clear what the answer is, but we could try distinguishing these different categories by the sorts of descriptions that they have (like the summaries I was showing above).
Since meticulous manual curation is out of the question, I turn to a tool turned out by Google engineer Mikilov for parsing and learning semantic associations of words. This tool called word2vec is a neat bag of tricks (pun intended). Most basically, what does is to look at where words are placed relative to the wordsaround it to associate a word with some high-dimensional vector. A vector is a list of numbers, and each number here corresponds to some semantic dimension, i.e. the first dimension could be valence (is it a positive or negative words?) or it could be whether the word refers to an object or a person. Of course, I just made these up as examples, but that's the basic idea. It turns out that this space is organized intuitively to our understanding of words because vector operations lead to neat relationships between words. For example, we can take the summation of "Paris" + "France" - "Italy" and return "Rome". In a sense, this word has learned that Paris and France are related in a similar way to Italy and Rome, which is a wonderful example! So, by simple addition and summation, we can represent analogies...weird. There are many such analogies one can test, and that is how Mikilov et al. test this algorithm--a neat result!
Returning to the problem here, we can train this algorithm on our data and see what words are similar to one another, and results seem to make sense. If we put in "saf" (small arms fire), we get out
rd: saf Position in vocabulary: 41 Word Cosine distance ------------------------------------------------------------------------ safrpg 0.730477 rpgsaf 0.669158 idfsaf 0.610775 safidf 0.560729 iedrpg 0.536905 sporadic 0.534531 idf 0.532000 mortarsaf 0.500157 rpgssaf 0.498307 precision 0.494308 originating 0.494009 unknown-caliber 0.492614 iedrpgsaf 0.487029 small_arms 0.479189 heavy_volumes 0.477986 automatic_gunfire 0.477143 rpg-saf 0.474320 heavy_volume 0.472217 hardened_tcp 0.471142
Or if we put "crashed"
Word: crashed Position in vocabulary: 3514 Word Cosine distance ------------------------------------------------------------------------ uav_crashed 0.670964 engine_failure 0.646415 scan_eagle 0.626466 chute 0.574162 uav 0.572886 hard_landing 0.572373 takeoff 0.552746 take-off 0.550813 epic 0.550638 uav_epic 0.542496 downed_uav 0.540679 lost 0.525179 crashing 0.512446
Word: children Position in vocabulary: 1150 Word Cosine distance ------------------------------------------------------------------------ women 0.750357 females 0.745454 adults 0.708191 child 0.672946 kids 0.629073 adult 0.625351 adult_females 0.611993 adult_male 0.605669 woman 0.594059 males 0.588210 infants 0.587606
Although not everything makes sense:
Word: the Position in vocabulary: 2 Word Cosine distance ------------------------------------------------------------------------ and 0.512416 injuriesdamage_cm 0.498597 passive_measures 0.493837 ramp_measures 0.482025 passenger-side_flank 0.477169 behave_suspiciously 0.471563 excited 0.471364 extremely_agitated 0.468746 immediately 0.462687 baghdadmosul 0.457104 which 0.456538
We might expect that article words like "the" and "and" are similar to other article, but from reading the summaries, we can see that such filler words are not too frequent. Instead, other filler words might serve similar roles.
Now that we which words are similar to one another, we can cluster words into groups that are similar and see what sorts of classes of words the descriptions tend to use. Hopefully, it is the case that different categories of SIGACTS use different kinds of words. As one view, we can look at how different pairs of summaries are from one another from two different categories as below:
Along each axis, we show a sample of 1000 summaries from two different categories. When compared with one another (top left and bottom right) the summaries are more similar to one another than when compared against each other (top right and bottom left). However, there seems to be a fair amount of heterogeneity within each block, part of which comes from the fact that some summaries are much shorter than others. The distances shown are the sine of the angle between vectors squared.
If we take this method as a reasonable way of comparing the summaries, we should expect to see this pattern throughout the different categories. For each of the top ten categories, I show, at the top, the minimum distance between pairs of summaries and, at the bottom, the average distances. So, focusing at the bottom, the blue points are the average distance between summaries in "IED explosion". The red points show for each category on the x-axis the average distance between summaries in that category. The black dots show the average distance between "IED explosion" and the relevant category on the x-axis. What one would hope for is that the average distance between categories (black) is higher than the average distance within categories (blue and red). It seems that this does hold for most categories, but not within one standard deviation (error bars).
Pairwise distances between top ten categories
Several things to note here. One is that hte intra-category distance is largest for "Murder". If anything, our worry about "IED explosion" and "Direct fire" being catch-all categories is not clearly evident. Most categories have comparable distances within the group. Larger intra-distance category seems to give larger inter-category distances. This means that categories that are more diverse are not just spreading out into the space of other categories, but actually spreading out into an entirely different space of their own.
It might be worthwhile to take a moment here to ask why a simpler method of classifying summaries by not be better like word frequency comparison. Part of the reason is that there are many possible words that can be used and so a straightforward counting of words may not yield a very substantive result. Another reason is that context matters for words! It is important here to consider whether words are often used in conjunction with others or predominantly in a certain context. In fact, this mutual interdependence between words allows us to figure out whether other words that might not be used in a summary could "belong" there. By classifying words into just a few groups, we've reduced the apparent dimensions to the effective ones and hopefully made the problem easier. Of course, I should mention the caveat here that this is all preliminary and so none of these statements are completely rigorous.
Continuing with these comparisons, we might think wide variation within a category might have to do with variation in the lengths of summaries, misclassification, SIGACTS that involve multiple types of encounters, etc. It does turn out that many short summaries have no words in a certain category.
Number of empty categories against the length of the summary. I've added a bit of random noise to make it easier to pick out the points in the categories. If you look closely, you can see that there are two summaries with only entries in one category, but these are not just the shortest summaries. However, we can trace the leftmost boundary and see that summaries of a certain length must have words in multiple categories.
But the most worrying problem would be if some of these categories are not different from each other at all. As a check on this, we can pull out the cluster of most similar summaries in every category and see whether these are more similar to one another and less similar to other categories. Note that these two criteria are not mutually exclusive. By definition, they have to be more similar to one another, but it would be bad if they were also more similar to other categories. The outcome of this test is below:
It does look like the stereotypical summaries (as defined as the ones most similar to one another) for each category are different from the others. This is reassuring, and as a further check we can even look at the entries themselves to see what the most similar versus the most different entries look like. For the "IED explosion" category, one of the most different pairs (with some text modification) is
at 0635c tf %%% reported an ied detonated on a route %%% patrol on asr %%% no injuries or damage
1x vbied exploded near the drink factory with 3x civ killed and 7x civ injured nfi
One of the most similar pairs is
mnd )%%% event %%% unit )%%% cav %%% whojss %%% whatunk explosion when 062000mar08 where %%% at the %%% mosque ied type unk ied description black %%% golf kia %%% wia %%% equipment bda none enemy bda none enemy detainee none sect unk crew system duke other ied defeat equip rhino %%% door kits installed yes %%% of patrol 2x %%% 2x %%% initial report- %%% jss reports %%% in %%% ia reports that the explosion was at the %%% mosque ia reports no damage or injuries sending qrf to investigate update - %%% mitt reports that the explosion was a black %%% golf vbied - %%% arrives on site and has cordon set %%% line ied as follows %%% blackjack %%% black %%% golf %%% unk %%% cf %%% halted %%% cordon set %%% immediate %%% request eod team with %%% qrf reports that the vehicle reported belongs to %%% a body guard for the mosque %%% reported that he did not see anyone around the car and it detonated at the end of prayer there was no secondary device found at this time %%% eod arrives on site %%% eod is sending their robot out to interrogate the vbied blast site %%% qrf reports that eod has cleared the vehicle %%% qrf reports eod is complete and they are moving off the scene s2 assessment this attack was possibly an attempted %%% ied placed under the vehicle it appears as thought the device was set on a timer set to blow after prayer it is currently unknown why %%% was targeted in the attack the attack was most likely a continuation of the recent attacks carried out by the ue in an attempt to destabilize the area expect to see propaganda about the attack within the next two days summary %%% x ied explosion %%% x inj %%% x damage closed
mnd )%%% sigact %%% mnd )%%% %%% mnc )%%% %%% mnd )%%% event %%% unit %%% who %%% ia )%%% cav )%%% ar bn ied type ied kia %%% ia wia %%% ia equipment bda none enemy bda none enemy detainee none sect %%% crew system none other ied defeat equip none %%% door kit installed no %%% of patrol unknown ia what %%% ia conduct a show of force patrol on rte %%% they come upon a suspicious black bag in the street and set a cordon while the cordon is set a small ied detonated with %%% x ia receiving minor wounds when the other soldiers went to give aid a second larger ied detonated on the ia resulting in %%% x additional wia the second ied was believed to be hidden near a tree or light pole when %%% apr %%% where %%% closest isfsoi cp cp %%% unit %%% grid %%% distance %%% initial report ied detonates on %%% ia patrol killing the ia platoon leader lt %%% wounding %%% others %%% line eod not requested due to delay in ia reporting and no cf present on scene timeline %%% x ied detonate on ia patrol resulting in %%% x ia kia dow and %%% x wia %%% ia medevac %%% x wia to csh %%% x ia dow at the csh %%% ia mitt send updated report - %%% arrive on site to investigate and question civilians in the area s2 assessment this attack is likely perpetrated by the same aqi cell that is responsible for the %%% apr and %%% apr bombings on rte %%% and rte %%% respectively as they have used a very similar ttp in all three attacks we are unable to determine if this aqi cell is subordinate to %%% or %%% summary %%% x ied strike %%% x ia kia %%% x ia wia %%% x dmg closed
This pairing makes sense. The short summaries are very different by virtue of the fact that they don't fall along many categories. Since the probability that they happen to coincide is small, they will be very different. When we get longer summaries, we have a better idea of what the summary is about and so we can actually compare them. We can see some similarities in what they deal with here. Obviously, these summaries are both about IEDs, both mention a secondary device (although "no secondary device" was found in the first), both provide assessment referring to groups that might be responsible, etc.--what seems to be a reasonable similarity.
From other samples, we find that the most different pairs tend to be very short summaries, where we do not have a very good idea of the content of the summary, and reading them shows (like above) that they still seem to fit within the category.
There is no grand conclusion to take away from these thoughts. What we can take away from this brief look at the data is that data sets like this are very messy. Here, we have to check everything because we have no one we can ask about the validity of the data or of the labels. In truth, there is probably no one person we could ask about that because this data is a collection of many different reports and reporters. Instead, we must check for internal consistency, and in this case it does seem that at least with the category labels, they are not haphazard. Also, it was a fun opportunity to use a neat tool to look at the very interesting data set!