E_WP4: Context-driven Behaviour Monitoring & Anomaly Detection

This work package investigates the identification and classification of behaviours as normal or abnormal, safe or threatening, from an irregular and often heterogeneous sensor network. Our approach to anomaly detection is based on the premise that better models of normality are required for more complex anomalies to be detected. As such this work package aims to identify techniques for improving behaviour models using spatio-temporal context acquired through pattern-of-life learning from wide-area surveillance.

We have mainly been focusing on pattern-of-life learning in wide area surveillance. Prior work has largely focused on easy applications such as street intersections [2] and transport hubs [3]. However, long-range wide-field sensors are increasingly being deployed (e.g. wide area motion imagery - WAMI). Processing such data raises several `big-data' challenges that have not been solved. By `big data' we mean streaming sensors/sensor networks that monitor very wide areas (≥ 352 km), having many targets (≥ 1000) per sample (e.g. video frame), and monitoring over long durations (days/weeks/years).

We have proposed a hierarchical structure comprising spatial, and spatio-temporal layers to model normal scene behaviour. At the spatial layer clustering is used to construct chains of Gaussian distributions representing common sub-trajectories. Each node of the chain is then augmented with a non-parametric model of temporal behaviour. This can be seen in Figure 1.

We have recently been evaluating our model against public datasets. We report anomaly detection accuracy since it is most commonly quoted in prior work, defined as: Acc = (TP+TN)/(TP+FP+FN+TN), where TP, TN, FP, and FN are the number of true positive, true negative, false positive and false negative detections respectively. However, we are of the opinion that Accuracy is not a sufficiently descriptive metric for anomaly detection since a large number of TNs will dominate the calculation to give high accuracy even when no anomalies are detected. Therefore, in addition to accuracy we also report false positive rate/fall-out (FPR), true positive rate/sensitivity (TPR), positive prediction value/precision (PPV), and our preferred metric, F1-score, defined as:

The F1-score is an evenly weighted combination of PPV and TPR and thus is a useful metric for comparing overall classification performance.

Streaming synthetic trajectories

We used synthetic trajectories from Laxhammar and Falkman to evaluating sequential anomaly detection and incremental learning [4]. The dataset contained 100 random trajectory subsets, each containing 2000 2-dimensional trajectories with approximately 1% anomalies. We partitioned the data into training and test subsets by randomly sampling without replacement, varying the size of the training set from fifty to 1950. We compared performance against published results for the sequential Hausdorff nearest neighbour conformal anomaly detector (SHNN-CAD) and a sequential version of the Discords algorithm.

Figure 2 shows a distinct advantage of our approach with as much as a 24% improvement over competing approaches. This lead slowly decreases as the size of the training set increases. Our approach also achieves earlier detections, with a mean detection delay of 5.1 observations verses 9.7 and 10.3 for Discords and SHNN-CAD respectively. This is a significant improvement when considering that each complete trajectory is only 16 observations long. Finally, the mean processing time for one subset of trajectories (2000 trajectories, 100 runs) was 18.2 seconds for our approach compared with 19.1 seconds and 45.0 seconds for SHNN-CAD and Discords respectively.

GPS trajectories – Spatial

To address the limited size of publically available datasets, a dataset of 766 GPS trajectories was gathered by ourselves over a seven month period and hence forth will be referred to as the HW-POL1 dataset. Trajectories represent the movements of a single individual going about their daily routine and were gathered at a sampling frequency of 5 seconds. Each trajectory was hand-annotated with a ground truth label indicating if it was spatially and/or temporally anomalous which, although subjective, provides ground truth that can be compared to. A spatial anomaly was defined to be any trajectory/subtrajectory that the collector perceived to be travelled less than once every 6 months, and a temporal anomaly was any trajectory travelled at a time of day that was perceived as inconsistent from the person's historical behaviour. Figure 3 illustrates two concrete examples from the dataset: Area (A) shows a trajectory labelled `normal', and its corresponding frequency and time-of-day distribution. Area (B) illustrates a spatio-temporal anomaly for a trajectory only travelled once. Note that by definition a spatial anomaly must also be a temporal anomaly, however, a temporal anomaly need-not be spatial. For example, traversing the trajectory in (A) at 2am would only be a temporal anomaly. The dataset contains a total of 127 spatio-temporal anomalies and 24 temporal anomalies.

We first considered the spatial anomaly detection layer in isolation. Since our Gaussian Chain (GC) algorithm demonstrated superior performance to competing approaches on previous datasets we did not perform further comparisons for spatial anomaly detection and instead performed all experiments using our GC algorithm.

We evaluated the performance of our approach while varying the minimum candidate length - that is - the minimum number of consecutive observation (ɣ) that must be detected as anomalous before the trajectory is classified as anomalous. It was shown that ɣ has an impact on FPR, which decreased from 0.19 to 0.11 as ɣ was increased from 15 to 75 seconds. A candidate length of 45 seconds (ɣ =9) was judged as a reasonable cut-off for filtering false-positive classifications (FPR: 0.14) without compromising F1-score (0.7).

GPS trajectories - spatio-temporal

Turning next to the temporal layer in isolation we considered KDE and the kNN NCM with varying k (number of neighbours). We set configured sensitivity to be 0.05 in-line with the standard p-value sensitivity for hypothesis testing, where the null hypothesis is that the observation is normal. KDE considerably out performed kNN NCM with respect to the F1-score, TPR and PPV. FPR variance was negligible in all cases ranging from 0.01 (kNN NCM) to 0.03 (KDE). Similarly, variance in accuracy was also negligible with all techniques achieving 0.95/0.96.

Finally, we considered the combined anomaly detection performance (spatio-temporal). A clear improvement in performance was observed by considering temporal anomalies in addition to spatial anomalies. Specifically, KDE accuracy increased by 12% to 0.84, true-positive rate by 5% to 0.74, and F1-score by 8% to 0.78. Smaller increases were generally seen for the kNN NCM method.

Wide area motion imagery trajectories

Next we used the publicly available Wright Patterson Air Force Base (WPAFB) wide area motion imagery dataset. The dataset is a 17 minute flyover of an Air Force Base and the surrounding area and is well suited for our target application. Target tracks of vehicles were provided by the dataset authors and formed the input to our model. There are 10,014 target trajectories in the training set (frames 100-611) and 10,297 in the test set (frames 612-1124).

Evaluation on this dataset is made difficult by a lack of ground truth regarding the anomalousness of trajectories. To evaluate the algorithm quantatively we collected anomalous/benign classification labels from 10 independent human coders. In the spirit of [5] we established a ``human consensus'' ground truth, using Fleiss' Kappa test [6] to determine the statistical significance of classification labels from different coders. Fleiss' Kappa can be applied to any fixed number of coders and is a statistical measure of agreement between coders that accounts for the level of agreement one would expect to occur by chance. It has range ϰ ∈ [-1,1] where a score of 1 indicates complete agreement by the coders and ϰ ≤ 0 indicates no agreement other than through chance.

Figure 4 shows the agreement between coders for trajectories labelled as anomalous. At one extreme, only 13 anomalies were identified and agreed upon by all coders, while 165 were identified and agreed upon by just two coders. We omit anomalies identified by only one coder since our goal is to identify a consensus. The large variability of agreement is striking and in itself highlights one of the great challenges for anomaly detection research: obtaining agreement about what constitutes an anomaly.

Looking at the ϰ scores for different levels of minimum agreement: A 40% agreement included all anomalies identified by 4 or more coders, with all other anomalies re-labelled as benign. By increasing the minimum support level the ϰ score gradually increased, with ϰ =1 (complete statistical agreement) when the support was set to 100%, indicating that the ϰ score was well calibrated.

It should be highlighted that the `known anomaly' (determined by a domain expert) was only labelled as anomalous by 2 of the human coders. Since the 20% support level shows good statistical agreement (ϰ =0.68) we use this ground truth for our final validation of classification performance. Of the 241 trajectories labelled as anomalous by the human coders 0.71 (TPR) of these were also identified as anomalous by our algorithm. A precision (PPV) of 0.92 was also achieved indicating very few false-positive detections and an overall F1-score of 0.8.

Runtime performance on WAMI

Finally, we evaluated the runtime performance of our algorithm on WAMI data, comparing performance against the Piciarelli model [7] which shares some similarities to our own.

We used the first 20 frames of the WPAFB dataset on an Intel i7 2.2GHz with 16GB RAM. Two metrics were considered: The number of clusters created and the runtime per frame.

Focusing initially on the number of clusters, Figure 5 shows that both algorithms have linear growth, although ours exhibits a slower growth rate and significantly fewer clusters. This is because the Piciarelli distance function performs poorly when the number of different trajectory starting locations is large which conflicts with their implicit assumption that the number of starting locations is small. Furthermore, our merging and concatenation operations are less constrained than for the Piciarelli model, reducing the overall number of clusters required. Runtime is highly dependent on the number of clusters for both algorithms. Figure 5 shows that our algorithm required under 0.3 seconds per frame by frame 20, making real-time performance tractable. In contrast, the Piciarelli model required almost 1 hour to process the twentieth frame.

Dr Neil Robertson and Rolf Baxter on Vison Lab at Heriot Watt University