Violence Detection in Videos: Related Work

1 Jun 2024


(1) Praveen Tirupattur,  University of Central Florida.

Violence Detection is a sub-task of activity recognition where violent activities are to be detected from a video. It can also be considered as a kind of multimedia event detection. Some approaches have already been proposed to address this problem. These proposed approaches can be classified into three categories: (i) Approaches in which only the visual features are used. (ii) Approaches in which only the audio features are used. (iii) Approaches in which both the audio and visual features are used. The category of interest here is the third one, where both video and audio are used. This chapter provides an overview of some of the previous approaches belonging to each of these categories.

2.1. Using Audio and Video

The initial attempt to detect violence using both audio and visual cues is by Nam et al. [41]. In their work, both the audio and visual features are exploited to detect violent scenes and generate indexes so as to allow for content-based searching of videos. Here, the spatio-temporal dynamic activity signature is extracted for each shot to categorize it to be violent or non-violent. This spatio-temporal dynamic activity feature is based on the amount of dynamic motion that is present in the shot.

The more the spatial motion between the frames in the shot, the more significant is the feature. The reasoning behind this approach is that most of the action scenes involve a rapid and significant amount of movement of people or objects. In order to calculate the spatio-temporal activity feature for a shot, motion sequences from the shot are obtained and are normalized by the length of the shot to make sure that only the shots with shorter lengths and high spatial motion between the frames have higher value of the activity feature.

Apart from this, to detect flames from gunshots or explosions, a sudden variation in intensity values of the pixels between frames is examined. To eliminate false positives, such as intensity variation because of camera flashlights, a pre-defined color table with color values close to the flame colors such as yellow, orange and red are used. Similarly to detect blood, which is common in most of the violent scenes, pixel colors within a frame are matched with a pre-defined color table containing blood-like colors. These visual features by itself are not enough to detect violence effectively. Hence, audio features are also considered.

The sudden change in the energy level of the audio signal is used as an audio cue. The energy entropy is calculated for each frame and the sudden change in this value is used to identify violent events such as explosion or gunshots. The audio and visual clues are time synchronized to obtain shots containing violence with higher accuracy. One of the main contributions of this paper is to highlight the need of both audio and visual cues to detect violence.

Gong et al. [27] also used both visual and audio cues to detect violence in movies. A three-stage approach to detect violence is described. In the first stage, low-level visual and auditory features are extracted for each shot in the video. These features are used to train a classifier to detect candidate shots with potential violent content. In the next stage, high-level audio effects are used to detect candidate shots. In this stage, to detect high-level audio effects, SVM classifiers are trained for each category of the audio effect by using low-level audio features such as power spectrum, pitch, MFCC (Mel-Frequency Cepstral Coefficients) and harmonicity prominence (Cai et al. [7]). The output of each of the SVMs can be interpreted as probability mapping to a sigmoid, which is a continuous value between [0,1] (Platt et al. [46]). In the last stage, the probabilistic outputs of first two stages are combined using boosting and the final violence score for a shot is calculated as a weighted sum of the scores from the first two stages.

These weights are calculated using a validation dataset and are expected to maximize the average precision. The work by Gong et al. [27] concentrates only on detecting violence in movies where universal film-making rules are followed. For instance, the fast-paced sound during action scenes. Violent content is identified by detecting fastpaced scenes and audio events associated with violence such as explosions and gunshots. The training and testing data used are from a collection of four Hollywood action movies which contain many violent scenes. Even though this approach produced good results it should be noted that it is optimized to detect violence only in movies which follow some film-making rules and it will not work with the videos that are uploaded by the users to the websites such as Facebook, Youtube, etc.

In the work by Lin and Wang [38], a video sequence is divided into shots and for each shot both the audio and video features in it are classified to be violent or non-violent and the outputs are combined using co-training. A modified pLSA algorithm (Hofmann [30]) is used to detect violence from the audio segment. The audio segment is split into audio clips of one second each and is represented by a feature vector containing lowlevel features such as power spectrum, MFCC, pitch, Zero Cross Rate (ZCR) ratio and harmonicity prominence (Cai et al. [7]). These vectors are clustered to get cluster centers which denote an audio vocabulary. Then, each audio segment is represented using this vocabulary as an audio document. The Expectation Maximization algorithm (Dempster et al. [20]) is used to fit an audio model which is later used for classification of audio segments. To detect violence in a video segment, the three common visual violent events: motion, flame/explosions and blood are used. Motion intensity is used to detect areas with fast motion and to extract motion features for each frame, which is then used to classify a frame to be violent or non-violent. Color models and motion models are used to detect flame and explosions in a frame and to classify them. Similarly, color model and motion intensity are used to detect the region containing blood and if it is greater than a pre-defined value for a frame, it is classified to be violent. The final violence score for the video segment is obtained by the weighted sum of the three individual scores mentioned above. The features used here are same as the ones used by Nam et al. [41]. For combining the classification scores from the video and the audio stream, co-training is used. For training and testing, a dataset consisting of five Hollywood movies is used and precision of around 0.85 and recall of around 0.90 are obtained in detecting violent scenes. Even this work targets violence detection only in movies but not in the videos available on the web. But the results suggest that the visual features such are motion and blood are very crucial for violence detection.

2.2. Using Audio or Video

All the approaches mentioned so far use both audio and visual cues, but there are others which used either video or audio to detect violence and some others which try to detect only one a specific kind of violence such as fist fights. A brief overview of these approaches is presented next.

One of the only works which used audio alone to detect semantic context in videos is by Cheng et al. [11], where a hierarchical approach based on Gaussian mixture models and Hidden Markov models is used to recognize gunshots, explosions, and car-braking. Datta et al. [14] tried to detect person-on-person violence in videos which involve only fist fighting, kicking, hitting with objects etc., by analyzing violence at object level rather than at the scene level as most approaches do. Here, the moving objects in a scene are detected and a person model is used to detect only the objects which represent persons. From this, the motion trajectory and orientation information of a person’s limbs are used to detect person-on-person fights.

Clarin et al. [12] developed an automated system named DOVE to detect violence in motion pictures. Here, blood alone is used to detect violent scenes. The system extracts key frames from each scene and passes them to a trained Self-Organizing Map for labeling the pixels with the labels: skin, blood or nonskin/nonblood. Labeled pixels are then grouped together through connected components and are observed for possible violence. A scene is considered to be violent if there is a huge change in the pixel regions with skin and blood components. One other work on fight detection is by Nievas et al. [42] in which Bag-of-Words framework is used along with the action descriptors Space-Time Interest Points (STIP - Laptev [37]) and Motion Scale-invariant feature transform (MoSIFT - Chen and Hauptmann [10]). The authors introduced a new video dataset consisting of 1,000 videos, divided into two groups fights and non-fights. Each group has 500 videos and each video has a duration of one second. Experimentation with this dataset has produced a 90% accuracy on a dataset with fights from action movies.

Deniz et al. [21] proposed a novel method to detect violence in videos using extreme acceleration patterns as the main feature. This method is 15 times faster than the stateof-the-art action recognition systems and also have very high accuracy in detecting scenes containing fights. This approach is very useful in real-time violence detection systems, where not only accuracy but also speed matters. This approach compares the power spectrum of two consecutive frames to detect sudden motion and depending on the amount of motion, a scene is classified to be violent or non-violent. This method does not use feature tracking to detect motion, which makes it immune to blurring. Hassner et al. [28] introduced an approach for real-time detection of violence in crowded scenes. This method considers the change of flow-vector magnitudes over time. These changes for short frame sequences are called Violent Flows (ViF) descriptors. These descriptors are then used to classify violent and non-violent scenes using a linear Support Vector Machine (SVM). As this method uses only flow information between frames and forgo high-level shape and motion analysis, it is capable of operating in real-time. For this work, the authors created their own dataset by downloading videos containing violent crowd behavior from Youtube.

All these works use different approaches to detect violence from videos and all of them use their own datasets for training and testing. They all have their own definition of violence. This demonstrates a major problem for violence detection, which is the lack of independent baseline datasets and a common definition of violence, without which the comparison between different approaches is meaningless.

To address this problem, Demarty et al. [16] presented a benchmark for automatic detection of violence segments in movies as part of the multimedia benchmarking initiative MediaEval-2011 [1]. This benchmark is very useful as it provides a consistent and substantial dataset with a common definition of violence and evaluation protocols and metrics. The details of the provided dataset are discussed in detail in Section 4.1. Recent works on violence recognition in videos have used this dataset and details about some of them are provided next.

2.3. Using MediaEval VSD

Acar et al. [1] proposed an approach that merges visual and audio features in a supervised manner using one-class and two-class SVMs for violence detection in movies. Low-level visual and audio features are extracted from video shots of the movies and then combined in an early fusion manner to train SVMs. MFCC features are extracted to describe the audio content and SIFT (Scale-Invariant Feature Transform - Lowe [39]) based Bag-of-Words approach is used for visual content.

Jiang et al. [33] proposed a method to detect violence based on a set of features derived from the appearance and motion of local patch trajectories(Jiang et al. [34]). Along with these patch trajectories, other features such as SIFT, STIP, and MFCC features are extracted and are used to train an SVM classifier to detect different categories of violence. Score and feature smoothing are performed to increase the accuracy.

Lam et al. [36] evaluated the performance of low-level audio/visual features for the violent scene detection task using the datasets and evaluation protocols provided by MediaEval. In this work both the local and global visual features are used along with motion and MFCC audio features. All these features are extracted for each keyframe in a shot and are pooled to form a single feature vector for that shot. An SVM classifier is trained to classify the shots to be violent or non-violent based on this feature vector. Eyben et al. [23] applied large-scale segmental feature extraction along with audiovisual classification for detecting violence. The audio feature extraction is done with the open-source feature extraction toolkit openSmile(Eyben and Schuller [22]). Low-level visual features such as Hue-Saturation-Value (HSV) histogram, optical flow analysis, and Laplacian edge detection are computed and used for violence detection. Linear SVM classifiers are used for classification and a simple score averaging is used for fusion.

2.4. Summary

In summary, almost all methods described above try to detect violence in movies using different audio and visual features with an expectation of only a couple [Nievas et al. [42], Hassner et al. [28]], which use video data from surveillance cameras or from other real-time videos systems. It can also be observed that not all these works use the same dataset and each have their own definition of violence. The introduction of the MediaEval dataset for Violent Scene Detection (VSD) in 2011, has solved this problem. The recent version of the dataset, VSD2014 also includes video content from Youtube apart from the Hollywood movies and encourages researchers to test their approach on user-generated video content.

2.5. Contributions

The proposed approach presented in Chapter 3 is motivated by the earlier works on violence detection, discussed in Chapter 2. In the proposed approach, both audio and visual cues are used to detect violence. MFCC features are used to describe audio content and blood, motion and SentiBank features are used to describe video content. SVM classifiers are used to classify each of these features and late fusion is applied to fuse the classifier scores.

Even though this approach is based on earlier works on violence detection, the important contributions of it are: (i) Detection of different classes of violence. Earlier works on violence detection concentrated only on detecting the presence of violence in a video. This proposed approach is one of the first to tackle this problem. (ii) Use of SentiBank feature to describe visual content of a video. SentiBank is a visual feature which is used to describe the sentiments in an image. This feature was earlier used to detect adult content in videos (Schulze et al. [52]). In this work, it is used for the first time to detect violent content. (iii) Use of 3-dimensional color model, generated using images from the web, to detect pixels representing blood. This color model is very robust and has shown very good results in detecting blood. (iv) Use of information embedded in a video codec to generate motion features. This approach is very fast when compared to the others, as the motion vectors for each pixel are precomputed and stored in the video codec. A detailed explanation of this proposed approach is presented in the next chapter, Chapter 3.

This paper is available on arxiv under CC 4.0 license.