OptiGuard

GIF
This is a geospatial location tracking using arbitrary data. However, it's easy to implement this based on the camera location.
Model Architecture
Administrative page that tracks the assailant
To report crime
Phone Notification

Inspiration

Safety and security are things that we expect to be common in the United States. However, in many public areas that sense of peace has vanished. Over the last 10 years, 2023 had the most major crimes, arson, robbery, and homicide, among others by almost 300 more cases than the next closest year. Existing alert systems have a latency of 10-15 minutes between an active threat and an alert being sent out. Our project OptiGuard aims to effectively solve this major issue with its full-stack computer vision application. We can instantly detect violence and alert users before a threat escalates. We do this by using a series of custom Mobilnet Bi-LSTM and YOLO v5 face detection Convolutional Neural Network to identify violence in real-time and track the individuals involved across multiple live feeds. Additionally, we leveraged homography mapping and represented the aggressors in real time on a geo-spatial map tracking their real-time movements.

What it does

OptiGuard takes two or more live feeds and both detect violence while also tracking the people involved in the violence across multiple feeds. It then takes the video feed and creates a geo-spatial map that shows real-time updates of the assailant’s and victim’s location as they move. Tracking the same people across multiple camera angles is something novel that we built from scratch as there’s no publicly available code to do so. In essence, the idea is that our project is replicable for a large host of CCTV cameras and using multithreading we can integrate violence detection, and tracking across a large public space.

Cross-Camera Tracking: It tracks assailants and victims across different camera angles. Even if people are detected, if their faces aren’t identified during a violent attack, they will be ignored. Additionally, individuals who were in frame during a violent attack, but weren’t detected to be engaged in a violent action aren’t tracked by a bounding box either.

Mapping: It takes in a camera feed and is able to project the assailant’s and victim’s location onto a geo-spatial map. Based on the camera feed, it’s able to update the location of the aforementioned people in real-time.

Multiple Live Streams: It concurrently runs two live streams and runs multiple models simultaneously. Users can see the live feed of each camera and also the outputs of our application.

People Detection: It detects people’s faces in both live streams at the same time. Moreover, it tags and monitors people involved in a violent conflict across cameras by tagging the person with a label and a bounding box. It would not tag a person in the second feed if they were not involved in the violence in the first feed.

Violence Detection: It takes in a live feed and is able to detect where there’s violence or not.

User Interface: We built a user interface that lets administrators track violence across an aesthetic display of live feeds. This includes video playback of the violent incident, live feed of the incident area, and switching between different cameras available. Additionally, the general public is allowed to upload images with descriptions when they see violence, aiding our models and can report feedback. We have a notification system that will ping users' phones when violence is detected near their local area.

How we built it

Detecting Violence: We use a 16-frame sliding window technique to use as input data for our Mobile-Net Bi-LSTM neural network that identifies violence from images.

Mobile-Net Bi-LSTM: We trained this neural network on the RWF-2000 dataset to identify violence in videos. It consisted of an input layer, a Mobile-Net layer, a Bidirectional-LSTM layer, and then a sequence of repeated dense and dropout layers. This enabled us to utilize our camera threads and leverage our model to accurately detect violence in a live feed.

Violence Metric: Based on the confidence level outputted by the Mobile-Net Bi-LSTM on the current frame, if the confidence level is above 0.3, we were able to output a percentage representative of how violent the situation is (0-100). We save this percentage for administrators to access in the future.

Tracking assailants across different cameras: We captured individuals' faces based on the violence metric through a pre-trained YOLOv5 model for face detection that we modified for our own purposes. This included additional capturing for more than one face and visual display with naming. We cut bounding boxes detected by the YOLO model to store faces that we could compare via the SSIM metric from the Sci-Kit Image framework. We don’t store repeat faces by checking the SSIM metric of face cropped via the YOLO model with all other faces already stored, saving memory and runtime. Thus, if faces appear in the other live feed, we were able to compare the faces appearing to the faces stored from the violent interaction to determine if they were involved in the violent interaction or not. This way, we don’t tag innocent bystanders.

SSIM: We use the SSIM metric to determine how similar two faces are. It’s a basic metric we used to construct our model and compare faces against each other standardizing the faces that our live feed scans and accurately comparing faces to determine individuals to track into another camera angle.

Concurrent camera live feeds/multithreading: We used multithreading to handle running two live streams concurrently. Moreover, each livestream had a YOLO model running in addition to one of them running the violence detection model. After using the SSIM metric to only store unique images of the assailant and the victim, we use the metric on the detected faces in the other live stream to check if it’s a face of the assailant or the victim. If it is, we draw a bounding box; otherwise, we don’t.

Visualization: Using OpenCV, we were able to output the live feeds from each of the cameras. Moreover, we were able to place the bounding boxes from YOLO and the result of the violence detection model.

Homography/geo-spatial mapping: We isolate the person we focus on via tracking assailants across different cameras. For demo purposes, we use arbitrary coordinates, but it can easily be modified so that we base coordinates based off of which camera we are looking at. Then, we used a homography mapping to project the person’s location onto a 2D geo-map for all people involved in the violent interaction. As the person moves in the camera, their location updates in real-time.

Challenges we ran into

Training violence-detection: The model took a while to fine-tune and train given the large dataset. Once we switched to GPU computation, it made it much faster.

Concurrent Model Use: Originally, running the YOLO models at the same time as the violence detection model was too computationally expensive for our laptops. By running the two YOLO models through our GPU, we were able to get both working at once.

Tracking across cameras: There is no publicly available approach to track individuals. We had to come up with a novel solution comparing values using an SSIM metric function and comparing live faces. The difficulty with this approach was figuring out how to properly store and compare faces (and not storing repeat faces). We did this by ensuring there were no duplicates and saved runtime by cropping faces and comparing them to the existing faces we store in our model (this is done through the yolov5 face detection model we leveraged).

Accomplishments that we're proud of

Creating our own novel tracking framework across multiple camera feeds
Training and fitting a Mobile-Net Bi-LSTM to accurately detect violence occurring in a live feed.
Integrating our mobile-net model to our full stack application
Mapping the aggressor and assailant to geo-spatial coordinates and tracking them in real-time by feeding interpolated positions.

What we learned

Running multiple models on concurrent live streams via multithreading
Training our own Mobile-Net Bi-LSTM neural network
Performing facial recognition across different camera angles
Using homography mapping to project geospatial locations

What's next for OptiGuard

Database: If we use a database, we can store much larger amounts of people. Over time, the number of people involved in violent interactions goes up. This way, it would be more viable in larger public settings like stadiums or streets. Essentially improving our model's testing accuracy and more accurately detects faces from a larger group of people.
Tracking: We would like to improve our tracking effectiveness and efficiency over the two cameras using convolutional neural networks, which are more accurate. This can be done with better technology and more resources.
High Frame Integration of Live Feeds with UI: We hope to feed live camera frames to our hosted server and quickly categorize and assess those images to be displayed on our webpage in an efficient manner. This can be done with access to more resources instead of running all of these models on one laptop.