Topic outline

  • Video-Based Object Detection (VOD), Object Detection at the Edge (ODE) and Activity Recognition (AR)


    Video-Based Object Detection (VOD), Object Detection at the Edge (ODE) and Activity Recognition (AR) in the 7SHIELD Architecture 

    • Outline

      1. Short Description

      2. Main Purpose and Benefits

      3. Main Functions

      4. Integration with other Tools

      5. Infrastructure Requirements

      6. Operation Manual



      • Content

        1. Short Description

        Video based Object Detection (VOD) and Activity Recognition (AR) module is responsible for interpreting the video input provided to it. It's action is three-fold meaning that 3 submodules exists which complement each other in their task. The submodules include 2 object detectors and 1 activity recognizer. Both detectors perform the same task, but they are used differently. More specifically, the first one is Video-Based Object Detection (VOD) which is a generic object detection tool for video input. It is trained to detect any instance of objects of interest inside each frame. It is designed to be executed on dedicated work station which is equipped with GPU hardware. The second one is Object Detection at the Edge (ODE) which is a specially designed object detector targeted at functioning on on-board GPU facilities like the ones found in UAVs. Last, the Activity Recognition module (AR) is responsible for recognising any action of interest which is depicted in the video inputs.


        2. Main purpose and Benefits

        As mentioned before the main purpose of the module is to accurately and timely interpret the image scenery provided to it in an automatic way. This means the analysis is performed automatically, without human involvement and, thus, it can easily be integrated in the project's pipeline. The input of the module is video provided to it and the respected output are a list of all objects of interest detected inside it. The term detected is referring to both recognise (identify in which class each object belongs to) as well as localise (indicate where in the frame the object has been spotted). By combining this information an analysis of the monitored situation can be deducted. Also, since the detection is performed in a frame level, which means a sequence of consecutive image is being processed, the tool is able to provide time information along with spacial one. As a result, the module ultimately is able to provide spacial information (where in the image an object has been detected), time information (at which frame the objects had been detected and for how long) and finally what type of object is the detected one (in which category the detected object belongs to). A clarification here regarding the classes, the term object in fact includes all detectable entities regardless of being alive or not (so human and animal if required can be detected too). In fact, person is typically of crucial importance for scene analysis since most actions are performed by humans and, thus, it is really important to be informed when a person instance is present.

        Regarding the benefits of using the module, there are a lot that contribute to the module's importance. At first, the module is fully automatic meaning it can cover an area in constant way and, thus, excel any human ability to monitor the same area. Second, it is not affected by fatigue or time of day and performs flawlessly at all times of the day. Third, for the ODE submodule specifically, since the input is provided by UAVs it can cover areas which are difficult to be accessed by humans and, thus, provide an accurate and timely alert if necessary. It can easily be incorporated in modular pipelines and perform in various levels of alert status. The accumulated detected objects information can be further analysed in order to acquire a more thorough analysis of the monitored scenery which provided the opportunity of a second level analysis.


        3. Main Functions

        The main functionality of the module is also divided in three parts since those are the submodule that comprise the module.

        VOD is responsible for detecting and reporting all objects of interest inside each frame. The analysis is performed on a frame level which means that each frame is analysed independently from the others. Nevertheless, there is a connection between consecutive frames via a tracker. A tracker is a module which is responsible for the updating of the relative position of each object through each processing frame. In other words, a tracker connects the objects being detected in a frame with the objects being detected in the following (or previous frames). Thus, the use of tracker expands the module to the time dimension also and, additionally, provides the whole pipeline with a functionality for inspecting the monitored area in time dimensionality. VOD and ODE use the same tracker to function while AR do not require any tracker because it already functions in time space. In fact, AR is responsible for recognizing the performed action in a sequence of frames and, thus, it requires a sequence of frames to be provided just to operate.

        3.1 Function 01

        As mentioned before, VOD and ODE are both detectors and operate in a similar fashion. They process video inputs while their output is two-fold:

        They report the exact location every object has been detected inside the image (relative frame coordinates) and also the label each object the module believes it belongs to. Along with the class label the module also provides a confidence score about how certain the module is regarding the label assignment. Finally, through the integrated tracker, the objects are connected to consecutive frames by keeping the same object id in order to recognize that an object has simply changed position and it's not a newly appeared object.


        Figure 1. The output of VOD or ODE


        Figure 2 Object ids and assigned by the tracker

        3.2 Function 02

        AR operates as an extension of VOD and it has some differences from the detectors. First, it functions on a sequence of frames and, thus, no tracker is being used or need per se. Also, since it operates on a time space it inherently provides information about the evolution of the actions performed by every person.


        Figure 3 The output of AR submodule

        4. Integration with other Modules 

        The VOD, ODE and AR are backend modules and do not acquire any user interface for direct communication with the user.

        There are two ways of communicating with the rest of the modules.

        1) The first and more common approach is by using the project’s messaging system. Through this functionality the module can be triggered (if needed) and also propagate its outputs. Since the VOD submodule is expected to be functioning on CCTV cameras the functionality of triggering the submodule is not enabled. The module when launched consumes the video stream propagated from the cameras and proceed with the analysis of the frames. After analysing a number of frames, it collects the results (the detected objects found on these frames) and propagates them into Geospatial Complex Event Processing Engine (G-CEP) module. AR functions in a similar fashion and propagates its output also on G-CEP. Also, for the VOD submodule there is the possibility to store snapshots of the annotated frames (i.e. frame on which the detected objects have been annotated using bounding boxes).

        2) A faster and more direct communication is used by ODE submodule for communicating with the UAVs via User Datagram Protocol (UDP) messages. The information gathered this way are the telemetry of the UAV which is incorporated in the output message as well as signals for starting and stopping the detection. The UDP messages are solely used as input for the ODE submodule while the output is propagated the same way as in VOD and AR cases via the messaging system.

        5. Infrastructure Requirements

        All 3 submodules can be operated both locally as well as remotely bit the latter case usually inserts some delay in the video streaming and, thus, the whole pipeline.

        The main requirement for the operation of the module is the existence of a dedicated GPU for performing the core detection operations. The existence of GPU is not a requirement per se but it is highly advisable to be available since in other case the efficiency of the module is reduced significantly and the whole pipeline is underperforming.

        6. Operational Manual

        6.1 Set-up

        The module can be launched using either a docker container or a virtual environment with all the required packages preinstalled. Since, there is the necessity for GPU card availability the module is tested on specific workstations which have this hardware.

        The typical execution requires for a functional video stream video link to be provided to the VOD submodule. This means that wherever the VOD is being executed from there should be a functional video stream available. Usually, this depends on each premise and the way its security has been constructed and involves either a private network, a firewall bypassing mechanism etc. AR does not require any video stream since it operates on the VOD output. ODE uses the video captured from the UAV camera and, thus, for processing the frames no additional requirements exist.

        For all submodules an active connection to the internet (or equivalently to the place the messaging system is installed) is required in order to publish the results of the module.

        6.2 Getting Started

        As long as the requirements regarding connectivity are fulfilled the module can operate and consume any video stream provide to it.

        6.3 Nominal Operations

        6.3.1 Notifications

        The module typically responds to any ping being sent to it about being available in the project’s pipeline.

        6.3.2 Data Entry

        Regarding the VOD submodule the required data are the video streams being propagated. AR submodule does not require any additional data, while ODE acquires the required data from the UAV itself (through UDP messages).

        6.3.3 User Inputs

        No direct user inputs are being inserted in the module.

        6.3.4 User Output

        The output of the module is being propagated to the users through various intermediate modules like G-CEP, SPGU and CRCL modules. Additionally, the end users are allowed to inspect screenshots of the detected objects by using the provided links in the propagated messages.


        • Acronyms

          AR                                     Activity Recognition

          CI                                        Critical Infrastructure

          CIP                                     Critical Infrastructure Protection

          C/P                                     Cyber/Physical

          DCEP                                 Data Collection and Edge Processing

          EC                                       European Commission

          EU                                      European Union

          ODE                                   Object Detection at the Edge

          SGS                                    Satellite Ground Station

          UAV                                   Unmanned Aerial Vehicle

          VOD                                 Video-Based Object Detection