Panoptic Video Scene Graph Generation

¹S-Lab, Nanyang Technological University ²SenseTime Research
^*Main Contributors ^✉Corresponding Author
Accepted to CVPR 2023

The PVSG Dataset comprises 400 videos characterized by their length (averaging 76.5 seconds), perspective diversity (combining first and third-person views across different scenarios), and dynamism (featuring significant camera and object motion), with rich annotation includes Video Panoptic Segmentation and Temporal Scene Graph on 150K frames, and video-level dense captions and QA pairs. Here we show some examples. All 400 visual examples are [here].

Dense Caption

0000-0076: A woman (adult-1) assists a little girl (child-1) as she starts riding a bicycle (bike-1).
0077-0223: The little girl (child-1) rides on her own for a distance, with the woman (adult-1) following closely beside her.
0224-0272: The woman (adult-1) helps the little girl (child-1) turn around on the bike.
0273-0311: The woman (adult-1) assists the little girl (child-1) as she starts again.
0312-0363: The little girl (child-1) rides on her own, gaining confidence in her cycling skills.
0364-0374: The little girl (child-1) dismounts from the bike (bike-1), and the woman (adult-1) praises her for her efforts.
0375-0464: The little girl (child-1) walks on the ground (ground-0) with a sense of accomplishment.

QA Pair

0000:Q: Why does the woman (adult-1) need to support the bike (bike-1)? A: Because it's challenging to start riding.
0080:Q: Why is the woman (adult-1) following closely beside the little girl (child-1)? A: Because she's afraid the little girl (child-1) might fall.
0377:Q: Why is the little girl (child-1) smiling? A: Because she received praise and she has learned how to ride the bike.
0385:Q: What should the little girl (child-1) do next? A: She should probably head home with her parents and receive a reward for her achievement.

Dense Caption

0000-0052: Myself (adult-1) and a friend (adult-2) both simultaneously drew a card from our decks. I got the King of Spades (card-8).
0052-0103: I (adult-1) drew another card, the 6 of Hearts (card-9), and placed it on top of cards of the same suit (card-3).
...

QA Pair

0052: Q: Where should I (adult-1) place the King of Spades (card-8) that I drew? A: Since there's no matching card for it, it should be discarded.
0142: Q: Why didn't I (adult-1) place the 5 of Hearts (card-10) on the 6 of Hearts(card-9)? A: I (adult-1) thought the odds were low.
0803: Q: What might be my (adult-1) emotional state at this point? A: Frustrated.

Dense Caption

0000-0035: A person (adult-1) holds a spatula (spatula-1) and stirs the vegetables (vegetable-1) in the frying pan (pan-1), shaking the pan as well.
0036-0143: A person (adult-1) picks up a cutting board (board-1) and pours the chopped vegetables (vegetable-1) from the cutting board (board-1) into the frying pan (pan-1).
...

QA Pair

0036 Q: Why does the person (adult-1) turn to the cutting board (board-1)? A:To pour the vegetables (vegetable-1) from the cutting board (board-1) into the pan (pan-1).
0144 Q: Why does the person (adult-1) pick up the bag (bag-2) from the table? A:To pour the seasoning from the bag (bag-2) into the pan (pan-1).

Dense Caption

0000-0109: I (adult-1) squatted down and used a rag (rag-1) to clean the floor (floor-0).
0189-0280: I (adult-1) placed the rag (rag-1) into a bucket (bucket-1) for cleaning, moved the chair (chair-1) and small table (table-2) to clean the floor, and then cleaned the rag (rag-1) again.
0280-0420: I (adult-1) picked up the rag (rag-1) after sweeping the floor and then reset the chair (chair-1).

QA Pair

0205:Q: Why did I (adult-1) move the rag (chair-1)? A: Because the chair (chair-1) was obstructing the cleaning process.
0420:Q: Why did I use a towel (rag-1) to clean the floor (floor-0)? A: Because a rag(rag-1) is capable of effectively cleaning up dirt and stains in a detailed manner.

Dense Caption

0000-0022: A horse (horse-1) carries a man (adult-1) wearing a hat (hat-1), and a woman (adult-2) returns from a distance.
0022-0090: The woman (adult-2) dismounts from the horse (horse-1) and the man (adult-1) who controls the reins helps another woman (adult-5) onto the horse.
0090-0098: The man (adult-1) and the woman (adult-5) ride off together on the horse.
0098-0130: Another man (adult-6) and another woman (adult-7) return to the starting point on horseback (horse-2).
0130-0166: A man (adult-3) helps the woman (adult-7) dismount from the horse (horse-2).
0166-0220: The man (adult-6) on the horse (horse-2) pulls the woman (adult-4) who is standing on the ground (ground-0) up onto the horse.
0220-0327: The man (adult-6) rides the horse (horse-2) with the woman (adult-4) and they leave the starting point with smiles.
0327-0370: The man (adult-1) and the woman (adult-5), who departed earlier, return on horseback (horse-1).

QA Pair

0070: Q: Why doesn't the man (adult-1) dismount? A: It can be reasonably inferred that the man (adult-1) and his colleague, the other man (adult-2), are both instructors at the horse stable and they are providing services to the visitors.

Dense Caption

0000-0005: A man (adult-1) bends over to roll up his pants, and a photographer (adult-4) hands a bottle of water (bottle-1) to a woman (adult-2).
0005-0029: After taking the bottle (bottle-1), the woman (adult-2) turns around. The photographer (adult-4) raises the camera (camera-1) to prepare for a photo. The man (adult-1) walks into the pool (rock-0) after rolling up his pants.
0029-0040: The man (adult-1) walks over to the woman (adult-2), embraces her, and they share a kiss.
0040-0053: The man (adult-1) helps the woman (adult-2) pick up the bottle (bottle-1) and feeds her some water. After she finishes drinking, she then offers the man (adult-1) some water in return.
0053-0065: The woman (adult-2) puts down the bottle (bottle-1) and engages in conversation with the people around her.
0065-0071: A nearby woman (adult-6) exits the pool (rock-0).
0071-0073: The woman (adult-2) and the man (adult-1) move closer and hand the bottle (bottle-1) to the photographer (adult-4).

QA Pair

0044: Q: Why did the man (adult-1) and the woman (adult-2) embrace and kiss? A: They wanted the photographer (adult-4) to capture the moment of their love in front of the water pool (rock-0).

Dense Caption

0000-0080: Mom (adult-1) carries the little boy (baby-1) from the room to the living room.
0080-0089: Mom (adult-1) picks up a toy (toy-1) from above the fireplace (rock-0).
0090-0227: Mom (adult-1) holds the toy (toy-1).
0095-0106: The dog (dog-1) jumps up from the floor (floor-0).
0228-0325: The little boy (baby-1) grabs the toy (toy-1).
0288-0335: Mom (adult-1) makes the little boy (baby-1) face the camera.

QA Pair

0000: Q: Why does the little boy (baby-1) look a bit sluggish when coming out of the room? A: He might have just woken up from a nap.
0086: Q: Why did the little boy (baby-1) widen his eyes when he saw the toy (toy-1)? A: It was a pleasant surprise for him.
0097: Q: Why did the dog (dog-1) approach the toy (toy-1)? A: Because the dog (dog-1) was curious about the object.
0295: Q: Why is the little boy (baby-1) looking at the camera? A: Probably because the cameraman is his father.
0322: Q: Why did the little boy (baby-1) cover his mouth with his hand? A: To express gratitude to his father.

Dense Caption

0000-0039: A woman (adult-2) puts a toy plane (toy-1) on the little boy (child-1).
0040-0097: The woman (adult-2) places sunglasses (glasses-1) on the little boy (child-1)'s face.
0065-0097: The little boy (child-1) grabs a microphone (microphone-1).
0097-0175: The little boy (child-1) picks up the microphone (microphone-1) and starts speaking.
0120-0175: Two women (adult-1)(adult-2) talk to the little boy (child-1).

QA Pair

0030:Q: Why did they put the toy plane (toy-1) on the little boy (child-1)? A: To playfully amuse him.
0110:Q: Why does the little boy (child-1) look very happy? A: Because he feels like a pilot.
0097:Q: What did the little boy (child-1) say when he picked up the microphone (microphone-1)? A: He might be praising himself or expressing his excitement.
0175:Q: What should the little boy (child-1) do next? A: He should continue playing with the toy plane (toy-1) and have fun.

Abstract

Towards building comprehensive real-world visual perception systems, we propose and study a new problem called panoptic scene graph generation (PVSG). PVSG is related to the existing video scene graph generation (VidSGG) problem, which focuses on temporal interactions between humans and objects localized with bounding boxes in videos. However, the limitation of bounding boxes in detecting non-rigid objects and backgrounds often causes VidSGG systems to miss key details that are crucial for comprehensive video understanding. In contrast, PVSG requires nodes in scene graphs to be grounded by more precise, pixel-level segmentation masks, which facilitate holistic scene understanding. To advance research in this new area, we contribute a high-quality PVSG dataset, which consists of 400 videos (289 third-person + 111 egocentric videos) with totally 150K frames labeled with panoptic segmentation masks as well as fine, temporal scene graphs. We also provide a variety of baseline methods and share useful design practices for future work.

Dataset Overview

The PVSG dataset statistics. The PVSG dataset contains 400 third-person and ego-centric videos from diverse environments, as shown in (a). The statistics of object classes and relation classes are shown in (b) and (c).

PVSG Dataset Annotation Pipeline. The construction of the PVSG dataset can be divided into VPS annotation and relation annotation. For VPS annotation, we select a few key frames and use an off-the-shelf video object segmentation (VOS) model AOT~\cite{yang2021aot} to propagate the annotated objects to the whole video, and then perform frame-level mask fusion using the predefined layer order to obtain a coarse VPS annotation for further revision. The relations are annotated based on the description of the key information in the video.

Methods and Experimental Results

Framework for the PVSG Task in Two Stages. The first stage focuses on generating video panoptic segmentation masks and video-length feature tubes for each object, offering two methodological options. The subsequent stage involves predicting inter-object relations by analyzing the feature tubes, with four distinct methods available for a thorough evaluation. The connection between stages is facilitated by feature tubes which are paired according to the ground truth and then processed using various strategies—ranging from simple fully-connected layers to a complex transformer encoder—to effectively classify relations while incorporating temporal dynamics.

Comparative Analysis of PVSG Task Models. This table highlights the second stage's effectiveness, with the transformer encoder showing optimal results, and the 1D convolutional approach surpassing the handcrafted method, indicating the merit of learnable parameters. Even basic vanilla methods achieve some recall, suggesting feasibility with a solid first-stage model. The first stage analysis reveals that end-to-end VPS models underperform compared to IPS+T baselines, particularly in the PVSG dataset's challenging dynamic video context. The table also underscores that the model's efficiency and accuracy, while present, require enhancement.

IPS-T + Transformer Visualization.

IPS-T + Transformer Visualization and Ground Truth

IPS-T + Transformer Visualization

IPS+T model as first stage, video: P28_19.

VPS model as first stage, video: P28_19.

BibTeX

@article{yang2023pvsg, title={Panoptic video scene graph generation}, author={Yang, Jingkang and Peng, Wenxuan and Li, Xiangtai and Guo, Zujin and Chen, Liangyu and Li, Bo and Ma, Zheng and Zhou, Kaiyang and Zhang, Wayne and Loy, Chen Change and others}, booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition}, pages={18675--18685}, year={2023} }

Panoptic Video Scene Graph Generation

Abstract

Dataset Overview

The PVSG dataset statistics. The PVSG dataset contains 400 third-person and ego-centric videos from diverse environments, as shown in (a). The statistics of object classes and relation classes are shown in (b) and (c).

Methods and Experimental Results

IPS-T + Transformer Visualization.

IPS-T + Transformer Visualization and Ground Truth

IPS-T + Transformer Visualization

IPS-T + Transformer Visualization

IPS-T + Transformer Visualization

IPS+T model as first stage, video: P28_19.

VPS model as first stage, video: P28_19.

Paper (Latest)

BibTeX