Dense Caption
0000-0076: A woman (adult-1) assists a little girl (child-1) as she starts riding a bicycle (bike-1).
0077-0223: The little girl (child-1) rides on her own for a distance, with the woman (adult-1) following closely beside her.
0224-0272: The woman (adult-1) helps the little girl (child-1) turn around on the bike.
0273-0311: The woman (adult-1) assists the little girl (child-1) as she starts again.
0312-0363: The little girl (child-1) rides on her own, gaining confidence in her cycling skills.
0364-0374: The little girl (child-1) dismounts from the bike (bike-1), and the woman (adult-1) praises her for her efforts.
0375-0464: The little girl (child-1) walks on the ground (ground-0) with a sense of accomplishment.
QA Pair
0000:Q: Why does the woman (adult-1) need to support the bike (bike-1)? A: Because it's challenging to start riding.
0080:Q: Why is the woman (adult-1) following closely beside the little girl (child-1)? A: Because she's afraid the little girl (child-1) might fall.
0377:Q: Why is the little girl (child-1) smiling? A: Because she received praise and she has learned how to ride the bike.
0385:Q: What should the little girl (child-1) do next? A: She should probably head home with her parents and receive a reward for her achievement.
Dense Caption
0000-0052: Myself (adult-1) and a friend (adult-2) both simultaneously drew a card from our decks. I got the King of Spades (card-8).
0052-0103: I (adult-1) drew another card, the 6 of Hearts (card-9), and placed it on top of cards of the same suit (card-3).
...
QA Pair
0052: Q: Where should I (adult-1) place the King of Spades (card-8) that I drew? A: Since there's no matching card for it, it should be discarded.
0142: Q: Why didn't I (adult-1) place the 5 of Hearts (card-10) on the 6 of Hearts(card-9)? A: I (adult-1) thought the odds were low.
0803: Q: What might be my (adult-1) emotional state at this point? A: Frustrated.
Dense Caption
0000-0035: A person (adult-1) holds a spatula (spatula-1) and stirs the vegetables (vegetable-1) in the frying pan (pan-1), shaking the pan as well.
0036-0143: A person (adult-1) picks up a cutting board (board-1) and pours the chopped vegetables (vegetable-1) from the cutting board (board-1) into the frying pan (pan-1).
...
QA Pair
0036 Q: Why does the person (adult-1) turn to the cutting board (board-1)? A:To pour the vegetables (vegetable-1) from the cutting board (board-1) into the pan (pan-1).
0144 Q: Why does the person (adult-1) pick up the bag (bag-2) from the table? A:To pour the seasoning from the bag (bag-2) into the pan (pan-1).
Dense Caption
0000-0109: I (adult-1) squatted down and used a rag (rag-1) to clean the floor (floor-0).
0189-0280: I (adult-1) placed the rag (rag-1) into a bucket (bucket-1) for cleaning, moved the chair (chair-1) and small table (table-2) to clean the floor, and then cleaned the rag (rag-1) again.
0280-0420: I (adult-1) picked up the rag (rag-1) after sweeping the floor and then reset the chair (chair-1).
QA Pair
0205:Q: Why did I (adult-1) move the rag (chair-1)? A: Because the chair (chair-1) was obstructing the cleaning process.
0420:Q: Why did I use a towel (rag-1) to clean the floor (floor-0)? A: Because a rag(rag-1) is capable of effectively cleaning up dirt and stains in a detailed manner.
Dense Caption
0000-0022: A horse (horse-1) carries a man (adult-1) wearing a hat (hat-1), and a woman (adult-2) returns from a distance.
0022-0090: The woman (adult-2) dismounts from the horse (horse-1) and the man (adult-1) who controls the reins helps another woman (adult-5) onto the horse.
0090-0098: The man (adult-1) and the woman (adult-5) ride off together on the horse.
0098-0130: Another man (adult-6) and another woman (adult-7) return to the starting point on horseback (horse-2).
0130-0166: A man (adult-3) helps the woman (adult-7) dismount from the horse (horse-2).
0166-0220: The man (adult-6) on the horse (horse-2) pulls the woman (adult-4) who is standing on the ground (ground-0) up onto the horse.
0220-0327: The man (adult-6) rides the horse (horse-2) with the woman (adult-4) and they leave the starting point with smiles.
0327-0370: The man (adult-1) and the woman (adult-5), who departed earlier, return on horseback (horse-1).
QA Pair
0070: Q: Why doesn't the man (adult-1) dismount? A: It can be reasonably inferred that the man (adult-1) and his colleague, the other man (adult-2), are both instructors at the horse stable and they are providing services to the visitors.
Dense Caption
0000-0005: A man (adult-1) bends over to roll up his pants, and a photographer (adult-4) hands a bottle of water (bottle-1) to a woman (adult-2).
0005-0029: After taking the bottle (bottle-1), the woman (adult-2) turns around. The photographer (adult-4) raises the camera (camera-1) to prepare for a photo. The man (adult-1) walks into the pool (rock-0) after rolling up his pants.
0029-0040: The man (adult-1) walks over to the woman (adult-2), embraces her, and they share a kiss.
0040-0053: The man (adult-1) helps the woman (adult-2) pick up the bottle (bottle-1) and feeds her some water. After she finishes drinking, she then offers the man (adult-1) some water in return.
0053-0065: The woman (adult-2) puts down the bottle (bottle-1) and engages in conversation with the people around her.
0065-0071: A nearby woman (adult-6) exits the pool (rock-0).
0071-0073: The woman (adult-2) and the man (adult-1) move closer and hand the bottle (bottle-1) to the photographer (adult-4).
QA Pair
0044: Q: Why did the man (adult-1) and the woman (adult-2) embrace and kiss? A: They wanted the photographer (adult-4) to capture the moment of their love in front of the water pool (rock-0).
Dense Caption
0000-0080: Mom (adult-1) carries the little boy (baby-1) from the room to the living room.
0080-0089: Mom (adult-1) picks up a toy (toy-1) from above the fireplace (rock-0).
0090-0227: Mom (adult-1) holds the toy (toy-1).
0095-0106: The dog (dog-1) jumps up from the floor (floor-0).
0228-0325: The little boy (baby-1) grabs the toy (toy-1).
0288-0335: Mom (adult-1) makes the little boy (baby-1) face the camera.
QA Pair
0000: Q: Why does the little boy (baby-1) look a bit sluggish when coming out of the room? A: He might have just woken up from a nap.
0086: Q: Why did the little boy (baby-1) widen his eyes when he saw the toy (toy-1)? A: It was a pleasant surprise for him.
0097: Q: Why did the dog (dog-1) approach the toy (toy-1)? A: Because the dog (dog-1) was curious about the object.
0295: Q: Why is the little boy (baby-1) looking at the camera? A: Probably because the cameraman is his father.
0322: Q: Why did the little boy (baby-1) cover his mouth with his hand? A: To express gratitude to his father.
Dense Caption
0000-0039: A woman (adult-2) puts a toy plane (toy-1) on the little boy (child-1).
0040-0097: The woman (adult-2) places sunglasses (glasses-1) on the little boy (child-1)'s face.
0065-0097: The little boy (child-1) grabs a microphone (microphone-1).
0097-0175: The little boy (child-1) picks up the microphone (microphone-1) and starts speaking.
0120-0175: Two women (adult-1)(adult-2) talk to the little boy (child-1).
QA Pair
0030:Q: Why did they put the toy plane (toy-1) on the little boy (child-1)? A: To playfully amuse him.
0110:Q: Why does the little boy (child-1) look very happy? A: Because he feels like a pilot.
0097:Q: What did the little boy (child-1) say when he picked up the microphone (microphone-1)? A: He might be praising himself or expressing his excitement.
0175:Q: What should the little boy (child-1) do next? A: He should continue playing with the toy plane (toy-1) and have fun.
Towards building comprehensive real-world visual perception systems, we propose and study a new problem called panoptic scene graph generation (PVSG). PVSG is related to the existing video scene graph generation (VidSGG) problem, which focuses on temporal interactions between humans and objects localized with bounding boxes in videos. However, the limitation of bounding boxes in detecting non-rigid objects and backgrounds often causes VidSGG systems to miss key details that are crucial for comprehensive video understanding. In contrast, PVSG requires nodes in scene graphs to be grounded by more precise, pixel-level segmentation masks, which facilitate holistic scene understanding. To advance research in this new area, we contribute a high-quality PVSG dataset, which consists of 400 videos (289 third-person + 111 egocentric videos) with totally 150K frames labeled with panoptic segmentation masks as well as fine, temporal scene graphs. We also provide a variety of baseline methods and share useful design practices for future work.
@article{yang2023pvsg,
title={Panoptic video scene graph generation},
author={Yang, Jingkang and Peng, Wenxuan and Li, Xiangtai and Guo, Zujin and Chen, Liangyu and Li, Bo and Ma, Zheng and Zhou, Kaiyang and Zhang, Wayne and Loy, Chen Change and others},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={18675--18685},
year={2023}
}