You have likely encountered presentation-style videos that combine slides, figures, tables, and spoken explanations. These videos have become a widely used medium of delivering information, particularly after the COVID-19 pandemic when stay-at-home measures were implemented. While videos are an engaging way to access content, they have significant drawbacks, such as being time-consuming, since one must watch the entire video to find specific information, and taking up considerable storage space due to their large file size.
Researchers led by Professor Hyuk-Yoon Kwon at Seoul National University of Science and Technology in South Korea aimed to address these issues with PV2DOC, a software tool that converts presentation videos into summarized documents. Unlike other video summarizers, which require a transcript alongside the video and become ineffective when only the video is available, PV2DOC overcomes this limitation by combining both visual and audio data and converting video into documents.
This paper was made available online on October 11, 2024, and was published in Volume 28 of the journal SoftwareX on December 1, 2024.
"For users who need to watch and study numerous videos, such as lectures or conference presentations, PV2DOC generates summarized reports that can be read within two minutes. Additionally, PV2DOC manages figures and tables separately, connecting them to the summarized content so users can refer to them when needed," explains Prof. Kwon.
For image processing, PV2DOC extracts frames from the video at one-second intervals. It uses a method called the structural similarity index, which compares each frame with the previous one to identify unique frames. Objects in each frame, such as figures, tables, graphs, and equations, are then detected by object detection models, Mask R-CNN and YOLOv5. During this process, some images may become fragmented due to whitespace or sub-figures. To resolve this, PV2DOC uses a figure merge technique that identifies overlapping areas and combines them into a single figure. Next, the system applies optical character recognition (OCR) using the Google Tesseract engine to extract text from the images. The extracted text is then organized into a structured format, such as headings and paragraphs.
Simultaneously, PV2DOC extracts the audio from the video and uses the Whisper model, an open-source speech-to-text (STT) tool, to convert it into written text. The transcribed text is then summarized using the TextRank algorithm, creating a summary of the main points. The extracted images and text are combined into a Markdown document, which can be turned into a PDF file. The final document presents the video's content—such as text, figures, and formulas—in a clear and organized way, following the structure of the original video.
By converting unorganized video data into structured, searchable documents, PV2DOC enhances the accessibility of the video and reduces the storage space needed for sharing and storing the video. "This software simplifies data storage and facilitates data analysis for presentation videos by transforming unstructured data into a structured format, thus offering significant potential from the perspectives of information accessibility and data management. It provides a foundation for more efficient utilization of presentation videos," says Prof. Kwon.
The researchers plan to further streamline video content into accessible formats. Their next goal is to train a large language model (LLM), similar to ChatGPT, to offer a question-answering service, where users can ask questions based on the content of the videos, with the model generating accurate, contextually relevant answers.