You have likely encountered presentation-style videos that combine slides, figures, tables, and spoken explanations. These videos have become a widely used medium of delivering information, particularly after the COVID-19 pandemic when stay-at-home measures were implemented. While videos are an engaging way to access content, they have significant drawbacks, such as being time-consuming, since one must watch the entire video to find specific information, and taking up considerable storage space due to their large file size.
Researchers led by Professor Hyuk-Yoon Kwon at Seoul National University of Science and Technology in South Korea aimed to address these issues with PV2DOC, a software tool that converts presentation videos into summarized documents. Unlike other video summarizers, which require a transcript alongside the video and become ineffective when only the video is available, PV2DOC overcomes this limitation by combining both visual and audio data and converting video into documents.
This paper was made available online on October 11, 2024, and was published in Volume 28 of the journal SoftwareX on December 1, 2024.
“For users who need to watch and study numerous videos, such as lectures or conference presentations, PV2DOC generates summarized reports that can be read within two minutes. Additionally, PV2DOC manages figures and tables separately, connecting them to the summarized content so users can refer to them when needed,” explains Prof. Kwon.
For image processing, PV2DOC extracts frames from the video at one-second intervals. It uses a method called the structural similarity index, which compares each frame with the previous one to identify unique frames. Objects in each frame, such as figures, tables, graphs, and equations, are then detected by object detection models, Mask R-CNN and YOLOv5. During this process, some images may become fragmented due to whitespace or sub-figures. To resolve this, PV2DOC uses a figure merge technique that identifies overlapping areas and combines them into a single figure. Next, the system applies optical character recognition (OCR) using the Google Tesseract engine to extract text from the images. The extracted text is then organized into a structured format, such as headings and paragraphs.
Simultaneously, PV2DOC extracts the audio from the video and uses the Whisper model, an open-source speech-to-text (STT) tool, to convert it into written text. The transcribed text is then summarized using the TextRank algorithm, creating a summary of the main points. The extracted images and text are combined into a Markdown document, which can be turned into a PDF file. The final document presents the video’s content—such as text, figures, and formulas—in a clear and organized way, following the structure of the original video.
By converting unorganized video data into structured, searchable documents, PV2DOC enhances the accessibility of the video and reduces the storage space needed for sharing and storing the video. “This software simplifies data storage and facilitates data analysis for presentation videos by transforming unstructured data into a structured format, thus offering significant potential from the perspectives of information accessibility and data management. It provides a foundation for more efficient utilization of presentation videos,” says Prof. Kwon.
The researchers plan to further streamline video content into accessible formats. Their next goal is to train a large language model (LLM), similar to ChatGPT, to offer a question-answering service, where users can ask questions based on the content of the videos, with the model generating accurate, contextually relevant answers.
***
Reference
DOI: 10.1016/j.softx.2024.101922
About the institute Seoul National University of Science and Technology (SEOULTECH)
Seoul National University of Science and Technology, commonly known as 'SEOULTECH,' is a national university located in Nowon-gu, Seoul, South Korea. Founded in April 1910, around the time of the establishment of the Republic of Korea, SeoulTech has grown into a large and comprehensive university with a campus size of 504,922 m2.
It comprises 10 undergraduate schools, 35 departments, 6 graduate schools, and has an enrollment of approximately 14,595 students.
Website: https://en.seoultech.ac.kr/
About Associate Professor Hyuk-Yoon Kwon
Prof. Kwon is currently an Associate Professor with the ITM Division, Department of Industrial Engineering/Graduate School of Data Science, Seoul National University of Science and Technology, Seoul, South Korea. He is leading Big Data-Driven AI Laboratory (https://bigdata.seoultech.ac.kr). Before that, he worked at the Ministry of National Defense as a researcher from 2014 to 2018 and at KAIST as a postdoctoral researcher from 2013 to 2014. He received a Ph.D. degree in computer science from KAIST in 2013. He worked as a visiting scholar at the Georgia Institute of Technology from 2024 to 2025 and at Microsoft Research in Asia as a research intern from 2011 to 2012. His research interests include data-driven AI/ML, Big Data management, distributed & cloud computing, federated & distributed learning, databases, data-centric cybersecurity, and fair data scraping and analysis. He has presented and published top-tier conferences and journals in databases, big data, artificial intelligence, and data mining fields, including ACM SIGMOD, NeurIPS, AAAI, IEEE ICDM, IEEE BigData, IEEE TKDE, and IEEE TII (https://scholar.google.co.kr/citations?user=INJzI3IAAAAJ).
Journal
SoftwareX
Method of Research
Data/statistical analysis
Subject of Research
Not applicable
Article Title
PV2DOC: Converting the presentation video into the summarized document
Article Publication Date
1-Dec-2024
COI Statement
Hyuk-Yoon Kwon reports financial support was provided by Seoul National University of Science & Technology. If there are other authors, they declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.