Adventures in Generative AI: Text-to-Video 101

Adventures in Generative AI: Text-to-Video 101

Tags
AI
Published
November 13, 2023

🤖 This post is the first in a future series by Gideon Crawley. You can find Gideon on LinkedIn, or hire him and other AI explorers over at the Quorum1 services page. 🤖

In our fast-paced digital world technology continues to advance at a breakneck pace; innovative solutions are constantly reshaping the way we interact with, consume, and share information.

One such groundbreaking advancement is the emergence of Text-to-Video Artificial Intelligence, a transformative technology that offers a bridge between the written word and the visual realm, enabling the automated transformation of text-based content into engaging and dynamic video presentations.

This blog post aims to provide a brief introduction of this innovative technology with an overview on the history of text-to-video. Additionally, you will find links to some more comprehensive articles in case you’d like to take a deeper dive into the topic.

We’ll close with a demonstrative walkthrough of Fliki, a text-to-video AI tool that I have been experimenting with, followed by “The Divine Proportion”, an educational video that I recently made using Fliki.

The Evolution of Text-To-Video

Historical information sources from: https://www.fabianmosele.com/ai-timeline

Early Development (2000s)

In the early 2000s, the technology that would eventually allow for basic text-to-video systems began to emerge, initially allowing simple animations or slideshows to be generated from written text. However, these systems were limited in terms of complexity and naturalness.

Convolutional Neural Networks (CNN) are a special kind of multi-layer neural network, designed to recognize visual patterns directly from pixel images with minimal preprocessing. One of the earliest convolutional neural networks was LeNet 5, designed for handwritten and machine-printed character recognition.

This was followed by ImageNet in 2009. ImageNet was a dataset of more than 14 million images with human-annotated descriptions of the contents. The biggest at that time, it greatly helped in the development of computer vision research.

Depiction of the ImageNet data set (Source:
Depiction of the ImageNet data set (Source: Roboflow)

Improved Natural Language Processing / Image Classification (2010s)

During the 2010s, advancements in Natural Language Processing (NLP) greatly improved the understanding of context, sentiment, and semantics within written text. This enhanced understanding formed the basis for the more sophisticated text-to-video AI systems to come.

During the 2012 ImageNet competition (ILSVRC12), a convolutional neural network trained on ImageNet called AlexNet, revolutionized the way to approach image classification, outperforming all the other entries that year and winning the competition. AlexNet was a major improvement, with the next best entry getting only 26.2% top 5 test error rate.

Introduction of Generative Adversarial Networks / GANs (2014)

Following the release of Microsoft’s Coco (Common Objects in Context)- a large-scale object detection, segmentation, and captioning dataset, with more than 200 thousand labeled images, and Google’s convolutional Neural Network GoogLeNet, 2014 also saw the introduction of Generative Adversarial Networks, which are machine learning frameworks for generating images between two adversarial neural networks.

Source:
Source: Coco Project

The introduction of Generative Adversarial Networks provided a significant boost to the field of AI. GANs enabled the generation of more realistic and nuanced visuals from textual input, paving the way for more refined text-to-video algorithms.

Deep Learning and Neural Networks (Mid-2010s)

Deep learning and neural networks became central to text-to-video advancements, allowing for more complex modeling of both textual data and visual content. This enabled the development of systems that could create dynamic, engaging videos based on the intricacies of the input text.

Arrival of Transformer Architecture (2017)

The introduction of the transformer architecture in 2017 revolutionized NLP tasks. Transformers, with their attention mechanisms, significantly improved the ability to process and understand large amounts of text, further enhancing text-to-video capabilities. Transformer architecture is a type of neural network architecture that uses an encoder-decoder structure to transform one sequence into another.

Commercial Applications and Integration (Late 2010s - Early 2020s)

Towards the end of the 2010s and into the early 2020s, several companies and startups began integrating AI into their products and services. One such company was Expedia. This popular travel-planning website has integrated conversational AI assistance into its services. Rather than searching for flights, hotels or destinations, customers can plan their vacations as though they are chatting with a friendly, knowledgeable travel agent.

In 2023, Adobe introduced a new family of generative AI models called Firefly, bringing generative AI into Adobe’s suite of apps and services to generate media content.

Canva now uses AI to turn text into images, providing a great example of how AI can be integrated into digital marketing. Here is a video that I made using Canva’s AI:

Video created by the author using Canva’s AI

And here is a great list of real-world corporate use cases for ChatGPT:

The following text-to-video tools have been used in various fields such as marketing, e-learning platforms, and social media to convert text-based content into engaging video formats.

  1. Runway Gen-2: This tool stands out as one of the best AI video generators. It compensates with over 30 AI features dedicated to video editing, and its newly introduced Gen-2 feature revolutionizes AI video generation by enabling the creation of innovative videos from text and/or images. However, it does not support text-to-video with realistic AI avatars or generate speech from text directly.
  2. Synthesia AI: Synthesia is the world’s leading AI video generator that allows you to create videos with AI presenters from text. It includes more than 60 video templates you can start from, and you can choose between more than 140 AI avatars that can speak your text in more than 120 languages and accents.
  3. Kapwing: Kapwing’s text to video generator takes any length of text and creates a professional-looking video complete with stock footage, background music, text overlays, subtitles, transitions, and more. It also has a feature to convert text into realistic human-like audio for your video content.

Enhanced Realism and Deepfake Technology (Late 2010s - Early 2020s)

On a bit of a darker note: The early 2020s saw a rise in deepfake technology, which uses AI to manipulate and generate highly realistic videos. Deepfake technology has the potential to create highly convincing artificial video content that can be hard to distinguish from reality, and therefore raises some very legitimate ethical concerns, especially in the realm of politics, as is evidenced in the following NY Times article:

Current State and Future Prospects (2020s and Beyond)

As of the 2020s, text-to-video AI has achieved a level of sophistication that allows for the creation of compelling, high-quality videos from text inputs. The field continues to evolve, with ongoing research focusing on improving accuracy, reducing biases, ensuring ethical use, and expanding the applications of this technology across various domains.

Diagram of the LeNet architecture (Source:
Diagram of the LeNet architecture (Source: SuperAnnotate)

The following article contains a few examples of videos created from a text prompt using the new ModelScope Video Generator from Hugging Face. As you can see it is still quite raw and very surreal. Also pretty entertaining, if you ask me:

Hands-On Text-To-Video Review: Fliki

The future of text-to-video AI holds immense promise, with ongoing efforts to integrate it seamlessly into our digital communication landscape. This evolution continues to transform how we interact with and interpret information, bringing us closer to a future where communication is not just informative but visually captivating and immersive.

One of the more impressive tools I tested that creates video from a text prompt is Fliki.

Fliki creates informational, educational, marketing, motivational or tutorial videos from just a single one line text prompt, a blog post, or a presentation. It writes the entire script, chooses stock video segments, and includes a variety of built-in AI voices you can choose from to narrate the videos.

The following Motivational Short was created from a one-line text prompt, using the “Idea to Video” option. Here is the prompt I used:

Failure is not the opposite of success, it is the opportunity for growth and improvement.

I chose “Motivational” from a drop-down list of video styles, tried a few different voices until I landed on one that I liked, and did I little editing including changing one video clip for another here and there and making sure the audio narration lined up with the text subtitles. Other than that, the tool did most of the work. The resulting video is below!

Video created by the author using Fliki

I found myself pretty impressed with these results, and was really curious how the tool would handle “Blog-to-Video”. This ended up taking significantly more work before I had an end result that I was happy with. Honestly, I could have posted the video it initially gave me as-is and it still would have been a coherent, well put-together video that made sense. But it wouldn’t really have been my work, would it? It would not have been imbued with the energy of my attention or intention. I think that while these tools are incredible performance enhancers, ultimately they lack the actual experience of being human, so anything that it creates on its own is going to feel lacking.

During my bachelor’s degree studies, I researched and wrote an article for my Sociology course about the golden ratio and the Fibonacci sequence, a subject that really fascinates me. I’ve recently become interested in creating educational and/or inspirational video content for younger folks, and feel this is the perfect topic to help spark an interest in mathematics.

The Fibonacci Sequence and The Golden Ratio:

I decided on using this article to test the “Blog-to-Video” option and fed my entire article to Fliki. It summarized the whole thing, breaking it up into scenes by topic. I ended up replacing all the video segments with other video clips, some that I found on Fliki’s site, most that I found elsewhere, and a couple that I created myself using Canva.

I also edited quite a bit of the script, which I found to be a bit stiff and generic, as the text in the script was not my text from the original article. It actually did a very good job summarizing my writing and separating it into segments by theme! I then picked the AI voice that I felt best fit the narrative, and added my own background music. As a child I used to love the narrated astronomy lessons at the planetarium, and was sort of going for a planetarium exhibit vibe in this video. I think it captures that quite well.

While it still ended up taking quite a bit of work to complete, the entire video would have taken countless tedious hours to put together from scratch, without the help of Artificial Intelligence.

After much arranging, rearranging, editing and re-editing, here is the final product. Enjoy!

Video created by author by converting a previously written blog post using Fliki.

We truly have entered a new age of enhanced productivity. Yes, AI is probably going to eliminate thousands of jobs. On the other hand, it can empower the individual to get things done that it would have taken an entire company to do before.

I encourage everyone to research what is now possible with the help of AI as it relates to their particular niche / métier. Download some tools, fork some repos, join some Discord servers, ask some questions, learn the basics, conduct some experiments… empower yourselves!

That’s what it’s for.

Want help with your AI project?

Quorum1 is a professional collective filled with experts, aspiring experts, and explorers on the cutting edge of AI. We’re a great partner, whether you’re just ideating an early stage offering or working to deploy a new model to millions of customers.

Use the link below to reach out.