Delving into Multimodal Deep Learning in the Real World with Issac Godfried

10 min readNov 4, 2024

Recently, I had a chance to watch Issac Godfried’s presentation at the PyData London 2024 conference. He delivered a great presentation about the academic background of the Multimodal Deep Learning models and their real-life use cases. In addition, he also gave a quick tutorial about how to use some of these models. For beginners on this topic, his speech was a good starting point. He started with what multimodel deep learning is and he mentioned why these models are hard to use and in which cases they can be utilized. After explaining these key points he continued with recent papers and projects. This part was enlightening, he showed a paper about CLIP developed by the researchers at openAI and mentioned about open-source Multi-model named Idefics developed by Hugging Face one of the most popular dataset platforms. Then, he continued with short scripts to show how to utilize these models in code. In this part, he demonstrated 3 different multimodel deep learning algorithms. Finally, he ended his presentation with challenges and suggestions during the process. It was an informative talk with various examples. Based on this presentation, this blog post will explore multimodal deep-learning algorithms in more detail. Let’s go :)

What is Multimodal Deep Learning? Like regular models weren’t confusing enough :)

Multimodal deep learning is a subfield in Machine Learning which mainly focuses on finding relationships across different types of data. We can easily give examples from today’s popular models. The first example that comes to my mind is the recent ChatGPT model, GPT4-o. You can easily talk with it (audio input), text it (text input), or give an image (visual input) and get a response. Another application is autonomous driving systems. The algorithm not only takes the view of the sides of the car but also combines them with sensor data(LIDAR, Ultrasonic, GPS…). This way, it detects the type of object and the distance from the car at the same time.

Why is designing multi-model systems important?

Multimodal deep learning algorithms are crucial for numerous aspects, here I will give two of the most important:

1) We live in a world where we must take different input types!

Think about your daily life, you will meet with your friend in a crowded street. You came early and tried to see your friend. Let’s give you only one input to achieve this task, you can only hear the street. You can not see who is on that crowded street. You can hear the people talking, cars, horns… But you can not differentiate your friend's voice from the crowd. However, when you can use your eyes you can see your friend and all the noise in the street is not a problem for you. This is how multimodal deep learning algorithms help us. If a model does not have enough data on a subject it can not be successful. Therefore, in real-life problems using a variety of inputs matters.

2) Different types of inputs enable more robust models.

When a model has multiple types of data it can easily get rid of noise and validate its predictions with other types of data. Let’s go back to the example above, finding a friend in a crowded street. When you hear the street it is literally just noise. You can not differentiate what a person looks like or how far a person is from you with all the sounds. But with the visual data from your eyes, you can pair the noises with people and cars. This way you get rid of noise and have more structured data, which enables you to make more precise predictions. Similar to this example think about autonomous cars. Only visual data is not enough for safe travel, it needs LIDAR data, ultrasonic data, GPS data and more to detect the object near it. Since an object can have complex shapes or it can move, these multiple types of data are crucial to have a more robust model.

Difficulties in designing multi-model systems.

Even if they seem beneficial to our lives, designing a multi-model system has some difficulties. Since Issac has a great slide in this matter, I want to summarize his speech here.

1) Complexity

The process is more complex than uni-model systems. The model has more layers. Therefore, most of the time, it is computationally more expensive. Debugging is also harder than other models because of complexity.

2) Large number of models to choose from

The field is improving rapidly, every day a new model comes to market. It is hard to pick the best from hundreds of models.

3) Needs multiple separate deep-learning services

Unlike uni-models, multi-model systems often require more services in production.

4) Engineering challenges

Since the data is paired with other data, they needed to store and version accordingly. In addition, creating infrastructure to train the system is also a challenge.

5) Needs different types of specialists to have a viable product

Most of the time, the model needs specialists from different fields to succeed. For example, the model takes text and visuals as input probably an NLP engineer and Computer vision specialist should work together on the project.

Deep learning for geophysics: Current and future trends — Scientific Figure on ResearchGate.

After seeing what a multi-modal deep learning algorithm is and its pros and cons, let’s explore the production steps and see some examples from the field.

Key aspects for developing a multi-modal deep learning system

1) Single multi-modal or multiple single-modal?

This question has multiple aspects I will mention some of them:

- Task complexity and integration: If you have multiple correlated datasets, it is better to use single multi-model systems since they can process and combine different data sources in real-time. A common example of this is autonomous driving systems. On the other hand, if the datasets are not correlated, it can be better to use multiple single models since they are simpler. We will talk about the advantages of this simplicity in the next lines.

- Efficiency: This aspect depends on the project. While a single multimodal can be efficient due to not having to communicate with another model, it can also be less efficient if it is computationally expensive.

- Flexibility: Since single multi-models are tailored to a specific job as one model, they are commonly more complex and less flexible. On the other hand, multiple single models are a system together, which makes them simple. This simplicity enables them to be updated separately, being more modular. One of the biggest advantages of it is the ease of maintenance of the system.

2) Should I fine-tune the model for more domain-specific results?

In this question, Issac mentions a couple of years ago the answer to this question was always yes but today with the rise of really big modals like LLMs or LLMs + images, we have tools to get more precise results without the need to fine-tune.

3) Scaling: One of the other key questions is how much you want to spend to make the modal more scalable. Currently, many models use a token system which is measured by a specific unit in the model. This unit can be characters, words or some other measurement.

4) Latency: Nobody would want to get an answer from a modal in hours or days. It should be as quick as possible if it is possible real time. However, this comes with a cost. Building a modal with real-time response for every user is expensive.

Examples from the field

I want to start this section as Issac started his presentation with the CLIP paper.

CLIP (Contrastive Language–Image Pretraining)

CLIP was first introduced by a couple of OpenAI developers in the paper named “Learning Transferable Visual Models From Natural Language Supervision”. CLIP is a breakthrough model for training vision systems using natural language supervision as noted by ChatGPT (OpenAI, 2024). We can say that this paper and this model were the tipping point of the development of big LLMs like today. The model works with a contrastive learning approach, learning from paired image-text inputs while understanding the differences of other pairs. It is also highly successful at “zero-shot” tests, tasks where it hasn’t been trained before. CLIP’s applications are image classification, object detection, and image captioning but are not limited to these applications. Consequently, the paper Learning Transferable Visual Models from Natural Language Supervision pioneered multimodal systems that are highly adaptable and powerful in diverse tasks.

Idefics

Idefics is an open source Multimodal LLM trained using an image encoder by Huggingface. It is as good as ChatGPT in several benchmark tests. If you want to check out the latest version of this model here is their blog post: https://huggingface.co/blog/idefics2.

Earthformer

Earthformer is a model published by Amazon in 2021. It aims to create a high-quality digital model of the Earth’s surface by utilizing multimodal deep-learning models. It utilizes a cuboid attention mechanism which enables it to be computationally more efficient and have better performance.

Crossvivit

Crossvivit was introduced in 2023 in the paper Improving day-ahead Solar Irradiance Time Series Forecasting by Leveraging Spatio-Temporal Context. Issac says this paper is the first major paper that fuses together numerical time series data and imagery in a meaningful way. If you want to check the paper here is the link: Improving day-ahead Solar Irradiance Time Series Forecasting by Leveraging Spatio-Temporal Context.

After the theoretical background of multimodal deep learning algorithms, let’s talk about some practical examples. Since the topic of this blog post is not the coding part mainly, I will give only the key parts in the script.

1- Optical Character Recognition OCR model + Gemini

In this application, he uses an open source OCR package in Python docTR. He also utilizes Google’s Gemini to summarize the text he extracts.

Code:

from google.colab import auth
from datetime import datetime
import os
import vertexai
from vertexai.genereative_models import GenerativeModel, Part
auth, authenticate_user()
!pip install python-doctr
!pip install google-cloud-aiplatform
!pip install tf2onnx
Import doctr
from doctr.models import ocr_predictor
ocr_model = ocr_predictor(det_arch="db_resnet50", reco_arch="master", pretrained=True)

In this project, he used Google colab for as an environment. He imported related packages as vertexai(part of Google Cloud AI services), doctr (OCR package), google-cloud-aiplatform, tf2onnx(a tool for converting TensorFlow models to ONNX (Open Neural Network Exchange) format).

from doctr.io import DocumentFile
pdf_doc = DocumentFile.from_pdf("/content/example_doxs.pdf")

Importing the document.

result = ocr_model(pdf_doc)
result.pages[0].render()

Running the inference and testing it.

vertexai.init(project="dashcam-examination, location="us-central1")
model= GenerativeModel("gemini-1.5-pro-preview-0409",)
model.start_chat()

Setting up the Vertex AI environment instantiating a generative model and starting the chat.

res = model.generate_content(Part.from_text("""{prompt} - -{text} - -""".format(prompt = prompt, text = result.render())
Print(res.text)

Preparing the input format to give Gemini. After taking the results and printing it.

2- Using Idefics:

In another example, Issac uses the Idefics model from Huggingface. He says for the structural OCR, this model is a good option. In this example, he extracts the information from an old newspaper.

Code:

import requests
import torch
from PIL import image
from io import BytesIO
from transformers import AutoProcessor, AutoModelForVision25eq
from transformers.image_utils import load_image
DEVICE = "cuda:0"
Processor = AutoProcessor.from_pretrained("HuggingFaceM4/idefics2–8b")

Importing the necessary libraries including torch(pytorch library for neural networks), PIL (python imaging library), bytesIO(handling binary data) and transformers(Huggingface library). Then he sets the device for computation(unfortunately he was not able to find an adequate GPU for this). Then he loads the pre-trained processor as “HuggingFaceM4/idefics2–8b”.

model = AutoModelForVision25eq.from_pretrained("HuggingFaceM4/idefics2–8b-chatty").to(DEVICE)

Training the model.

generated_ids = model.generate(**inputs, max_new_tokens = 1000)
generated_texts = processor.batch_decode(generated_ids, skip_special_tokens= True)
print(generated_texts)

Generating text using a pre-trained model from Hugging Face.

After the Idefics example, he continues with creating multi-model embeddings and querying them. He uses old writings to test it, however, there was not an available GPU at the time of the presentation so he was not able to run the code. Still, he explains the code and how it works in his talk.

He mentioned model layers should be model agnostic, if you want to change your model you can easily swap them in and out. You want to have it interchangeable.

Architecture of the system

- Django (main app/front end)

- Cloud Function (GCS)

- PubSub

- Docker Flask (model serving layer)

o Multiple models: OCR + Gemini or LLAMA3

o Single model: IDEFICs model(Hugging Face) or OpenAI

- Deploy with Kubernetes

Conclusion

In this post, we have deep-dived to multimodal deep-learning models based on Issac Godfrieds presentation on PyData2024. We explored one of the most popular topics in the AI field, multimodal deep learning algorithms. Until now the developments in this area have exponentially increased. Maybe these multimodal deep learning algorithms will be the basics of the popular “Artificial Generalized Intelligence”. I learned a lot while preparing this post, I hope you feel the same pleasure in reading it.

Thank you for taking the time to read this post.

References

Buhl, N. (2024, November 4). Introduction to multimodal deep learning. https://encord.com/blog/multimodal-learning-guide/

Deep learning for geophysics: Current and future trends — Scientific Figure on ResearchGate. Available from: https://www.researchgate.net/figure/An-illustration-of-multimodal-deep-learning_fig21_350058755 [accessed 4 Nov 2024]

Gao, Z., Shi, X., Wang, H., Zhu, Y., Wang, Y., Li, M., & Yeung, D. (2022, July 12). Earthformer: Exploring Space-Time Transformers for Earth System Forecasting. arXiv.org. https://arxiv.org/abs/2207.05833

Gitbooo. (n.d.). GitHub — gitbooo/CrossViVit: This repository contains code for the paper “Improving day-ahead Solar Irradiance Time Series Forecasting by Leveraging Spatio-Temporal Context.” GitHub. https://github.com/gitbooo/CrossViVit

Introducing Idefics2: A Powerful 8B Vision-Language Model for the community. (n.d.). https://huggingface.co/blog/idefics2

OpenAI. (2024). ChatGPT (November 4 Version) [Large language model]. OpenAI.

PyData. (2024, June 21). Issac Godfried — Multimodal Deep Learning in the Real World | PyData London 2024 [Video]. YouTube. https://www.youtube.com/watch?v=9Qy0b0prepk

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021, February 26). Learning transferable visual models from natural language supervision. arXiv.org. https://arxiv.org/abs/2103.00020