Vicuna chatbot
Author: r | 2025-04-24
In contrast, Vicuna is an open-source chatbot framework that allows developers to build and deploy chatbots with ease. Vicuna Features. Vicuna is an open-source chatbot framework that offers a range of features for Building chatbots with Vicuna-13B - An article on using Vicuna to create chatbots; Comparing LLMs for chat: LLaMA v2 vs Vicuna - A comparison between LLaMA's new version 2 and Vicuna; LangChain chat models: an overview - LangChain is a popular framework for building chat applications
Vicuna: An Open-Source Chatbot
MiniGPT-4: Enhancing Vision-language Understanding with Advanced Large Language ModelsDeyao Zhu* (On Job Market!), Jun Chen* (On Job Market!), Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. *Equal ContributionKing Abdullah University of Science and Technology NewsWe now provide a pretrained MiniGPT-4 aligned with Vicuna-7B! The demo GPU memory consumption now can be as low as 12GB.Online DemoClick the image to chat with MiniGPT-4 around your imagesExamplesMore examples can be found in the project page.IntroductionMiniGPT-4 aligns a frozen visual encoder from BLIP-2 with a frozen LLM, Vicuna, using just one projection layer.We train MiniGPT-4 with two stages. The first traditional pretraining stage is trained using roughly 5 million aligned image-text pairs in 10 hours using 4 A100s. After the first stage, Vicuna is able to understand the image. But the generation ability of Vicuna is heavilly impacted.To address this issue and improve usability, we propose a novel way to create high-quality image-text pairs by the model itself and ChatGPT together. Based on this, we then create a small (3500 pairs in total) yet high-quality dataset.The second finetuning stage is trained on this dataset in a conversation template to significantly improve its generation reliability and overall usability. To our surprise, this stage is computationally efficient and takes only around 7 minutes with a single A100.MiniGPT-4 yields many emerging vision-language capabilities similar to those demonstrated in GPT-4.Getting StartedInstallation1. Prepare the code and the environmentGit clone our repository, creating a python environment and ativate it via the following commandgit clone MiniGPT-4conda env create -f environment.ymlconda activate minigpt42. Prepare the pretrained Vicuna weightsThe current version of MiniGPT-4 is built on the v0 versoin of Vicuna-13B.Please refer to our instruction hereto prepare the Vicuna weights.The final weights would be in a single folder in a structure similar to the following:vicuna_weights├── config.json├── generation_config.json├── pytorch_model.bin.index.json├── pytorch_model-00001-of-00003.bin... Then, set the path to the vicuna weight in the model config filehere at Line 16.3. Prepare the pretrained MiniGPT-4 checkpointDownload the pretrained checkpoints according to the Vicuna model you prepare.Checkpoint Aligned with Vicuna 13BCheckpoint Aligned with Vicuna 7BDownladDownloadThen, set the path to the pretrained checkpoint in the evaluation config filein eval_configs/minigpt4_eval.yaml at Line 11.Launching Demo LocallyTry out our demo demo.py on your local machine by runningpython demo.py --cfg-path eval_configs/minigpt4_eval.yaml --gpu-id 0To save GPU memory, Vicuna loads as 8 bit by default, with a beam search width of 1.This configuration requires about 23G GPU memory for Vicuna 13B and 11.5G GPU memory for Vicuna 7B.For more powerful GPUs, you can run the modelin 16 bit by setting low_resource to False in the config fileminigpt4_eval.yaml and use a larger beam search width.Thanks @WangRongsheng, you can also run our code on ColabTrainingThe training of MiniGPT-4 contains two alignment stages.1. First pretraining stageIn the first pretrained stage, the model is trained using image-text pairs from Laion and CC datasetsto align the vision and language model. To download and prepare the datasets, please checkour first stage dataset preparation instruction.After the first stage, the visual features are mapped and can be understood by the languagemodel.To launch the first stage SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models (CVPR-2024 Highlight)[Paper][Project Page][Demo] 🔥🔥 2024.04. SmartEdit is released!🔥🔥 2024.04. SmartEdit is selected as highlight by CVPR-2024!🔥🔥 2024.02. SmartEdit is accepted by CVPR-2024!If you are interested in our work, please star ⭐ our project.SmartEdit Framework SmartEdit on Understanding Scenarios SmartEdit on Reasoning Scenarios Dependencies and Installation pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url pip install -r requirements.txt git clone cd flash-attention pip install . --no-build-isolation cd ..Training model preparationPlease put the prepared checkpoints in file checkpoints.Prepare Vicuna-1.1-7B/13B checkpoint: please download Vicuna-1.1-7B and Vicuna-1.1-13B in link.Prepare LLaVA-1.1-7B/13B checkpoint: please follow the LLaVA instruction to prepare LLaVA-1.1-7B/13B weights.Prepare InstructDiffusion checkpoint: please download InstructDiffusion(v1-5-pruned-emaonly-adaption-task.ckpt) and the repo in link. Download them first and use python convert_original_stable_diffusion_to_diffusers.py --checkpoint_path "./checkpoints/InstructDiffusion/v1-5-pruned-emaonly-adaption-task.ckpt" --original_config_file "./checkpoints/InstructDiffusion/configs/instruct_diffusion.yaml" --dump_path "./checkpoints/InstructDiffusion_diffusers".Training dataset preparationPlease put the prepared checkpoints in file dataset.Prepare CC12M dataset: InstructPix2Pix and MagicBrush datasets: these two datasets InstructPix2Pix and MagicBrush are prepared in diffusers website. Download them first and use python process_HF.py to process them from "parquet" file to "arrow" file.Prepare RefCOCO, GRefCOCO and COCOStuff datasets: please follow InstructDiffusion to prepare them.Prepare LISA ReasonSeg dataset: please follow LISA to prepare it.Prepare our synthetic editing dataset: please download in link.Stage-1: textual alignment with CC12MUse the script to train: bash scripts/TrainStage1_7b.sh bash scripts/TrainStage1_13b.shThen, use the script to inference: python test/TrainStage1_inference.py --model_name_or_path "./checkpoints/vicuna-7b-v1-1" --LLaVA_model_path "./checkpoints/LLaVA-7B-v1" --save_dir './checkpoints/stage1_CC12M_alignment_7b/Results-100000' --pretrain_model "./checkpoints/stage1_CC12M_alignment_7b/embeddings_qformer/checkpoint-150000.bin" --get_orig_out --LLaVA_version "v1.1-7b" python test/TrainStage1_inference.py --model_name_or_path "./checkpoints/vicuna-13b-v1-1" --LLaVA_model_path "./checkpoints/LLaVA-13B-v1" --save_dir './checkpoints/stage1_CC12M_alignment_13b/Results-100000' --pretrain_model "./checkpoints/stage1_CC12M_alignment_13b/embeddings_qformer/checkpoint-150000.bin" --get_orig_out --LLaVA_version "v1.1-13b"Stage-2: SmartEdit trainingUse the script to train first: bash scripts/MLLMSD_7b.sh bash scripts/MLLMSD_13b.shThen, use the script to train: bash scripts/SmartEdit_7b.sh bash scripts/SmartEdit_13b.shInferencePlease download SmartEdit-7B and SmartEdit-13B checkpoints and put them in file checkpointsPlease download Reason-Edit evaluation benchmark and put it in file datasetUse the script to inference on understanding and reasoning scenes: python test/DS_SmartEdit_test.py --is_understanding_scenes True --model_name_or_path "./checkpoints/vicuna-7b-v1-1" --LLaVA_model_path "./checkpoints/LLaVA-7B-v1" --save_dirLLMs之Vicuna:《Vicuna: An Open-Source Chatbot Impressing
'./checkpoints/SmartEdit-7B/Understand-15000' --steps 15000 --total_dir "./checkpoints/SmartEdit-7B" --sd_qformer_version "v1.1-7b" --resize_resolution 256 python test/DS_SmartEdit_test.py --is_reasoning_scenes True --model_name_or_path "./checkpoints/vicuna-7b-v1-1" --LLaVA_model_path "./checkpoints/LLaVA-7B-v1" --save_dir './checkpoints/SmartEdit-7B/Reason-15000' --steps 15000 --total_dir "./checkpoints/SmartEdit-7B" --sd_qformer_version "v1.1-7b" --resize_resolution 256 python test/DS_SmartEdit_test.py --is_understanding_scenes True --model_name_or_path "./checkpoints/vicuna-13b-v1-1" --LLaVA_model_path "./checkpoints/LLaVA-13B-v1" --save_dir './checkpoints/SmartEdit-13B/Understand-15000' --steps 15000 --total_dir "./checkpoints/SmartEdit-13B" --sd_qformer_version "v1.1-13b" --resize_resolution 256 python test/DS_SmartEdit_test.py --is_reasoning_scenes True --model_name_or_path "./checkpoints/vicuna-13b-v1-1" --LLaVA_model_path "./checkpoints/LLaVA-13B-v1" --save_dir './checkpoints/SmartEdit-13B/Reason-15000' --steps 15000 --total_dir "./checkpoints/SmartEdit-13B" --sd_qformer_version "v1.1-13b" --resize_resolution 256You can use different resolution to inference on reasoning scenes: python test/DS_SmartEdit_test.py --is_reasoning_scenes True --model_name_or_path "./checkpoints/vicuna-7b-v1-1" --LLaVA_model_path "./checkpoints/LLaVA-7B-v1" --save_dir './checkpoints/SmartEdit-7B/Reason-384-15000' --steps 15000 --total_dir "./checkpoints/SmartEdit-7B" --sd_qformer_version "v1.1-7b" --resize_resolution 384 python test/DS_SmartEdit_test.py --is_reasoning_scenes True --model_name_or_path "./checkpoints/vicuna-13b-v1-1" --LLaVA_model_path "./checkpoints/LLaVA-13B-v1" --save_dir './checkpoints/SmartEdit-13B/Reason-384-15000' --steps 15000 --total_dir "./checkpoints/SmartEdit-13B" --sd_qformer_version "v1.1-13b" --resize_resolution 384Explanation of new tokens:The original vocabulary size of LLaMA-1.1 (both 7B and 13B) is 32000, while LLaVA-1.1 (both 7B and 13B) is 32003, which additionally expands 32000="", 32001="", 32002="". In SmartEdit, we maintain "" and "" in LLaVA and remove "". Besides, we add one special token called "img" for system message to generate image, and 32 tokens to summarize image and text information for conversation system ("..."). Therefore, the original vocabulary size of SmartEdit is 32035, where "img"=32000, ""=32001, ""=32002, and the 32 new tokens are 32003~32034. Only the 32 new tokens are effective embeddings for QFormer.We especially explain the meanings of new embeddings here to eliminate misunderstanding, and there is no need to merge lora after you download SmartEdit checkpoints. If you have download the checkpoints of SmartEdit before 2024.4.28, please only re-download checkpoints in LLM-15000 folder. Besides, when preparing LLaVA checkpoints, you must firstly convert the LLaMA-delta-weight, since it is under policy protection, and LLaVA fine-tunes the whole LLaMA weights.Metrics EvaluationUse the script to compute metrics on Reason-Edit (256x256 resolution): python test/metrics_evaluation.py --edited_image_understanding_dir "./checkpoints/SmartEdit-7B/Understand-15000" --edited_image_reasoning_dir "./checkpoints/SmartEdit-7B/Reason-15000" python test/metrics_evaluation.py --edited_image_understanding_dir "./checkpoints/SmartEdit-13B/Understand-15000" --edited_image_reasoning_dir "./checkpoints/SmartEdit-13B/Reason-15000"Todo List Release checkpoints that could conduct "add" functionality. In contrast, Vicuna is an open-source chatbot framework that allows developers to build and deploy chatbots with ease. Vicuna Features. Vicuna is an open-source chatbot framework that offers a range of features for Building chatbots with Vicuna-13B - An article on using Vicuna to create chatbots; Comparing LLMs for chat: LLaMA v2 vs Vicuna - A comparison between LLaMA's new version 2 and Vicuna; LangChain chat models: an overview - LangChain is a popular framework for building chat applicationsVicuna: An Impressive Open-Source Chatbot
Training, run the following command. In our experiments, we use 4 A100.You can change the save path in the config filetrain_configs/minigpt4_stage1_pretrain.yamltorchrun --nproc-per-node NUM_GPU train.py --cfg-path train_configs/minigpt4_stage1_pretrain.yamlA MiniGPT-4 checkpoint with only stage one training can be downloadedhere.Compared to the model after stage two, this checkpoint generate incomplete and repeated sentences frequently.2. Second finetuning stageIn the second stage, we use a small high quality image-text pair dataset created by ourselvesand convert it to a conversation format to further align MiniGPT-4.To download and prepare our second stage dataset, please check oursecond stage dataset preparation instruction.To launch the second stage alignment,first specify the path to the checkpoint file trained in stage 1 intrain_configs/minigpt4_stage1_pretrain.yaml.You can also specify the output path there.Then, run the following command. In our experiments, we use 1 A100.torchrun --nproc-per-node NUM_GPU train.py --cfg-path train_configs/minigpt4_stage2_finetune.yamlAfter the second stage alignment, MiniGPT-4 is able to talk about the image coherently and user-friendly.AcknowledgementBLIP2 The model architecture of MiniGPT-4 follows BLIP-2. Don't forget to check this great open-source work if you don't know it before!Lavis This repository is built upon Lavis!Vicuna The fantastic language ability of Vicuna with only 13B parameters is just amazing. And it is open-source!If you're using MiniGPT-4 in your research or applications, please cite using this BibTeX:@misc{zhu2022minigpt4, title={MiniGPT-4: Enhancing Vision-language Understanding with Advanced Large Language Models}, author={Deyao Zhu and Jun Chen and Xiaoqian Shen and xiang Li and Mohamed Elhoseiny}, year={2023},}LicenseThis repository is under BSD 3-Clause License.Many codes are based on Lavis withBSD 3-Clause License here. There are several AI players in the market right now, including ChatGPT, Google Bard, Bing AI Chat, and many more. However, all of them require you to have an internet connection to interact with the AI. What if you want to install a similar Large Language Model (LLM) on your computer and use it locally? An AI chatbot that you can use privately and without internet connectivity. Well, with new GUI desktop apps like LM Studio and GPT4All, you can run a ChatGPT-like LLM offline on your computer effortlessly. So on that note, let’s go ahead and learn how to use an LLM locally without an internet connection. Run a Local LLM Using LM Studio on PC and Mac1. First of all, go ahead and download LM Studio for your PC or Mac from here.2. Next, run the setup file and LM Studio will open up.3. Next, go to the “search” tab and find the LLM you want to install. You can find the best open-source AI models from our list. You can also explore more models from HuggingFace and AlpacaEval leaderboard.4. I am downloading the Vicuna model with 13B parameters. Depending on your computer’s resources, you can download even more capable models. You can also download coding-specific models like StarCoder and WizardCoder.5. Once the LLM model is installed, move to the “Chat” tab in the left menu.6. Here, click on “Select a model to load” and choose the model you have downloaded.7. You can now start chatting with the AI model right away using your computer’s resources locally. All your chats are private and you can use LM Studio in offline mode as well.8. Once you are done, you can click on “Eject Model” which will offload the model from the RAM.9. You can also move to the “Models” tab and manage all your downloaded models. So this is how you can locally run a ChatGPT-like LLM on your computer.Run a Local LLM on PC, Mac, and Linux Using GPT4AllGPT4All is another desktop GUI app that lets you locally run a ChatGPT-like LLM on your computer in a private manner. The best part about GPT4All is that it does not even require a dedicated GPU and you can also upload your documents to train the model locally. No API or coding is required. That’s awesome, right? So let’s go ahead and find out how to use GPT4All locally.1. Go ahead and download GPT4All from here. It supports Windows, macOS, and Ubuntu platforms.2. Next, run the installer and it will download some additional packages during installation.3. After that, download one of the models based on your computer’s resources. You must have at least 8GB of RAM to use any of the AI models.4. Now, you can simply start chatting. Due to low memory, I faced some performance issues and GPT4All stopped working midway. However, if you have a computer with beefy specs, it would work much better.Editor’s Note: Earlier this guide included the step-by-step process to set up LLaMA andVicuna: The Open-Source Chatbot Revolutionizing
Become the Franchise Ownerof the already established Chatbot Platform. Grow your business revenue by offering chatbot services without hassle. Resell white-label Chatbot with Your Domain, Your Colors, Your Logo and Your Control Your own branded chatbot as a highly profitable saas product, without any massive investment. Leverage the ever-growing demand for chatbots and add a new income stream to your business.With our resell white-label chatbot service, you can offer a personalized and efficient chatbot solution to your clients, increasing their customer satisfaction and loyalty. Our easy-to-use platform allows you to customize your chatbot with your preferred features and integrations, making it a versatile tool for various industries and use cases. The Power of Reselling White-Label Chatbots with Botsify With Botsify's chatbot development service, you can now offer your own branded chatbot to your clients as a highly profitable SaaS product. Without any massive investment, you can leverage the ever-growing demand for chatbots and add a new income stream to your business. Unlock the Potential of Your Chatbot with Botsify's Analytics Solutions Get the insights you need to optimize your chatbot's performance with Botsify's Chatbot Analytics. Our advanced analytics tools provide real-time data and insights into customer interactions, allowing you to continuously improve your chatbot and achieve your marketing goals. Experience the power of data-driven chatbot marketing with Botsify. A Turnkey Solution for Chatbot Resellers Botsify's whitelabel chatbot solution is a turnkey solution that makes it easy for chatbot resellers to get started. With your own branded platform, you can offer your clients a complete chatbot solution, including AI-powered chatbots, live chat integration, sentiment analysis, and much more. Grow Your Business with Botsify's Whitelabel Chatbot Solution With Botsify's whitelabel chatbot solution, you can tap into the growing demand for chatbots and grow your business. Whether you're an agency looking to add a new service, a freelancer looking to expand your offerings, or a business looking for a new way to generate income, Botsify's white-label solution is the perfect way to get started. So what are you waiting for? Sign up for Botsify today and avail yourself of our services to grow your business with our whitelabel chatbot solution!Vicuna: An Open-Source Chatbot Comparable to ChatGPT and
Often encounter passive chatbots that require multiple steps to complete tasks, leading to inefficiencies and drop-offs. With the ChatBot Actions update, your SearchAI ChatBot can proactively trigger targeted actions based on specific keywords or phrases.Users often encounter passive chatbots that require multiple steps to complete tasks, leading to inefficiencies and drop-offs. With the ChatBot Actions update, your SearchAI ChatBot can proactively trigger targeted actions based on specific keywords or phrases.Users often encounter passive chatbots that require multiple steps to complete tasks, leading to inefficiencies and drop-offs. With the ChatBot Actions update, your SearchAI ChatBot can proactively trigger targeted actions based on specific keywords or phrases.Available actions allow the chatbot to initiate calls, download files, send emails, and open URLs.Available actions allow the chatbot to initiate calls, download files, send emails, and open URLs.Available actions allow the chatbot to initiate calls, download files, send emails, and open URLs.Imagine a potential customer visiting your website and expressing interest in a particular product or service. They might use phrases like “schedule a demo” or “download brochure.”Imagine a potential customer visiting your website and expressing interest in a particular product or service. They might use phrases like “schedule a demo” or “download brochure.”With ChatBot Actions, your chatbot can recognize these trigger phrases, automatically schedule a demo or provide a download link for the brochure, and capture the user’s contact information for further follow-up.With ChatBot Actions, your chatbot can recognize these trigger phrases, automatically schedule a demo or provide a download link for the brochure,. In contrast, Vicuna is an open-source chatbot framework that allows developers to build and deploy chatbots with ease. Vicuna Features. Vicuna is an open-source chatbot framework that offers a range of features for Building chatbots with Vicuna-13B - An article on using Vicuna to create chatbots; Comparing LLMs for chat: LLaMA v2 vs Vicuna - A comparison between LLaMA's new version 2 and Vicuna; LangChain chat models: an overview - LangChain is a popular framework for building chat applicationsWhy Vicuna is the Future of Chatbots: An Overview of its
The new paradigm.Introducing SearchBlox 10.7 — the new paradigm.Introducing SearchBlox 10.7 — the new paradigm.SearchBlox 10.7 represents a significant advancement in personalizing chatbot capabilities. The new updates are designed around user-centric feedback – creating a solution that streamlines workflows within the chatbot, engages users with targeted actions, ensures uninterrupted chatbot service, and much more.SearchBlox 10.7 represents a significant advancement in personalizing chatbot capabilities. The new updates are designed around user-centric feedback – creating a solution that streamlines workflows within the chatbot, engages users with targeted actions, ensures uninterrupted chatbot service, and much more.SearchBlox 10.7 represents a significant advancement in personalizing chatbot capabilities. The new updates are designed around user-centric feedback – creating a solution that streamlines workflows within the chatbot, engages users with targeted actions, ensures uninterrupted chatbot service, and much more.A Whole New ConversationA Whole New ConversationA Whole New ConversationThe new release bolsters our Enterprise Search and SearchAI ChatBot capabilities. It comes with a 30-day trial, during which you can deploy and test out the new features.The new release bolsters our Enterprise Search and SearchAI ChatBot capabilities. It comes with a 30-day trial, during which you can deploy and test out the new features.The new release bolsters our Enterprise Search and SearchAI ChatBot capabilities. It comes with a 30-day trial, during which you can deploy and test out the new features.SearchAI AgentsSearchAI AgentsSearchAI AgentsPersonalizing User JourneysPersonalizing User JourneysPersonalizing User JourneysContext is great, but adding more focus to personalized conversations can seamlessly guide users toward their desired outcomes. SearchAI AgentsComments
MiniGPT-4: Enhancing Vision-language Understanding with Advanced Large Language ModelsDeyao Zhu* (On Job Market!), Jun Chen* (On Job Market!), Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. *Equal ContributionKing Abdullah University of Science and Technology NewsWe now provide a pretrained MiniGPT-4 aligned with Vicuna-7B! The demo GPU memory consumption now can be as low as 12GB.Online DemoClick the image to chat with MiniGPT-4 around your imagesExamplesMore examples can be found in the project page.IntroductionMiniGPT-4 aligns a frozen visual encoder from BLIP-2 with a frozen LLM, Vicuna, using just one projection layer.We train MiniGPT-4 with two stages. The first traditional pretraining stage is trained using roughly 5 million aligned image-text pairs in 10 hours using 4 A100s. After the first stage, Vicuna is able to understand the image. But the generation ability of Vicuna is heavilly impacted.To address this issue and improve usability, we propose a novel way to create high-quality image-text pairs by the model itself and ChatGPT together. Based on this, we then create a small (3500 pairs in total) yet high-quality dataset.The second finetuning stage is trained on this dataset in a conversation template to significantly improve its generation reliability and overall usability. To our surprise, this stage is computationally efficient and takes only around 7 minutes with a single A100.MiniGPT-4 yields many emerging vision-language capabilities similar to those demonstrated in GPT-4.Getting StartedInstallation1. Prepare the code and the environmentGit clone our repository, creating a python environment and ativate it via the following commandgit clone MiniGPT-4conda env create -f environment.ymlconda activate minigpt42. Prepare the pretrained Vicuna weightsThe current version of MiniGPT-4 is built on the v0 versoin of Vicuna-13B.Please refer to our instruction hereto prepare the Vicuna weights.The final weights would be in a single folder in a structure similar to the following:vicuna_weights├── config.json├── generation_config.json├── pytorch_model.bin.index.json├── pytorch_model-00001-of-00003.bin... Then, set the path to the vicuna weight in the model config filehere at Line 16.3. Prepare the pretrained MiniGPT-4 checkpointDownload the pretrained checkpoints according to the Vicuna model you prepare.Checkpoint Aligned with Vicuna 13BCheckpoint Aligned with Vicuna 7BDownladDownloadThen, set the path to the pretrained checkpoint in the evaluation config filein eval_configs/minigpt4_eval.yaml at Line 11.Launching Demo LocallyTry out our demo demo.py on your local machine by runningpython demo.py --cfg-path eval_configs/minigpt4_eval.yaml --gpu-id 0To save GPU memory, Vicuna loads as 8 bit by default, with a beam search width of 1.This configuration requires about 23G GPU memory for Vicuna 13B and 11.5G GPU memory for Vicuna 7B.For more powerful GPUs, you can run the modelin 16 bit by setting low_resource to False in the config fileminigpt4_eval.yaml and use a larger beam search width.Thanks @WangRongsheng, you can also run our code on ColabTrainingThe training of MiniGPT-4 contains two alignment stages.1. First pretraining stageIn the first pretrained stage, the model is trained using image-text pairs from Laion and CC datasetsto align the vision and language model. To download and prepare the datasets, please checkour first stage dataset preparation instruction.After the first stage, the visual features are mapped and can be understood by the languagemodel.To launch the first stage
2025-04-19SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models (CVPR-2024 Highlight)[Paper][Project Page][Demo] 🔥🔥 2024.04. SmartEdit is released!🔥🔥 2024.04. SmartEdit is selected as highlight by CVPR-2024!🔥🔥 2024.02. SmartEdit is accepted by CVPR-2024!If you are interested in our work, please star ⭐ our project.SmartEdit Framework SmartEdit on Understanding Scenarios SmartEdit on Reasoning Scenarios Dependencies and Installation pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url pip install -r requirements.txt git clone cd flash-attention pip install . --no-build-isolation cd ..Training model preparationPlease put the prepared checkpoints in file checkpoints.Prepare Vicuna-1.1-7B/13B checkpoint: please download Vicuna-1.1-7B and Vicuna-1.1-13B in link.Prepare LLaVA-1.1-7B/13B checkpoint: please follow the LLaVA instruction to prepare LLaVA-1.1-7B/13B weights.Prepare InstructDiffusion checkpoint: please download InstructDiffusion(v1-5-pruned-emaonly-adaption-task.ckpt) and the repo in link. Download them first and use python convert_original_stable_diffusion_to_diffusers.py --checkpoint_path "./checkpoints/InstructDiffusion/v1-5-pruned-emaonly-adaption-task.ckpt" --original_config_file "./checkpoints/InstructDiffusion/configs/instruct_diffusion.yaml" --dump_path "./checkpoints/InstructDiffusion_diffusers".Training dataset preparationPlease put the prepared checkpoints in file dataset.Prepare CC12M dataset: InstructPix2Pix and MagicBrush datasets: these two datasets InstructPix2Pix and MagicBrush are prepared in diffusers website. Download them first and use python process_HF.py to process them from "parquet" file to "arrow" file.Prepare RefCOCO, GRefCOCO and COCOStuff datasets: please follow InstructDiffusion to prepare them.Prepare LISA ReasonSeg dataset: please follow LISA to prepare it.Prepare our synthetic editing dataset: please download in link.Stage-1: textual alignment with CC12MUse the script to train: bash scripts/TrainStage1_7b.sh bash scripts/TrainStage1_13b.shThen, use the script to inference: python test/TrainStage1_inference.py --model_name_or_path "./checkpoints/vicuna-7b-v1-1" --LLaVA_model_path "./checkpoints/LLaVA-7B-v1" --save_dir './checkpoints/stage1_CC12M_alignment_7b/Results-100000' --pretrain_model "./checkpoints/stage1_CC12M_alignment_7b/embeddings_qformer/checkpoint-150000.bin" --get_orig_out --LLaVA_version "v1.1-7b" python test/TrainStage1_inference.py --model_name_or_path "./checkpoints/vicuna-13b-v1-1" --LLaVA_model_path "./checkpoints/LLaVA-13B-v1" --save_dir './checkpoints/stage1_CC12M_alignment_13b/Results-100000' --pretrain_model "./checkpoints/stage1_CC12M_alignment_13b/embeddings_qformer/checkpoint-150000.bin" --get_orig_out --LLaVA_version "v1.1-13b"Stage-2: SmartEdit trainingUse the script to train first: bash scripts/MLLMSD_7b.sh bash scripts/MLLMSD_13b.shThen, use the script to train: bash scripts/SmartEdit_7b.sh bash scripts/SmartEdit_13b.shInferencePlease download SmartEdit-7B and SmartEdit-13B checkpoints and put them in file checkpointsPlease download Reason-Edit evaluation benchmark and put it in file datasetUse the script to inference on understanding and reasoning scenes: python test/DS_SmartEdit_test.py --is_understanding_scenes True --model_name_or_path "./checkpoints/vicuna-7b-v1-1" --LLaVA_model_path "./checkpoints/LLaVA-7B-v1" --save_dir
2025-04-23'./checkpoints/SmartEdit-7B/Understand-15000' --steps 15000 --total_dir "./checkpoints/SmartEdit-7B" --sd_qformer_version "v1.1-7b" --resize_resolution 256 python test/DS_SmartEdit_test.py --is_reasoning_scenes True --model_name_or_path "./checkpoints/vicuna-7b-v1-1" --LLaVA_model_path "./checkpoints/LLaVA-7B-v1" --save_dir './checkpoints/SmartEdit-7B/Reason-15000' --steps 15000 --total_dir "./checkpoints/SmartEdit-7B" --sd_qformer_version "v1.1-7b" --resize_resolution 256 python test/DS_SmartEdit_test.py --is_understanding_scenes True --model_name_or_path "./checkpoints/vicuna-13b-v1-1" --LLaVA_model_path "./checkpoints/LLaVA-13B-v1" --save_dir './checkpoints/SmartEdit-13B/Understand-15000' --steps 15000 --total_dir "./checkpoints/SmartEdit-13B" --sd_qformer_version "v1.1-13b" --resize_resolution 256 python test/DS_SmartEdit_test.py --is_reasoning_scenes True --model_name_or_path "./checkpoints/vicuna-13b-v1-1" --LLaVA_model_path "./checkpoints/LLaVA-13B-v1" --save_dir './checkpoints/SmartEdit-13B/Reason-15000' --steps 15000 --total_dir "./checkpoints/SmartEdit-13B" --sd_qformer_version "v1.1-13b" --resize_resolution 256You can use different resolution to inference on reasoning scenes: python test/DS_SmartEdit_test.py --is_reasoning_scenes True --model_name_or_path "./checkpoints/vicuna-7b-v1-1" --LLaVA_model_path "./checkpoints/LLaVA-7B-v1" --save_dir './checkpoints/SmartEdit-7B/Reason-384-15000' --steps 15000 --total_dir "./checkpoints/SmartEdit-7B" --sd_qformer_version "v1.1-7b" --resize_resolution 384 python test/DS_SmartEdit_test.py --is_reasoning_scenes True --model_name_or_path "./checkpoints/vicuna-13b-v1-1" --LLaVA_model_path "./checkpoints/LLaVA-13B-v1" --save_dir './checkpoints/SmartEdit-13B/Reason-384-15000' --steps 15000 --total_dir "./checkpoints/SmartEdit-13B" --sd_qformer_version "v1.1-13b" --resize_resolution 384Explanation of new tokens:The original vocabulary size of LLaMA-1.1 (both 7B and 13B) is 32000, while LLaVA-1.1 (both 7B and 13B) is 32003, which additionally expands 32000="", 32001="", 32002="". In SmartEdit, we maintain "" and "" in LLaVA and remove "". Besides, we add one special token called "img" for system message to generate image, and 32 tokens to summarize image and text information for conversation system ("..."). Therefore, the original vocabulary size of SmartEdit is 32035, where "img"=32000, ""=32001, ""=32002, and the 32 new tokens are 32003~32034. Only the 32 new tokens are effective embeddings for QFormer.We especially explain the meanings of new embeddings here to eliminate misunderstanding, and there is no need to merge lora after you download SmartEdit checkpoints. If you have download the checkpoints of SmartEdit before 2024.4.28, please only re-download checkpoints in LLM-15000 folder. Besides, when preparing LLaVA checkpoints, you must firstly convert the LLaMA-delta-weight, since it is under policy protection, and LLaVA fine-tunes the whole LLaMA weights.Metrics EvaluationUse the script to compute metrics on Reason-Edit (256x256 resolution): python test/metrics_evaluation.py --edited_image_understanding_dir "./checkpoints/SmartEdit-7B/Understand-15000" --edited_image_reasoning_dir "./checkpoints/SmartEdit-7B/Reason-15000" python test/metrics_evaluation.py --edited_image_understanding_dir "./checkpoints/SmartEdit-13B/Understand-15000" --edited_image_reasoning_dir "./checkpoints/SmartEdit-13B/Reason-15000"Todo List Release checkpoints that could conduct "add" functionality
2025-04-07Training, run the following command. In our experiments, we use 4 A100.You can change the save path in the config filetrain_configs/minigpt4_stage1_pretrain.yamltorchrun --nproc-per-node NUM_GPU train.py --cfg-path train_configs/minigpt4_stage1_pretrain.yamlA MiniGPT-4 checkpoint with only stage one training can be downloadedhere.Compared to the model after stage two, this checkpoint generate incomplete and repeated sentences frequently.2. Second finetuning stageIn the second stage, we use a small high quality image-text pair dataset created by ourselvesand convert it to a conversation format to further align MiniGPT-4.To download and prepare our second stage dataset, please check oursecond stage dataset preparation instruction.To launch the second stage alignment,first specify the path to the checkpoint file trained in stage 1 intrain_configs/minigpt4_stage1_pretrain.yaml.You can also specify the output path there.Then, run the following command. In our experiments, we use 1 A100.torchrun --nproc-per-node NUM_GPU train.py --cfg-path train_configs/minigpt4_stage2_finetune.yamlAfter the second stage alignment, MiniGPT-4 is able to talk about the image coherently and user-friendly.AcknowledgementBLIP2 The model architecture of MiniGPT-4 follows BLIP-2. Don't forget to check this great open-source work if you don't know it before!Lavis This repository is built upon Lavis!Vicuna The fantastic language ability of Vicuna with only 13B parameters is just amazing. And it is open-source!If you're using MiniGPT-4 in your research or applications, please cite using this BibTeX:@misc{zhu2022minigpt4, title={MiniGPT-4: Enhancing Vision-language Understanding with Advanced Large Language Models}, author={Deyao Zhu and Jun Chen and Xiaoqian Shen and xiang Li and Mohamed Elhoseiny}, year={2023},}LicenseThis repository is under BSD 3-Clause License.Many codes are based on Lavis withBSD 3-Clause License here.
2025-04-23