Cycle Consistency as Reward: Learning
Image-Text Alignment without Human Preferences

Hyojin Bahng*
Caroline Chan*
Frédo Durand
Phillip Isola
* Equal contribution, alphabetical order
MIT CSAIL

Method Overview: (1) Given an input image x, we generate multiple candidate captions Fi(x), Fj(x) using different captioning models. Each caption is mapped back to the image domain via a text-to-image model G, and compared against the original image. Captions whose reconstructions G(F(x)) are more similar to the original image are preferred; those with low similarity are rejected. (2) These comparison pairs are used to train a reward model, which learns to assign higher scores to preferred captions at inference time (3). We apply the same process for text-to-image generation.



Abstract

Learning alignment between language and vision is a fundamental challenge, especially as multimodal data becomes increasingly detailed and complex. Existing methods often rely on collecting human or AI preferences, which can be costly and time-intensive. We propose an alternative approach that leverages cycle consistency as a supervisory signal. Given an image and generated text, we map the text back to image space using a text-to-image model and compute the similarity between the original image and its reconstruction. Analogously, for text-to-image generation, we measure the textual similarity between an input caption and its reconstruction through the cycle. We use the cycle consistency score to rank candidates and construct a preference dataset of 866K comparison pairs. The reward model trained on our dataset outperforms state-of-the-art alignment metrics on detailed captioning, with superior inference-time scalability when used as a verifier for Best-of-N sampling. Furthermore, performing DPO and Diffusion DPO using our dataset enhances performance across a wide range of vision-language tasks and text-to-image generation.


Installation and Usage

Installation:

pip install cyclereward

Usage:

from cyclereward import cyclereward
from PIL import Image
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = cyclereward(device=device)

caption = "a photo of a cat"
image = preprocess(Image.open("cat.jpg")).unsqueeze(0).to(device)
score = model.score(image, caption)


CyclePrefDB Dataset

Our dataset of cycle consistency-based preferences consists of 398K comparison pairs for image-to-text generation and 468K pairs for text-to-image synthesis. For image-to-text evaluation, we observe that preferred and rejected texts vary in levels of comprehensiveness, hallucination, and density. Additionally for text-to-image, we find that images containing fine-grained details produce better text reconstructions, resulting in higher preference. You can find our dataset here: image-to-text, text-to-image
Input ImageCandidate #1 (Best)Candidate #2Candidate #3Candidate #4Candidate #5Candidate #6Candidate #7Candidate #8Candidate #9Candidate #10Candidate #11 (Worst)
Example 0
Caption
Image Reconstruction
The image depicts a bustling outdoor scene in a snowy town square, likely in a mountainous region, given the presence of snow-covered ground and trees. The setting appears to be a popular tourist destination, possibly a ski resort or a winter festival, given the presence of people dressed in winter clothing and the festive decorations. The square is lined with various shops and buildings,

DreamSim Similarity: 0.64
Captioning model: InternVL2-2B

Caption
Image Reconstruction
The image captures a lively scene in a European town square. The square is bustling with people, all bundled up in winter clothing, indicating a chilly day. The ground is covered in a layer of snow, adding to the wintry atmosphere. The square is lined with buildings on both sides, their architecture hinting at the town's European charm.

DreamSim Similarity: 0.60
Captioning model: llava-v1.6-mistral-7b-hf

Caption
Image Reconstruction
The image captures a bustling scene at the Winter Wonderland in Lake Geneva, Switzerland. The image is taken from a low angle, providing a unique perspective of the street. The street is filled with people, all dressed in winter clothing, walking in various directions, creating a lively atmosphere. The ground is covered in a blanket of snow, reflecting the overcast sky above. The buildings lining the

DreamSim Similarity: 0.59
Captioning model: llava-onevision-qwen2-0.5b-ov-hf

Caption
Image Reconstruction
a snowy street with people walking

DreamSim Similarity: 0.59
Captioning model: blip2-flan-t5-xxl

Caption
Image Reconstruction
The image captures a lively scene in a European town square, blanketed in a layer of fresh snow. The square is bustling with people, their winter attire a splash of color against the white landscape. They are seen walking in various directions, some carrying shopping bags, suggesting a day of holiday shopping. The square is framed by traditional European buildings, their architecture a testament to the region's

DreamSim Similarity: 0.56
Captioning model: llava-onevision-qwen2-7b-ov-hf

Caption
Image Reconstruction
The image depicts a bustling winter scene in a quaint, picturesque town square. The ground is covered with a thin layer of snow, indicating a cold, wintry day. The square is paved with cobblestones, and the snow is mostly melted, leaving patches of wetness on the ground. People are walking around, dressed in winter clothing such as coats,

DreamSim Similarity: 0.56
Captioning model: InternVL2-8B

Caption
Image Reconstruction
The image depicts a bustling street scene in a snowy, mountainous town. The street is lined with traditional wooden buildings, some of which have colorful facades and large windows. The architecture suggests a European alpine village, possibly in a region like the Alps. The buildings are adorned with snow, indicating a cold, winter climate. In the foreground, there

DreamSim Similarity: 0.55
Captioning model: InternVL2-40B

Caption
Image Reconstruction
The image depicts a bustling city street filled with people walking and enjoying the winter atmosphere. There are numerous individuals scattered throughout the scene, some closer to the foreground and others further back. A few people are carrying handbags, with one handbag located near the center of the scene and another one closer to the right side. Additionally, a

DreamSim Similarity: 0.54
Captioning model: llava-1.5-7b-hf

Caption
Image Reconstruction
The image depicts a bustling winter scene in what appears to be a small town or village. The setting is characterized by a cobblestone street that stretches through the center of the image, flanked by various buildings and shops. The street is covered with a light dusting of snow, indicating a cold, wintry climate. The sky is partly cloudy, with patches of

DreamSim Similarity: 0.53
Captioning model: InternVL2-26B

Caption
Image Reconstruction
The image captures a lively scene in a European town square. The square is bustling with people, their colorful winter coats adding a vibrant touch to the otherwise gray day. The square is paved with cobblestones, adding a rustic charm to the scene. In the center of the square, a large Christmas tree stands tall, adorned with twinkling lights that cast

DreamSim Similarity: 0.51
Captioning model: llava-v1.6-34b-hf

Caption
Image Reconstruction
The image depicts a bustling city street filled with a large crowd of people walking down the sidewalk. The pedestrians are carrying various items, such as handbags and backpacks, as they navigate the busy street. Some people are walking in pairs or small groups, while others are walking individually. The street is lined with buildings,

DreamSim Similarity: 0.51
Captioning model: llava-1.5-13b-hf

Input PromptCandidate #1 (Best)Candidate #2Candidate #3Candidate #4Candidate #5Candidate #6Candidate #7Candidate #8Candidate #9Candidate #10Candidate #11Candidate #12 (Worst)
Example 0
A small wooden dock with containers, potted plants, and a building with a blue porch in the background, surrounded by lush greenery and reflections on the water.
Generated Image
Text Reconstruction
The image features a small blue house situated on a dock, surrounded by a variety of potted plants. The house is positioned near the water, creating a serene and picturesque scene. There are several potted plants of different sizes and shapes, some placed close to the house and others scattered around the dock. The plants add a touch of greenery and life to the scene

SBERT similarity: 0.81
Text-to-image model: Stable Diffusion 1.5

Generated Image
Text Reconstruction
The image features a small blue house situated on a dock, surrounded by a serene environment. The house is positioned near a body of water, possibly a lake or a river. The house is surrounded by a lush green forest, adding to the picturesque setting. In the scene, there are several potted plants placed around the house, with one near the left

SBERT similarity: 0.75
Text-to-image model: Stable Diffusion XL

Generated Image
Text Reconstruction
The image features a small blue house situated on a pier overlooking a body of water. The house is surrounded by a variety of potted plants, adding a touch of greenery to the scene. There are at least six potted plants placed around the house, with some located closer to the water and others near the house itself. In addition to the house and plants,

SBERT similarity: 0.73
Text-to-image model: Stable Diffusion 3

Generated Image
Text Reconstruction
The image features a small wooden house situated on a pier overlooking a body of water. The house is painted blue and has a white roof, giving it a charming appearance. The pier extends out into the water, providing a serene and peaceful atmosphere. In the vicinity of the house, there are several potted plants scattered around, adding a touch of green

SBERT similarity: 0.72
Text-to-image model: Stable Diffusion XL (Turbo)

Generated Image
Text Reconstruction
The image features a small blue house situated on a body of water, possibly a lake or a river. The house is surrounded by a lush green forest, creating a serene and picturesque setting. The house is equipped with a dock, allowing for easy access to the water. In addition to the house, there are several potted plants scattered around the area,

SBERT similarity: 0.72
Text-to-image model: Stable Diffusion 1.5

Generated Image
Text Reconstruction
The image features a blue house situated on a pier overlooking a body of water. The house is surrounded by a variety of potted plants, creating a lush and vibrant atmosphere. There are at least 13 potted plants of different sizes and shapes, some placed closer to the house and others further away. In addition to the plants, there are several

SBERT similarity: 0.70
Text-to-image model: Stable Diffusion XL

Generated Image
Text Reconstruction
The image features a blue house situated on a pier overlooking a body of water. The house is surrounded by a variety of potted plants, creating a lush and vibrant atmosphere. There are at least 13 potted plants of different sizes and shapes, some placed closer to the house and others further away. In addition to the plants, there are two

SBERT similarity: 0.70
Text-to-image model: Stable Diffusion 3

Generated Image
Text Reconstruction
The image features a serene scene with a body of water, possibly a lake or a pond, surrounded by lush greenery. There are two blue shipping containers situated on the water, creating a unique and interesting sight. The containers are positioned close to each other, with one slightly larger than the other. In addition to the containers, there are two small

SBERT similarity: 0.70
Text-to-image model: Stable Diffusion 1.5

Generated Image
Text Reconstruction
The image features a serene scene with a wooden pier extending out into a body of water. On the pier, there are two small blue houses, one on the left side and the other on the right side. These houses are situated on stilts, providing a unique living experience. In addition to the houses, there are several potted plants scattered around the pier,

SBERT similarity: 0.69
Text-to-image model: Stable Diffusion XL (Turbo)

Generated Image
Text Reconstruction
The image features a small blue house situated on a pier overlooking a body of water. The house has a wooden deck with a few chairs and potted plants placed around it. There are three chairs in total, with one located near the left side of the house, another in the middle, and the third one on the right side. In addition to the

SBERT similarity: 0.69
Text-to-image model: Stable Diffusion 3

Generated Image
Text Reconstruction
The image features a small blue house situated on a wooden deck, surrounded by a variety of potted plants. The house is located near a body of water, possibly a lake or a river. The plants surrounding the house are of different sizes and types, adding a lush and vibrant atmosphere to the scene. In addition to the house and plants, there are several

SBERT similarity: 0.68
Text-to-image model: Stable Diffusion XL

Generated Image
Text Reconstruction
The image features a small blue house situated on a pier over a body of water. The house is surrounded by a lush green forest, creating a serene and peaceful atmosphere. There are two chairs placed outside the house, one on the left side and the other on the right side, providing a comfortable spot for relaxation or enjoying the view. In addition

SBERT similarity: 0.65
Text-to-image model: Stable Diffusion XL (Turbo)




Acknowledgements

Research was sponsored by the Department of the Air Force Artificial Intelligence Accelerator and was accomplished under Cooperative Agreement Number FA8750-19-2-1000. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Department of the Air Force or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein. This work was supported in part by a Packard Fellowship and a Sloan Research Fellowship to P.I., by the MIT-IBM Watson AI Lab, by the Sagol Weizmann-MIT Bridge Program, and by ONR MURI grant N00014-22-1-2740, by the MIT-Google program for computing innovation, the Amazon Science Hub, and an MIT-GIST grant.


Bibtex

@article{bahng2025cyclereward,
 title={Cycle Consistency as Reward: Learning Image-Text Alignment without Human Preferences},
 author= {Bahng, Hyojin and Chan, Caroline and Durand, Fredo and Isola, Phillip},
 journal={arXiv preprint arXiv:2506.02095},
 year={2025}
}