I don't know if anyone else is facing this issue, but I observe a mismatch in the order of captions between the provided reference "none_filled.json" and the other one extracted as "captions.xlsx" from "NasrinImp/Final_defactify_test_new"
This leads to an erroneous score on the codalab.
*HOW TO KEEP THE ORDER SAME?*
Examples:
1. last-10 caption from the none_filled.json
"""
A bike parked on a sidewalk near a street.
A room with lots of seats next to an entrance.
A black and white cat is laying on a green pillow on top of a desk.
A black and white dog chasing sheep in a field.
there are many bikes parked outside a pizzeria
Warning signs outside a fence at a transit station
red traffic lights at an intersection on a well lighted street
A man in blue hat pointing towards a street.
The apples on the table are being peeled.
An intersection with traffic lights hanging above it.
A dead end street sign sitting next to abandoned furniture and garbage
"""
2. last-10 caption from the caption generated by "hf_test_data_download.py":
"""
A large airplane is on the airport runway.
A car, a moped, and a bike rider are stopped at an intersection.
Three red two deck buses along a city street.
People putting luggage onto a bus outside beside a building
A green rail bridge spanning over the width of a river.
A white bicycle parked next to a brown brick wall.
A train engine carrying carts down a track.
A person sitting on top of a bench next to a light pole.
A laptop computer is propped up on a cluttered desk.
a couple of men that are on some motor cycles
"""
#### hf_test_data_download.py ####
import os
from datasets import load_dataset
from PIL import Image
import pandas as pd
# Load the dataset
ds = load_dataset("data")
# Get the directory where the dataset is saved
dataset_dir = os.path.dirname(os.path.abspath(__file__)) # Get current script directory (assuming script is in the same directory as the dataset)
# Create directories for saving images and captions inside the dataset's directory
base_dir = os.path.join(dataset_dir, "Test")
os.makedirs(base_dir, exist_ok=True)
# List to hold the captions and their indices
captions_data = []
# Loop through each example in the dataset
for i, example in enumerate(ds['train']):
# Extract the image and caption
img = example['image'] # Assuming the image column is named 'image'
caption = example['caption']
# Create a filename for the image
img_filename = f"image_{i}.jpg"
img_path = os.path.join(base_dir, img_filename)
# Save the image to the file system
img.save(img_path)
# Save the caption data for later use
captions_data.append({'Index': i, 'Caption': caption})
# Print progress every 50 images
if (i + 1) % 50 == 0:
print(f"Saved {i + 1} images...")
# Save the captions data to an Excel file
captions_df = pd.DataFrame(captions_data)
captions_df.to_excel('captions.xlsx', index=False)
print(f"All images and captions have been saved successfully in {base_dir}!")
Posted by: skmalviya @ Dec. 5, 2024, 1:20 p.m.Hi,
Can I also request, when done, that you please share the updated "none_filled.json" with us? It'll help us ensure match the order of samples in both our output and the reference.
Thanks.
Apologies for the delayed reply. Here is the link to the updated "new_none_filled.json" file: https://drive.google.com/file/d/1uihgASssS7-qx7DF1R8iMZJDX8dE2Epn/view.
This will help us ensure that the order of samples in both the output and the reference matches correctly.
If you need any further assistance, feel free to reach out!
Posted by: RajRoy1243 @ Dec. 8, 2024, 10:30 a.m.