Getting Context: The Unseen Foundation of Accurate Scanlation

In scanlation, the journey from raw manga pages to a polished, translated chapter involves more than just cleaning images and swapping text. One of the most critical, yet often invisible, steps is getting context. Without a deep understanding of who is speaking, what they're referring to, and the correct flow of dialogue, even the most technically perfect image can lead to a clumsy or inaccurate translation. This stage is all about building that essential framework for clarity and precision.

What is Context in Scanlation?

At its heart, "context" in scanlation refers to understanding the narrative landscape of a manga page. This includes:

Why is Context so Important?

Simply put, you can't translate correctly without knowing the context. Imagine trying to translate a conversation without knowing who is speaking, or reading dialogue out of order. The result would be nonsensical, potentially changing plot points, character relationships, and emotional beats. Context ensures that the translated text accurately reflects the original intent, character personalities, and narrative progression. It's the difference between a rough transliteration and a truly immersive reading experience.

Manga's Unique Contextual Challenges

Manga presents specific challenges for context extraction that differ from standard text-based stories:

Leveraging AI and Automation to "Extract" Context

Manually identifying every character and determining the reading order for hundreds of speech bubbles across a chapter would be incredibly time-consuming. This is where AI and automation become indispensable tools, helping us "extract" context efficiently.

1. Identifying "Who" is There: Named Entity Recognition (NER)

The first step in understanding "who said what" is to identify the named entities (characters, locations, organizations) present in the chapter. This also provides crucial information on how to translate their names, which AI alone often struggles with without additional context.

We use AI to perform Named Entity Recognition (NER) on the Japanese OCR text. The AI's task is to pinpoint specific names and terms.


NER_PROMPT = """You are an expert assistant specialized in manga translation and analysis. Your task is to perform Named Entity Recognition (NER) on a given chapter of Japanese manga text and suggest appropriate English translations for the identified entities for human review.

**Input:**
You will receive text extracted from a manga chapter. The text is formatted as a series of lines, each representing a speech bubble or text box, like this:
`ID {{id}}: {{japanese_text}}`
Where `{{id}}` is the unique identifier for the text's source bubble/box, and `{{japanese_text}}` is the original Japanese content.

**Instructions:**
1.  Carefully read through all the provided Japanese text lines.
2.  Identify all named entities present within the text. Focus on entities such as:
    * **Persons:** Character names, titles, aliases.
    * **Locations:** Cities, regions, countries, specific buildings, landmarks, or fictional places.
    * **Organizations:** Groups, teams, companies, institutions, factions.
    * **Items/Concepts (Optional but helpful):** Significant objects, named techniques/attacks, special abilities, unique terms specific to the manga's world (if easily identifiable as named entities).
3.  For each **unique** named entity you identify:
    * Note the `id` of the *first* line (speech bubble/box) where this specific entity appears in the input text.
    * Record the exact original Japanese text of the named entity (`named_entity_japanese`).
    * Provide a contextually appropriate and plausible English translation (`named_entity_english_translation`). Aim for consistency if the entity appears multiple times, but base the output entry on its first appearance.
    * Translate names using their Japanese readings. Do not use Chinese reading.

**Begin Input Japanes text:**
{input_data}
**End Input Japanes text**

**Begin Common names:**
{common_ner}
**End Common names**
"""
        

After the AI identifies these named entities, a crucial manual verification step follows. We cross-reference the suggested English names with established translations found on official manga wikis, such as Kingdom Wiki or Captain Tsubasa Wiki. This ensures consistent and accurate naming throughout the scanlation.

2. Determining Reading Order: Algorithmic Panel and Bubble Sorting

Correct reading order is paramount for flow. Manga layouts can be incredibly dynamic, with panels and bubbles overlapping or being arranged in non-linear patterns to convey motion or emotional intensity. Our system employs a sophisticated algorithm to determine this order:

The process roughly follows these steps:

  1. Panel Detection and Merging:
    • The algorithm first identifies individual panels on the page using image processing techniques (like thresholding and contour detection).
    • It then filters out very small or nested contours to ensure only distinct, meaningful panels are considered.
    • A crucial step involves merging overlapping panels. Manga often uses a "borderless" or "dynamic" layout where panels bleed into each other or overlap to create a sense of action or depth. The algorithm merges these visually connected panels into single logical units.
  2. Recursive Row/Column Grouping:
    • The algorithm recursively groups panels into rows or columns. It starts by attempting to group panels horizontally (into rows).
    • If a row contains multiple panels, it then recursively applies column grouping to those panels to sort them right-to-left within that row.
    • If a column contains multiple panels, it recursively applies row grouping to sort them top-to-bottom within that column. This "toggling" between row and column grouping allows it to handle complex, nested layouts.
  3. Bubble-to-Panel Assignment and Sorting:
    • Once the panels are ordered, each individual speech bubble (from the OCR output, which includes their bounding box coordinates) is assigned to its enclosing panel.
    • Bubbles within each panel are then sorted based on their proximity to the panel's top-right corner, respecting the right-to-left, top-to-bottom reading convention within a panel.

The Python code snippet below outlines the core logic behind these operations, ensuring each bubble gets a precise reading order:


import os
import cv2
import numpy as np
from PIL import Image
import math
import pandas as pd

def calculate_bbox_intersection_area(bbox1, bbox2):
    if not isinstance(bbox1, dict) or not isinstance(bbox2, dict):
        return 0
    xmin1, ymin1, xmax1, ymax1 = bbox1.get('xmin', 0), bbox1.get('ymin', 0), bbox1.get('xmax', 0), bbox1.get('ymax', 0)
    xmin2, ymin2, xmax2, ymax2 = bbox2.get('xmin', 0), bbox2.get('ymin', 0), bbox2.get('xmax', 0), bbox2.get('ymax', 0)
    inter_xmin = max(xmin1, xmin2)
    inter_ymin = max(ymin1, ymin2)
    inter_xmax = min(xmax1, xmax2)
    inter_ymax = min(ymax1, ymax2)
    inter_width = max(0, inter_xmax - inter_xmin)
    inter_height = max(0, inter_ymax - inter_ymin)
    return inter_width * inter_height

def get_bounding_box(box_points):
    x_coords = [p[0] for p in box_points]
    y_coords = [p[1] for p in box_points]
    xmin, xmax = min(x_coords), max(x_coords)
    ymin, ymax = min(y_coords), max(y_coords)
    width = xmax - xmin
    height = ymax - ymin
    center_x = xmin + width / 2
    center_y = ymin + height / 2
    return {
        'xmin': xmin, 'ymin': ymin, 'xmax': xmax, 'ymax': ymax,
        'width': width, 'height': height,
        'center_x': center_x, 'center_y': center_y,
        'box_points': box_points
    }

def detect_panels_filtered(image_path, min_area=10000, nested_threshold=0.80):
    image = cv2.imread(image_path, 0)
    if image is None:
        print(f"Error: Image not found at {image_path}")
        return None, []
    panel_image_vis = cv2.cvtColor(image, cv2.COLOR_GRAY2BGR)
    ret, mask = cv2.threshold(image, 0, 255, cv2.THRESH_BINARY_INV | cv2.THRESH_OTSU)
    contours, _ = cv2.findContours(mask, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
    initial_candidates = []
    for i, contour in enumerate(contours):
        area = cv2.contourArea(contour)
        if area < min_area:
            continue
        try:
            rect = cv2.minAreaRect(contour)
            box_points = np.int0(cv2.boxPoints(rect))
            bbox = get_bounding_box(box_points)
            if bbox['width'] <= 0 or bbox['height'] <= 0:
                continue
            initial_candidates.append({
                'temp_id': i,
                'contour': contour,
                'area': area,
                'bbox': bbox,
                'box_points': box_points
            })
        except Exception as e:
            print(f"Warning: Could not process contour {i}. Error: {e}")
            continue
    indices_to_remove = set()
    num_candidates = len(initial_candidates)
    for i in range(num_candidates):
        if i in indices_to_remove:
            continue
        panel_i = initial_candidates[i]
        bbox_i = panel_i['bbox']
        area_i = panel_i['area']
        epsilon = 1e-6
        for j in range(i + 1, num_candidates):
            if j in indices_to_remove:
                continue
            panel_j = initial_candidates[j]
            bbox_j = panel_j['bbox']
            area_j = panel_j['area']
            if not (bbox_i['xmin'] < bbox_j['xmax'] and bbox_i['xmax'] > bbox_j['xmin'] and \
                    bbox_i['ymin'] < bbox_j['ymax'] and bbox_i['ymax'] > bbox_j['ymin']):
                continue
            intersection_area = calculate_bbox_intersection_area(bbox_i, bbox_j)
            if area_i > epsilon and (intersection_area / area_i) > nested_threshold and area_i < area_j:
                indices_to_remove.add(i)
                break
            if area_j > epsilon and (intersection_area / area_j) > nested_threshold and area_j < area_i:
                indices_to_remove.add(j)
    final_panel_data_list = []
    final_id_counter = 0
    for i in range(num_candidates):
        if i not in indices_to_remove:
            panel_data = initial_candidates[i]
            panel_data['id'] = final_id_counter
            final_panel_data_list.append(panel_data)
            final_id_counter += 1
    print(f"Detected {len(contours)} raw contours, {len(initial_candidates)} passed min area, kept {len(final_panel_data_list)} panels after nested filtering.")
    for panel_data in final_panel_data_list:
        box = panel_data['box_points']
        bbox = panel_data['bbox']
        panel_id = panel_data['id']
        cv2.polylines(panel_image_vis, [box], True, (0, 0, 255), 2)
        cv2.putText(panel_image_vis, str(panel_id),
                    (int(bbox['center_x'] - 10), int(bbox['center_y'] + 10)),
                    cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 255, 0), 2)
    return image, final_panel_data_list

def euclidean_distance(pt1, pt2):
    x1, y1 = pt1
    x2, y2 = pt2
    return ((x2 - x1) ** 2 + (y2 - y1) ** 2) ** 0.5

def order_corners(corners):
    corners = sorted(corners, key=lambda corner: (corner[1], corner[0]))
    top_corners = sorted(corners[0:2], key=lambda corner: corner[0])
    bottom_corners = sorted(corners[2:4], key=lambda corner: corner[0])
    return top_corners[0], top_corners[1], bottom_corners[1], bottom_corners[0]

def transform_panel(image, corners):
    top_left, top_right, bottom_right, bottom_left = order_corners(corners)
    widthA = euclidean_distance(bottom_right, bottom_left)
    widthB = euclidean_distance(top_right, top_left)
    maxWidth = max(int(widthA), int(widthB))
    heightA = euclidean_distance(top_right, bottom_right)
    heightB = euclidean_distance(top_left, bottom_left)
    maxHeight = max(int(heightA), int(heightB))
    source_coordinates = np.float32([
        top_left, top_right, bottom_right, bottom_left
    ])
    target_coordinates = np.float32([
        [0, 0],
        [maxWidth - 1, 0],
        [maxWidth - 1, maxHeight - 1],
        [0, maxHeight - 1]
    ])
    perspective_matrix = cv2.getPerspectiveTransform(
        source_coordinates, target_coordinates
    )
    return cv2.warpPerspective(image, perspective_matrix, (maxWidth, maxHeight))

def merge_overlapping_panels(panel_list, min_intersection_area=10):
    if not panel_list:
        return []
    current_panels = list(panel_list)
    while True:
        merged_made_in_pass = False
        next_pass_panels = []
        num_current = len(current_panels)
        processed_indices_this_pass = set()
        for i in range(num_current):
            if i in processed_indices_this_pass:
                continue
            panel_i_data = current_panels[i]
            bbox_i = panel_i_data.get('bbox')
            if not bbox_i:
                continue
            group_to_merge_indices = {i}
            potential_group_panels = [panel_i_data]
            for j in range(num_current):
                if i == j:
                    continue
                if j not in processed_indices_this_pass:
                    panel_j_data = current_panels[j]
                    bbox_j = panel_j_data.get('bbox')
                    if not bbox_j:
                        continue
                    intersection_area = calculate_bbox_intersection_area(bbox_i, bbox_j)
                    if intersection_area > min_intersection_area:
                        overlaps_with_group = False
                        for p_in_group in potential_group_panels:
                            if calculate_bbox_intersection_area(p_in_group['bbox'], bbox_j) > min_intersection_area:
                                overlaps_with_group = True
                                break
                        if overlaps_with_group:
                            group_to_merge_indices.add(j)
                            potential_group_panels.append(panel_j_data)
            if len(group_to_merge_indices) > 1:
                merged_made_in_pass = True
                all_bboxes_in_group = [current_panels[idx]['bbox'] for idx in group_to_merge_indices]
                union_xmin = min(b['xmin'] for b in all_bboxes_in_group)
                union_ymin = min(b['ymin'] for b in all_bboxes_in_group)
                union_xmax = max(b['xmax'] for b in all_bboxes_in_group)
                union_ymax = max(b['ymax'] for b in all_bboxes_in_group)
                merged_bbox = {
                    'xmin': union_xmin, 'ymin': union_ymin,
                    'xmax': union_xmax, 'ymax': union_ymax,
                    'width': union_xmax - union_xmin,
                    'height': union_ymax - union_ymin,
                    'center_x': (union_xmin + union_xmax) / 2,
                    'center_y': (union_ymin + union_ymax) / 2,
                    'box_points': None
                }
                original_ids = [current_panels[idx].get('id', current_panels[idx].get('temp_id')) for idx in group_to_merge_indices]
                merged_panel = {
                    'bbox': merged_bbox,
                    'contour': None,
                    'box_points': None,
                    'merged_from_ids': [id for id in original_ids if id is not None]
                }
                next_pass_panels.append(merged_panel)
                processed_indices_this_pass.update(group_to_merge_indices)
            elif i not in processed_indices_this_pass:
                next_pass_panels.append(panel_i_data)
                processed_indices_this_pass.add(i)
        current_panels = next_pass_panels
        if not merged_made_in_pass:
            break
    final_merged_panels = []
    for idx, panel_data in enumerate(current_panels):
        panel_data['id'] = idx
        final_merged_panels.append(panel_data)
    return final_merged_panels

def detect_panels(image_path, min_area=5000, nested_threshold=0.80, merge_overlap_pixels=10):
    image, filtered_panels = detect_panels_filtered(image_path, min_area, nested_threshold)
    if image is None or not filtered_panels:
        print("No panels found after initial detection and nested filtering.")
        return image, []
    print(f"\nAttempting to merge {len(filtered_panels)} panels based on overlap (min intersection: {merge_overlap_pixels} pixels)...")
    merged_panel_list = merge_overlapping_panels(filtered_panels, merge_overlap_pixels)
    print(f"Resulted in {len(merged_panel_list)} panels after merging.")
    return image, merged_panel_list

def group_by_row_recursive(panel_group, y_tolerance_ratio=0.2):
    if not panel_group:
        return []
    panel_heights = [p['bbox']['height'] for p in panel_group if p['bbox']['height'] > 0]
    if not panel_heights:
        avg_height = 50
    else:
        avg_height = np.mean(panel_heights)
    if avg_height is None or np.isnan(avg_height) or avg_height <= 0:
        avg_height = 50
    y_tolerance = avg_height * y_tolerance_ratio
    sorted_panels = sorted(panel_group, key=lambda p: (p['bbox']['ymin'], p['bbox']['xmin']))
    rows = []
    current_row = []
    current_row_ymin = -1
    current_row_ymax = -1
    for panel in sorted_panels:
        bbox = panel['bbox']
        if not current_row:
            current_row.append(panel)
            current_row_ymin = bbox['ymin']
            current_row_ymax = bbox['ymax']
        else:
            starts_near_row_top = abs(bbox['ymin'] - current_row_ymin) < y_tolerance
            overlaps_row_span = max(current_row_ymin, bbox['ymin']) < min(current_row_ymax, bbox['ymax'])
            starts_near_row_bottom = bbox['ymin'] < current_row_ymax + y_tolerance
            if starts_near_row_top or (overlaps_row_span and starts_near_row_bottom):
                current_row.append(panel)
                current_row_ymin = min(current_row_ymin, bbox['ymin'])
                current_row_ymax = max(current_row_ymax, bbox['ymax'])
            else:
                if current_row:
                    rows.append(current_row)
                current_row = [panel]
                current_row_ymin = bbox['ymin']
                current_row_ymax = bbox['ymax']
    if current_row:
        rows.append(current_row)
    return rows

def group_by_column_recursive(panel_group, x_tolerance_ratio=0.1):
    if not panel_group:
        return []
    panel_widths = [p['bbox']['width'] for p in panel_group if p['bbox']['width'] > 0]
    if not panel_widths:
        avg_width = 50
    else:
        avg_width = np.mean(panel_widths)
    if avg_width is None or np.isnan(avg_width) or avg_width <= 0:
        avg_width = 50
    x_tolerance = avg_width * x_tolerance_ratio
    sorted_panels = sorted(panel_group, key=lambda p: (p['bbox']['xmax'], p['bbox']['ymin']), reverse=True)
    columns = []
    current_column = []
    current_column_xmin = float('inf')
    current_column_xmax = -1
    for panel in sorted_panels:
        bbox = panel['bbox']
        if not current_column:
            current_column.append(panel)
            current_column_xmin = bbox['xmin']
            current_column_xmax = bbox['xmax']
        else:
            starts_near_column_right = abs(bbox['xmax'] - current_column_xmax) < x_tolerance
            overlaps_column_span = max(current_column_xmin, bbox['xmin']) < min(current_column_xmax, bbox['xmax'])
            starts_near_column_left = bbox['xmax'] > current_column_xmin - x_tolerance
            if starts_near_column_right or (overlaps_column_span and starts_near_column_left):
                current_column.append(panel)
                current_column_xmin = min(current_column_xmin, bbox['xmin'])
                current_column_xmax = max(current_column_xmax, bbox['xmax'])
            else:
                if current_column:
                    columns.append(current_column)
                current_column = [panel]
                current_column_xmin = bbox['xmin']
                current_column_xmax = bbox['ymax'] # Corrected: should be bbox['xmax'] not bbox['ymax']
    if current_column:
        columns.append(current_column)
    return columns

def order_panels_recursive(panel_group, group_type="row"):
    if len(panel_group) == 1:
        return [panel_group[0]['id']]
    if not panel_group:
        return []
    ordered_panel_ids = []
    if group_type == "row":
        row_subgroups = group_by_row_recursive(panel_group)
        toggled_type = "column"
        for row_group in row_subgroups:
            ordered_panel_ids.extend(order_panels_recursive(row_group, toggled_type))
    elif group_type == "column":
        column_subgroups = group_by_column_recursive(panel_group)
        toggled_type = "row"
        for column_group in column_subgroups:
            ordered_panel_ids.extend(order_panels_recursive(column_group, toggled_type))
    else:
        print(f"Error: Unknown group type '{group_type}'")
        return [p['id'] for p in panel_group]
    return ordered_panel_ids

def get_reading_order(image_path, output_path):
    original_image, panel_data_list = detect_panels(image_path, min_area=5000)
    ordered_indices = order_panels_recursive(panel_data_list, group_type="row")
    print(f"Final calculated order: {ordered_indices}")
    panels_by_id = {p['id']: p for p in panel_data_list}
    final_ordered_image = cv2.cvtColor(original_image, cv2.COLOR_GRAY2BGR)
    count = 1
    for panel_id in ordered_indices:
        panel_data = panels_by_id[panel_id]
        bbox = panel_data['bbox']
        cv2.putText(final_ordered_image, str(count),
                    (int(bbox['center_x'] - 10), int(bbox['center_y'] + 10)),
                    cv2.FONT_HERSHEY_SIMPLEX, 1.5, (255, 0, 255), 3)
        count += 1
    cv2.imwrite(output_path, final_ordered_image)

def assign_reading_order_to_bubbles(input_csv_path, image_dir, output_csv_path,
                                      min_panel_area=5000, nested_threshold=0.80,
                                      merge_overlap_pixels=10):
    df = pd.read_csv(input_csv_path, encoding='utf-8')
    print(f"Read {len(df)} rows from {input_csv_path}")
    reading_orders = {}
    grouped = df.groupby('filename')
    page_counter = 0
    reading_order_output_dir = os.path.join(os.path.dirname(output_csv_path), 'reading_order')
    if not os.path.exists(reading_order_output_dir):
        os.makedirs(reading_order_output_dir)
        print(f"Created directory: {reading_order_output_dir}")
    for filename, group in grouped:
        page_counter += 1
        print(f"\nProcessing Page {page_counter}: {filename}...")
        image_path = os.path.join(image_dir, filename)
        image = cv2.imread(image_path)
        _, panel_data_list = detect_panels(image_path, min_panel_area,
                                            nested_threshold, merge_overlap_pixels)
        ordered_panel_indices = order_panels_recursive(panel_data_list, group_type="row")
        panel_order_map = {panel_id: order + 1 for order, panel_id in enumerate(ordered_panel_indices)}
        panels_by_id = {p['id']: p for p in panel_data_list}
        panel_bubbles = {panel_id: [] for panel_id in panel_order_map.keys()}
        unassigned_bubbles = []
        bubble_assignments = {}
        for idx, row in group.iterrows():
            sx, sy, ex, ey = row['startx'], row['starty'], row['endx'], row['endy']
            cx = (sx + ex) / 2
            cy = (sy + ey) / 2
            assigned = False
            for panel_id, panel_data in panels_by_id.items():
                bbox = panel_data.get('bbox')
                if not bbox:
                    continue
                if cx >= bbox['xmin'] and cx < bbox['xmax'] and cy >= bbox['ymin'] and cy < bbox['ymax']:
                    bubble_data = {
                        'original_index': idx,
                        'center_x': cx,
                        'center_y': cy,
                        'text': row['text']
                    }
                    panel_bubbles[panel_id].append(bubble_data)
                    bubble_assignments[idx] = panel_id
                    assigned = True
                    break
            if not assigned:
                unassigned_bubbles.append({
                    'original_index': idx,
                    'center_x': cx,
                    'center_y': cy,
                    'text': row['text']
                })
        if unassigned_bubbles:
            print(f"Warning: Found {len(unassigned_bubbles)} bubbles not assigned to any panel for {filename}.")
        reading_orders_for_page = {}
        for panel_read_order, panel_id in enumerate(ordered_panel_indices, 1):
            bubbles_in_panel = panel_bubbles.get(panel_id, [])
            current_panel_data = panels_by_id.get(panel_id)
            if not current_panel_data or 'bbox' not in current_panel_data:
                continue
            panel_bbox = current_panel_data['bbox']
            panel_top_right_corner = (panel_bbox['xmax'], panel_bbox['ymin'])
            sorted_bubbles = sorted(
                bubbles_in_panel,
                key=lambda b: euclidean_distance(
                    (b['center_x'], b['center_y']),
                    panel_top_right_corner
                )
            )
            for bubble_read_order, bubble_data in enumerate(sorted_bubbles, 1):
                final_order_str = f"{page_counter:02}:{panel_read_order:02}:{bubble_read_order:02}"
                reading_orders[bubble_data['original_index']] = final_order_str
                reading_orders_for_page[bubble_data['original_index']] = final_order_str
                bubble_center_x = int(bubble_data['center_x'])
                bubble_center_y = int(bubble_data['center_y'])
                cv2.putText(image, str(f"{panel_read_order}:{bubble_read_order}"),
                            (bubble_center_x - 15, bubble_center_y + 15),
                            cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 0, 255), 2)
        for i, bubble_data in enumerate(unassigned_bubbles, 1):
            final_order_str = f"{page_counter:02}:-1:{i:02}"
            reading_orders[bubble_data['original_index']] = final_order_str
            reading_orders_for_page[bubble_data['original_index']] = final_order_str
            bubble_center_x = int(bubble_data['center_x'])
            bubble_center_y = int(bubble_data['center_y'])
            cv2.putText(image, str(i),
                        (bubble_center_x - 15, bubble_center_y + 15),
                        cv2.FONT_HERSHEY_SIMPLEX, 0.7, (0, 0, 255), 2)
        output_image_path = os.path.join(reading_order_output_dir, f"reading_order_{filename}")
        cv2.imwrite(output_image_path, image)
        print(f"Saved image with reading order numbers to: {output_image_path}")
    df['reading_order'] = df.index.map(reading_orders)
    df['reading_order'].fillna("ERROR:PROCESSING_FAILED", inplace=True)
    try:
        output_dir = os.path.dirname(output_csv_path)
        if output_dir and not os.path.exists(output_dir):
            os.makedirs(output_dir)
            print(f"Created output directory: {output_dir}")
        df.to_excel(output_csv_path, index=False)
        print(f"\nSuccessfully saved output CSV with reading order to: {output_csv_path}")
    except Exception as e:
        print(f"Error saving output CSV: {e}")
        

3. Identifying "Who Said What" and getting the "Draft Translation": Vision Language Models

Once we have a refined list of potential characters and their canonical names, we use a Vision Language Model (VLM) to determine the speaker for each speech bubble. A VLM is an AI model that can understand both images and text, making it perfect for analyzing manga pages.

We feed the VLM the OCR'd text (from a previous step) and the image of the manga page. The model then analyzes the visual cues (tails of speech bubbles, character proximity, type of bubble) along with the dialogue content to identify the speaker. This is an example prompt for translating the Kingdom manga. It will require modifications to be suitable for other manga titles.


PROMPT_KINGDOM = """You are an expert manga translator specializing in Kingdom, the historical manga by Yasuhisa Hara. Your mission is to translate manga pages from Japanese to English, preserving the spirit and cultural context of the original work.

You will receive manga page images and their corresponding OCR text. Your tasks are:

1. OCR Text Refinement ("japanese" field):
- Carefully examine the provided OCR text, line by line, and compare it to the visual representation of the original page. Identify and correct any errors in the text itself.
- Pay particular attention to potential OCR mistakes involving Japanese characters and punctuation, as these are common sources of error.
- Crucially, DO NOT change the order of the lines of text. The lines are arranged in the correct reading order, from top to bottom, as reviewed by a human.
- Your task is solely to correct the text within each individual line. Do not add, remove, or reorder any of the lines themselves.

2. Identify who is speaking ("character" field):
- Analyze the page visually to determine which character is speaking in each speech bubble. Use the image and content of the speech to identify the speaker accurately.
- If there is only one character in the speech bubble, usually, it is the character who is speaking.
- Standard Speech Bubble: It's usually round or oval with a tail pointing directly at the speaker. This type is used for regular dialogue.
- Thought Bubble: These are typically cloud-shaped, they might have a series of small circles leading to the character's head. They represent the character's internal thoughts, which are not spoken aloud.
- Narration Box: These are rectangular boxes that provide context or narration. They are not spoken by any character but rather describe the scene or provide background information. Use "Narrator" in this case.
- Shouting Speech Bubble: These bubbles often have jagged edges, sharp points, or are drawn with thicker lines to convey a loud, angry, or excited tone. Use "Shouting" in this case if you can't infer the character.
- If the speaker is not clear, use "Unknown" as a placeholder. However, if you can identify the speaker with confidence, provide their name in the "character" field.
- Kingdom uses many extra characters, for those characters, you can use "Extra Character" as the character name.

3. Context-Aware Manga Translation ("english" field):
- Translate the Japanese text within each speech bubble into fluent, natural English. Avoid word-for-word translations. Aim for idiomatic English that resonates with the tone and style of Kingdom.
- Prioritize contextual understanding. Analyze each speech bubble in relation to:
  + Its neighbors:  Consider the flow of conversation and dialogue. Pay attention to sentences that are spead over multiple speech bubbles. If a character's speech is split into two or more bubbles, ensure the translation reflects the continuity of speech.
  + The overall page content: Understand the scene, character emotions, who says what, and plot progression depicted on the page.
- Capture nuances and cultural context. Be mindful of subtle implications, unspoken meanings, and Japanese cultural references embedded within the dialogue.
- If a character's speech is split into two or more bubbles (usually when the previous Japanese text ends with a comma), ensure the translation reflects the continuity of speech.
- Translate names using their established Japanese readings as is convention in manga translations. Verify that the extracted names accurately reflect the furiganas (Japanese reading aid) present on the images. The furigana provides the intended pronunciation, so ensure strict consistency.
- Remember to choose words that fit the period and setting of the manga. Kingdom is set in ancient China, so avoid modern slang or expressions that would not fit the historical context. For example, use "chancellor" instead of "prime minister."
- Retain Japanese honorifics if they are there in the Japanese text (e.g., "sama," "dono," "sensei") to preserve the nuances of social hierarchy and character relationships.
- Be careful when translating pronouns. In Japanese, the subject is often omitted, so you need to infer the correct pronoun based on the context.

5. Output: a list of SpeechBubble objects, each containing:
- The "japanese" field: the Japanese text from the corresponding speech bubble.
- The "english" field: the corresponding English translation.
- The "character" field: the character who is speaking.

### Begin OCR text
{text}
### End OCR text

### Begin Translation of names in this chapter
{ner}
### End Translation of names in this chapter
"""
        

This prompt guides the VLM not only to identify speakers (using defined bubble types like Standard, Thought, Narration, Shouting) but also to refine the OCR text and provide a draft English translation, all while considering the intricate context of a manga page.

The Enhanced Output: A Richer Translation Foundation

After all these context-gathering steps (NER, character identification, and reading order), the initial CSV file generated by OCR is significantly enriched. What you get now is typically an Excel (or CSV) file with the following crucial columns, providing a comprehensive foundation for the translator and typesetter:

Column Name Description
filename The original image file name.
startx The X-coordinate of the top-left corner of the detected text bubble bounding box.
starty The Y-coordinate of the top-left corner of the detected text bubble bounding box.
endx The X-coordinate of the bottom-right corner of the detected text bubble bounding box.
endy The Y-coordinate of the bottom-right corner of the detected text bubble bounding box.
text The refined Japanese text string extracted by the OCR engine.
reading_order A new column indicating the precise reading order of each bubble within the chapter (e.g., Page01:Panel03:Bubble01). This ensures smooth flow for the reader.
character A new column indicating the identified speaker for each bubble (e.g., "Shin", "Kyou Kai", "Narrator", "Shouting"). This is vital for accurate translation and character consistency.
english_draft_translation A new column containing the initial, context-aware English draft translation provided by the VLM. This gives the human translator a strong starting point, significantly speeding up their work.

This comprehensive output transforms raw OCR data into a structured, context-rich document, empowering translators to focus on linguistic nuances rather than logistical puzzles.

Conclusion

Getting context is the silent hero of scanlation. By meticulously identifying named entities, attributing dialogue to characters, and establishing the correct reading order, we move beyond mere text replacement to deliver a truly authentic and enjoyable manga experience. This blend of intelligent automation and human oversight ensures that every panel and every line of dialogue resonates as the original author intended, bridging the gap between Japanese artistry and a global audience.