Seedling — actively sprouting

Marvel vs. DC: Aesthetic and tonal identity across eras. An ML study.

You merely adopted the style. I was born in it.

machine learning multimodal comics portfolio
The question

Marvel and DC are the standout institutions of the comic book world, each with decades of accumulated visual and narrative style. Different color palettes, page geometry, and editorial voices are used for various forms of expression. Avid fan bases hold strong preferences for one publisher over the other; DC is generally considered darker and grittier, with more powerful characters. I can generally articulate which one I'm in the mood for and why. But art styles and writing tones shift significantly across decades, and illustrators are often given free rein over their issues. That's what makes the question interesting. Rather than treating 75 years of comics as a single dataset, this analysis focuses on the different eras, asking whether a machine learning model can distinguish the two publishers at each stage. Do they start off different, or does a divergence point emerge where house style hardens into something consistently legible?

This project frames publisher classification as a proxy for cultural signal detection. If a multimodal classifier can reliably separate the two publishers, and if we can isolate which modality is more diagnostic, we learn something about where institutional identity actually lives in the artifact — in the image, in the words, or in their combination.

Status · April 2026 Data pipeline underway. Target ~200–400 labeled (cover image, description) pairs via the Comic Vine API. Scraper drafted; awaiting API key. Architecture frozen — running ablations once data is collected.
Architecture

A late-fusion multimodal classifier

Three ablation variants, trained and compared. The fusion model is the primary interest; the two unimodal variants provide the diagnostic baseline.

Branch 01 · Image
ResNet-50
Pretrained ImageNet weights, frozen backbone. 2048-dim embedding via torchvision.models.
Branch 02 · Text
DistilBERT
CLS token, frozen encoder. 768-dim embedding via transformers.
Fusion · Late
Concatenate → MLP
Both branches projected to shared dim, concatenated, passed through dense head → softmax (Marvel / DC).

Late fusion was chosen over cross-attention or early fusion: each branch stays interpretable and the ablation comparison stays clean. If text-only matches the fused model, the image branch is redundant for this task — and vice versa.

# Simplified forward pass class ComicClassifier(nn.Module): def forward(self, image, input_ids, attention_mask): img_feat = self.img_proj(self.resnet(image)) # → shared_dim text_feat = self.txt_proj(self.bert( # → shared_dim input_ids, attention_mask ).last_hidden_state[:, 0]) fused = torch.cat([img_feat, text_feat], dim=-1) return self.classifier(fused) # → [Marvel, DC]
Results

Ablation table

Test accuracy — Comic Vine cover + description dataset
Variant Modality Accuracy Notes
Image only ResNet-50 cover image — pending
Text only DistilBERT description — pending
Fused Image + text, late fusion — pending

Hypothesis going in: text is the stronger signal. Cover art has converged aesthetically across publishers; editorial voice and character vocabulary are harder to fake. CoSMo (2025) found visual features dominate macro-structure in panel-level tasks — but cover images are a different regime than interior panels, so the result may not transfer.

Data

Comic Vine API

Starting with cover images + issue descriptions — no OCR required, freely available, labeled by publisher. Marvel publisher_id = 31, DC publisher_id = 10. Target ~200–400 balanced pairs. Each sample in the manifest: {image_path, text, label, publisher, title}.

Interior pages with OCR are Phase 2, pending CBZ/CBR parsing via zipfile / rarfile and EasyOCR for stylized panel text. Phase 1 intentionally scoped narrow to get a clean baseline first.

Why this is interesting

Existing comics ML work focuses almost entirely on within-comic tasks: panel segmentation, speech bubble detection, emotion recognition in scenes. Publisher identity as a learnable institutional signal hasn't been explicitly treated. The closest prior work is CoSMo (ICCV 2025) — structurally adjacent but asking a different question entirely.

The framing here — publisher as cultural fingerprint, not just a class label — is the novel angle. If the classifier fails, that's also interesting: it would suggest that whatever makes Marvel feel like Marvel has been obscured by decades of aesthetic convergence between the two publishers.

References