Seedling — actively sprouting

Marvel vs. DC: Aesthetic and tonal identity across eras. An ML study.

You merely adopted the style. I was born in it.

April 2026 · PyTorch · DistilBERT · ResNet-50

machine learning multimodal comics portfolio

The question

Marvel and DC are the standout institutions of the comic book world, each with decades of accumulated visual and narrative style. Different color palettes, page geometry, and editorial voices are used for various forms of expression. Avid fan bases hold strong preferences for one publisher over the other; DC is generally considered darker and grittier, with more powerful characters. I can generally articulate which one I'm in the mood for and why. But art styles and writing tones shift significantly across decades, and illustrators are often given free rein over their issues. That's what makes the question interesting. Rather than treating 75 years of comics as a single dataset, this analysis focuses on the different eras, asking whether a machine learning model can distinguish the two publishers at each stage. Do they start off different, or does a divergence point emerge where house style hardens into something consistently legible?

This project frames publisher classification as a proxy for cultural signal detection. If a multimodal classifier can reliably separate the two publishers, and if we can isolate which modality is more diagnostic, we learn something about where institutional identity actually lives in the artifact — in the image, in the words, or in their combination.

Status · April 2026 Data pipeline underway. Target ~200–400 labeled (cover image, description) pairs via the Comic Vine API. Scraper drafted; awaiting API key. Architecture frozen — running ablations once data is collected.

Architecture

A late-fusion multimodal classifier

Three ablation variants, trained and compared. The fusion model is the primary interest; the two unimodal variants provide the diagnostic baseline.

Branch 01 · Image

ResNet-50

Pretrained ImageNet weights, frozen backbone. 2048-dim embedding via torchvision.models.

Branch 02 · Text

DistilBERT

CLS token, frozen encoder. 768-dim embedding via transformers.

Fusion · Late

Concatenate → MLP

Both branches projected to shared dim, concatenated, passed through dense head → softmax (Marvel / DC).

Late fusion was chosen over cross-attention or early fusion: each branch stays interpretable and the ablation comparison stays clean. If text-only matches the fused model, the image branch is redundant for this task — and vice versa.

# Simplified forward pass
class ComicClassifier(nn.Module):
  def forward(self, image, input_ids, attention_mask):
    img_feat  = self.img_proj(self.resnet(image))        # → shared_dim
    text_feat = self.txt_proj(self.bert(                 # → shared_dim
                  input_ids, attention_mask
                ).last_hidden_state[:, 0])
    fused = torch.cat([img_feat, text_feat], dim=-1)
    return self.classifier(fused)                          # → [Marvel, DC]

Results

Ablation table

Test accuracy — Comic Vine cover + description dataset

Variant	Modality	Accuracy
Image only	ResNet-50 cover image	— pending
Text only	DistilBERT description	— pending
Fused	Image + text, late fusion	— pending

Hypothesis going in: text is the stronger signal. Cover art has converged aesthetically across publishers; editorial voice and character vocabulary are harder to fake. CoSMo (2025) found visual features dominate macro-structure in panel-level tasks — but cover images are a different regime than interior panels, so the result may not transfer.

Data

Comic Vine API

Starting with cover images + issue descriptions — no OCR required, freely available, labeled by publisher. Marvel publisher_id = 31, DC publisher_id = 10. Target ~200–400 balanced pairs. Each sample in the manifest: {image_path, text, label, publisher, title}.

Interior pages with OCR are Phase 2, pending CBZ/CBR parsing via zipfile / rarfile and EasyOCR for stylized panel text. Phase 1 intentionally scoped narrow to get a clean baseline first.

Why this is interesting

Existing comics ML work focuses almost entirely on within-comic tasks: panel segmentation, speech bubble detection, emotion recognition in scenes. Publisher identity as a learnable institutional signal hasn't been explicitly treated. The closest prior work is CoSMo (ICCV 2025) — structurally adjacent but asking a different question entirely.

The framing here — publisher as cultural fingerprint, not just a class label — is the novel angle. If the classifier fails, that's also interesting: it would suggest that whatever makes Marvel feel like Marvel has been obscured by decades of aesthetic convergence between the two publishers.

References

[1] Serra et al. CoSMo: Multimodal Comics Scene Segmentation. ICCV 2025. SigLIP + Qwen OCR embeddings, late fusion MLP, 20,800-page dataset. github →
[2] Augereau et al. Image segmentation, classification and recognition methods for comics: A decade systematic literature review. ScienceDirect, 2023.
[3] Yao et al. Multimodal Emotion Recognition on Comics Scenes. ICDAR 2021. Panel-level multimodal classification.
[4] Vivoli et al. Multimodal Transformer for Comics Text-Cloze. Domain-adapted ResNet on comic imagery matches larger multimodal LLMs with fewer parameters.