Marvel vs. DC: Aesthetic and tonal identity across eras. An ML study.
You merely adopted the style. I was born in it.
Marvel and DC are the standout institutions of the comic book world, each with decades of accumulated visual and narrative style. Different color palettes, page geometry, and editorial voices are used for various forms of expression. Avid fan bases hold strong preferences for one publisher over the other; DC is generally considered darker and grittier, with more powerful characters. I can generally articulate which one I'm in the mood for and why. But art styles and writing tones shift significantly across decades, and illustrators are often given free rein over their issues. That's what makes the question interesting. Rather than treating 75 years of comics as a single dataset, this analysis focuses on the different eras, asking whether a machine learning model can distinguish the two publishers at each stage. Do they start off different, or does a divergence point emerge where house style hardens into something consistently legible?
This project frames publisher classification as a proxy for cultural signal detection. If a multimodal classifier can reliably separate the two publishers, and if we can isolate which modality is more diagnostic, we learn something about where institutional identity actually lives in the artifact — in the image, in the words, or in their combination.
A late-fusion multimodal classifier
Three ablation variants, trained and compared. The fusion model is the primary interest; the two unimodal variants provide the diagnostic baseline.
torchvision.models.transformers.Late fusion was chosen over cross-attention or early fusion: each branch stays interpretable and the ablation comparison stays clean. If text-only matches the fused model, the image branch is redundant for this task — and vice versa.
Ablation table
| Variant | Modality | Accuracy | Notes |
|---|---|---|---|
| Image only | ResNet-50 cover image | — pending | |
| Text only | DistilBERT description | — pending | |
| Fused | Image + text, late fusion | — pending |
Hypothesis going in: text is the stronger signal. Cover art has converged aesthetically across publishers; editorial voice and character vocabulary are harder to fake. CoSMo (2025) found visual features dominate macro-structure in panel-level tasks — but cover images are a different regime than interior panels, so the result may not transfer.
Comic Vine API
Starting with cover images + issue descriptions — no OCR required, freely available, labeled by publisher. Marvel publisher_id = 31, DC publisher_id = 10. Target ~200–400 balanced pairs. Each sample in the manifest: {image_path, text, label, publisher, title}.
Interior pages with OCR are Phase 2, pending CBZ/CBR parsing via zipfile / rarfile and EasyOCR for stylized panel text. Phase 1 intentionally scoped narrow to get a clean baseline first.
Existing comics ML work focuses almost entirely on within-comic tasks: panel segmentation, speech bubble detection, emotion recognition in scenes. Publisher identity as a learnable institutional signal hasn't been explicitly treated. The closest prior work is CoSMo (ICCV 2025) — structurally adjacent but asking a different question entirely.
The framing here — publisher as cultural fingerprint, not just a class label — is the novel angle. If the classifier fails, that's also interesting: it would suggest that whatever makes Marvel feel like Marvel has been obscured by decades of aesthetic convergence between the two publishers.
- [1] Serra et al. CoSMo: Multimodal Comics Scene Segmentation. ICCV 2025. SigLIP + Qwen OCR embeddings, late fusion MLP, 20,800-page dataset. github →
- [2] Augereau et al. Image segmentation, classification and recognition methods for comics: A decade systematic literature review. ScienceDirect, 2023.
- [3] Yao et al. Multimodal Emotion Recognition on Comics Scenes. ICDAR 2021. Panel-level multimodal classification.
- [4] Vivoli et al. Multimodal Transformer for Comics Text-Cloze. Domain-adapted ResNet on comic imagery matches larger multimodal LLMs with fewer parameters.