When 2D Meets 3D: Rethinking Medical Segmentation Through MNet

December 01, 2025

A concise technical walkthrough of our MNet replication and architectural improvements

Modern medical imaging—MRI and CT in particular—often suffers from anisotropy: slices may be spaced far apart along the z-axis while in-plane resolution remains high. This mismatch creates a persistent challenge for 3D segmentation models. Pure 3D CNNs tend to overfit sparse depth information; pure 2D CNNs ignore valuable volumetric context.

MNet was proposed as a hybrid 2D/3D architecture that blends both perspectives at every stage of the network. In this project, we reproduced MNet from scratch, validated its core claims, and designed two lightweight extensions—Fusion Gating and VMamba—to push the architecture further.

Below is a brief look at what we reproduced, how we improved it, and what we learned.

Why MNet?

MNet introduces a mesh-like U-Net architecture with parallel 2D and 3D encoder-decoder branches. Instead of choosing between 2D or 3D convolutions, MNet blends both inside each block using manually defined fusion rules (add / subtract / concatenate).

According to the authors, this yields strong robustness to anisotropy across MRI and CT datasets.

Our goals were:

Reproduce MNet and verify the original performance trends.
Test its robustness under controlled z-spacing perturbations.
Extend MNet by introducing:
- Fusion Gating → learnable 2D–3D blending
- VMamba → efficient long-range depth modeling via state-space modules

Experimental Setup

We used nnU-Net v1 pipelines for preprocessing, augmentation, and patch sampling—matching the original implementation as closely as possible. Datasets:

PROMISE12 (MRI prostate) – strongly anisotropic MRI
LiTS (CT liver & tumors) – much more isotropic, but tumor segmentation is notoriously difficult

Training was performed on 16-GB GPUs (Lightning.ai), so long schedules (500 epochs) were not feasible. Instead, we trained for 150 epochs, and validated that PROMISE converges by epoch ~100–120.

Reproducing MNet

Our reproduction closely matched the original paper’s reported numbers:

PROMISE12

Original MNet: 89.8% Dice
Our reproduction: 89.0 ± 0.9%

LiTS

Liver: 94.3 ± 1.9% (matches original 94.3%)
Tumor: 54.6 ± 3.1% (expected drop due to 50-case subset)

Conclusion: We successfully reproduced MNet within acceptable reproducibility margins (±2–3 Dice), confirming the architecture’s validity.

Extension 1: Fusion Gating

The original MNet uses fixed operations to merge 2D and 3D feature streams. But fixed rules cannot adapt to variations in local anatomy, noise, or slice spacing.

We replaced them with learnable gates that estimate “trust” between 2D and 3D features either:

Channel-wise: one gate per feature channel
Spatial-wise: one gate per voxel location

Gating performs soft interpolation:

[ y = g \odot x_{2D} + (1-g) \odot x_{3D} ]

Result:

Spatial Gating improved PROMISE Dice from 89.0 → 90.0%
Added negligible compute overhead
Provided smoother boundary predictions

Extension 2: VMamba for Depth Modeling

3D convolutions struggle with long-range dependencies when z-resolution is coarse. Attention-based models help, but are expensive in 3D.

We integrated VMamba, a state-space model performing efficient z-axis scanning:

Models each (H, W) pixel as a sequence along depth
Achieves O(D) complexity instead of the quadratic cost of attention
Replaces heavy 3D convolutions in deeper bottleneck layers
Reduces parameter count from 8.77M → 7.42M

Results:

Strongest LiTS liver Dice: 95.8 ± 0.7%
Improved consistency under anisotropy
Higher variance for tumor class (due to dataset complexity)

Results Summary

✔ Reproduction Success

PROMISE reproduction is within 0.8% of the original.
LiTS liver segmentation matches the original exactly.

✔ Extensions Help

Spatial Gating → Most stable and consistent improvements
VMamba → Best liver performance; fewer parameters

✔ Robust to Anisotropy

Across PROMISE (1–4 mm spacing), Dice only dropped ~1.5 points for extended variants, outperforming the baseline’s ~2–3 point drop.

Key Takeaways

MNet is indeed reproducible—its hybrid 2D/3D design is robust and stable.
Learned fusion works better than rigid fusion. Spatial Gating provides a free performance boost with minimal overhead.
State-space models (VMamba) are promising for 3D medical imaging, especially where z-resolution is limited.
LiTS tumor segmentation remains a data-limited problem, not an architectural one.

Limitations & Future Directions

Due to compute limitations, LiTS results were based on a 50-case subset and 150-epoch schedules. With full dataset access and extended training, we expect larger gains—especially for VMamba.

Promising next steps:

Full-resolution LiTS with 500-epoch training
Stratified cross-validation to mitigate tumor sparsity issues
Bidirectional VMamba scanning
Combining learned fusion with modern 3D attention mechanisms

Closing Thoughts

This project shows that MNet remains a strong and elegant baseline for anisotropic medical segmentation—but also that small, thoughtful architectural tweaks can further improve robustness and generalization.

By adding adaptive fusion and efficient global depth modeling, we demonstrate meaningful gains without sacrificing MNet’s lightweight nature.

Note: If the embedded PDF is not displayed properly or if you are viewing this on a mobile device, please click here to access the PDF directly.