MXNet Model Zoo

MXNet features fast implementations of many state-of-the-art models reported in the academic literature. This Model Zoo is an ongoing project to collect complete models, with python scripts, pre-trained weights as well as instructions on how to build and fine tune these models.

How to Contribute a Pre-Trained Model (and what to include)

The Model Zoo has good entries for CNNs but is seeking content in other areas.

Issue a Pull Request containing the following:

  • Gist Log
  • .json model definition
  • Model parameter file
  • Readme file (details below)

Readme file should contain:

  • Model Location, access instructions (wget)
  • Confirmation the trained model meets published accuracy from original paper
  • Step by step instructions on how to use the trained model
  • References to any other applicable docs or arxiv papers the model is based on

Convolutional Neural Networks (CNNs)

Convolutional neural networks are the state-of-art architecture for many image and video processing problems. Some available datasets include:

  • ImageNet: a large corpus of 1 million natural images, divided into 1000 categories.
  • CIFAR10: 60,000 natural images (32 x 32 pixels) from 10 categories.
  • PASCAL_VOC: A subset of ImageNet images with object bounding boxes.
  • UCF101: 13,320 videos from 101 action categories.
  • Mini-Places2: Subset of the Places2 dataset. Includes 100,000 images from 100 scene categories.
  • ImageNet 11k
  • Places2: There are 1.6 million train images from 365 scene categories in the Places365-Standard, which are used to train the Places365 CNNs. There are 50 images per category in the validation set and 900 images per category in the testing set. Compared to the train set of Places365-Standard, the train set of Places365-Challenge has 6.2 million extra images, leading to totally 8 million train images for the Places365 challenge 2016. The validation set and testing set are the same as the Places365-Standard.
  • Multimedia Commons: YFCC100M (99.2 million images and 0.8 million videos from Flickr) and supplemental material (pre-extracted features, additional annotations).

For instructions on using these models, see the python tutorial on using pre-trained ImageNet models.

Model Definition Dataset Model Weights Research Basis Contributors
CaffeNet ImageNet Param File Krizhevsky, 2012 @jspisak
Network in Network (NiN) ImageNet Param File Lin et al.., 2014 @jspisak
SqueezeNet v1.1 ImageNet Param File Iandola et al.., 2016 @jspisak
VGG16 ImageNet Param File Simonyan et al.., 2015 @jspisak
VGG19 ImageNet Param File Simonyan et al.., 2015 @jspisak
Inception v3 w/BatchNorm ImageNet Param File Szegedy et al.., 2015 @jspisak
ResidualNet152 ImageNet Param File He et al.., 2015 @jspisak
ResNext101-64x4d ImageNet Param File Xie et al.., 2016 @Jerryzcn
Fast-RCNN PASCAL VOC [Param File] Girshick, 2015  
Faster-RCNN PASCAL VOC [Param File] Ren et al..,2016  
Single Shot Detection (SSD) PASCAL VOC [Param File] Liu et al.., 2016  
LocationNet MultimediaCommons Param File Weyand et al.., 2016 @jychoi84 @kevinli7

Recurrent Neural Networks (RNNs) including LSTMs

MXNet supports many types of recurrent neural networks (RNNs), including Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU) networks. Some available datasets include:

  • Sherlock Holmes: Text corpus with ~1 million words.The task is predicting downstream words/characters.
  • Penn Treebank (PTB): Text corpus with ~1 million words. Vocabulary is limited to 10,000 words. The task is predicting downstream words/characters.
  • Shakespeare: Complete text from Shakespeare’s works.
  • IMDB reviews: 25,000 movie reviews, labeled as positive or negative
  • Facebook bAbI: As a set of 20 question & answer tasks, each with 1,000 training examples.
  • Flickr8k, COCO: Images with associated caption (sentences). Flickr8k consists of 8,092 images captioned by AmazonTurkers with ~40,000 captions. COCO has 328,000 images, each with 5 captions. The COCO images also come with labeled objects using segmentation algorithms.
Model Definition Dataset Model Weights Research Basis Contributors
LSTM - Image Captioning Flickr8k, MS COCO   Vinyals et al.., 2015 @...
LSTM - Q&A System bAbl   Weston et al.., 2015  
LSTM - Sentiment Analysis IMDB   Li et al.., 2015  

Generative Adversarial Networks (GANs)

Generative Adversarial Networks train a competing pair of neural networks: a generator network which transforms a latent vector into content like an image, and a discriminator network that tries to distinguish between generated content and supplied “real” training content. When properly trained the two achieve a Nash equilibrium.

Model Definition Dataset Model Weights Research Basis Contributors
DCGANs ImageNet   Radford et al..,2016 @...
Text to Image Synthesis MS COCO   Reed et al.., 2016  
Deep Jazz     Deepjazz.io  

Other Models

MXNet Supports a variety of model types beyond the canonical CNN and LSTM model types. These include deep reinforcement learning, linear models, etc.. Some available datasets and sources include:

  • Google News: A text corpus with a vocabulary of 3 million words architected for word2vec.
  • MovieLens 20M Dataset: 20 million ratings and 465,000 tag applications applied to 27,000 movies by 138,000 users. Includes tag genome data with 12 million relevance scores across 1,100 tags.
  • Atari Video Game Emulator: Stella is a multi-platform Atari 2600 VCS emulator released under the GNU General Public License (GPL).
Model Definition Dataset Model Weights Research Basis Contributors
Word2Vec Google News   Mikolov et al.., 2013 @...
Matrix Factorization MovieLens 20M   Huang et al.., 2013  
Deep Q-Network Atari video games   Minh et al.., 2015  
Asynchronous advantage actor-critic (A3C) Atari video games   Minh et al.., 2016