MillionCelebs: One Million Celebrity Face Recognition Dataset



Description

In this page, we present the MillionCelebs Face Recognition Dataset, which contains one million celebrity face images.

The MillionCelebs face dataset is collected from the Internet Image Search Engine according to the Freebase celebrity name list released along with MS1M. After preprocessing, 87.0M face images of 1M identities are collected. Figure shows detailed demography statistics. Different from many celebrity datasets in which most identities are actors, MillionCelebs contains a big range of professions. With the abundant statistical information, MillionCelebs has been used for selecting a subset to meet the research needs, such as the race bias and class imbalance problem in deep face recognition.


The dataset is first published with paper Global-Local GCN: Large-Scale Label Noise Cleansing for Face Recognition, and is further refined in the following one year. After merging with the GlintAsia dataset, we build up a cleaned version with 22.8M images of 719.6K identities.

Datasets Year # of photos # of subjects # photos per subject
CASIA-WebFace1 2014 494,414 10,575 46.8
MS-Celeb-1M2 2016 10M 100,000 100
VGGFace23 2018 3.31M 9,131 362.6
Celeb-500k4 2018 50M 500K 100
Celeb-500k (clean) 2018 25M 356K 70.2
MillionCelebs 2020 87M 1M 87
MillionCelebs (clean) 2020 22,823,433 719,613 31.7

  1. Dong Yi, Zhen Lei, Shengcai Liao, and Stan Z Li. Learning face representation from scratch. arXiv preprint arXiv:1411.7923, 2014.
  2. Yandong Guo, Lei Zhang, Yuxiao Hu, Xiaodong He, Jianfeng Gao. Ms-celeb-1m: A dataset and benchmark for large-scale face recognition. ECCV, 2016.
  3. Qiong Cao, Li Shen, Weidi Xie, Omkar M. Parkhi, Andrew Zisserman. VGGFace2: A dataset for recognising faces across pose and age. FG, 2018.
  4. Jiajiong Cao, Yingming Li, and Zhongfei Zhang. Celeb-500k: A large training dataset for face recognition. ICIP, 2018.

For more information, please refer to the paper and supplementary material.

Results

MillionCelebs is trained with state-of-the-art SFace face recognition loss function on ResNet-124 network architecture with batch size 600 on 4 Titan X GPUs. Table shows that MillionCelebs reaches state of the art on the IJB-C benchmark. The model will be available soon.

IJB-C 1e-6 1e-5 1e-4 1e-3 1e-2 1e-1
CASIA-WebFace 48.16% 68.04% 80.71% 89.34% 95.18% 98.32%
MS-RetinaFace-t11 88.62% 94.37% 96.36% 97.57% 98.49% 99.04%
MillionCelebs (clean) 92.57% 96.50% 97.49% 98.27% 98.73% 99.12%

  1. Jiankang Deng, Jia Guo, Debing Zhang, Yafeng Deng, Xiangju Lu, Song Shi. Lightweight face recognition challenge. ICCV Workshops, 2019.

Reference

If you use this dataset in your research, please kindly cite our work as:


Yaobin Zhang, Weihong Deng, Mei Wang, Jiani Hu, Xian Li, Dongyue Zhao, and Dongchao Wen. Global-local gcn: Large-scale label noise cleansing for face recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7731–7740, 2020. [pdf][supp]

BibTeX entry:
@inproceedings{zhang2020global,
	title={Global-Local GCN: Large-Scale Label Noise Cleansing for Face Recognition},
	author={Zhang, Yaobin and Deng, Weihong and Wang, Mei and Hu, Jiani and Li, Xian and Zhao, Dongyue and Wen, Dongchao},
	booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
	pages={7731--7740},
	year={2020}
}

Download the database

The full dataset is too large to put on the online disk, so we release a cleaned version on OneDrive.

There are 14 tar.gz files and one lmk.tsv file. Afther downloading all files, unzip the 14 files sequentially. This will create folders in "/images". Every folder represents one class. The lmk.tsv records the face bounding box and landmark information. For example, one line in lmk.tsv

"0/0.jpg 6 13 99 129 32 64 74 52 60 80 46 104 81 94"

shows the bounding box (upper left, lower right) and five facial landmarks (left eye, right eye, nose, left corner of mouth and right corner of mouth) information of face image "images/0/0.jpg".

The release of the complete noisy dataset is still under planning.


Contact

Please contact Yaobin Zhang (zhangyaobin@bupt.edu.cn) and Weihong Deng for questions about the database.