In this page, we present the MillionCelebs Face Recognition Dataset, which contains one million celebrity face images.
The MillionCelebs face dataset is collected from the Internet Image Search Engine according to the Freebase celebrity name list released along with MS1M. After preprocessing, 87.0M face images of 1M identities are collected. Figure shows detailed demography statistics. Different from many celebrity datasets in which most identities are actors, MillionCelebs contains a big range of professions. With the abundant statistical information, MillionCelebs has been used for selecting a subset to meet the research needs, such as the race bias and class imbalance problem in deep face recognition.
The dataset is first published with paper Global-Local GCN: Large-Scale Label Noise Cleansing for Face Recognition, and is further refined in the following one year. After merging with the GlintAsia dataset, we build up a cleaned version with 22.8M images of 719.6K identities.
Datasets | Year | # of photos | # of subjects | # photos per subject |
---|---|---|---|---|
CASIA-WebFace1 | 2014 | 494,414 | 10,575 | 46.8 |
MS-Celeb-1M2 | 2016 | 10M | 100,000 | 100 |
VGGFace23 | 2018 | 3.31M | 9,131 | 362.6 |
Celeb-500k4 | 2018 | 50M | 500K | 100 |
Celeb-500k (clean) | 2018 | 25M | 356K | 70.2 |
MillionCelebs | 2020 | 87M | 1M | 87 |
MillionCelebs (clean) | 2020 | 22,823,433 | 719,613 | 31.7 |
For more information, please refer to the paper and supplementary material.
MillionCelebs is trained with state-of-the-art SFace face recognition loss function on ResNet-124 network architecture with batch size 600 on 4 Titan X GPUs. Table shows that MillionCelebs reaches state of the art on the IJB-C benchmark. The model will be available soon.
IJB-C | 1e-6 | 1e-5 | 1e-4 | 1e-3 | 1e-2 | 1e-1 |
---|---|---|---|---|---|---|
CASIA-WebFace | 48.16% | 68.04% | 80.71% | 89.34% | 95.18% | 98.32% |
MS-RetinaFace-t11 | 88.62% | 94.37% | 96.36% | 97.57% | 98.49% | 99.04% |
MillionCelebs (clean) | 92.57% | 96.50% | 97.49% | 98.27% | 98.73% | 99.12% |
@inproceedings{zhang2020global, title={Global-Local GCN: Large-Scale Label Noise Cleansing for Face Recognition}, author={Zhang, Yaobin and Deng, Weihong and Wang, Mei and Hu, Jiani and Li, Xian and Zhao, Dongyue and Wen, Dongchao}, booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition}, pages={7731--7740}, year={2020} }
The full dataset is too large to put on the online disk, so we release a cleaned version on OneDrive.
There are 14 tar.gz files and one lmk.tsv file. Afther downloading all files, unzip the 14 files sequentially. This will create folders in "/images". Every folder represents one class. The lmk.tsv records the face bounding box and landmark information. For example, one line in lmk.tsv
"0/0.jpg 6 13 99 129 32 64 74 52 60 80 46 104 81 94"
shows the bounding box (upper left, lower right) and five facial landmarks (left eye, right eye, nose, left corner of mouth and right corner of mouth) information of face image "images/0/0.jpg".
The release of the complete noisy dataset is still under planning.