What could be the reason behind a cblas_sgemm call taking much less time for matrices with a large number of zeros as compared to the same cblas_sgemm call for dense matrices?

I know gemv is designed for matrix-vector multiplication but why can't I use gemm for vector-matrix multiplication if it takes less time, especially for sparse matrices

A short representative code is given below. It asks to enter a value, and then populates a vector with that value. It then replaces every 32nd value with its index. So, if we enter '0' then we get a sparse vector but for other values we get a dense vector.

#include <iostream>
#include <stdio.h>
#include <time.h>
#include <cblas.h>
#include <cublas_v2.h>
using namespace std;
int main()
{
const int m = 5000;
timespec blas_start, blas_end;
long totalnsec; double totalsec, totaltime;
int i, j;
float *A = new float[m]; float *B = new float[m*m]; float *C = new float[m];
float input;
cout << "Enter a value to populate the vector (0 for sparse) ";
cin >> input;
for(i = 0; i < m; i++)
{
A[i] = input;
if( i % 32 == 0) A[i] = i;
}
for(i = 0; i < m; i++)
for(j = 0; j < m; j++)
B[i*m + j] = (i==j);
clock_gettime(CLOCK_REALTIME, &blas_start);
cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, 1, m, m, 1.0f, A, m, B, m, 0.0f, C, m);
clock_gettime(CLOCK_REALTIME, &blas_end);
totalsec = (double)blas_end.tv_sec - (double)blas_start.tv_sec;
totalnsec = blas_end.tv_nsec - blas_start.tv_nsec;
if(totalnsec < 0)
{
totalnsec += 1e9;
totalsec -= 1;
}
totaltime = totalsec + (double)totalnsec*1e-9;
cout<<"Duration = "<< totaltime << "\n";
return 0;
}

When I run this code in Ubuntu 14.04, I get the following results

erisp@ubuntu:~/uas/stackoverflow$ g++ gemmcomp.cpp -o gemmcomp.o -lblas
erisp@ubuntu:~/uas/stackoverflow$ ./gemmcomp.o
Enter a value to populate the vector (0 for sparse) **5**
Duration = **0.0291558**
erisp@ubuntu:~/uas/stackoverflow$ ./gemmcomp.o
Enter a value to populate the vector (0 for sparse) **0**
Duration = **0.000959521**

showing that cblas_sgemm call for sparse matrices is much much more efficient than the same call for dense matrices. What could be the reason?

**What I have tried:**
I have already tested the output and it is correct