Simple matching coefficient
The Simple Matching Coefficient (SMC) is a statistic used for comparing the similarity and diversity of sample sets.[1]
A | |||
---|---|---|---|
0 | 1 | ||
B | 0 | ||
1 |
Given two objects, A and B, each with n binary attributes, SMC is defined as:
Where:
- represents the total number of attributes where A and B both have a value of 1.
- represents the total number of attributes where the attribute of A is 0 and the attribute of B is 1.
- represents the total number of attributes where the attribute of A is 1 and the attribute of B is 0.
- represents the total number of attributes where A and B both have a value of 0.
The Simple Matching Distance (SMD), which measures dissimilarity between sample sets, is given by .[2]
Difference with the Jaccard index
The SMC is very similar to the more popular Jaccard index. The main difference is that the SMC has the term in its numerator and denominator, whereas the Jaccard index does not. Thus, the SMC compares the number of matches with the entire set of the possible attributes, whereas the Jaccard index only compares it to the attributes that have been chosen by at least A or B.
In market basket analysis for example, the basket of two consumers who we wish to compare might only contain a small fraction of all the available products in the store, so the SMC would always return very small values compared to the Jaccard index. Using the SMC would then induce a bias by systematically considering, as more similar, two customers with large identical baskets compared to two customers with identical but smaller baskets; thus making the Jaccard index a better measure of similarity in that context.
In other contexts, where 0 and 1 carry equivalent information (symmetry), the SMC is a better measure of similarity. For example, vectors of demographic variables stored in dummies variables, such as gender, would be better compared with the SMC than with the Jaccard index since the impact of gender on similarity should be equal, independent of whether male is defined as a 0 and female as a 1 or the other way around. However, when we have symmetric dummy variables, one could replicate the behaviour of the SMC by splitting the dummies into two binary attributes (in this case, male and female), thus transforming them into asymmetric attributes, allowing the use of the Jaccard index without introducing the bias. By using this trick, the Jaccard index can be considered as making the SMC a fully redundant metric. The SMC remains however more computationally efficient in the case of symmetric dummy variables since it doesn't require adding extra dimensions.
In general, the Jaccard index can be considered as an indicator of local "similarity" while SMC evaluates "similarity" relative to the whole "universe". Similarity and dissimilarity must be understood in a relative sense. For example, if there are only 2 attributes (x,y), then A=(1,0) is intuitively very different from B=(0,1). However if there are 10 attributes in the "universe", A=(1,0,0,0,0,0,0,0,0,0) and B=(0,1,0,0,0,0,0,0,0) are not intuitively so different anymore. If the focus comes back to be just on A and B, the remaining 8 attributes are often considered as redundant. As a result, A and B are very different in a "local" sense (which the Jaccard Index measures efficiently), but less different in a "global" sense (which the SMC measures efficiently). From this point of view, the choice of using SMC or the Jaccard index comes down to more than just symmetry and asymmetry of information in the attributes. The distribution of sets in the defined "universe" and the nature of the problem to be modeled should also be considered.
The Jaccard index is also more general than the SMC and can be used to compare other data types than just vectors of binary attributes, such as Probability measures.