6.2 C
London
Thursday, December 19, 2024
HomeRDescriptive Statistics in RHow to Calculate Jaccard Similarity in R

How to Calculate Jaccard Similarity in R

Related stories

Learn About Opening an Automobile Repair Shop in India

Starting a car repair shop is quite a good...

Unlocking the Power: Embracing the Benefits of Tax-Free Investing

  Unlocking the Power: Embracing the Benefits of Tax-Free Investing For...

Income Splitting in Canada for 2023

  Income Splitting in Canada for 2023 The federal government’s expanded...

Can I Deduct Home Office Expenses on my Tax Return 2023?

Can I Deduct Home Office Expenses on my Tax...

Canadian Tax – Personal Tax Deadline 2022

  Canadian Tax – Personal Tax Deadline 2022 Resources and Tools...

The Jaccard similarity index measures the similarity between two sets of data. It can range from 0 to 1. The higher the number, the more similar the two sets of data.

The Jaccard similarity index is calculated as:

Jaccard Similarity = (number of observations in both sets) / (number in either set)

Or, written in notation form:

J(A, B) = |A∩B| / |A∪B|

This tutorial explains how to calculate Jaccard Similarity for two sets of data in R.

Example: Jaccard Similarity in R

Suppose we have the following two sets of data:

a 
b 

We can define the following function to calculate the Jaccard Similarity between the two sets:

#define Jaccard Similarity function
jaccard function(a, b) {
    intersection = length(intersect(a, b))
    union = length(a) + length(b) - intersection
    return (intersection/union)
}

#find Jaccard Similarity between the two sets 
jaccard(a, b)

0.4

The Jaccard Similarity between the two lists is 0.4.

Note that the function will return if the two sets don’t share any values:

c 

And the function will return if the two sets are identical:

e 

The function also works for sets that contain strings:

g cat', 'dog', 'hippo', 'monkey')
h monkey', 'rhino', 'ostrich', 'salmon')

jaccard(g, h)

0.142857

You can also use this function to find the Jaccard distance between two sets, which is the dissimilarity between two sets and is calculated as 1 – Jaccard Similarity.

a #find Jaccard distance between sets a and b
1 - jaccard(a, b)

[1] 0.6

Refer to this Wikipedia page to learn more details about the Jaccard Similarity Index.

Subscribe

- Never miss a story with notifications

- Gain full access to our premium content

- Browse free from up to 5 devices at once

Latest stories