How to Combine Sets in Python

A practical example of using sets to reconcile period-over-period movement for an insurance policy

In this article, I will go through various methods to join sets as well as an example of how these techniques may be used to explain year over year policyholder movement for an insurance policy.

Consider an insurance policy which has policyholders changing from year to year. What if we need to find out how many new policies came in this year? Or how many policies were terminated from the prior year? Typically, these questions can be answered using formulas in Microsoft Excel. Perhaps there is a better way…

In this example, we have the following policyholder ID’s from 2021-2023, as well as a list of policies which are included on the client’s terminated policy list. For each year, we will be using sets to find:

  • has_duplicates: the number of duplicate policy ID’s

  • new_policies: policy ID’s for new policyholders

  • retained_policies: policy ID’s for policyholders who remained active from prior year

  • dropped_policies: policy ID’s for policyholders who terminated their policies during the year

  • terminated_variance: policy ID’s that do not match between the calculated dropped_policies and list of expected terminated_policies

start_year = 2021

policy_ids = {
    2021: ['PL-01', 'PL-04', 'PL-06', 'PL-07', 'PL-08'],
    2022: ['PL-01', 'PL-04', 'PL-06', 'PL-09', 'PL-09', 'PL-10'],
    2023: ['PL-04', 'PL-09', 'PL-10', 'PL-11', 'PL-14', 'PL-15', 'PL-16', 'PL-17', 'PL-18']
}

terminated_policies = {
    2022: ['PL-06', 'PL-07', 'PL-08', 'PL-09'],
    2023: ['PL-06', 'PL-01', 'PL-04']
}

has_duplicates = {2021: set(), 2022: set(), 2023: set()}
new_policies = {2022: set(), 2023: set()}
retained_policies = {2022: set(), 2023: set()}
dropped_policies = {2022: set(), 2023: set()}
terminated_variance = {2021: set(), 2022: set(), 2023: set()}

Find all items in set 1 and set 2

Mathematical Notation:`

$${set_1\cup set_2}$$

In Python, there are 2 ways to write this:

set_1 | set_2

or, equivalently:

set_1.union(set_2)

In our insurance example, we can use this notation to find all policyholder ID’s from 2021-2023.

all_policies = set(policy_ids[2021]) | set(policy_ids[2022]) | set(policy_ids[2023])
print(f"policies for all years: {all_policies}")
policies for all years: {'PL-04', 'PL-07', 'PL-17', 'PL-11', 'PL-01', 'PL-08', 
'PL-16', 'PL-18', 'PL-10', 'PL-14', 'PL-15', 'PL-06', 'PL-09'}

Find all items in set 1, that are not in set 2

Mathematical Notation:

$$set_1 /\ set_2$$

In Python, there are 2 ways to write this:

set_1 - set_2

or, equivalently:

set_1.difference(set_2)

In our example, we can use this notation to find:

  • Number of duplicate policies in a single year

    • len(policy_ids) → count of policy ID list

    • len(set(policy_ids)) → count of policy ID list after removing duplicates

  • New policies for a given year = policies included in current year that were not included in prior year

  • Dropped policies for a given year = policies included in prior year that are not included in current year

policy_ids = {
    2021: ['PL-01', 'PL-04', 'PL-06', 'PL-07', 'PL-08'],
    2022: ['PL-01', 'PL-04', 'PL-06', 'PL-09', 'PL-09', 'PL-10'],
    2023: ['PL-04', 'PL-09', 'PL-10', 'PL-11', 'PL-14', 'PL-15', 'PL-16', 'PL-17', 'PL-18']
}
for year in policy_ids:
    duplicate_count[year] = len(policy_ids[year]) - len(set(policy_ids[year]))
    print(f"duplicate count: {duplicate_count[year]}")

    if year != start_year:
        new_policies[year] = set(policy_ids[year]) - set(policy_ids[year - 1])
        print(f"new policies: {new_policies[year]}")

        dropped_policies[year] = set(policy_ids[year - 1]) - set(policy_ids[year])
        print(f"dropped policies: {dropped_policies[year]}")
2022 policy checks
------------------------------------------------------
duplicate policies: 1
new policies: {'PL-10', 'PL-09'}
retained policies: {'PL-04', 'PL-01', 'PL-06'}
dropped policies: {'PL-08', 'PL-07'}
terminated variance: {'PL-09', 'PL-06'}

Find all items that are in both set 1 and set 2

Mathematical Notation:`

$$set_1 \cap set_2$$

In Python, there are 2 ways to write this:

set_1 & set_2

or, equivalently:

set_1.intersection(set_2)

In our example, we can use this notation to find the policy ID’s for policyholders who have persisted from the prior to current year:

policy_ids = {
    2021: ['PL-01', 'PL-04', 'PL-06', 'PL-07', 'PL-08'],
    2022: ['PL-01', 'PL-04', 'PL-06', 'PL-09', 'PL-09', 'PL-10'],
    2023: ['PL-04', 'PL-09', 'PL-10', 'PL-11', 'PL-14', 'PL-15', 'PL-16', 'PL-17', 'PL-18']
}
for year in policy_ids:
    if year != start_year:
        retained_policies[year] = set(policy_ids[year - 1]) & set(policy_ids[year])
        print(f"retained policies: {retained_policies[year]}")
2023 policy checks
------------------------------------------------------
duplicate policies: 0
new policies: {'PL-11', 'PL-14', 'PL-17', 'PL-15', 'PL-18', 'PL-16'}
retained policies: {'PL-10', 'PL-04', 'PL-09'}
dropped policies: {'PL-01', 'PL-06'}
terminated variance: {'PL-04'}

Find all items in set 1 or set 2, but not both

Mathematical Notation:

$$(set_1 \cup set_2) - (set_1 \cap set_2)$$

In Python, there are 2 ways to write this:

set_1 ^ set_2

or, equivalently:

set_1.symmetric_difference(set_2)

From our example, we can use this notation to determine if there are any differences between our expected list of terminated policies and our calculated list:

policy_ids = {
    2021: ['PL-01', 'PL-04', 'PL-06', 'PL-07', 'PL-08'],
    2022: ['PL-01', 'PL-04', 'PL-06', 'PL-09', 'PL-09', 'PL-10'],
    2023: ['PL-04', 'PL-09', 'PL-10', 'PL-11', 'PL-14', 'PL-15', 'PL-16', 'PL-17', 'PL-18']
}

terminated_policies = {
    2022: ['PL-06', 'PL-07', 'PL-08', 'PL-09'],
    2023: ['PL-06', 'PL-01', 'PL-04']
}
for year in policy_ids:
    if year != start_year:
        dropped_policies[year] = set(policy_ids[year - 1]) - set(policy_ids[year])
        print(f"dropped policies: {dropped_policies[year]}")

        terminated_variance = set(terminated_policies[year]) ^ set(dropped_policies[year])
        print(f"terminated variance: {terminated_variance}")

2023 policy checks
------------------------------------------------------
terminated variance: {'PL-04'}

Conclusion

Why Use Python Sets?

While this example only included a limited number of policies, a typical insurance policy can have thousands, or even millions of members. Data reconciliation is commonly handled in Microsoft Excel using formulas such as VLOOKUP or INDEX-MATCH. With large datasets, using these methods leads to poor performance in Excel and can cause the program to freeze or even crash.

Benefits of using sets in Python:

  • Memory efficient

  • Intuitive and straight-forward syntax

    • Would take more code to achieve the same results using lists or list comprehension
  • Very fast performance compared to Microsoft Excel

    • Excel does not have the resources to handle large datasets and has a limit to the number of rows that can be held in a single workbook
  • Low time complexity

Appendix

Full code for insurance example:

start_year = 2021

policy_ids = {
    2021: ['PL-01', 'PL-04', 'PL-06', 'PL-07', 'PL-08'],
    2022: ['PL-01', 'PL-04', 'PL-06', 'PL-09', 'PL-09', 'PL-10'],
    2023: ['PL-04', 'PL-09', 'PL-10', 'PL-11', 'PL-14', 'PL-15', 'PL-16', 'PL-17', 'PL-18']
}

terminated_policies = {
    2022: ['PL-06', 'PL-07', 'PL-08', 'PL-09'],
    2023: ['PL-06', 'PL-01', 'PL-04']
}

has_duplicates = {2021: set(), 2022: set(), 2023: set()}
new_policies = {2022: set(), 2023: set()}
retained_policies = {2022: set(), 2023: set()}
dropped_policies = {2022: set(), 2023: set()}
terminated_variance = {2021: set(), 2022: set(), 2023: set()}

for year in policy_ids:
    print("******************************************************")
    print(f"{year} policy checks")
    print("------------------------------------------------------")
    has_duplicates[year] = len(policy_ids[year]) - len(set(policy_ids[year]))
    print(f"duplicate policies: {has_duplicates[year]}")

    if year != start_year:
        new_policies[year] = set(policy_ids[year]) - set(policy_ids[year - 1])
        print(f"new policies: {new_policies[year]}")

        retained_policies[year] = set(policy_ids[year - 1]) & set(policy_ids[year])
        print(f"retained policies: {retained_policies[year]}")

        dropped_policies[year] = set(policy_ids[year - 1]) - set(policy_ids[year])
        print(f"dropped policies: {dropped_policies[year]}")

        terminated_variance = set(terminated_policies[year]) ^ set(dropped_policies[year])
        print(f"terminated variance: {terminated_variance}")

    print("")

print("******************************************************")
all_policies = set(policy_ids[2021]) | set(policy_ids[2022]) | set(policy_ids[2023])
print(f"policies for all years: {all_policies}")
******************************************************
2021 policy checks
------------------------------------------------------
duplicate policies: 0

******************************************************
2022 policy checks
------------------------------------------------------
duplicate policies: 1
new policies: {'PL-10', 'PL-09'}
retained policies: {'PL-04', 'PL-01', 'PL-06'}
dropped policies: {'PL-08', 'PL-07'}
terminated variance: {'PL-09', 'PL-06'}

******************************************************
2023 policy checks
------------------------------------------------------
duplicate policies: 0
new policies: {'PL-11', 'PL-14', 'PL-17', 'PL-15', 'PL-18', 'PL-16'}
retained policies: {'PL-10', 'PL-04', 'PL-09'}
dropped policies: {'PL-01', 'PL-06'}
terminated variance: {'PL-04'}

******************************************************
policies for all years: {'PL-04', 'PL-09', 'PL-11', 'PL-14', 'PL-17', 'PL-08', 'PL-16', 'PL-07', 'PL-06', 'PL-10', 'PL-15', 'PL-01', 'PL-18'}

References:

https://www.w3schools.com/python/python_sets_join.asp

https://wiki.python.org/moin/TimeComplexity