How to Combine Sets in Python
A practical example of using sets to reconcile period-over-period movement for an insurance policy
In this article, I will go through various methods to join sets as well as an example of how these techniques may be used to explain year over year policyholder movement for an insurance policy.
Consider an insurance policy which has policyholders changing from year to year. What if we need to find out how many new policies came in this year? Or how many policies were terminated from the prior year? Typically, these questions can be answered using formulas in Microsoft Excel. Perhaps there is a better way…
In this example, we have the following policyholder ID’s from 2021-2023, as well as a list of policies which are included on the client’s terminated policy list. For each year, we will be using sets to find:
has_duplicates: the number of duplicate policy ID’s
new_policies: policy ID’s for new policyholders
retained_policies: policy ID’s for policyholders who remained active from prior year
dropped_policies: policy ID’s for policyholders who terminated their policies during the year
terminated_variance: policy ID’s that do not match between the calculated dropped_policies and list of expected terminated_policies
start_year = 2021
policy_ids = {
2021: ['PL-01', 'PL-04', 'PL-06', 'PL-07', 'PL-08'],
2022: ['PL-01', 'PL-04', 'PL-06', 'PL-09', 'PL-09', 'PL-10'],
2023: ['PL-04', 'PL-09', 'PL-10', 'PL-11', 'PL-14', 'PL-15', 'PL-16', 'PL-17', 'PL-18']
}
terminated_policies = {
2022: ['PL-06', 'PL-07', 'PL-08', 'PL-09'],
2023: ['PL-06', 'PL-01', 'PL-04']
}
has_duplicates = {2021: set(), 2022: set(), 2023: set()}
new_policies = {2022: set(), 2023: set()}
retained_policies = {2022: set(), 2023: set()}
dropped_policies = {2022: set(), 2023: set()}
terminated_variance = {2021: set(), 2022: set(), 2023: set()}
Find all items in set 1 and set 2
Mathematical Notation:`
$${set_1\cup set_2}$$
In Python, there are 2 ways to write this:
set_1 | set_2
or, equivalently:
set_1.union(set_2)
In our insurance example, we can use this notation to find all policyholder ID’s from 2021-2023.
all_policies = set(policy_ids[2021]) | set(policy_ids[2022]) | set(policy_ids[2023])
print(f"policies for all years: {all_policies}")
policies for all years: {'PL-04', 'PL-07', 'PL-17', 'PL-11', 'PL-01', 'PL-08',
'PL-16', 'PL-18', 'PL-10', 'PL-14', 'PL-15', 'PL-06', 'PL-09'}
Find all items in set 1, that are not in set 2
Mathematical Notation:
$$set_1 /\ set_2$$
In Python, there are 2 ways to write this:
set_1 - set_2
or, equivalently:
set_1.difference(set_2)
In our example, we can use this notation to find:
Number of duplicate policies in a single year
len(policy_ids) → count of policy ID list
len(set(policy_ids)) → count of policy ID list after removing duplicates
New policies for a given year = policies included in current year that were not included in prior year
Dropped policies for a given year = policies included in prior year that are not included in current year
policy_ids = {
2021: ['PL-01', 'PL-04', 'PL-06', 'PL-07', 'PL-08'],
2022: ['PL-01', 'PL-04', 'PL-06', 'PL-09', 'PL-09', 'PL-10'],
2023: ['PL-04', 'PL-09', 'PL-10', 'PL-11', 'PL-14', 'PL-15', 'PL-16', 'PL-17', 'PL-18']
}
for year in policy_ids:
duplicate_count[year] = len(policy_ids[year]) - len(set(policy_ids[year]))
print(f"duplicate count: {duplicate_count[year]}")
if year != start_year:
new_policies[year] = set(policy_ids[year]) - set(policy_ids[year - 1])
print(f"new policies: {new_policies[year]}")
dropped_policies[year] = set(policy_ids[year - 1]) - set(policy_ids[year])
print(f"dropped policies: {dropped_policies[year]}")
2022 policy checks
------------------------------------------------------
duplicate policies: 1
new policies: {'PL-10', 'PL-09'}
retained policies: {'PL-04', 'PL-01', 'PL-06'}
dropped policies: {'PL-08', 'PL-07'}
terminated variance: {'PL-09', 'PL-06'}
Find all items that are in both set 1 and set 2
Mathematical Notation:`
$$set_1 \cap set_2$$
In Python, there are 2 ways to write this:
set_1 & set_2
or, equivalently:
set_1.intersection(set_2)
In our example, we can use this notation to find the policy ID’s for policyholders who have persisted from the prior to current year:
policy_ids = {
2021: ['PL-01', 'PL-04', 'PL-06', 'PL-07', 'PL-08'],
2022: ['PL-01', 'PL-04', 'PL-06', 'PL-09', 'PL-09', 'PL-10'],
2023: ['PL-04', 'PL-09', 'PL-10', 'PL-11', 'PL-14', 'PL-15', 'PL-16', 'PL-17', 'PL-18']
}
for year in policy_ids:
if year != start_year:
retained_policies[year] = set(policy_ids[year - 1]) & set(policy_ids[year])
print(f"retained policies: {retained_policies[year]}")
2023 policy checks
------------------------------------------------------
duplicate policies: 0
new policies: {'PL-11', 'PL-14', 'PL-17', 'PL-15', 'PL-18', 'PL-16'}
retained policies: {'PL-10', 'PL-04', 'PL-09'}
dropped policies: {'PL-01', 'PL-06'}
terminated variance: {'PL-04'}
Find all items in set 1 or set 2, but not both
Mathematical Notation:
$$(set_1 \cup set_2) - (set_1 \cap set_2)$$
In Python, there are 2 ways to write this:
set_1 ^ set_2
or, equivalently:
set_1.symmetric_difference(set_2)
From our example, we can use this notation to determine if there are any differences between our expected list of terminated policies and our calculated list:
policy_ids = {
2021: ['PL-01', 'PL-04', 'PL-06', 'PL-07', 'PL-08'],
2022: ['PL-01', 'PL-04', 'PL-06', 'PL-09', 'PL-09', 'PL-10'],
2023: ['PL-04', 'PL-09', 'PL-10', 'PL-11', 'PL-14', 'PL-15', 'PL-16', 'PL-17', 'PL-18']
}
terminated_policies = {
2022: ['PL-06', 'PL-07', 'PL-08', 'PL-09'],
2023: ['PL-06', 'PL-01', 'PL-04']
}
for year in policy_ids:
if year != start_year:
dropped_policies[year] = set(policy_ids[year - 1]) - set(policy_ids[year])
print(f"dropped policies: {dropped_policies[year]}")
terminated_variance = set(terminated_policies[year]) ^ set(dropped_policies[year])
print(f"terminated variance: {terminated_variance}")
2023 policy checks
------------------------------------------------------
terminated variance: {'PL-04'}
Conclusion
Why Use Python Sets?
While this example only included a limited number of policies, a typical insurance policy can have thousands, or even millions of members. Data reconciliation is commonly handled in Microsoft Excel using formulas such as VLOOKUP or INDEX-MATCH. With large datasets, using these methods leads to poor performance in Excel and can cause the program to freeze or even crash.
Benefits of using sets in Python:
Memory efficient
Intuitive and straight-forward syntax
- Would take more code to achieve the same results using lists or list comprehension
Very fast performance compared to Microsoft Excel
- Excel does not have the resources to handle large datasets and has a limit to the number of rows that can be held in a single workbook
Low time complexity
Appendix
Full code for insurance example:
start_year = 2021
policy_ids = {
2021: ['PL-01', 'PL-04', 'PL-06', 'PL-07', 'PL-08'],
2022: ['PL-01', 'PL-04', 'PL-06', 'PL-09', 'PL-09', 'PL-10'],
2023: ['PL-04', 'PL-09', 'PL-10', 'PL-11', 'PL-14', 'PL-15', 'PL-16', 'PL-17', 'PL-18']
}
terminated_policies = {
2022: ['PL-06', 'PL-07', 'PL-08', 'PL-09'],
2023: ['PL-06', 'PL-01', 'PL-04']
}
has_duplicates = {2021: set(), 2022: set(), 2023: set()}
new_policies = {2022: set(), 2023: set()}
retained_policies = {2022: set(), 2023: set()}
dropped_policies = {2022: set(), 2023: set()}
terminated_variance = {2021: set(), 2022: set(), 2023: set()}
for year in policy_ids:
print("******************************************************")
print(f"{year} policy checks")
print("------------------------------------------------------")
has_duplicates[year] = len(policy_ids[year]) - len(set(policy_ids[year]))
print(f"duplicate policies: {has_duplicates[year]}")
if year != start_year:
new_policies[year] = set(policy_ids[year]) - set(policy_ids[year - 1])
print(f"new policies: {new_policies[year]}")
retained_policies[year] = set(policy_ids[year - 1]) & set(policy_ids[year])
print(f"retained policies: {retained_policies[year]}")
dropped_policies[year] = set(policy_ids[year - 1]) - set(policy_ids[year])
print(f"dropped policies: {dropped_policies[year]}")
terminated_variance = set(terminated_policies[year]) ^ set(dropped_policies[year])
print(f"terminated variance: {terminated_variance}")
print("")
print("******************************************************")
all_policies = set(policy_ids[2021]) | set(policy_ids[2022]) | set(policy_ids[2023])
print(f"policies for all years: {all_policies}")
******************************************************
2021 policy checks
------------------------------------------------------
duplicate policies: 0
******************************************************
2022 policy checks
------------------------------------------------------
duplicate policies: 1
new policies: {'PL-10', 'PL-09'}
retained policies: {'PL-04', 'PL-01', 'PL-06'}
dropped policies: {'PL-08', 'PL-07'}
terminated variance: {'PL-09', 'PL-06'}
******************************************************
2023 policy checks
------------------------------------------------------
duplicate policies: 0
new policies: {'PL-11', 'PL-14', 'PL-17', 'PL-15', 'PL-18', 'PL-16'}
retained policies: {'PL-10', 'PL-04', 'PL-09'}
dropped policies: {'PL-01', 'PL-06'}
terminated variance: {'PL-04'}
******************************************************
policies for all years: {'PL-04', 'PL-09', 'PL-11', 'PL-14', 'PL-17', 'PL-08', 'PL-16', 'PL-07', 'PL-06', 'PL-10', 'PL-15', 'PL-01', 'PL-18'}
References: