Contents

Learning Python By Example #3: The Birthday Problem

I'm getting older, but it's better than the alternative.

My Intro

I’m improving my Python skills 1981 style by coding through The Big Book of Small Python Projects by Al Sweigart.

Here’s a direct link to read All Sweigart’s “Birthday Paradox” code, though I suggest you buy the book if you’re finding the code useful.

You got a problem with birthdays?

The problem with birthdays is that many people have the same one. I, for example, share a birthday with John Cleese. This program by Al generates a bunch of random birthdays for a group of people and then calculates the odds of there being a matching birthday.

Let’s Get Primed to Python!!

As always, I copied and pasted his opening intro comments because I get no benefit from that.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
"""Birthday Paradox Simulation, by Al Sweigart al@inventwithpython.com
Explore the surprising probabilities of the "Birthday Paradox". 
More info at https://en.wikipedia.org/wiki/Birthday_problem
View this code at https://nostarch.com/big-book-small-python-projects
Tags: short, math, simulation"""

import datetime, random


def getBirthdays(numberOfBirthdays):
  """Returns a list of [numberOfBirthdays] random date objects."""
  birthdays = []
  for i in range(numberOfBirthdays):
    # The year is unimportant so long
    # as all the birthdays have the same year.
    startOfYear = datetime.date(2001, 1, 1)

    # Get a random day into the year
    randomNumberofDays = datetime.timedelta(random.randint(0,364))
    birthday = startOfYear + randomNumberofDays
    birthdays.append(birthday)
  return birthdays

Opening Notes

No main()

The first thing I noticed was that he doesn’t define a main function in this like he did in the first script. It actually makes this simpler, but it’s not importable? The last one had a comment that the main function (and the code at the end to invoke it) was in case the code was being run from the command-line instead of being imported.

What is datetime.time()?

One of the things I’ll do as I go through this is look up a library in the Python documentation to better understand why I’m doing something and what exactly is going on.

Looking up datetime.date() shows me that it returns a date object based on the arguments.

And while there, I looked up datetime.timedelta() on the page to understand what it did.

Overall the function generates a date object for January first, then a “delta,” which is an object for an amount of time, adds them together to get our random date, then appends the random date to the list of random dates.

Playing the Match Game with Gene Rayburn

The next function iterates through the list of birthdays and checks for matches.

25
26
27
28
29
30
31
32
33
34
35
def getMatch(birthdays):
  """ Returns the date object of a birthday 
  that occurs more than once in the list. """
  if len(birthdays) == len(set(birthdays)):
    return None # All birthdays are unique
  
  # Compare each birthday to every other birthday
  for a, birthdayA in enumerate(birthdays):
    for b, birthdayB in enumerate(birthdays[ a + 1 :]):
      if birthdayA == birthdayB:
        return birthdayA

Number generating and matching notes

TIL about the colon operator

Although this code is not testable in this state yet, I will be able to test it after the next block. And at that point, I find out I missed a colon while typing in line 33.

wrong: for b, birthdayB in enumerate(birthdays[ a + 1 ]):

right: for b, birthdayB in enumerate(birthdays[ a + 1 :]):

That missing colon is very important. Without it birthdays[ a + 1 ] returns a datetime object, which is not what I want. I want to iterate through the rest of the list. That colon after a + 1 is called “the colon operator” and it’s used to slice the array from the value before it to the value after it. If the value is not provided, it just interprets that as beginning or end. So birthdays[ a + 1 :] gives you the birthdays list from item a + 1 to the end.

There’s no iterable list returned by birthdays[ a + 1 ] (without the colon), so the script broke there until I found and fixed my error.

Is this the best use of our resources?

The birthday matching algorithm is pretty simple. It runs a simple sort that returns a list of just the unique values in the list (set(birthdays)). If the list of unique birthdays is the same size as the list of birthdays, it quits and acknowledges there’s no match. If there is one, it starts going through the list, testing position X in the list against every birthday after it, but quits when it finds a match.

If I sorted the list in date order first, then it would take just one run through the list, comparing each value to the one after it. Would that be faster? That’s a variation to try out after I complete the program.

How many birthdays do you want?!

The next function gets a number of birthdays from you and then uses the functions above to check for a match.

36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
# Display the intro:
print('''Birthday Paradox, by Al Sweigart al@inventwithpython.com
The Birthday Paradox shows us that in a group of N people, the odds
that two of them have matching birthdays is surprisingly large.
This program does a Monte Carlo simulation (that is, repeated random 
simulations) to explore this concept.
(It's not actually a paradox, it's just a surprising result.)
''')

# Set up a tuple of month names in order
MONTHS = ('Jan', 'Feb','Mar','Apr','May','Jun','Jul','Aug','Sept','Oct','Nov','Dec')

while True: #Keep asking until the user enters a valid amount.
  print('How many birthdays shall I generate? (Max 100)')
  response = input('> ')
  if response.isdecimal() and (0 < int(response) <= 100):
    numBDays = int(response)
    break # user has entered a valid amount
print();

#Generate and display the birthdays
print('Here are', numBDays, "birthdays:")
birthdays = getBirthdays(numBDays);
for i, birthday in enumerate(birthdays):
    if i != 0:
      # Display a comma for each birthday after the first birthday.
      print(', ', end='')
    monthName = MONTHS[birthday.month - 1]
    dateText = '{} {}'.format(monthName, birthday.day)
    print(dateText, end='')
print()
print()  

# Determine if there are two birthdays that match
match = getMatch(birthdays)

#Display the results
print('In this simulation, ', end='')
if match != None: 
  monthName = MONTHS[match.month - 1]
  dateText = '{} {}'.format(monthName, match.day)
  print('multiple people have a birthday on', dateText)
else:
  print('there are no matching birthdays')
print()

First batch notes

Ready? Break! It’s been a while since I’ve seen this, but it’s been a while since I’ve programmed getting input from STDIN.

In lines 48-53, he uses while True to start an endless loop that will run forever unless explicitly broken. The code explicitly breaks it only after you’ve given it an integer between 1 and 100.

Early polling suggests… It runs a demo of the generation of a birthday list and searching it for matches, printing the results to show you what it did.

This is just a demo, though. In the next lines of code it will run this simulation a BUNCH more times to give you a statistic on how often matches happen.

The long slog…

Now I have the code that runs the 100,000 tests.

 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
# Run through 100,000 simulations
# Greg's note - this is where I'll want to add timing code
print('Generating', numBDays, '100,000 times...')
input('Hit enter to begin: ')

print('Running...')
simMatch = 0 #counter for matching birthdays
for i in range(100_000):
  #Report on progress every 10,000 simulations
  if i % 10_000 == 0:
    print(i, "simulations run...")
  birthdays = getBirthdays(numBDays)
  if getMatch(birthdays) != None:
    simMatch = simMatch + 1
print('100,000 simulations complete.')

# Display simulation results:
probability = round(simMatch / 100_000 * 100, 2)
print('Out of 100,000 simulations of', numBDays, 'people, there was a')
print('matching birthday in that group', simMatch, 'times. This means')
print('that', numBDays, 'people have a', probability, '% chance of')
print('having a matching birthday in their group.')
print('That\'s probably more than you would think!')

Comparing our results

Proper indenting saves lives

When I ran it the first time, I thought “wow, that’s amazingly quick.” But I got a much lower result than Al led me to expect. I was getting less than 10 matches. In fact, no matter how big I made the set, I never got more than 10 matches out of 100,000.

As I inspected the code, I noticed my error. Lines 93-95 were indented too far. They were under the if statement that was printing the progress, thus were only getting run once every 10,000 times.

When I reduced their indenting by one level, then they ran on every iteration and the program wasn’t nearly as fast, but it did give me a result that tracked with what I was led to expect.

But can I save time?

Remember I asked a few paragraphs ago about whether the birthday matching algorithm was the most efficient method. Well, here’s where I find out.

First, put in some timing code. Since I’re already importing datetime, I’m grabbing a timestamp from it, though that syntax is a little more complex than if I used the time library.

I’ll insert the line: start_time = datetime.datetime.now().timestamp() at line 88 so I start the stopwatch right before I start running.

Then, at the end of the script, I’ll add:

106
107
108
109
end_time = datetime.datetime.now().timestamp()
the_time = (end_time - start_time)
print();
print('This run took', str(the_time), 'seconds.')

And I get…

 1How many birthdays shall I generate? (Max 100)
 2> 40
 3
 4Here are 40 birthdays:
 5Oct 11, Dec 28, Jun 28, Sept 30, Feb 19, Sept 1, Feb 21, May 13, Aug 12, 
 6Jan 18, Aug 2, Mar 25, Jun 21, Dec 2, May 12, Aug 24, Mar 13, Apr 15, 
 7Mar 18, Aug 30, Jan 16, May 24, Mar 14, Jul 26, Feb 6, Oct 2, Jun 17, 
 8Apr 18, Aug 31, Aug 16, Nov 29, May 10, Mar 18, Jun 28, Apr 30, May 7, 
 9Apr 26, Nov 18, May 19, Dec 5
10
11In this simulation, multiple people have a birthday on Jun 
1228
13
14Generating 40 100,000 times...
15Hit enter to begin:
16Running...
170 simulations run...
1810000 simulations run...
1920000 simulations run...
2030000 simulations run...
2140000 simulations run...
2250000 simulations run...
2360000 simulations run...
2470000 simulations run...
2580000 simulations run...
2690000 simulations run...
27100,000 simulations complete.
28Out of 100,000 simulations of 40 people, there was a       
29matching birthday in that group 89125 times. This means    
30that 40 people have a 89.12 % chance of
31having a matching birthday in their group.
32That's probably more than you would think!
33
34This run took 3.980314016342163 seconds.

I ran it 10 times to get an average speed of 4.003537011 seconds on my home desktop.

Now, the question: would a list sort that allowed a much simpler array walk be faster?

The first part was to add the sort at the end of the birthday generating routine so it sorted the list before returning it. So, just before the end of the date generator (at line 22), I inserted:

birthdays.sort()

I ran it a few times with the original list-walking algorithm to see if the sort itself added some time, and it did; about a half-second over 100,000 iterations.

But with the original algorithm, if there was a matching date, it had to walk large chunks of the list over and over again until it reached a date that had a twin. This was the same way it worked with the list unsorted. It just might mean a few or a few dozen less comparisons overall.

But now, with the sorted list, and a new algorithm, I could make sure there were no more than 39 comparisons for each list of 40, instead of potentially hundreds.

I commented out the part that compared the length of the birthday set and the unique birthdays list too. At what’s now line 32, I put in the new comparison routine.

32
33
34
35
  for a in range( len( birthdays ) - 2 ):
      if birthdays[ a ] == birthdays[ a + 1 ]:
        return birthdays[ a ]
  return None

Instead of a nested loop, where for each element in the list I compared it to all the subsequent elements in the list, I have a single loop that runs from 0 to the length of the list minus 2. But why minus two?

The list numbers from 0, so the numbers of the items go from 0 to 39, but the length of the list is 40 elements. The final comparison is of 38 to 39, so I set the loop to iterate from 0 to 38.

Ten runs of the old method took an average of 4.003537011 seconds. Ten runs of the new method took an average of 3.48092289 seconds. The old method was a little more than 15% slower.

BUT as the number of birthdays went up, the gap closed.

One reason for this might be the high likelihood of matching birthdays. If you’ve got 88% odds of a match in a set of 40, that unsorted algorithm will exit when or before it gets to the 39th item in the list 88% of the time, regardless of list size. Meanwhile the process of sorting by date order doesn’t benefit from those efficiencies of scale, so it just keeps getting more expensive, but the higher the likelhood of a match in the set, the more the unsorted method benefits from the scale. Still, up to a list of 200, the sorted version eked out wins.

The full final code of the sorted version is below. Or you can check out the full code of Al’s version.

Want to read the next one?

I’ll link it in this last section when it drops. But you can also follow me on Twitter or LinkedIn. I also livecode these on Twitch/YouTube/LinkedIn (that’s right, baby… simulcasting) now and again.

Read Python Project #3: Message in a Bitmap.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
"""Birthday Paradox Simulation, by Al Sweigart al@inventwithpython.com
Explore the surprising probabilities of the "Birthday Paradox". 
More info at https://en.wikipedia.org/wiki/Birthday_problem
View this code at https://nostarch.com/big-book-small-python-projects
Tags: short, math, simulation"""

import datetime, random


def getBirthdays(numberOfBirthdays):
  """Returns a list of random date objects."""
  birthdays = []
  for i in range(numberOfBirthdays):
    # The year is unimportant so long
    # as all the birthdays have the same year.
    startOfYear = datetime.date(2001, 1, 1)

    # Get a random day into the year
    randomNumberofDays = datetime.timedelta(random.randint(0,364))
    birthday = startOfYear + randomNumberofDays
    birthdays.append(birthday)
  birthdays.sort()
  return birthdays

def getMatch(birthdays):
  """ Returns the date object of a birthday 
  that occurs more than onbce in the list. """
  #if len(birthdays) == len(set(birthdays)):
  #  return None # All birthdays are unique
  
  # Compare each birthday to every other birthday
  for a in range( len( birthdays ) -2 ):
      if birthdays[ a ] == birthdays[ a + 1 ]:
        return birthdays[ a ]
  return None #if nothing matched, return None
  
# Display the intro:
print('''Birthday Paradox, by Al Sweigart al@inventwithpython.com
The Birthday Paradox shows us that in a group of N people, the odds
that two of them have matching birthdays is surprisingly large.
This program does a Monte Carlo simulation (that is, repeated random 
simulations) to explore this concept.
(It's not actually a paradox, it's just a surprising result.)
''')

# Set up a tuple of month names in order
MONTHS = ('Jan', 'Feb','Mar','Apr','May','Jun','Jul','Aug','Sept','Oct','Nov','Dec')

while True: #Keep asking until the user enters a valid amount.
  print('How many birthdays shall I generate? (Max 100)')
  response = input('> ')
  if response.isdecimal() and (0 < int(response) <= 250):
    numBDays = int(response)
    break # user has entered a valid amount
print();

#Generate and display the birthdays
print('Here are', numBDays, "birthdays:")
birthdays = getBirthdays(numBDays);
for i, birthday in enumerate(birthdays):
    if i != 0:
      # Display a comma for each birthday after the first birthday.
      print(', ', end='')
    monthName = MONTHS[birthday.month - 1]
    dateText = '{} {}'.format(monthName, birthday.day)
    print(dateText, end='')
print()
print()  

# Determine if there are two birthdays that match
match = getMatch(birthdays)

#Display the results
print('In this simulation, ', end='')
if match != None: 
  monthName = MONTHS[match.month - 1]
  dateText = '{} {}'.format(monthName, match.day)
  print('multiple people have a birthday on', dateText)
else:
  print('there are no matching birthdays')
print()

# Run through 100,000 simulations
# Greg's note - this is where we'll want to add timing code
print('Generating', numBDays, '100,000 times...')
input('Hit enter to begin: ')

print('Running...')
start_time = datetime.datetime.now().timestamp()
simMatch = 0 #counter for matching birthdays
for i in range(100_000):
  #Report on progress every 10,000 simulations
  if i % 10_000 == 0:
    print(i, "simulations run...")
  birthdays = getBirthdays(numBDays)
  if getMatch(birthdays) != None:
    simMatch = simMatch + 1
print('100,000 simulations complete.')

# Display simulation results:
probability = round(simMatch / 100_000 * 100, 2)
print('Out of 100,000 simulations of', numBDays, 'people, there was a')
print('matching birthday in that group', simMatch, 'times. This means')
print('that', numBDays, 'people have a', probability, '% chance of')
print('having a matching birthday in their group.')
print('That\'s probably more than you would think!')
end_time = datetime.datetime.now().timestamp()
the_time = (end_time - start_time)
print();
print('This run took', str(the_time), 'seconds.')