Learning Python By Example #3: The Birthday Problem
I'm getting older, but it's better than the alternative.
You got a problem with birthdays?
The problem with birthdays is that many people have the same one. I, for example, share a birthday with John Cleese. This program by Al generates a bunch of random birthdays for a group of people and then calculates the odds of there being a matching birthday.
Let’s Get Primed to Python!!
As always, I copied and pasted his opening intro comments because I get no benefit from that.
The first thing I noticed was that he doesn’t define a main function in this like he did in the first script. It actually makes this simpler, but it’s not importable? The last one had a comment that the main function (and the code at the end to invoke it) was in case the code was being run from the command-line instead of being imported.
What is datetime.time()?
One of the things I’ll do as I go through this is look up a library in the Python documentation to better understand why I’m doing something and what exactly is going on.
datetime.date() shows me that it returns a date object based on the arguments.
And while there, I looked up
datetime.timedelta() on the page to understand what it did.
Overall the function generates a date object for January first, then a “delta,” which is an object for an amount of time, adds them together to get our random date, then appends the random date to the list of random dates.
Playing the Match Game with Gene Rayburn
The next function iterates through the list of birthdays and checks for matches.
Number generating and matching notes
TIL about the colon operator
Although this code is not testable in this state yet, I will be able to test it after the next block. And at that point, I find out I missed a colon while typing in line 33.
for b, birthdayB in enumerate(birthdays[ a + 1 ]):
for b, birthdayB in enumerate(birthdays[ a + 1 :]):
That missing colon is very important. Without it
birthdays[ a + 1 ] returns a
datetime object, which is not what I want. I want to iterate through the rest of the list. That colon after
a + 1 is called “the colon operator” and it’s used to slice the array from the value before it to the value after it. If the value is not provided, it just interprets that as beginning or end. So
birthdays[ a + 1 :] gives you the birthdays list from item
a + 1 to the end.
There’s no iterable list returned by
birthdays[ a + 1 ] (without the colon), so the script broke there until I found and fixed my error.
Is this the best use of our resources?
The birthday matching algorithm is pretty simple. It runs a simple sort that returns a list of just the unique values in the list (
set(birthdays)). If the list of unique birthdays is the same size as the list of birthdays, it quits and acknowledges there’s no match. If there is one, it starts going through the list, testing position X in the list against every birthday after it, but quits when it finds a match.
If I sorted the list in date order first, then it would take just one run through the list, comparing each value to the one after it. Would that be faster? That’s a variation to try out after I complete the program.
How many birthdays do you want?!
The next function gets a number of birthdays from you and then uses the functions above to check for a match.
First batch notes
Ready? Break! It’s been a while since I’ve seen this, but it’s been a while since I’ve programmed getting input from STDIN.
In lines 48-53, he uses
while True to start an endless loop that will run forever unless explicitly broken. The code explicitly breaks it only after you’ve given it an integer between 1 and 100.
Early polling suggests… It runs a demo of the generation of a birthday list and searching it for matches, printing the results to show you what it did.
This is just a demo, though. In the next lines of code it will run this simulation a BUNCH more times to give you a statistic on how often matches happen.
The long slog…
Now I have the code that runs the 100,000 tests.
Comparing our results
Proper indenting saves lives
When I ran it the first time, I thought “wow, that’s amazingly quick.” But I got a much lower result than Al led me to expect. I was getting less than 10 matches. In fact, no matter how big I made the set, I never got more than 10 matches out of 100,000.
As I inspected the code, I noticed my error. Lines 93-95 were indented too far. They were under the
if statement that was printing the progress, thus were only getting run once every 10,000 times.
When I reduced their indenting by one level, then they ran on every iteration and the program wasn’t nearly as fast, but it did give me a result that tracked with what I was led to expect.
But can I save time?
Remember I asked a few paragraphs ago about whether the birthday matching algorithm was the most efficient method. Well, here’s where I find out.
First, put in some timing code. Since I’re already importing
datetime, I’m grabbing a timestamp from it, though that syntax is a little more complex than if I used the
I’ll insert the line:
start_time = datetime.datetime.now().timestamp()
at line 88 so I start the stopwatch right before I start running.
Then, at the end of the script, I’ll add:
And I get…
1How many birthdays shall I generate? (Max 100) 2> 40 3 4Here are 40 birthdays: 5Oct 11, Dec 28, Jun 28, Sept 30, Feb 19, Sept 1, Feb 21, May 13, Aug 12, 6Jan 18, Aug 2, Mar 25, Jun 21, Dec 2, May 12, Aug 24, Mar 13, Apr 15, 7Mar 18, Aug 30, Jan 16, May 24, Mar 14, Jul 26, Feb 6, Oct 2, Jun 17, 8Apr 18, Aug 31, Aug 16, Nov 29, May 10, Mar 18, Jun 28, Apr 30, May 7, 9Apr 26, Nov 18, May 19, Dec 5 10 11In this simulation, multiple people have a birthday on Jun 1228 13 14Generating 40 100,000 times... 15Hit enter to begin: 16Running... 170 simulations run... 1810000 simulations run... 1920000 simulations run... 2030000 simulations run... 2140000 simulations run... 2250000 simulations run... 2360000 simulations run... 2470000 simulations run... 2580000 simulations run... 2690000 simulations run... 27100,000 simulations complete. 28Out of 100,000 simulations of 40 people, there was a 29matching birthday in that group 89125 times. This means 30that 40 people have a 89.12 % chance of 31having a matching birthday in their group. 32That's probably more than you would think! 33 34This run took 3.980314016342163 seconds.
I ran it 10 times to get an average speed of 4.003537011 seconds on my home desktop.
Now, the question: would a list sort that allowed a much simpler array walk be faster?
The first part was to add the sort at the end of the birthday generating routine so it sorted the list before returning it. So, just before the end of the date generator (at line 22), I inserted:
I ran it a few times with the original list-walking algorithm to see if the sort itself added some time, and it did; about a half-second over 100,000 iterations.
But with the original algorithm, if there was a matching date, it had to walk large chunks of the list over and over again until it reached a date that had a twin. This was the same way it worked with the list unsorted. It just might mean a few or a few dozen less comparisons overall.
But now, with the sorted list, and a new algorithm, I could make sure there were no more than 39 comparisons for each list of 40, instead of potentially hundreds.
I commented out the part that compared the length of the birthday set and the unique birthdays list too. At what’s now line 32, I put in the new comparison routine.
Instead of a nested loop, where for each element in the list I compared it to all the subsequent elements in the list, I have a single loop that runs from 0 to the length of the list minus 2. But why minus two?
The list numbers from 0, so the numbers of the items go from 0 to 39, but the length of the list is 40 elements. The final comparison is of 38 to 39, so I set the loop to iterate from 0 to 38.
Ten runs of the old method took an average of 4.003537011 seconds. Ten runs of the new method took an average of 3.48092289 seconds. The old method was a little more than 15% slower.
BUT as the number of birthdays went up, the gap closed.
One reason for this might be the high likelihood of matching birthdays. If you’ve got 88% odds of a match in a set of 40, that unsorted algorithm will exit when or before it gets to the 39th item in the list 88% of the time, regardless of list size. Meanwhile the process of sorting by date order doesn’t benefit from those efficiencies of scale, so it just keeps getting more expensive, but the higher the likelhood of a match in the set, the more the unsorted method benefits from the scale. Still, up to a list of 200, the sorted version eked out wins.
The full final code of the sorted version is below. Or you can check out the full code of Al’s version.
Want to read the next one?