Problem 62 is yet another wonderful illustration of the limits of brute force approaches. Here is the problem description.
The cube, 41063625 (3453), can be permuted to produce two other cubes: 56623104 (3843) and 66430125 (4053). In fact, 41063625 is the smallest cube which has exactly three permutations of its digits which are also cube.
Find the smallest cube for which exactly five permutations of its digits are cube.
On the face of it, the approach seems evident: list out all cubes up to a certain upper bound. For each cube, check if the list of cubes has at least 4 other numbers formed by transposing its digits. The lowest number in that set is the result.
I will admit to taking this approach initially. And my program ran and ran and ran… and ran some more before I terminated it.
Why did this seemingly straightforward approach fail?
It failed because of the number of executions. To give you an idea, 1233 is 1860867. The number of permutations of this number is 5040, the factorial of the number of digits. So, just to loop through all permutations of cubes from 1 to 500, we would have over 23.3 million permutations to search through. If our upper bound was 1000 cubes, then the number shoots up to over 204 million permutations!
How can we solve this? Well, it turns out that if we changes the order of steps in our original approach, we can solve this problem in O(n) time. As we generate cubes, we can cache the digits of the cube. Any new cube that we add to our cache will be checked to see if it can be formed by an existing set of digits. Once the number of cubes for a particular set of digits reaches five, we have our answer.
Solving Project Euler: Problem 062
Python
1
2
3
4
5
6
7
8
9
10
11
12
cubes=[str(i**3)foriinrange(10000)]
d={}
forcube incubes:
digits=tuple(sorted(cube))# using tuples since lists are non-hashable
Triangle, square, pentagonal, hexagonal, heptagonal, and octagonal numbers are all figurate (polygonal) numbers and are generated by the following formulae:
Triangle
P3,n=n(n+1)/2
1, 3, 6, 10, 15, …
Square
P4,n=n2
1, 4, 9, 16, 25, …
Pentagonal
P5,n=n(3n−1)/2
1, 5, 12, 22, 35, …
Hexagonal
P6,n=n(2n−1)
1, 6, 15, 28, 45, …
Heptagonal
P7,n=n(5n−3)/2
1, 7, 18, 34, 55, …
Octagonal
P8,n=n(3n−2)
1, 8, 21, 40, 65, …
The ordered set of three 4-digit numbers: 8128, 2882, 8281, has three interesting properties.
The set is cyclic, in that the last two digits of each number is the first two digits of the next number (including the last number with the first).
Each polygonal type: triangle (P3,127=8128), square (P4,91=8281), and pentagonal (P5,44=2882), is represented by a different number in the set.
This is the only set of 4-digit numbers with this property.
Find the sum of the only ordered set of six cyclic 4-digit numbers for which each polygonal type: triangle, square, pentagonal, hexagonal, heptagonal, and octagonal, is represented by a different number in the set.
The statement at the end — “only ordered set“— threw me off a bit, because I took it to mean that the six numbers would be triangular, square, pentagonal and so on in that order. There is no such set, and so I had to go back to the drawing board again.
It was then that point 2 in the problem statement dawned on me. 8128, 2882, and 8281 are cyclical, but 2882 is pentagonal. This makes the problem more interesting. The set of six numbers could be any permutation of the six types of figurates we are dealing with. Thankfully, since we know that there exists only one such set of six numbers, we can stop executing when we reach that set.
[I recently moved the Jupyter notebooks for this problem to GitHub. If you want to skip the commentary, you may download the notebooks here: Part 1, Part 2]
The creators of this data set (Tiago, Tulio) collected 1,956 comments on five YouTube videos during a certain time period and classified each as spam or ham. We will attempt to build a machine learning model to learn the data set and test our model’s performance.
The data is spread across five similar, but distinct data sets. We will take two passes through this. In our first pass, we will consider only the first data set, based on a video by the artist Psy. We will build a simple model based on a Naive Bayes classifier.
In our second pass, we will merge all five data sets into one unified data set, and will build a model that learns from and predicts comments as spam. For this pass, we will attempt multiple classifiers and pick the one that has the best accuracy score, and will further tune this model to improve performance.
As we have done in the past, we will follow our established workflow for building and testing machine learning models, namely, read the data, perform data cleanup where necessary, split the data set, transform the data set as necessary, select a model, train the model, test the model, and determine next steps.
The data set has five columns: comment_id, author, date, content, and class. Though not labeled explicitly, class 1 seems to represent spam and class 0 represents ham.
Of these, the only relevant feature is content, with class being the target.
The columns of interest do not have any missing data, so no data cleanup is necessary.
The spam/ham distribution is split nearly equally. This is good because if one class dominated the data set, the model may be skewed towards that class.
We will learn from 80% of the data set, and use the remaining 20% as our test set.
Since the feature column is text, we will need to convert it to a numeric format. To do this, we will use the CountVectorizer class. The CountVectorizer will build a document-term matrix, which represents each row in the feature column as an array of ones and zeros, with the former representing the presence of a word. The result of this operation is a sparse matrix containing 1564 rows (80%) and 3810 columns (number of unique words in the training set).
Next, we will build the model:
For our first pass, we will build a model using the MultinomialNB class
For our second pass, we will take a more elaborate approach, choosing from 8 different models. We will use 10-fold cross validation and select the model with the best accuracy score. In this case, the DecisionTreeClassifier performed the best. We will further tune this model using GridSearch to settle on the best set of parameters that produce the highest accuracy
Once the model is built, we will test it against the testing data set.
For the Psy data set, our accuracy score was 97%
For the combined data set, our accuracy score was 93%
As our final step, we can review the confusion matrix, and take a look at the false positives and false negatives.
If our model allows it, we can also list the words that were the most and least spammy, i.e. words that had the highest and lowest spam-to-ham ratio.
It is interesting to note that the odd squares lie along the bottom right diagonal, but what is more interesting is that 8 out of the 13 numbers lying along both diagonals are prime; that is, a ratio of 8/13 ≈ 62%.
If one complete new layer is wrapped around the spiral above, a square spiral with side length 9 will be formed. If this process is continued, what is the side length of the square spiral for which the ratio of primes along both diagonals first falls below 10%?
We will take the same approach here as we did for solving problem 28. For each new square layer that is added, the numbers at the four corners will be of the form n2, n2 – (n-1), n2 – (n-1)*2, n2 – (n-1)*3. We will therefore check if each of these values (except, of course, n2) is prime and increment a counter accordingly.
My first attempt at solving this problem was using a prime sieve, but even with an upper limit of 100,000,000, we will be unable to reach 10%. So my revised attempt makes use of our trusty old is_prime() function.
The next three expansions are 99/70, 239/169, and 577/408, but the eighth expansion, 1393/985, is the first example where the number of digits in the numerator exceeds the number of digits in the denominator.
In the first one-thousand expansions, how many fractions contain a numerator with more digits than denominator?
I found putting pen to paper very helpful in solving this problem. Obviously, trying construct a fraction going to 1000 levels should be enough to scare anyone from trying to solve this using a brute force approach. Let us write down the first few numbers in this series.
3/2
7/5
17/12
41/29
99/70
239/169
My first thought was that the numerators are prime, but seeing 99 in the list blew a hole in that notion. As we continue to look for patterns, we find that the next denominator in the series is merely the sum of the numerator and the denominator of the current number.
5 = 3 + 2; 12 = 7 + 5; 29 = 17 + 12, and so on
The numerators also follow a pattern. If you look at 3, 7, 17, 41, you will notice that the next numerator in the series is twice the current numerator plus the previous numerator.
17 = 7*2 + 3; 41 = 17*2 + 7; 99 = 41*2 + 17, and so on
Now that we have reduced the nth number in the series into its constituent parts, we can set up two lists, one each for the numerator and the denominator, and count for those instances where the numerator has more digits than the denominator.
Problem 56 is titled Powerful digit sum. The problem statement is as follows.
A googol (10100) is a massive number: one followed by one-hundred zeros; 100100 is almost unimaginably large: one followed by two-hundred zeros. Despite their size, the sum of the digits in each number is only 1.
Considering natural numbers of the form, ab, where a, b < 100, what is the maximum digital sum?
Once you break this problem down to its constituent parts, it becomes very easy to solve.
That is, 349 took three iterations to arrive at a palindrome.
Although no one has proved it yet, it is thought that some numbers, like 196, never produce a palindrome. A number that never forms a palindrome through the reverse and add process is called a Lychrel number. Due to the theoretical nature of these numbers, and for the purpose of this problem, we shall assume that a number is Lychrel until proven otherwise. In addition you are given that for every number below ten-thousand, it will either (i) become a palindrome in less than fifty iterations, or, (ii) no one, with all the computing power that exists, has managed so far to map it to a palindrome. In fact, 10677 is the first number to be shown to require over fifty iterations before producing a palindrome: 4668731596684224866951378664 (53 iterations, 28-digits).
Surprisingly, there are palindromic numbers that are themselves Lychrel numbers; the first example is 4994.
How many Lychrel numbers are there below ten-thousand?
NOTE: Wording was modified slightly on 24 April 2007 to emphasise the theoretical nature of Lychrel numbers.
It is very helpful to know that we could stop checking after 50 attempts, for if we did not have this information, this would be a never-ending quest.
In the code listing below, I get a count of all the non-Lychrel numbers under 10000, and subtract that from 10000 to get the right answer. (While the logic is right, it took me a few tries to get to the right answer, because I did not know what number to use as my lower bound. I started with 10, moved down to 1, and finally to 0 in order to get to the right answer.)
It is not until n = 23, that a value exceeds one-million: 23C10 = 1144066.
How many, not necessarily distinct, values of nCr, for 1 ≤ n ≤ 100, are greater than one-million?
Interestingly, this problem only looks for the number of combinations, and the formula is listed in the problem statement itself. We will make use of this and check for values of n ≥ 23. For the given upper bound of n, it is sufficient to check for values of r between 4 and n-3. This is because 100C3 = 100C97 < 1000000.