Crawling the top 15,000 Drupal websites
The source was the top 1,000,000 websites from Alexa, dumped on 13th of November 2014. So the data should be reasonably up-to-date.
Out of those websites, I was able to recognize 171,010 websites as three of the most popular CMS's. For this task, I created a PHP library which detects CMS's in multiple ways (which is on GitHub by the way, so you are free to contribute!).
Why did I do this? I was curious about what are the markets shares of the top CMS's. But my main reason was to look into Drupal versions, to see how many of them are kept updated and how many are still vulnerable to the Drupalgeddon bug.
CMS Market shares
Let's get this out of the way: people REALLY love Wordpress! I wasn't expecting such a response, but looks like Wordpress is truly dominating against the other CMS's.
Most popular Drupal versions
Let's dive into Drupal versions. So I managed to recognize 14,526 different Drupal websites running 52 different versions. Here are the 5 most popular versions.
So from this data we can see that the most popular versions are fairly recent versions (7.34 was the latest at the time of this crawl). I was expecting far more older versions. And also the top 3 most popular versions are safe from the Drupalgeddon bug, which is great!
Latest Drupal versions
The 7.32 version seems to have quite a bump, with a good reason though: that was the first version to fix the Drupalgeddon bug. But suprisingly many sites update to the very latest version of 7.x.
Vulnerable to Drupalgeddon
And finally, the no 1 question I wanted an answer to: just how many websites are vulnerable to the Drupalgeddon bug? (According to website version)
So around 1/3 of top Drupal websites are not protected against the Drupalgeddon bug. This issue should be solved immediately; thousands of websites are in danger.
Please note: Drupal version is not the best way of determining the vulnerability. You can patch your Drupal against Drupalgeddon bug, which will not update the version number. So some older Drupal websites could still be protected against the bug.
Fun facts
On a lighter note, let's get to some fun facts!
I found 2 websites running 5.x, and 7 websites running 4.x! No, seriously, 4.x.
One brave website was even running on a custom 8.129 version.
I couldn't recognize versions of 129 websites. This is mostly because they did not have their CHANGELOG.txt accessible. I could recognize them as Drupal though, usually due to headers or metatags.
What's the most popular Drupal website? According to Alexa it's taboola.com, on rank 358.
It took my VPS a week to crawl through all the websites and another 2 days to determine the correct Drupal version.
Comments
Hi Kristian,
nice article, very useful.
I tested your PHP library and it works ;-) Of course it is quite time consuming to add many URLs manually. So how did you do it automatically. I do have a CSV with URLs i want to check and would like to know how to do it, sorry bit i am not a coder at all ;-)
Chris
Thanks!
http://pastebin.com/jjCzAssX
Here is a simplified version of the script I ran. First you need to insert the CSV to a MySQL table, and this script will read from there one random line at a time.
To speed things up, you can have multiple instances of this script running.
Thank you very much!
I think CHANGELOG.txt is very inaccurate way of checking the version of Drupal. Some methods of updates won't update that file at all so CHANGELOG.txt really points to the version which was available when that installation was initially done. I know using drush to update core behaves this way and it's not uncommon.
Yes, CHANGELOG.txt by all means is not fool-proof. Although I am pretty sure "drush up drupal" updates every file in core, even CHANGELOG.txt? Have to check up on that.
Yes, drush up will change CHANGELOG.txt
The only reason I can think to leave an older CHANGELOG.txt visible and publicly accessible is if you were running some kind of honey pot to attract attackers, not something these popular websites would likely be doing.
It is relatively common to block access to these .txt files at the server level though, might be another way the results get skewed, but anyway thanks for doing this research, hugely informative.
I wouldn't be too trusting of CHANGELOG.txt - while many sites remove it to obfuscate version, others may have an old one sitting around. A more accurate approach is demonstrated by tools like WhatWeb (not very active project @ https://github.com/urbanadventurer/WhatWeb) - retrieve common files which are available like misc/drupal.js, misc/tabledrag.js, misc/ajax.js, then compare them against known hashes for particular versions. Those files change often but are rarely modified by sites, so they are a more robust means of fingerprinting the actual target you're scanning.
Would you consider sharing your list of the top 15K Drupal sites?
Thanks, btw! Interesting article, good on you for doing the research and finding your own way to implement it.
Great idea about sharing the list! I have uploaded it as a CSV: http://polso.info/sites/polso.info/files/alexa-drupal-2014-11-13.csv
That WhatWeb project seems really interesting, I had not seen it before. And you are correct, comparing system files is a much better way of determining the version. Some problems I noticed though is that between some versions, there has not been any significant change in site resources. Like see here: https://github.com/drupal/drupal/compare/7.33...7.34.patch . There has not been any changes for the website visitor, from which we could determine the correct version.
I see much higher number of WordPress sites compared to Drupal sites. Probably one of the features in Drupal that non technical persons find difficult is updating drupal core and its modules.
I'm glad that you made such nice tool open source.
Thank you.
Interesting data, thanks.
Specially thanks for this PHP library.
The more foolproof way to detect a drupal site is to get /misc/drupal.js - that file exists on drupal 5,6, and 7, and will always be there unless you have done major hacks to core. It's not as illustrative as to exactly which versions the user is running.
The "32.4%" figure is misleading. It might also is also unnecessarily alarming. For starters;
1. Many Drupal websites were patched for Drupageddon, instead of upgraded. Your method appears to count these websites as though they are vulnerable when they are not. Testing if they are exploitable would be more accurate, but omit incorrectly count websites as "not vulnerable" if they were compromised and patched by the attacker. This is a common scenario.
2. The top 15 thousand Drupal websites are more likely to be better maintained than the other hundreds of thousands. It is likely that the "long tail" of low-trafficked and/or poorly maintained Drupal websites is where the damage of Drupageddon exploits has and will continue to be felt most.
Estimating how many Drupal websites are (or were) vulnerable to Drupageddon is extremely difficult. Any estimate that claims a precision of less than 50% is probably ill-advised in my opinion.
Nevertheless, this was very interesting to read and useful data. Thank you for sharing it. There are several crawlers that detect Drupal websites. BuiltWith seems to be the most comprehensive (http://trends.builtwith.com/cms/Drupal). But I don't think any of them attempt to detect the minor version number.
Drupal's usage statistics provide some insight into that, but it is hard to know how many Drupal websites have disabled the reporting feature those statistics depend on.
Both data sets show peaks for 7.32 and 7.34. However Drupal.org shows there are almost twice as many websites on 7.34 than 7.32. While your survey shows the opposite.
In our drupal development agency this script will be very useful :) Thanks for sharing!
Add new comment