Update 2013.01.07: asteriskpound on reddit has pointed out a flaw in how I determine the root certifcate, and how I calculate the length of the certificate. The flaw is that I thought that the last certificate in the "certificate chain" from openssl's output would always be the root of the chain, but actually this "chain" can be very broken (as is the case with me thinking www.olivenoel.com had 21 certificates in it's chain). I don't expect my final results to be very different, but I will need to re-evaluate before these results can be trusted.
Update 2013.01.09: See my follow-up post Root Certificate Authority research – post 2 for corrections to the data.
Highlights
Only 50 of the top 1M sites (from Alexa) are signed by TURKTRUST. For the conspiracy theorists out there, 22 of those are for Iranian sites, with the rest being for Turkish sites. See the list of those sites at: http://pastebin.com/PdKb5BkF (site name followed by certificate info).
22% of the top 1M sites have an SSL site (not using a self-signed cert, but possibly not all of these are valid). 86% of all SSL sites are signed by only 20 root certificates, and because many companies (such as Verisign) have multiple root certificates, 98% of all SSL sites are signed by 20 companies. For counts on the top 1000 root CAs for the top 1M sites, see:
http://pastebin.com/kgd1g2m3
To look at this data yourself, you can download the file certPaths.zip (22MB). The format (further shown in the Appendix) is: site|# of certs in chain to root|cert info|issuer info|root info
Motivation and research
Before I get into how I gathered this data, let me explain what provoked me to do this. Due to the recent TURKTRUST news, and similar Diginotar news from 2011, involving root Certificate Authority (CA) mistakes, I decided to do some research into certificate authorities this weekend. The basic problem is that all browsers trust a bunch of weird root certificate authorities, and these CA's can create certs for any site, which your browser will trust without any warnings or indications if one of these CA's does something weird (like TURKTRUST signing google.com certs). So although the cert for www.google.com is signed by Verisign, there are no technical restrictions for Turktrust or Diginotar to have signed a different cert to allow a server to pretend to be the real www.google.com. Verisign and Turktrust are each trusted equally by your browser (and Diginotar used to be). So if someone can get one of these fraudulent certs and MiTM your browser, then when you connect to gmail, they can read all your traffic as if you weren't using SSL at all.
There has been various research into trying to improve this situation. The EFF SSL Observatory project has tried to identify how many CA's there really are that your browser trusts, and has identified 1,482 CAs trusted by Microsoft and Mozilla from 651 organizations (many organizations have multiple certs). Even though your browser may "only" trust 194 CA's (as is the case with my Firefox 17.0.1 browser as of Jan 6, 2013), those CA's have signed intermediate CA's and given them the ability to sign more certs. Furthermore, although IE may only show 30 CA's that it trusts (as is the case with my Internet Explorer 9.0.9112.16421 install), it actually trusts many more which Microsoft downloads and trusts on an as-needed basis when it runs into them as explained in the technet article Microsoft Root Certificate Program (pointed out in "Certified Lies" paper discussed later in this post). Chrome relies on the OS to provide its trust, so on my Windows 7 system it has the 30 CA's that IE does, and on my Ubuntu system it trusts 145 certificates.
It is important to note that not any cert can create and sign another cert. That issue has been talked about by Moxie Marlinspike in his 2009 Blackhat DC presentation on sslstrip, and should not happen with current browsers.
However, a default browser install today still trusts many more organizations than many believe it should, and those CAs can have intermediate CAs which are capable of signing certificates that do not show up as root CAs in the browsers. There is a great paper called "Certified Lies: Detecting and Defeating Government Interception Attacks Against SSL" by Christopher Soghoian and Sid Stamm, explaining how SSL can be exploited due to these trust issues, and introduces their Firefox addon CertLock which implements their Trust-On-First-Use (TOFU) philosophy wherein you trust the Google cert the first time you see it, and panic if it ever changes. There is a post on the TOR project about a guy that removes all certs and adds them as needed, and another post shows the 10 CA's that were ultimately accepted by someone after 30 days of using a browser with which he had originally removed all CA's and only added them as needed.
How to protect yourself
So what should you do? Well, first, SSL is still a good thing, and you should still use it as much as possible. To ensure you do, install EFF's HTTPS Everywhere browser add-on which will try to force sites to use SSL that offer it. There are a few additional extensions (all for Firefox only) which have sought to fix this particular problem (I have not tested these):
- CertLock the project discussed previously to use the Trust-On-First-Use philosophy.
- Perspectives explained on their site perspectives-project.org that they provide 3rd party notaries to check with to ensure a cert should be trusted, similar to the "web of trust" concept used by PGP.
- Certificate Patrol reveals when certificates are updated, so you can ensure it was a legitimate change.
My research
How many CA's do my browsers trust?
My question in all this was "Why do I need to trust so many CA's?" So I decided to first see how many I trust. I manually exported all the CA's in Firefox and IE (Chrome on Windows has the same list as IE as they both get them from the OS apparently). This was a pain, so I've made these files available at:
All certs are in the PEM format and can be viewed using:
openssl x509 -in certum.pem -text -noout
This allows you to see information such as:
Signature Algorithm: sha1WithRSAEncryption
Issuer: C=PL, O=Unizeto Sp. z o.o., CN=Certum CA
Validity
Not Before: Jun 11 10:46:39 2002 GMT
Not After : Jun 11 10:46:39 2027 GMT
If you would like to export your own files from Chrome or IE, you will need to covert them from the DER format to PEM using:
openssl x509 -inform der -in certificate.cer -out certificate.pem
How many CA's do I need to trust?
I wanted to see how many CA's are really in use, so I downloaded the Alexa Top 1 Million sites in the world, which just gives a list like this:
1,facebook.com
2,google.com
3,youtube.com
4,yahoo.com
5,baidu.com
I then wrote some scripts to collect certificate info using openssl (see Appendix for scripts).
Other interesting results
Not really relevant, but I found it interesting to see how many certs were in the chain to the root.
# of certs in chain | Site count |
0 (site did not have ssl) | 640734 |
1 (self-signed) | 123733 |
2 | 100136 |
3 | 95165 |
4 | 26648 |
5 | 1561 |
6 | 97 |
7 | 49 |
8 | 18 |
9 | 25 |
10 | 1 |
11 | 6 |
12 | 3 |
13 | 55 |
14 | 1 |
15 | 2 |
16 | 1 |
17 | 12 |
18 | 2 |
19 | 5 |
21 | 1 www.olivenoel.com |
Appendix
To grab cert info from the Alexa Top 1 Million sites list, I create a bash script called output_cert_info.sh with the following:
#!/bin/sh
echo $1
timeout 3 openssl s_client -connect $1:443 < /dev/null > data/$1 2>/dev/null
echo "HEAD / HTTP/1.0
Host: $1:443
EOT
" \
| timeout 3 openssl s_client -connect $1:443 2>/dev/null \
| sed -n '/-----BEGIN CERTIFICATE-----/,/-----END CERTIFICATE-----/p' \
| openssl x509 -noout -text -certopt no_signame > x509/$1 2>/dev/null
and ran it with the following:
cat top-1m.csv | sed 's/.*,/www./g' | xargs -P50 -I {} ./output_cert_info.sh {}
# Takes about 16 hours
That turns all the sites into www. names, and using some Taco Bell Programming to parallelize this. The bash script itself will record data about the certificate chain to the ./data dir and the x509 cert itself to ./x509.
To extract the certificate path info from the data directory, I wrote a python script called gatherData.py:
import sys
import os
import re
for filename in sys.stdin:
filename = filename.strip()
site = os.path.basename(filename)
try:
with open(filename) as f:
content = f.readlines()
certData = False
certNum = 0
isIssuer = False
subject = ""
issuer = ""
root = ""
for line in content:
line = line.strip()
if line == "Certificate chain":
certData = True
continue
if certData:
if line == "---":
certData = False
break
if not isIssuer:
if subject == "":
subject = re.sub("%d s:" % certNum, "", line)
certNum += 1
isIssuer = not isIssuer
continue
if isIssuer:
root = re.sub("i:", "", line)
if issuer == "":
issuer = root
isIssuer = not isIssuer
continue
print "%s|%d|%s|%s|%s" % (site, certNum, subject, issuer, root)
sys.stdout.flush()
except:
print "%s|-1|||" % (site)
I ran this with:
cd data
ls -f | python ../gather_data.py > ../certPaths
# Takes about an hour
The generated file (certPaths) is only 988257 lines long, instead of 1 million, because the original list has a lot of non-site lines, such as maharojgar.gov.in/~selfemp.
The data directory is 2.1GB uncompressed, so I have not posted it online, but can make it and the x509 directory available on request.