As far as the time of this writing, the 29 Leaks dump of Formations House contains a series of three tar'd files each containing a MySQL dump file of a database of emails. This repository contains a few scripts to make the analysis of these possible.
In total this processed ~50 million emails, which will be available soon in a fully searchable database. The process, when run end to end took about two weeks in total on a fairly bulky multi-core AWS instance.
There are a number issues that don't just let us import the files directly:
- The files are only
ALTER TABLE
, so the database and tables have to be set up manually - The files after decompressing are HUGE. (552.1MB, 92.15GB, and 220.25GB)
- Each import statement is not on a seperate line. Technically these files are like only 15 or 20 lines, mostly comments
- Good luck finding a machine that can even handle a 220GB import file (since it'll be put into memory).
- The 552MB file has some one
IMPORT
statement that is most of the file. - Examining these files in a regular text editor (or even something like
bat
) freezes basically any machine. - This means examining any formatting issues is a struggle.
There are additional issues with the data itself:
- All the email headers are stored in a single column that is just a text string. @pudo has pointed out this is a PGP serialized string... because, sure.
- It looks like there was a length limit in whatever system this came out of, because some of the headers seem to be cut off
- Email messages can be made with a bunch of different encodings, but they were exported as, it seems, UTF-8 or extended ASCII. This means that anything that was in a non-Latin language is totally scrambled beyond repair.
- The PHP seralize spec requires a "length of object", however, if it was scrambled this is invalid, and there's no way to guess how long it should be.
- Instead I had to create, essentially, a fully Ruby, fault-tolerant, PHP serialize parser that handles malformed data properly, this may break and is pretty fragile, but works better than anyone should be able to hope for.
Each file contains a header setting up a bunch of database settings, followed by a LOCK TABLES `user_emails_archive` WRITE
command.
The odd thing is that user_emails_archive
is unused otherwise, but since it's here, I've left it. (Note that in the 552MB file this
first LOCK TABLES
is not required.
Following the first one there's an 40000 ALTER TABLE
line that is apparently required, though, again that table is not used.
Immediately afterwards there is a LOCK TABLES `user_emails` WRITE;
Below this are the INSERT
statements.
Each insert statment contains five columns:
- id (integer)
- plain-text content (string)
- html content (string)
- headers (serialized PHP as a string)
- text-encoding (string)
There are a few different files. All of them have a --help
to explain their options.
The main ones that are used are as follows:
sql_split.rb
- A program that runs through an SQL dump, splitting it into files either of a specified size or with a specified number of lines.
mysql_management.rb
- A program to take a folder of split SQL files and to import them to a MySQL database. (Note: The database and tables have to be set up manually ahead of time.
email_cleaner.rb
- A program to parse the headers of records in a MySQL database after they were imported and export the emails as a
.eml
file. - This does a few things
- Extracts the "to" and "from" and puts them into the table.
- Saves a more easily parsed JSON version of the headers to the database as well
- Filters for SPAM using the spam score assigned in the database
- Exports as a properly formatted
.eml
file.
- NOTE: Make sure to manually add the
to
andfrom
columns to the database after importing, but before running this.
- A program to parse the headers of records in a MySQL database after they were imported and export the emails as a
- Ruby 2.6.3
- mysql2 working (this can get tricky, I recommend installing this gem seperately before
bundle install
- Running MySQL instance on a machine big enough to handle the import sizes. At least a TB.
- Probably a big machine in general otherwise this can take a very long time.
- First, make sure you have the most recent version of Ruby installed. I tend to use https://rvm.io/.
- Clone this repository to somewhere.
- Install the package manager (we don't have many, but it's nice).
gem install bundler
- Install the required packages.
bundle install
- Make sure you have the unarchived dump files somewhere nearby you can access
ruby sql_split.rb -s 50 ../sql_dumps/user_emails.sql
will split the file into 50mb chunks- Alternatively,
ruby sql_split.rb -l 40 ../sql_dumps/user_emails.sql
will split it into 40 imports per file. - This saves everything to a
output
folder in the home directory of the sql file. NOTE: This will overwrite anything already in the folder. - If you're splitting one of the
archive
files you have to add a-a
flag to manage the headers properly.ruby sql_split.rb -s 50 -a .....
- Alternatively,
- First make sure you split the files (duh)
ruby mysql_management.rb -u <database-user> -p<database-password> <output folder> <database-name>
will import them into the database indicated.
- Make sure the import worked (duh)
ruby email_cleaner.rb -u <database-user> -p<database-password> <database-name>
will go through, parse the headers, and save everything- I've put some multi-threading in here, hopefully it will help speed up the process of parsing. It's all in the
-h
but here's some basics-s
The number of threads (default 5)--mysql-timeout
Probably only useful if you're doing this with a remote database, if it's local this shouldn't be touched-v
Prints out stuff--debug
Sets debug mode, turning off concurrency and making verbose true. Good if you need to debug the parser since debuggers and concurrency don't play well together.
- I've put some multi-threading in here, hopefully it will help speed up the process of parsing. It's all in the
- Also make sure everything is imported
ruby email_cleaner.rb -u <mysql-database-user> -p <mysql-database-password> -h <myslq-host-name> -e <eml-output-directory>
- There's some more here as well,
-h
should explain them but:-i
The id to start from when getting entries from the db, for partial output
-v
verbose mode, etc.--debug
puts everything into single thread mode and lets you debug
- There's some more here as well,
- Wait... a very long time, like, days, weeks. The end will be a
.eml
directly. From here you can zip them (another tedious process) or whatever.
This was created by Christopher Guess @cguess. If you want more info reach out to him at cguess@gmail.com PGP Key: https://keybase.io/cguess