Data-Driven Security: Analysis, Visualization and Dashboards Paperback – 8 Apr 2014
|New from||Used from|
Customers Who Bought This Item Also Bought
Enter your mobile number or email address below and we'll send you a link to download the free Kindle App. Then you can start reading Kindle books on your smartphone, tablet, or computer - no Kindle device required.
To get the free app, enter your e-mail address or mobile phone number.
More About the AuthorsDiscover books, learn about writers, and more.
From the Back Cover
A practical guide to securing your data and IT infrastructure
From safeguarding corporate data to keeping e–commerce transactions secure, today s IT professionals are tasked with enormous and complex data security responsibilities. In Data–Driven Security, Jay Jacobs and Bob Rudis draw together three of the most important topics in IT security, data analysis, and visualization to present a real–world security strategy to defend your networks. Turning their backs on insufficient security based on hunches and best practices, the authors help you access the world of security data analysis and visualization, where real data drives security decisions, and they teach you to apply the principles of that security with real–world cases.
- Develop an understanding of how to acquire, prepare, and visualize security data
- Learn how to use the analytical and visualization tools in R and Python
- Dissect IP addresses to find malicious activity
- Map security data and learn statistical techniques to look for significant connections
- Understand how visual communication works and how it can help you see and present your data clearly
- Develop effective, informative security dashboards
- Design analytical models to help you detect malicious behavior
- Gain practical how–to knowledge from specific, real–world use cases detailing an array of data and network security scenarios
Visit the companion website at www.wiley.com/go/datadrivensecurity for additional information and resources
About the Author
Jay Jacobs is the coauthor of Verizon Data Breach Investigation Reports and the cofounder of the Society of Information Risk Analysts, where he currently sits on the board of directors.
Bob Rudis is the Director of Enterprise Information Security & IT Risk Management at Liberty Mutual Insurance and was named one of the Top 25 Influencers in Information Security by Tripwire.
What Other Items Do Customers Buy After Viewing This Item?
Most Helpful Customer Reviews on Amazon.com (beta)
In this extremely valuable book, authors and noted experts Jay Jacobs and Bob Rudis bring their decades of experience to the reader and show you how to find security patterns in your data logs and extract enough information from it to create effective information security countermeasures. By using data correctly and truly understanding what that data means, the authors show how you can achieve much greater levels of security.
The book is meant for a serious reader who is willing to put in the time and effort to learn the programming necessary (mainly in Python and R) to truly understand what information exists deep in the recesses of their logs. As to R, it is a GNU project and a free software programming language and software environment for statistical computing and graphics. The R language is widely used among statisticians and data miners for developing statistical software and data analysis. For analysis the level of which Jacobs and Rudis prescribe, R is a godsend.
The following are the 12 densely packed chapters in the book:
1 : The Journey to Data-Driven Security
2 : Building Your Analytics Toolbox: A Primer on Using R and Python for Security Analysis
3 : Learning the "Hello World" of Security Data Analysis
4 : Performing Exploratory Security Data Analysis
5 : From Maps to Regression
6 : Visualizing Security Data
7 : Learning from Security Breaches
8 : Breaking Up with Your Relational Database
9 : Demystifying Machine Learning
10 : Designing Effective Security Dashboards
11 : Building Interactive Security Visualizations
12 : Moving Toward Data-Driven Security
After completing the book, the reader will have the ability to know which questions to ask to gain security insights, and use that data to ensure the overall security of their data and networks. Getting to that level is not a trivial at all a trivial task; even if there are vendors who can promise to do that.
For many people performing data analysis, the dependable Excel spreadsheet is their basic choice for data manipulation. The book calls the spreadsheet a gateway tool between a text editor and programming. The book notes that spreadsheets work as long as the data is not too large or complex. The book quotes a 2013 report to shareholders from J.P. Morgan in which parts of their 2012 $6 billion in losses was due in part to problems with their Excel spreadsheets.
The authors suggest using Excel as a temporary solution for quick one-shot tasks. For those that have repeating analytical tasks or models that are used repeatedly, it's best to move to some type of structured programming language, specifically those that the book suggest and for provides significant amounts of code examples.
The goal of all data extraction is to use data analysis to answer real questions. A large part of the book focuses on how to ask the right question. In chapter 1, the authors write that every good data analysis project begins with setting a goal and creating one or more research questions. Without a well-formed question guiding the analysis, you may wasting time and energy seeking convenient answers in the data, or worse, you may end up answering a question that nobody was asking in the first place.
The value of the book is that it shows the reader how to focus on context and purpose of the data analysis by setting the research question appropriately; rather than simply parsing large amounts of data. It's ultimately irrelevant if you can use Hadoop to process petabytes of data if you don't know what you are looking for.
Visualization is a large part of what this book is about, and in chapter 6 - Visualizing Security Data, the book notes that the most efficient path to human understanding is via the visual sense. It goes on to details the many advantages data visualization has, and the key to making it work.
As important as visualization is, describing the data is equally important. In chapter 7, the book introduces the VERIS (Vocabulary for Event Recording and Incident Sharing) framework. VERIS is a set of metrics designed to provide a common language for describing security incidents in a structured and repeatable manner. VERIS helps organizations collect useful incident-related information and to share that information, anonymously and responsibly with others.
The book shows how you can use dashboards for effective data visualization. But the authors warn that a dashboard is not an art show. They caution that given the graphical nature of dashboards, it's easy to fall into the trap of making them look like pieces of modern or fringe art; when they are far more akin to architectural and industrial diagrams that require more controlled, deliberate and constrained design.
The book uses the definition of dashboard according to Stephen Few, in that it's a "visual display of the most important information needed to achieve one or more objectives that has been consolidated in a single computer screen so it can be monitored at a glance". The book enables the reader to create dashboards like that.
Data-Driven Security: Analysis, Visualization and Dashboards is a superb book written by two experts who provide significant amounts of valuable information in every chapter. For those that are willing to put the time and effort into the serious amount of work that the book requires, they will find it a vital resource that will certainly help them achieve much higher levels of security.
...saying that had this book been out when I started, I could have just read it instead. The authors do a fantastic job at parsing the most important messages and examples from each book and seamlessly integrate it into an engaging read. There is a little bit of a (dare I say) sales pitch about specific toolsets the authors are vested in, but that makes sense. That is why they are subject matter experts in the field. I cannot recommend this book enough to anyone who wants to start making sense of statistics and visualizations in the cyber security field. It is even a good book for people who are experts as it provides the "one stop shop" for many of the facts and lessons contained in the aforementioned bibliography.
The book starts off by introducing the reader to what data analysis is, covering historical concepts and how to create a good question to answer with analysis, rather than simply analyzing data for the sake of it.
It then moves on to provide an introduction to the R programming language, a free statistical programming language, and also how they us Python in conjunction with R to analyze data.
The book is very practically oriented, encouraging the reader to start playing around with both Python and R by providing full coded examples of all the analysis performed in each chapter. To make life easier, all the code examples can be downloaded from the books website and any data sets used for analysis are either publicly available already or can be downloaded with the source code.
Once you get your head around the basics of using the tools for analysis, the book then walks through examples of the different types of analysis that information security data sets may require, covering things like exploring data sets of malware infections, performing regression analysis on malware data and applying machine learning to breach data. Throughout the examples, the book puts a strong emphasis on visualization of data including both the common mistakes in presenting data analysis and also looks both at static and interactive visualization.
Overall I thoroughly enjoyed reading this book and while I haven't had the time to start looking at applying the ideas in the book to my own data sets, it's opened up a whole world of analysis tools and techniques and has effectively shortcutted my learning in the area dramatically.
The biggest benefit I see from this book is the highly practical oriented approach, which allows anyone with an interest in information security data analysis to quickly get up to speed in the basics, allowing for them to at least have the tools and knowledge to start trying to ask interesting questions and get results, without having to re-invent the wheel.
If you've ever been sitting in front of a huge set of firewall or webserver logs during an incident trying to figure things out by greping, cuting and counting results you're going to get a lot from this book!
I work with the two authors of this book. In fact one of them is my manager. But a) I don’t like to suck up to my colleagues and b) I’m sure they don’t like being sucked up to either. Despite this if you think my review will be biased then stop reading now. Go watch some cat videos.
Cover Data Driven Security is a first of it’s kind book that aims to achieve the impossible; To be a book that integrates all 3 dimensions of ‘Data Science’, a) Math and Statistical Knowledge, b) Coding/Hacking skills, and c) Domain Knowledge. Domain in this case being the Information Security Domain. If these 3 dimensions are unknown to you, look at the figure on the right.. DS Traditionally books available for data science have tackled only one dimension at a time or at best two. This book is unique in that regard as it tackles all 3 dimensions. This is worth mentioning especially when you consider that concepts like statistical and machine learning are not part of traditional InfoSec tools. Traditional InfoSec tools are based around the concept of signature matching, i.e. determining if a threat matches from a set of already known badness such as a virus, malware, network activity, ip address, domain name. This approach is always playing catch up and the good guys are always one step (in fact several steps) behind the bad guys. This is where data driven security comes in. The idea is to use data analysis techniques for security research and build the next generation of InfoSec tools that can spot badness before it is known. A fascinating field, trust me. The challenge of writing on such a subject can not be overstated. The book needs to be approachable by readers coming in from any of the three dimensions. Also each of the three dimensions is so vast and wide that you can find hundreds of books dedicated to just one single dimension. So have the authors been successful in this endeavor ? Read on to find out…
At a glance
The book (ISBN: 978–1–118–79372–5) is published by Wiley Publications in Feb, 2014. Wiley has been publishing some really interesting titles over the past few years in the Data Analysis, Statistics domain. The book is absolutely gorgeous from cover to cover. The page quality is very high and a lot of effort has gone into making the code and figures look stunning. It is one of the best visually pleasing books I have in my collection (in addition to anything by Stephen Few and Edward R. Tufte). All code presented is properly commented, something I seldom find in technical books. There is a ton of code in this book, which is expected as coding is one of the skills in data science. The authors have done a great job presenting code in python and R. Both python and R offer libraries and interactive environments for data analysis and by presenting the code in 2 languages the authors have made the book accessible to wide audience.
In addition to the book the authors have a website: datadrivensecurity.info for the book, a blog and a podcast where they discuss all things InfoSec and data science.
Chapter 1: The Journey to Data Driven Security
Chapter 1 starts with a brief history of data analysis, from the classical statistical analysis techniques of the Nineteenth and Twentieth century to the modern algorithmic approaches of the Twenty-First century. You have enough anecdotes to convince you of the importance of data analysis in case you are still wondering why analyze data in the first place. The chapter then explores the skill sets required for data analysis: Domain expertise, Data Management, Programming, Statistics and finally Visualization. Each topic is given its due credit and you’ll learn how each of these pieces fits in to the mosaic. The chapter rounds off with a very important section ‘Centering on a Question’. It is very much possible that your data analysis can lead you to many directions if you don’t have a proper research question framed (Your’s truly is guilty of this many times over). In short you learn the history of data analysis, skill-sets required for it and the importance of framing the right question(s).
Chapter 2: Building Your Analytics Toolbox: A Primer on Using R and Python for Security Analysis
Chapter 2 is all about setting up Python and R development environments for your data analysis. The authors start by explaining their reasons for using Python and R, and more importantly why both (avoiding the situation of having the hammer as your only tool). Next you learn how to set up Python using the Canopy distribution and R using Rstudio. You also get some sample code to test your respective setups. Next the chapter introduces you to the concept of a data frame; a tabular data structure often used in data analysis. You gee a taste of both R’s native data.frame as well as Python’s DataFrame from the pandas package. lastly you’ll see how to organize your code for a typical analysis project. I recommend you don’t skip this chapter even if you’re familiar with either Python or R or even both.
Chapter 3: Learning the “Hello World” of Security Data Analysis
Chapter 3 is where things get real. You start by importing AlientVault’s IP Reputation database in your Python and R environment. You then get a feel of the data by performing some basic introspection of various fields and their data types and appropriate statistical summaries of them. Next you perform some basic charting using R’s ggplot2and Python’s Matplotlib. Even if you are familiar with basics of exploratory data analysis (EDA) I suggest you don’t skip the ‘Homing In on a Question’ section, this is where you’ll learn how to use EDA for answering specifics questions about your data. There are quite a few examples here both in Python and R that use bar charts and heatmaps to explore relationships between various fields and derive answers from these relationships.
Chapter 4: Performing Exploratory Security Data Analysis
Chapter 4 dives into Exploratory Data Analysis (EDA) for InfoSec research. You get the know the the details about IPv4 addresses, and how they can be grouped together using Autonomous System Numbers (ASNs) and why that is useful. You also get to learn how to use a GeoIP service like Maxmind’s free geoip API for tying an IP address to coordinates on a map. Once you have the geo coordinates you can use charting APIs to plot the IPs on a world map. After that you get to see how to augment IP addresses with other useful attributes such as the IANA block information for that IP.
Next the chapter dives into basics of correlation analysis and lays down some core concepts behind correlation analysis. Lastly you get to build some graph data structures and visualize the graph nature of the relationships in IP addresses in the ZueS Botnet. All these techniques are foundations of initial exploratory analysis and although this chapter covers quite a bit of diverse but related concepts both from statistics as well as information security, it does a good job of tying all together. This will be the foundation for the analysis in later chapters.
Chapter 5: From Maps to Regression
Chapter 5 starts with basic concepts of plotting geographical maps, it walks you through plotting with latitude and longitude data, plotting per country stats using Choropleth plots, and zooming in on a specific country (USA in this example). Plotting some numbers on a geographical map is pointless unless it enables you to derive some information / insight from that plot. The chapter looks at a potentially interesting data point and then uses box plots to see if the data point is indeed an outlier. Finally the mapping part concludes by showing you how to aggregate the data at county level.
The last part is a quick introduction to regression analysis (There are multitudes of books written on just this one subject). You get to learn how to build regression models and perform analysis based on model parameters. You also see some caveats you need to keep in mind when interpreting regression models. Finally you get to see how to apply regression analysis for seeing if reported alien sightings have any impact on the infection rate of ZeroAccess rootkit. Yes you read that right and the authors are not fools, they chose these 2 variables to prove a point about multicollinearity a common problem in regression analysis.
Chapter 6: Visualizing Security Data
Chapter 6 is all about a picture speaking louder than a thousand words. Effective visualization is the key foundation of data analysis. The chapter starts with explaining the need for visualization and semi deep-dives into understanding visual perception and why it is important in building effective visualizations. These are topics that deserve their own books let alone chapters, but yet the authors manage to convey the gist of it all in the first few sections of the chapter.
The chapter then moves on to specific examples of visualization like bar charts, bubble charts, treemaps, distribution visualization using histogram and density plots. You also get a taste of visualizing time series data. Lastly you get to build a movie from your data. (I kid you not.)
Chapter 7: Learning from Security Breaches
Chapter 7 devotes to the art of examining and analyzing security breaches. The authors introduce you to the Veris Framework developed by one of the authors for capturing information related to data breaches to be used in Verizon’s annual Data Breach Investigations Report (DBIR). Before examining the details of the VERIS f/w the authors explain why it is necessary to analyze data breaches, what sort of research questions can be answered and what are some of the considerations when designing a data collection framework for the same.
Next the authors introduce the veris framework, its various sections, and enumerations used in them. You get to learn how VERIS tracks assets, actors, threats, actions, and how they affect Confidentiality, Integrity, & Availability (CIA triad) of the breached data. You also learn how to code up discovery/response and the subsequent impact of the data breach on the victim organization.
Next you get to play with some real life database which is captured in the VERIS Community DataBase (VCDB). VCDB is a project used to capture publicly disclosed data breaches and encode them in the VERIS format. The VERIS format is a JSON specification, and you see code examples of doing basic uni-variate and bi-variate analysis like bar-charts and heatmaps.
Chapter 8: Breaking Up with Your Relational Database
RDBMS , NOSQL and everything in between that’s what Chapter 9 is all about. With a quick primer on SQL/RDBMS you get to get your feet wet with MariaDB (MySQL fork), you learn how to create a small schema for storing InfoSec entities, as well as difference in terms of speed of a disk backed v/s memory backed storage engine. From RDBMS we move to NOSQL (Not Only SQL and not No SQL). The authors first explore BerkeleyDB a very popular key-value datastore. You have sample code in both R and python for interaction with BerkeleyDB. Next the chapter deals with Redis a very popular data-structure datastore. You learn about the various data structures supported by Redis and a couple of its advanced features. The authors also tackle Hadoop & MapReduce for processing security data at scale, and also touch base with MongoDB and passing reference to elasticsearch and Neo4J. Overall the chapter deals with some very popular RDBMSs and NOSQL databases, and provide you code samples to interact with them in python and R.
Chapter 9: Demystifying Machine Learning
Chapter 9 is all about Machine Learning in the InfoSec domain. Now let’s get this straight, ML is a very vast and widely spread topic. There are entire books devoted just to certain aspects of it. But even then the authors have managed to cover enough ground and should definitely pique your interest about ML if you haven’t been exposed to it yet. The chapter starts with defining ML, not an easy thing to do. The chapter shows you how to build a model to detect malware from non-malware using classification techniques. Then the chapter deals with model validation techniques/issues, risks of overfitting, feature selection which are some of the common things you do when building a ML model. Next the chapter looks at various supervised and unsupervised learning techniques. Finally you get 3 examples, clustering breach data, multidimensional scaling of victim industries, and hierarchical clustering of victim industries.
It is impossible to do full justice to ML even in a whole book let alone a single chapter, but you still get enough to get you started.
Chapter 10: Designing Effective Security Dashboards
A ‘Dashboard is not an Automobile’. Chapter 10 is about creating effective InfoSec Dashboards. The chapter introduces you to bullet graphs (a creation of Stephen Few) as a much saner and efficient alternative to Gauges and dials. You also see examples of other interesting dashboard visualizations like Sparklines. The authors have some good advice about things to do and don’t when designing dashboards.
Next the authors deal with a concrete example of conveying and managing security via Dashboards. The authors stress on the simple and yet extremely effective bar charts, and bullet graphs, as opposed to fancier but confusing UI elements like 3D charts, pie charts etc. To illustrate this point the authors have provided a couple of Dashboard makeover examples.
Finally the authors talk about designing dashboards for InfoSec. Stressing on two simple questions a) What is going on ? & b) So what ?, the authors explain what should and what should not be presented on an InfoSec Dashboard and how most effectively to present it.
Chapter 11: Building Interactive Security Visualizations
Chapter 12: Moving Towards Data-Driven Security
The authors provide their own advice for InfoSec research based on their experience and acumen in Chapter 12. They recommend ‘panning for gold’ rather than ‘drilling for oil’; that is to say not getting bogged down on a specific focus but explore the data and then focus on the questions you want to ask. They offer practical advice on various roles one can play in the InfoSec domain ranging from the Hacker, Coder, Data Munger, Visualizer, Thinker, Statistician to Security Domain Expert. For each role they provide a list of resources to sharpen your skill sets. Lastly they offer tips on moving your entire organization towards data-driven security and building security data teams.
Appendix A provides a vast list of web links. From Data Cleansing, Analytics and Visualization tools, to aggregation sites and blogs to follow. There is a ton of material worth checking out and bookmarking here.
Conclusion and Other thoughts
So how do I rate this book ? This is a rather difficult question considering that nothing like this has ever been attempted before. Sure there are plenty of books about traditional InfoSec research and tools, and there are even more books on Statistics, and Machine Learning, and Visualization, not to mention gazillions of books on Programming/Coding. But a book that touches all 3 aspects of Data Science is indeed very rare.
Having said that I like this book very much, it covers every aspect of Data Science with a focus on InfoSec in just enough detail to give it justice. The code samples are great but more important is the very serious advice the authors have to offer (albeit in a lighter tone). This book is by no means a small achievement, not only in InfoSec books but Data Science books as well. I don’t see any reason why this books should not be in your collection if you deal with InfoSec and/or Data Science. Even if your domain is not InfoSec but if you are interested in Data Science I would still highly recommend this book as it will show you how to make Data Science work for your domain using InfoSec as an example.