CVSS Exploratory Data Analysis

Preamble

In [136]:
# used to create block diagrams
%reload_ext xdiag_magic
%xdiag_output_format svg
    
import numpy as np                   # for multi-dimensional containers
import pandas as pd                  # for DataFrames
import plotly.graph_objects as go    # for data visualisation
import plotly.io as pio              # to set shahin plot layout
from wordcloud import WordCloud      # visualising word clouds
import matplotlib.pyplot as plt

plt.rcParams['figure.figsize'] = [10, 10]
pio.templates['shahin'] = pio.to_templated(go.Figure().update_layout(legend=dict(orientation="h",y=1.1, x=.5, xanchor='center'),margin=dict(t=0,r=0,b=40,l=40))).layout.template
pio.templates.default = 'shahin'

Dataset

In [82]:
data = pd.read_csv('../data/nvd_bufferoverflow.csv')
data['published_date'] = pd.to_datetime(data['published_date']).dt.date # date only, remove time
data.tail()
Out[82]:
vuln_id cvss_score3 cvss_score2 summary published_date
275 CVE-2018-4003 9.8 7.5 An exploitable heap overflow vulnerability ex... 2019-03-21
276 CVE-2019-6778 7.8 4.6 In QEMU 3.0.0, tcp_emu in slirp/tcp_subr.c has... 2019-03-21
277 CVE-2019-9895 9.8 7.5 In PuTTY versions before 0.71 on Unix, a remo... 2019-03-21
278 CVE-2019-9903 6.5 4.3 PDFDoc::markObject in PDFDoc.cc in Poppler 0.... 2019-03-21
279 CVE-2019-9904 6.5 4.3 An issue was discovered in lib\\cdt\\dttree.c ... 2019-03-21

Introduction

In [8]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 280 entries, 0 to 279
Data columns (total 5 columns):
vuln_id           280 non-null object
cvss_score3       248 non-null float64
cvss_score2       248 non-null float64
summary           280 non-null object
published_date    280 non-null object
dtypes: float64(2), object(3)
memory usage: 11.0+ KB
In [118]:
print(f"Earliest date {data.published_date.min()}")
print(f"Latest date {data.published_date.max()}")
print(f"Over {(data.published_date.max() - data.published_date.min()).days} days")
Earliest date 2019-03-15
Latest date 2019-07-29
Over 136 days

Two numerical features, cvss_score3 and cvss_score2. There is a difference in severity classification and base score range between the two metrics. E.g. a score of $> 9.0$ is classified as "Critical" in CVSS3, but only "High" in CVSS2.

In [11]:
data.describe()
Out[11]:
cvss_score3 cvss_score2
count 248.000000 248.000000
mean 8.393952 6.742339
std 1.321331 1.815771
min 3.700000 2.100000
25% 7.500000 5.000000
50% 8.800000 6.800000
75% 9.800000 7.500000
max 9.800000 10.000000

How many missing vulnerability scores?

In [135]:
data.isna().sum()
Out[135]:
vuln_id            0
cvss_score3       32
cvss_score2       32
summary            0
published_date     0
dtype: int64
In [77]:
fig = go.Figure()

fig.add_trace(go.Box(y=data.cvss_score3, name='CVSS3'))
fig.add_trace(go.Box(y=data.cvss_score2, name='CVSS2'))

fig.show()

CVSS distributions

In [58]:
fig = go.Figure()

fig.add_trace(go.Histogram(x=data.cvss_score3, name='CVSS3'))
fig.add_trace(go.Histogram(x=data.cvss_score2, name='CVSS2'))

fig.update_traces(opacity=0.75)
fig.show()

The summary field has some description of the event relating to the vulnerability detected. This could potentially be quantified and used for auxiliary analysis.

In [46]:
wordcloud = WordCloud(width=600, height=600, background_color="white").generate(str(data.summary.values))
plt.figure(figsize=(10,10), dpi=80)
plt.imshow(wordcloud)
plt.axis('off')
plt.show()

Timeseries

In [122]:
data = data.set_index('published_date', drop=False)
data.sort_index(inplace=True)
data.tail()
Out[122]:
vuln_id cvss_score3 cvss_score2 summary published_date
published_date
2019-07-28 CVE-2019-14323 NaN NaN SSDP Responder 1.x through 1.5 mishandles inco... 2019-07-28
2019-07-28 CVE-2019-14363 NaN NaN A stack-based buffer overflow in the upnpd bin... 2019-07-28
2019-07-29 CVE-2019-14267 NaN NaN PDFResurrect 0.15 has a buffer overflow via a ... 2019-07-29
2019-07-29 CVE-2019-13126 NaN NaN An integer overflow in NATS Server 2.0.0 allow... 2019-07-29
2019-07-29 CVE-2019-14378 NaN NaN ip_reass in ip_input.c in libslirp 4.0.0 has a... 2019-07-29

Vunerabilities published daily

In [128]:
daily_frequency = data.published_date.value_counts()
daily_frequency.sort_index(inplace=True)

fig = go.Figure()

fig.add_trace(go.Scatter(x=daily_frequency.index.values, y=daily_frequency.values))

fig.show()

Cumulative mean

In [134]:
daily_frequency = data.published_date.value_counts()
daily_frequency.sort_index(inplace=True)

fig = go.Figure()

fig.add_trace(go.Scatter(x=data.published_date, y=data.cvss_score3.expanding().mean(), name='CVSS3'))
fig.add_trace(go.Scatter(x=data.published_date, y=data.cvss_score2.expanding().mean(), name='CVSS2'))

fig.show()