langu_xyz

2020 IoT Threat Report 解读
今天看到PALO ALTO和Unit 42联合发布了这个不到20页的报告，读了一下，虽然篇幅稍短，但是内容还是很有价值的，结合笔者去年在做的事情，浅薄的解读下，有兴趣的可以阅读原文

0x01 概要
报告主要分为现阶段的IoT安全格局、Top IoT Threats、结论建议三个部分。

第一部分主要讲现阶段IoT的安全格局，现在企业缺乏完善的IoT资产管理，缺少相应的安全产品去保护IoT设备，人力资源缺乏，整体风险很高，其中健康医疗行业风险特别高；

第二部分重点讲了现阶段IoT的网络攻击、密码攻击、蠕虫等Top 威胁，同时还提到了因为没打补丁的设备以及老协议，导致攻击的横向移动，越来越多的威胁演化为专门针对IoT的场景；

第三部分主要是讲如何解决这些威胁，4个步骤和2个最佳实践，下文会细讲。

0x02 IoT Security Landscape
IoT是快速发展的，同时存在着很大的安全问题（数据支撑这里就不详细列了，例如2019比2018增长了21.5%的IoT设备数量、98%的IoT传输没有加密等）。

1、企业缺少工具去识别资产和保护IoT设备
- IT 无法准确识别IoT资产
报告中认为像传统IT依靠IP和OS来进行资产管理的方式对于IoT场景是不完善的。只有准确的识别出IoT设备的类型，才可以准确的进行网络规划、安全策略部署等，可以连续的跟踪IoT设备的行为，而不是仅仅根据一个动态的IP。

笔者按：从笔者这一年的经历来看，这点说的非常贴切，IoT资产的管理和准入一直是个很大的痛点，通过IP来定位机器很容易丢掉，例如今天发现环境里某个IP的设备上有木马，然后去排查的时候，却发现因为DHCP随机分配IP的缘故，无法定位到问题设备了。当然这个例子有点极端，在泛IoT的场景下，mac地址也是一个非常核心的设备资产数据，但是也存在一些不足，因为泛IoT场景下设备的多样性，要准确的实现IoT设备资产管理需要多种方式的综合运用，例如流量、网络运维设备、人工排查等。
- 现有的安全产品大多不支持IoT设备
EDR等安全产品不支持IoT设备，PC上的安全产品会将IoT设备识别为未知类型，无法准确的识别风险和处置。基于网络的安全产品可以发现一部分风险，但是无法准确的识别、追踪IoT设备。

笔者按：这里指的IoT设备和现在国内大多数场景下的泛IoT设备含义有点不同，和监控摄像头这类产品比较类似。面对IoT环境下的设备多样性，基于流量的威胁检测成了大多数厂商的共同选择，例如本报告的发布者平底锅，当然还有笔者也在做这方面的尝试。对于发现威胁后的准确识别追踪，就需要先将IoT设备资产进行准确的识别和有效的管理。
- 在IT和OT之间，企业缺少足够的人力资源
IT主要关注IT资产,例如电脑、网络、打印机等，OT(operational technology)主要关注非IT设备，也就是上文笔者提到的泛IoT设备。因为IT和OT往往团队不在一个，而且因为电脑等IT资产发展迅速，可以获得更多的资源。而IoT设备为了稳定性(和原文有点差别，这里来自笔者实际经历的解读)，漏洞往往没有人去主动修复，存在着很大的风险。

2、企业现在面临着巨大的IoT安全风险

公司内的IoT设备(摄像头、打印机等)因为缺少IT维护，存在着巨大的风险。

3、医疗保健行业的状况是非常危急的
```
  医疗设备运行着过时的操作系统
  组织内缺少安全防御能力
  医疗设备的操作系统是非常脆弱的
```
4、最基本的网络隔离最佳实践没有遵守

最简单的IoT风险补救措施就是网络细分，可以有效组织风险的横向移动。但是更多的情况下，网络划分时，没有严格细分，例如在医疗保健行业，将医疗设备和打印机等划分到一起。同时还提到了最理想的情况是进行微网络划分(在某些高危场景下确实应该这样)。

0x03 Top IoT Threats
针对IoT设备的威胁伴随着新的攻击技术在不断的演化，例如僵尸网络和蠕虫等

1、网络攻击、密码攻击、IoT蠕虫威胁位居榜首
- 利用目标设备的漏洞
IoT设备的特性特别容易成为被攻击的目标，它们往往成为攻击者入侵其它系统的跳板。
- 密码攻击
笔者按：默认密码和弱密码是真的痛，无论是在应用上还是在IoT设备上。
- IoT蠕虫变得比IoT僵尸网络更常见
笔者按：随着这几年勒索病毒的兴起，针对泛IoT设备的攻击主要都变成了这个，当然挖矿木马也非常常见。利用IoT僵尸网络的DDOS由于了解的不深，这里就不班门弄斧了。

2、没打补丁的设备、老旧的协议：横向移动的入口
- 补丁覆盖率低
笔者按：IoT设备往往会因为版本迭代，逐渐放弃对老版本的更新支持，同时因为设备运行环境及稳定性需求，往往会放弃给设备打安全补丁。
- 老旧的OT协议
这种情况更多的出现的工控环境下，随着网络边界的消失，这些老协议的风险正在暴露出来。
- 横向移动
57%的IoT设备容易受到中等或高强度的攻击，使得IoT设备成为攻击者的进攻入口。

3、许多威胁正在演化为专门针对IoT环境
- P2P通信的特点
使得攻击可以最小化的与外界通信来控制内网环境下的IoT设备集群。
- 为host而战
病毒之间会互相干掉对方，争夺资源。
- 病毒的变种
例如Mirai系列

0x04 总结和建议
1、4个步骤来降低IoT风险(虽然不全面，但是很大程度下降低了IoT的风险)
```
  1、IoT设备资产发现；
  2、打补丁；
  3、细划分VLANs；

  4、实时监控。
```
笔者吐槽：这几个步骤无理反驳，还是去买他家的盒子吧。吐槽归吐槽，这几个步骤对于现在大多数的泛IoT环境是非常有效的，但是如何做到是个难题，也是笔者去年和未来要努力去达到的。

最佳实践1：整体思考IoT的生命周期
- 1、识别：设备准入
- 2、边界：NAC和Firewall结合（据笔者了解有些团队已经在做了）
- 3、安全：基于流量的威胁发现（笔者正在做的事情）
- 4、最优化：提高IoT设备的使用率
- 5、管理：实时监控、报警
- 6、回收：IoT设备的回收审计流程
最佳实践2：通过产品集成将安全性扩展到所有的IoT设备

安全产品集成包括以下：
- Asset management and computerized maintenance management systems (CMMS)
- Security information and event management (SIEM)
- Security orchestration, automation, and response (SOAR)
- Next-generation firewalls (NGFW)
- Network access control (NAC)
- Wireless/Network management solutions
~~笔者总结~~

~~ 这个报告虽然篇幅较短，但是不得不说平底锅的盒子贵有贵的道理，这篇报告的绝大部分都击中了现在IoT环境，特别是泛IoT环境所面临的安全威胁，整体解决思路和笔者正在做的大致相同。不过报告并没有说到具体如何落地，和绝大多数安全厂商一样，有点空中楼阁的感觉。但是经过笔者去年的验证，这两个落地实践的可行性是没有问题的，但是如何落地，长路漫漫，一点一点来了。~~

报告地址：https://start.paloaltonetworks.com/unit-42-iot-threat-report?utm_source=marketo&utm_medium=email&utm_campaign=AMERICAS-DA-EN-20-03-10-7010g000001JJOZAA4-P3-Strata-Unit%2042%20IoT%20Report.Americas-DA-EN-20-03-10-XX-P3-Strata_IoT%20Report%20A/B
2020-03-12
- IoT安全
- IoT
Read more

应用安全评审中的三个关键节点及抓手实现

如何快速感知项目立项？如何感知应用上线？如何跟踪应用迭代？越权漏洞频发如何解决？

前言

应用安全如何做？这是一个老生常谈的问题，那为什么还要提这个话题呢？在笔者经历了短暂的两年多的应用安全建设来看，SDL的完整落地是一个很大的难题。当然，像其中的培训、代码扫描以及应急响应这几部分，各种落地方案很成熟，也就不提了。应用安全建设的本质就是运营，最难落地的差不多就是安全评审了。

如何进行安全评审，从方法论来看也不是什么难题，通过STRIDE威胁建模模型和DREAD威胁评级模型，再融入公司的实际情况，一份定制化的评审CheckList差不多就可以出炉了。然后问题就来了，在哪个环节切入？通过什么方式？如何持续运营？

安全评审需要介入的三个节点

立项时、上线时、迭代时。

立项时：根基不牢地动山摇，这个阶段需要进行架构安全评审。架构安全评审的必要性在于可以用最小的成本解决最大的风险，如果架构性安全问题在这个阶段未被发现，后续会随着一次次迭代，修复成本和风险都会急剧上升。

上线时：现在的白盒扫描，更多的是发现代码层漏洞，但是对于架构相关和业务相关的风险，就有心无力了。这也就引出了应用上线安全评审的必要性，验证立项评审阶段的风险是否存在，同时评审其有没有业务逻辑相关风险(越权、敏感信息等)。

迭代时：现在的开发思路大多都是快速立项、快速上线、持续迭代，也就导致了大部分的功能是在后期迭代过程中上线的，经过笔者的简单统计，当应用完成度在90%以上时，迭代新增的接口约占80%以上，换句话说，绝大多数的web接口都没有经过安全评审就暴露到公网当中去了，成为无数的攻击面，进而导致了权限相关安全风险的频发。

立项时–快速感知

如何做到快速感知？笔者了解到的有这么几种：

一是和PD混熟，有新项目时及时同步，这种方法具有局限性，适合应用比较少的公司或者公司的某条业务线；

二是利用现有平台，往往在中台支撑部门、工程效能部门等，会有一些环节可以感知到立项，财务部门也是一个非常好的环节。这个时候就要发挥敏锐的嗅觉，找到这样的点，然后形成联动。这些平台大概率会接受合作，一则可以提升该平台的价值，二来可以提升其影响力；

三是自立门户成为入口，这种方案的思想是让项目在安全平台上立项，想要实现需要运气。为什么这么说呢，想要做这件事，需要大量的人员投入和强制的流程更改，能决定这事的往往需要CXO的支持，要想获得他们的支持，就需要一个影响足够大的安全风险。不过据笔者所知，有几家大公司就是这么做的。

第一种方案，灵活性太强，效果时好时坏，笔者在很长一段时间里都是用这种方法，最后的结果很惨，随着业务的迅速发展，项目评审率跌至很低的水位。

第二种方案的可行性非常强，是一个成本低效果好的抓手。但是有一个问题是，项目非常多怎么办？笔者现在的思路是按照项目的等级来评审，项目等级的划分有很多种方法，例如人日、业务线等等

第三种方案就不提了，时刻准备着，机会来了抓紧。

当找到了有效的抓手时，一定要记得带上数据安全、业务安全等，这个阶段的主要风险往往集中在风控、合规等。

上线时–发布卡点

上线发布卡点，这个做起来就非常容易了，嵌入到应用构建平台中去就可以实现。有一个问题是，为什么不每次发布都卡点呢？

其实去看一下构建平台的发布记录就知道了，过于频繁，完全无法运营，所以只能退而求其次，卡住第一次上线发布。

迭代时–持续跟踪

上边说到“迭代新增的接口约占80%以上”，这就是一个超级大风险，迭代接口的安全性全依赖于开发的安全意识和应急响应。虽然在上线后会有持续的黑盒扫描，但是目前还没有哪个工具可以低误报、低风险的发现权限相关漏洞吧。

在上线评审那里有说到过，构建平台的发布记录非常多，如果依赖这个去评审迭代，会消耗大量的精力，当应用只有两位数时还能勉强运营，但是当应用数量上升到几百、几千的时候，每天最多迭代几十万行代码，怕是不吃不喝也搞不定了。

这里讲一下笔者的思路，commit监控：

1、每隔一段时间自动拉取commit记录；
2、获取应用源码进行解析（白盒代码扫描工具中大多都可以做到源码解析，笔者是自己实现的）；
3、解析diff记录，获取新增web接口；
4、通过污点跟踪结合关键方法(permission等)大致判断风险指数(这个笔者实现起来效果不是很好，几乎每个应用都有自己独特的鉴权逻辑，通用程度低)。

只需要这几步，即可以实现应用新增web接口的跟踪。可能还有一些其它通用接口平台，其实跟踪思想也是类似。

不过在笔者的运营过程中发现，还是会存在新增接口过多的情况，现在采用的是优先级的方案（重点应用、发生过高危风险的应用等），发现新增接口后，大部分情况下，人工快速审计下代码就可以发现风险了，当然还是会存在各种奇葩的鉴权逻辑，这时候就要和开发交流了。

这段时间的运营感受就是，随着覆盖应用的增多，每天要读代码量也开始快速上升，不过效果还是很明显的。第4点的自动化分析需求愈加迫切。

总结

总结下就是，三个节点，关键之处在于找到这三个节点的抓手，充分利用现有资源，如果实在没有，那就创造抓手。用技术的思路去做运营，用创业的心态去做产品。

本文纯属笔者的经验之谈，如有偏颇之处，还望指出，不甚感谢。

2020-02-07

应用安全

THINK

账号生命安全周期
注册
风险
- 垃圾注册
- 账号检存
- 弱密码
- 人机
防护
- 滑动验证
- 短信验证
- 强制改密
授权
登入
风险
- 撞库
- 盗号
- 弱密码
- 人机行为
防护
- 强制改密
- 滑动验证
- 短信验证
- 高危账号强制验证或者禁止登陆
- 重要账号强制验证
- 白名单
- 沉睡账号强制验证
登出
- session未失效
找回密码
- 盗号
- 信息重放
- 账号检存
- 逻辑缺陷
高危操作
- 更改密码
- 更换手机号
- 更换邮箱
防护
- 强弹二次验证
- 线下核身
注销
- 欠款用户不得注销（to do）
整体
- 可信体系
2019-10-22
- 安全技术
- Web
Read more

Domain generation algorithms (DGA) are algorithms seen in various families of malware that are used to periodically generate a large number of domain names that can be used as rendezvous points with their command and control servers.

Example

0x02 Random Forest

random forest = bagging + decision trees

0x03 code

Random Forest
MultinomialNB

import os, sys
import traceback
import json
import optparse
import pickle
import collections
import sklearn
import sklearn.feature_extraction
import sklearn.ensemble
import sklearn.metrics
import pandas as pd
import numpy as np
import tldextract
import math
import operator
from sklearn.model_selection import train_test_split
from matplotlib import pylab
from pylab import *

收集数据

1
2
3

alexa_dataframe = pd.read_csv('data/alexa_100k.csv', names=['rank','uri'], header=None, encoding='utf-8')
alexa_dataframe.info()
alexa_dataframe.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 2 columns):
rank    100000 non-null int64
uri     100000 non-null object
dtypes: int64(1), object(1)
memory usage: 1.5+ MB

	rank	uri
0	1	facebook.com
1	2	google.com
2	3	youtube.com
3	4	yahoo.com
4	5	baidu.com

1
2
3

dga_dataframe = pd.read_csv('data/dga_domains.txt', names=['raw_domain'], header=None, encoding='utf-8')
dga_dataframe.info()
dga_dataframe.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2669 entries, 0 to 2668
Data columns (total 1 columns):
raw_domain    2669 non-null object
dtypes: object(1)
memory usage: 20.9+ KB

	raw_domain
0	04055051be412eea5a61b7da8438be3d.info
1	1cb8a5f36f.info
2	30acd347397c34fc273e996b22951002.org
3	336c986a284e2b3bc0f69f949cb437cb.info
4	336c986a284e2b3bc0f69f949cb437cb.org

1
2
3

word_dataframe = pd.read_csv('data/words.txt', names=['word'], header=None, dtype={'word': np.str}, encoding='utf-8')
word_dataframe.info()
word_dataframe.head(10)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 479623 entries, 0 to 479622
Data columns (total 1 columns):
word    479619 non-null object
dtypes: object(1)
memory usage: 3.7+ MB

	word
0	1080
1	10-point
2	10th
3	11-point
4	12-point
5	16-point
6	18-point
7	1st
8	2
9	20-point

准备数据

def domain_extract(uri):
    ext = tldextract.extract(uri)
    if (not ext.suffix):
        return None
    else:
        return ext.domain
    
alexa_dataframe['domain'] = [ domain_extract(uri) for uri in alexa_dataframe['uri']]
del alexa_dataframe['rank']
del alexa_dataframe['uri']
alexa_dataframe = alexa_dataframe.dropna()
alexa_dataframe = alexa_dataframe.drop_duplicates()
alexa_dataframe.info()
alexa_dataframe.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 91377 entries, 0 to 99999
Data columns (total 1 columns):
domain    91377 non-null object
dtypes: object(1)
memory usage: 1.4+ MB

	domain
0	facebook
1	google
2	youtube
3	yahoo
4	baidu

1
2
3

alexa_dataframe['class'] = 'legit'
#对正常数据打标legit
alexa_dataframe.head()

	domain	class
0	facebook	legit
1	google	legit
2	youtube	legit
3	yahoo	legit
4	baidu	legit

# Shuffle the data (important for training/testing)
alexa_dataframe = alexa_dataframe.reindex(np.random.permutation(alexa_dataframe.index))
#打乱循序，重新索引
#Randomly permute a sequence, or return a permuted range
alexa_total = alexa_dataframe.shape[0]
print('Total Alexa domains %d' % alexa_total)

Total Alexa domains 91377

1
2
3

dga_dataframe['domain'] = dga_dataframe.applymap(lambda x: x.split('.')[0].strip().lower())
#This method applies a function that accepts and returns a scalar to every element of a DataFrame.
del dga_dataframe['raw_domain']

dga_dataframe = dga_dataframe.dropna()
dga_dataframe = dga_dataframe.drop_duplicates()
dga_total = dga_dataframe.shape[0]
print('Total DGA domains %d' % dga_total)

Total DGA domains 2664

1 2	dga_dataframe['class'] = 'dga' dga_dataframe.head()

	domain	class
0	04055051be412eea5a61b7da8438be3d	dga
1	1cb8a5f36f	dga
2	30acd347397c34fc273e996b22951002	dga
3	336c986a284e2b3bc0f69f949cb437cb	dga
5	40a43e61e56a5c218cf6c22aca27f7ee	dga

def entropy(s):
    '''
    熵计算
    '''
    p, lns = collections.Counter(s), float(len(s))
    return -sum( count/lns * math.log(count/lns, 2) for count in p.values())

all_domains = pd.concat([alexa_dataframe, dga_dataframe], ignore_index=True)
#将数据根据不同的轴作简单的融合
#如果两个表的index都没有实际含义，使用ignore_index=True
all_domains['length'] = [len(x) for x in all_domains['domain']]
all_domains = all_domains[all_domains['length'] > 6]
#排除短domain的干扰
all_domains['entropy'] = [entropy(x) for x in all_domains['domain']]
all_domains.head(10)

	domain	class	length	entropy
0	facebook	legit	8	2.750000
2	youtube	legit	7	2.521641
5	wikipedia	legit	9	2.641604
10	blogspot	legit	8	2.750000
11	twitter	legit	7	2.128085
12	linkedin	legit	8	2.500000
19	wordpress	legit	9	2.725481
23	microsoft	legit	9	2.947703
27	xvideos	legit	7	2.807355
28	googleusercontent	legit	17	3.175123

分析数据

#箱线图
all_domains.boxplot('length','class')
pylab.ylabel('Domain Length')
all_domains.boxplot('entropy','class')
pylab.ylabel('Domain Entropy')

Text(0,0.5,'Domain Entropy')

cond = all_domains['class'] == 'dga'
dga = all_domains[cond]
alexa = all_domains[~cond]
plt.scatter(alexa['length'], alexa['entropy'], s=140, c='#aaaaff', label='Alexa', alpha=.2)
plt.scatter(dga['length'], dga['entropy'], s=40, c='r', label='DGA', alpha=.3)
plt.legend()
#放置图例
pylab.xlabel('Domain Length')
pylab.ylabel('Domain Entropy')

Text(0,0.5,'Domain Entropy')

1	all_domains.tail(10)

	domain	class	length	entropy
94031	xcfwwghb	dga	8	2.750000
94032	xcgqdfyrkgihlrmfmfib	dga	20	3.684184
94033	xclqwzcfcx	dga	10	2.646439
94034	xcpfxzuf	dga	8	2.500000
94035	xcvxhxze	dga	8	2.405639
94036	xdbrbsbm	dga	8	2.405639
94037	xdfjryydcfwvkvui	dga	16	3.500000
94038	xdjlvcgw	dga	8	3.000000
94039	xdrmjeu	dga	7	2.807355
94040	xflrjyyjswoatsoq	dga	16	3.500000

legit = all_domains[(all_domains['class']=='legit')]
max_grams = np.maximum(legit['alexa_grams'],legit['word_grams'])
ax = max_grams.hist(bins=80)
ax.figure.suptitle('Histogram of the Max NGram Score for Domains')
pylab.xlabel('Number of Domains')
pylab.ylabel('Maximum NGram Score')

Text(0,0.5,'Maximum NGram Score')

word_dataframe = word_dataframe[word_dataframe['word'].map(lambda x: str(x).isalpha())]
word_dataframe = word_dataframe.applymap(lambda x: str(x).strip().lower())
word_dataframe = word_dataframe.dropna()
word_dataframe = word_dataframe.drop_duplicates()
word_dataframe.head(10)

	word
37	a
48	aa
51	aaa
53	aaaa
54	aaaaaa
55	aaal
56	aaas
57	aaberg
58	aachen
59	aae

alexa_vc = sklearn.feature_extraction.text.CountVectorizer(analyzer='char', ngram_range=(3,5), min_df=1e-4, max_df=1.0)
#词袋模型统计词频
#ngram_range：词组切分的长度范围
#如果一个词的频率小于min_df或者大于max_df，将不会被作为关键词
counts_matrix = alexa_vc.fit_transform(alexa_dataframe['domain'])
#生成词频向量
#fit_transform 计算各个词语出现的次数
alexa_counts = np.log10(counts_matrix.sum(axis=0).getA1())
#数据归一化
print(alexa_counts[:10])
ngrams_list = alexa_vc.get_feature_names()
#从包含文本和图片的数据集中提取特征，转换成机器学习中可用的数值型特征
print(ngrams_list[:10])

_sorted_ngrams = sorted(zip(ngrams_list, alexa_counts), key=operator.itemgetter(1), reverse=True)
#zip()将两个序列合并，返回zip对象，可强制转换为列表或字典
# sorted()对序列进行排序，返回一个排序后的新列表，原数据不改变
print('Alexa NGrams: %d' % len(_sorted_ngrams))
for ngram, count in _sorted_ngrams[:10]:
    print(ngram, count)

[1.         1.         1.17609126 1.64345268 1.11394335 1.14612804
 1.         1.17609126 1.07918125 1.54406804]
['-20', '-a-', '-ac', '-ad', '-ads', '-af', '-ag', '-ai', '-air', '-al']
Alexa NGrams: 23613
ing 3.443888546777372
lin 3.4271614029259654
ine 3.399673721481038
tor 3.26528962586083
ter 3.2631624649622166
ion 3.2467447097238415
ent 3.228913405994688
por 3.2013971243204513
the 3.2005769267548483
ree 3.16345955176999

#提取词的数值型特征
dict_vc = sklearn.feature_extraction.text.CountVectorizer(analyzer='char', ngram_range=(3,5), min_df=1e-5, max_df=1.0)
counts_matrix = dict_vc.fit_transform(word_dataframe['word'])
dict_counts = np.log10(counts_matrix.sum(axis=0).getA1())
ngrams_list = dict_vc.get_feature_names()
print(ngrams_list[:10])

['aaa', 'aab', 'aac', 'aad', 'aaf', 'aag', 'aah', 'aai', 'aak', 'aal']

_sorted_ngrams = sorted(zip(ngrams_list, dict_counts), key=operator.itemgetter(1), reverse=True)
print('Word NGrams: %d' % len(_sorted_ngrams))
for ngram, count in _sorted_ngrams[:10]:
    print(ngram, count)

Word NGrams: 123061
ing 4.387300822448285
ess 4.204879333760662
ati 4.1933472563864616
ion 4.165036479994566
ter 4.162415036106447
nes 4.112504458767161
tio 4.076822423342773
ate 4.0723602039634885
ent 4.069631102620343
tion 4.0496056125949735

def ngram_count(domain):
    '''
    domain中包含的ngrams数
    '''
    alexa_match = alexa_counts * alexa_vc.transform([domain]).T  
    dict_match = dict_counts * dict_vc.transform([domain]).T
    print('%s Alexa match:%d Dict match: %d' % (domain, alexa_match, dict_match))

ngram_count('google')
ngram_count('facebook')
ngram_count('1cb8a5f36f')
ngram_count('pterodactylfarts')

google Alexa match:17 Dict match: 14
facebook Alexa match:31 Dict match: 27
1cb8a5f36f Alexa match:0 Dict match: 0
pterodactylfarts Alexa match:35 Dict match: 76

#Compute NGram matches for all the domains and add to our dataframe
all_domains['alexa_grams']= alexa_counts * alexa_vc.transform(all_domains['domain']).T
all_domains['word_grams']= dict_counts * dict_vc.transform(all_domains['domain']).T
all_domains.head(10)

	domain	class	length	entropy	alexa_grams	word_grams
0	facebook	legit	8	2.750000	31.302278	27.872426
2	youtube	legit	7	2.521641	25.855170	18.287142
5	wikipedia	legit	9	2.641604	24.571024	29.175635
10	blogspot	legit	8	2.750000	24.435141	19.274501
11	twitter	legit	7	2.128085	23.244500	31.130820
12	linkedin	legit	8	2.500000	24.774916	32.904408
19	wordpress	legit	9	2.725481	38.369509	33.806635
23	microsoft	legit	9	2.947703	32.133033	39.530125
27	xvideos	legit	7	2.807355	28.906360	18.846834
28	googleusercontent	legit	17	3.175123	67.315750	86.104683

1
2
3

#Use the vectorized operations of the dataframe to investigate differences
all_domains['diff'] = all_domains['alexa_grams'] - all_domains['word_grams']
all_domains.sort_values(['diff'], ascending=True).head(10)

	domain	class	length	entropy	alexa_grams	word_grams	diff
79366	bipolardisorderdepressionanxiety	legit	32	3.616729	117.312465	190.833856	-73.521391
72512	channel4embarrassingillnesses	legit	29	3.440070	95.786979	169.119440	-73.332460
10961	stirringtroubleinternationally	legit	30	3.481728	134.049367	207.204729	-73.155362
85031	americansforresponsiblesolutions	legit	32	3.667838	148.143049	218.363956	-70.220908
20459	pragmatismopolitico	legit	19	3.326360	61.244630	121.536223	-60.291593
13702	egaliteetreconciliation	legit	23	3.186393	91.938518	152.125325	-60.186808
4706	interoperabilitybridges	legit	23	3.588354	95.037285	153.626312	-58.589028
85161	foreclosurephilippines	legit	22	3.447402	74.506548	132.514638	-58.008090
45636	annamalicesissyselfhypnosis	legit	27	3.429908	68.680068	126.667692	-57.987623
70351	corazonindomablecapitulos	legit	25	3.813661	75.535473	133.160690	-57.625217

1	all_domains.sort_values(['diff'], ascending=False).head(10)

	domain	class	length	entropy	alexa_grams	word_grams	diff
54228	gay-sex-pics-porn-pictures-gay-sex-porn-gay-se...	legit	56	3.661056	159.642301	85.124184	74.518116
85091	article-directory-free-submission-free-content	legit	46	3.786816	235.233896	188.230453	47.003443
16893	stream-free-movies-online	legit	25	3.509275	120.250616	74.496915	45.753701
63380	watch-free-movie-online	legit	23	3.708132	103.029245	58.943451	44.085794
44253	best-online-shopping-site	legit	25	3.452879	123.377240	79.596640	43.780601
22524	social-bookmarking-sites-list	legit	29	3.702472	145.755266	102.261826	43.493440
66335	free-online-directory	legit	21	3.403989	123.379738	80.735030	42.644708
46553	free-links-articles-directory	legit	29	3.702472	153.239055	110.955361	42.283694
59873	online-web-directory	legit	20	3.584184	116.310717	74.082948	42.227769
58016	web-directory-online	legit	20	3.584184	114.402671	74.082948	40.319723

#gram count低的词
weird_cond = (all_domains['class']=='legit') & (all_domains['word_grams']<3) & (all_domains['alexa_grams']<2)
weird = all_domains[weird_cond]
print(weird.shape[0])
weird.head(10)

	domain	class	length	entropy	alexa_grams	diff
1246	twcczhu	legit	7	2.521641	1.748188	1.748188
2009	ggmm777	legit	7	1.556657	1.518514	1.518514
2760	qq66699	legit	7	1.556657	1.342423	1.342423
17347	crx7601	legit	7	2.807355	0.000000	0.000000
18682	hzsxzhyy	legit	8	2.250000	0.000000	0.000000
19418	02022222222	legit	11	0.684038	1.041393	1.041393
19887	3181302	legit	7	2.235926	0.000000	0.000000
21172	hljdns4	legit	7	2.807355	1.755875	1.755875
26441	05tz2e9	legit	7	2.807355	0.000000	0.000000
26557	fzysqmy	legit	7	2.521641	1.176091	1.176091

1
2
3

#对于这些正常但是gram count低的domain标记为weird
all_domains.loc[weird_cond, 'class'] = 'weird'
all_domains['class'].value_counts()

legit    67221
dga       2664
weird       91
Name: class, dtype: int64

1	all_domains[all_domains['class'] == 'weird'].head()

	domain	class	length	entropy	alexa_grams	diff
1246	twcczhu	weird	7	2.521641	1.748188	1.748188
2009	ggmm777	weird	7	1.556657	1.518514	1.518514
2760	qq66699	weird	7	1.556657	1.342423	1.342423
17347	crx7601	weird	7	2.807355	0.000000	0.000000
18682	hzsxzhyy	weird	8	2.250000	0.000000	0.000000

cond = all_domains['class'] == 'dga'
dga = all_domains[cond]
alexa = all_domains[~cond]
plt.scatter(alexa['word_grams'], alexa['entropy'], s=140, c='#aaaaff', label='Alexa', alpha=.2)
plt.scatter(dga['word_grams'], dga['entropy'], s=40, c='r', label='DGA', alpha=.3)
plt.legend()
#放置图例
pylab.xlabel('Domain word_grams')
pylab.ylabel('Domain Entropy')

Text(0,0.5,'Domain Entropy')

训练算法

not_weird = all_domains[all_domains['class'] != 'weird']
X = not_weird.as_matrix(['length', 'entropy', 'alexa_grams', 'word_grams'])
#将frame转换为Numpy-array表示
y = np.array(not_weird['class'].tolist())
#将array转换为list
clf = sklearn.ensemble.RandomForestClassifier(n_estimators=20)
#A random forest classifier
#The number of trees in the forest
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
#随机划分训练集和测试集
#样本占比0.2
clf.fit(X_train, y_train)
#用训练数据拟合分类器模型
y_pred = clf.predict(X_test)
#用训练好的分类器去预测测试数据

/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/ipykernel_launcher.py:2: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.

def show_cm(cm, labels):
    #计算百分比
    percent = (cm*100.0)/np.array(np.matrix(cm.sum(axis=1)).T)  
    print('Confusion Matrix Stats')
    for i, label_i in enumerate(labels):
        for j, label_j in enumerate(labels):
            print("%s/%s: %.2f%% (%d/%d)" % (label_i, label_j, (percent[i][j]), cm[i][j], cm[i].sum()))

labels = ['legit', 'dga']
cm = sklearn.metrics.confusion_matrix(y_test, y_pred, labels)
#混淆矩阵被用于在分类问题上对准确率的一种评估形式
show_cm(cm, labels)

Confusion Matrix Stats
legit/legit: 99.57% (13369/13427)
legit/dga: 0.43% (58/13427)
dga/legit: 15.45% (85/550)
dga/dga: 84.55% (465/550)

1
2
3

importances = zip(['length', 'entropy', 'alexa_grams', 'word_grams'], clf.feature_importances_)
#了解每个特征的重要性
list(importances)

[('length', 0.16033779891739047),
 ('entropy', 0.12175502861193326),
 ('alexa_grams', 0.5087685303664589),
 ('word_grams', 0.20913864210421748)]

1	clf.fit(X, y)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=20, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

测试算法

def test_it(domain):
    _alexa_match = alexa_counts * alexa_vc.transform([domain]).T  
    _dict_match = dict_counts * dict_vc.transform([domain]).T
    _X = [[len(domain), entropy(domain), _alexa_match, _dict_match]]
    print('%s : %s' % (domain, clf.predict(_X)[0]))

test_it('google')
test_it('google8sdflkajssjgjksdh')
test_it('faceboosadfadfafdk')
test_it('1cb8a5f36f')
test_it('pterodactyladfasdfasdffarts')
test_it('ptes9dro-dwacty2lfa5rrts')
test_it('beyonce')
test_it('bey666on4ce')
test_it('supersexy')
test_it('yourmomissohotinthesummertime')

google : legit
google8sdflkajssjgjksdh : dga
faceboosadfadfafdk : legit
1cb8a5f36f : dga
pterodactyladfasdfasdffarts : legit
ptes9dro-dwacty2lfa5rrts : dga
beyonce : legit
bey666on4ce : dga
supersexy : legit
yourmomissohotinthesummertime : legit

使用算法

def save_model_to_disk(name, model, model_dir='models'):
    serialized_model = pickle.dumps(model, protocol=pickle.HIGHEST_PROTOCOL)
    model_path = os.path.join(model_dir, name+'.model')
    print('Storing Serialized Model to Disk (%s:%.2fMeg)' % (name, len(serialized_model)/1024.0/1024.0))
    open(model_path,'wb').write(serialized_model)

save_model_to_disk('dga_model_random_forest', clf)
save_model_to_disk('dga_model_alexa_vectorizor', alexa_vc)
save_model_to_disk('dga_model_alexa_counts', alexa_counts)
save_model_to_disk('dga_model_dict_vectorizor', dict_vc)
save_model_to_disk('dga_model_dict_counts', dict_counts)

Storing Serialized Model to Disk (dga_model_random_forest:1.80Meg)
Storing Serialized Model to Disk (dga_model_alexa_vectorizor:2.93Meg)
Storing Serialized Model to Disk (dga_model_alexa_counts:0.18Meg)
Storing Serialized Model to Disk (dga_model_dict_vectorizor:5.39Meg)
Storing Serialized Model to Disk (dga_model_dict_counts:0.94Meg)

def load_model_from_disk(name, model_dir='models'):
    model_path = os.path.join(model_dir, name+'.model')
    try:
        model = pickle.loads(open(model_path,'rb').read())
        print('success')
    except:
        print('Could not load model: %s from directory %s!' % (name, model_path))
        return None
    return model

clf = load_model_from_disk('dga_model_random_forest')
alexa_vc = load_model_from_disk('dga_model_alexa_vectorizor')
alexa_counts = load_model_from_disk('dga_model_alexa_counts')
dict_vc = load_model_from_disk('dga_model_dict_vectorizor')
dict_counts = load_model_from_disk('dga_model_dict_counts')
model = {'clf':clf, 'alexa_vc':alexa_vc, 'alexa_counts':alexa_counts,
                 'dict_vc':dict_vc, 'dict_counts':dict_counts}

success
success
success
success
success

def evaluate_url(model, url):
    domain = domain_extract(url)
    alexa_match = model['alexa_counts'] * model['alexa_vc'].transform([url]).T
    dict_match = model['dict_counts'] * model['dict_vc'].transform([url]).T
    
    X = [[len(domain), entropy(domain), alexa_match, dict_match]]
    y_pred = model['clf'].predict(X)[0]
    
    print('%s : %s' % (domain, y_pred))

1	evaluate_url(model, 'adfhalksfhjashfk.com')

adfhalksfhjashfk : dga

1 2	mtnb = MultinomialNB() mtnb.fit(X_train,y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

nb_y_pred=mtnb.predict(X_test)
print(classification_report(y_test, nb_y_pred))
cm = sklearn.metrics.confusion_matrix(y_test, nb_y_pred)
show_cm(cm, labels)

             precision    recall  f1-score   support

        dga       0.71      0.87      0.78       550
      legit       0.99      0.99      0.99     13427

avg / total       0.98      0.98      0.98     13977

Confusion Matrix Stats
legit/legit: 86.73% (477/550)
legit/dga: 13.27% (73/550)
dga/legit: 1.44% (194/13427)
dga/dga: 98.56% (13233/13427)

import os
import random
import tldextract
import sklearn
import pandas as pd
import numpy as np

from keras.models import Sequential, load_model
from keras.preprocessing import sequence
from keras.layers.core import Dense, Dropout, Activation
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import LSTM
from sklearn import feature_extraction
from sklearn.model_selection import train_test_split
from datetime import datetime
from zipfile import ZipFile

1
2
3

alexa_dataframe = pd.read_csv('data/top-1m.csv', names=['rank','uri'], header=None, encoding='utf-8')
alexa_dataframe.info()
alexa_dataframe.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 2 columns):
rank    1000000 non-null int64
uri     1000000 non-null object
dtypes: int64(1), object(1)
memory usage: 15.3+ MB

	rank	uri
0	1	google.com
1	2	youtube.com
2	3	facebook.com
3	4	baidu.com
4	5	wikipedia.org

def load_data_set(filename):
    fw = open('data/dga_domain.txt', 'w+')
    with open(filename, "r") as f:
        for line in f.readlines():
            lineArr = line.strip().split('\t')
            fw.write(lineArr[1] + '\n')
    fw.close()
load_data_set('data/dga.txt')

1
2
3

dga_dataframe = pd.read_csv('data/dga_domain.txt', names=['raw_domain'], header=None, encoding='utf-8')
dga_dataframe.info()
dga_dataframe.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1158695 entries, 0 to 1158694
Data columns (total 1 columns):
raw_domain    1158695 non-null object
dtypes: object(1)
memory usage: 8.8+ MB

	raw_domain
0	ogxbnjopz.biz
1	zyejwiist.net
2	buuqogz.com
3	vpjmomduqll.org
4	uakwifutnpn.biz

def domain_extract(uri):
    ext = tldextract.extract(uri)
    if (not ext.suffix):
        return None
    else:
        return ext.domain
    
alexa_dataframe['domain'] = [ domain_extract(uri) for uri in alexa_dataframe['uri']]
del alexa_dataframe['rank']
del alexa_dataframe['uri']
alexa_dataframe = alexa_dataframe.dropna()
alexa_dataframe = alexa_dataframe.drop_duplicates()
alexa_dataframe['length'] = [len(x) for x in alexa_dataframe['domain']]
alexa_dataframe = alexa_dataframe[alexa_dataframe['length'] > 6]
alexa_dataframe.info()
alexa_dataframe.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 718018 entries, 1 to 999999
Data columns (total 2 columns):
domain    718018 non-null object
length    718018 non-null int64
dtypes: int64(1), object(1)
memory usage: 16.4+ MB

	domain	length
1	youtube	7
2	facebook	8
4	wikipedia	9
11	instagram	9
13	twitter	7

1
2
3

alexa_dataframe['class'] = 'legit'
#对正常数据打标legit
alexa_dataframe.head()

	domain	length	class
1	youtube	7	legit
2	facebook	8	legit
4	wikipedia	9	legit
11	instagram	9	legit
13	twitter	7	legit

# Shuffle the data (important for training/testing)
alexa_dataframe = alexa_dataframe.reindex(np.random.permutation(alexa_dataframe.index))
#打乱循序，重新索引
#Randomly permute a sequence, or return a permuted range
alexa_total = alexa_dataframe.shape[0]
print('Total Alexa domains %d' % alexa_total)

Total Alexa domains 718018

1
2
3

dga_dataframe['domain'] = dga_dataframe.applymap(lambda x: x.split('.')[0].strip().lower())
#This method applies a function that accepts and returns a scalar to every element of a DataFrame.
del dga_dataframe['raw_domain']

dga_dataframe = dga_dataframe.dropna()
dga_dataframe = dga_dataframe.drop_duplicates()
dga_dataframe['length'] = [len(x) for x in dga_dataframe['domain']]
dga_dataframe = dga_dataframe[dga_dataframe['length'] > 6]
dga_total = dga_dataframe.shape[0]
print('Total DGA domains %d' % dga_total)

Total DGA domains 1082010

1 2	dga_dataframe['class'] = 'dga' dga_dataframe.head()

	domain	length	class
0	ogxbnjopz	9	dga
1	zyejwiist	9	dga
2	buuqogz	7	dga
3	vpjmomduqll	11	dga
4	uakwifutnpn	11	dga

1
2
3

all_domains = pd.concat([alexa_dataframe[:5000], dga_dataframe[:5000]], ignore_index=True)
#
all_domains.head(10)

	domain	length	class
0	youtube	7	legit
1	facebook	8	legit
2	wikipedia	9	legit
3	instagram	9	legit
4	twitter	7	legit
5	blogspot	8	legit
6	netflix	7	legit
7	pornhub	7	legit
8	xvideos	7	legit
9	livejasmin	10	legit

1	all_domains.tail(10)

	domain	length	class
9990	mxepwpxki	9	dga
9991	xnvqgaddhivrqowtbs	18	dga
9992	btgjyoydcwoeigdldngr	20	dga
9993	mnnridfyhxkyk	13	dga
9994	jmcctiodbdemfejo	16	dga
9995	mepoiwtmeffy	12	dga
9996	iwpikrmppfqeere	15	dga
9997	gcibdmrs	8	dga
9998	tusdspujigdyntbxusuah	21	dga
9999	wvsiuqhblxfijnoefjnao	21	dga

1 2	X = all_domains['domain'] labels = all_domains['class']

1
2
3

ngram_vectorizer = feature_extraction.text.CountVectorizer(analyzer='char', ngram_range=(2, 2))
count_vec = ngram_vectorizer.fit_transform(X)
max_features = count_vec.shape[1]

1	y = [0 if x == 'legit' else 1 for x in labels]

1	final_data = []

多层感知机（MLP）

def build_model(max_features):
    model = Sequential()
    model.add(Dense(1, input_dim=max_features, activation='sigmoid'))
    #添加一个全连接层，激活函数使用sigmoid，输出维度max_features
    model.compile(loss='binary_crossentropy',optimizer='adam')
    #编译模型，损失函数采用对数损失函数，优化器选用adam
    return model

max_epoch = 50
nfolds = 10
#10轮训练
batch_size = 128

for fold in range(nfolds):
    print("fold %u/%u" % (fold+1, nfolds))
    X_train, X_test, y_train, y_test, _, label_test = train_test_split(count_vec, y, labels, test_size=0.2)

    print('Build model...')
    model = build_model(max_features)

    print("Train...")
    X_train, X_holdout, y_train, y_holdout = train_test_split(X_train, y_train, test_size=0.05)
    best_iter = -1
    best_auc = 0.0
    out_data = {}

    for ep in range(max_epoch):
        model.fit(X_train.todense(), y_train, batch_size=batch_size, nb_epoch=1)
        t_probs = model.predict_proba(X_holdout.todense())
        t_auc = sklearn.metrics.roc_auc_score(y_holdout, t_probs)
        #计算AUC值

        print('Epoch %d: auc = %f (best=%f)' % (ep, t_auc, best_auc))
        if t_auc > best_auc:
            best_auc = t_auc
            best_iter = ep

            probs = model.predict_proba(X_test.todense())
            out_data = {'y':y_test, 'labels': label_test, 'probs':probs, 'epochs': ep,
                            'confusion_matrix': sklearn.metrics.confusion_matrix(y_test, probs > .5)}
            print(sklearn.metrics.confusion_matrix(y_test, probs > .5))
        else:
            if (ep-best_iter) > 5:
                break

    final_data.append(out_data)
    model.save('model.h5')

fold 1/10
Build model...
Train...


/usr/lib/python3/dist-packages/ipykernel_launcher.py:15: UserWarning: The `nb_epoch` argument in `fit` has been renamed `epochs`.
  from ipykernel import kernelapp as app


Epoch 1/1
7600/7600 [==============================] - 1s 86us/step - loss: 0.6297
Epoch 0: auc = 0.950239 (best=0.000000)
[[915  86]
 [108 891]]
Epoch 1/1
7600/7600 [==============================] - 0s 26us/step - loss: 0.5243
Epoch 1: auc = 0.980196 (best=0.950239)
[[952  49]
 [ 83 916]]
Epoch 1/1
7600/7600 [==============================] - 0s 31us/step - loss: 0.4502
Epoch 2: auc = 0.984872 (best=0.980196)
[[965  36]
 [ 78 921]]
Epoch 1/1
7600/7600 

Epoch 32: auc = 0.994192 (best=0.994192)

1	model = load_model('model.h5')

1	print(final_data)

[{'y': [0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1], 'labels': 2403    legit
2789    legit
450     legit
4521    legit
2841    legit
8645      dga
6999      dga
7831      dga
6291      dga
3746    legit
6226      dga
4111    legit
8487      dga
678     legit
90      legit
6151      dga
8300      dga
4004    legit
2489    legit
4836    legit
8291      dga
8198      dga
8911      dga
7585      dga
260     legit
5905      dga
5646      dga
970     legit
8718      dga
275     legit
        ...  
8589      dga
6620      dga
7470      dga
5230      dga
4827    legit
5677      dga
3417    legit
8539      dga
7147      dga
3699    legit
4751    legit
3043    legit
5475      dga
3736    legit
3887    legit
6349      dga
4996    legit
7379      dga
3530    legit
1942    legit
7914      dga
9752      dga
6717      dga
5363      dga
7622      dga
961     legit
1641    legit
4607    legit
8649      dga
6087      dga
Name: class, Length: 2000, dtype: object, 'probs': array([[0.14488636],
       [0.00496732],
       [0.00896166],
       ...,
       [0.00593334],
       [0.95598286],
       [0.9867235 ]], dtype=float32), 'epochs': 43, 'confusion_matrix': array([[972,  29],
       [ 62, 937]])}

1
2

z_test = np.array([[0, 0, 0, 0, 0,  0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1]])
model.predict(z_test)

array([[1.]], dtype=float32)

1	print(sklearn.metrics.classification_report(final_data[0]['y'], final_data[0]['probs'] > .5))

              precision    recall  f1-score   support

           0       0.95      0.97      0.96       970
           1       0.97      0.95      0.96      1030

   micro avg       0.96      0.96      0.96      2000
   macro avg       0.96      0.96      0.96      2000
weighted avg       0.96      0.96      0.96      2000

LSTM

def build_model_lstm(max_features, maxlen):
    """Build LSTM model"""
    model = Sequential()
    model.add(Embedding(max_features, 128, input_length=maxlen))
    #添加一个嵌入层，嵌入层是将正整数（下标）转换为具有固定大小的向量
    model.add(LSTM(128))
    #添加长短期记忆网络LSTM，从样本中学习特征，这个是核心层
    model.add(Dropout(0.5))
    #添加Dropout层防止过拟合
    model.add(Dense(1))
    model.add(Activation('sigmoid'))

    model.compile(loss='binary_crossentropy', optimizer='rmsprop')
    #编译模型，损失函数采用对数损失函数，优化器选用rmsprop

    return model

X = all_domains['domain']
labels = all_domains['class']

valid_chars = {x:idx+1 for idx, x in enumerate(set(''.join(X)))}
max_features = len(valid_chars) + 1
#计算特征字符长度
maxlen = np.max([len(x) for x in X])
#记录最长的域名长度
X = [[valid_chars[y] for y in x] for x in X]
#转换为下标数组
X = sequence.pad_sequences(X, maxlen=maxlen)
#进行长度填充
y = [0 if x == 'legit' else 1 for x in labels]
final_data = []

for fold in range(nfolds):
    print("fold %u/%u" % (fold+1, nfolds))
    X_train, X_test, y_train, y_test, _, label_test = train_test_split(X, y, labels, 
                                                                           test_size=0.2)

    print('Build model...')
    model = build_model_lstm(max_features, maxlen)

    print("Train...")
    X_train, X_holdout, y_train, y_holdout = train_test_split(X_train, y_train, test_size=0.05)
    best_iter = -1
    best_auc = 0.0
    out_data = {}

    for ep in range(max_epoch):
        model.fit(X_train, y_train, batch_size=batch_size, nb_epoch=1)

        t_probs = model.predict_proba(X_holdout)
        t_auc = sklearn.metrics.roc_auc_score(y_holdout, t_probs)

        print('Epoch %d: auc = %f (best=%f)' % (ep, t_auc, best_auc))

        if t_auc > best_auc:
            best_auc = t_auc
            best_iter = ep

            probs = model.predict_proba(X_test)

            out_data = {'y':y_test, 'labels': label_test, 'probs':probs, 'epochs': ep, 'confusion_matrix': sklearn.metrics.confusion_matrix(y_test, probs > .5)}

            print(sklearn.metrics.confusion_matrix(y_test, probs > .5))
        else:
            if (ep-best_iter) > 2:
                break

    final_data.append(out_data)

fold 1/10
Build model...
Train...

Epoch 1/1
7600/7600 [==============================] - 24s 3ms/step - loss: 0.3562
Epoch 0: auc = 0.979725 (best=0.000000)
[[893 113]
 [ 42 952]]
Epoch 1/1
7600/7600 [==============================] - 23s 3ms/step - loss: 0.1643
Epoch 7: auc = 0.980221 (best=0.981659)
Epoch 1/1
7600/7600 [==============================] - 21s 3ms/step - loss: 0.1603
Epoch 8: auc = 0.979843 (best=0.981659)

1	print(sklearn.metrics.classification_report(final_data[0]['y'], final_data[0]['probs'] > .5))

              precision    recall  f1-score   support

           0       0.95      0.96      0.96      1006
           1       0.96      0.95      0.95       994

   micro avg       0.96      0.96      0.96      2000
   macro avg       0.96      0.96      0.96      2000
weighted avg       0.96      0.96      0.96      2000

2019-08-24

安全数据分析

从SOAR(安全编排自动化与响应)中求解安全运营之法

凡学问者，皆有术法道三大层次。法者，于术精通而升华成理，复以理指导术之提高，学问之提高层次。达于法者，达中乘也。

0x00 个人理解的企业应用安全建设
参与企业应用安全建设两年有余，在公司的应用安全建设比较早期的时候参与进来，最近一年又有幸深度参与了多家中小型公司的应用安全建设，无论是基于云安全平台还是基于自研平台的企业安全建设都有了些许思考。也渐渐构建起了自己的安全观，

“企业安全建设是一个动态博弈需要持续投入的过程。安全是业务的一个重要属性，是业务的核心竞争力之一，应用安全的本质是运营。安全建设更重要的是看待安全问题的思路、角度和高度。攻防之道，相辅相成。”

什么是运营，一切围绕着网站产品进行的人工干预都叫运营，那什么又是强运营呢，直白点就是需要大量人工参与、与其它角色大量沟通的运营。从笔者的角度来看，企业安全建设就是一个强运营的工作，尤其是在安全建设后期，平台、工具、制度相对完善之后。大量的运营使人痛苦，尤其对技术安全运营来说尤甚，所以在很长的时间里思考这个困境的解法，有了些许思路，通过本文中将自己对企业安全中应用安全建设的思考和大家分享下。

当企业开始应用安全建设时，一般会经历这几个阶段：采购阶段、自研阶段、产品闭环，以期实现高效运营的目标。

0x01 采购阶段
这个阶段相信很多经历过“一个人的安全部”的都会深有感触，简单调研之后，大概率会发现这样一个事实————“一穷二白”，好一点的可能运维或IT同学已经做了一部分，例如系统漏洞、高危端口等，但很明显远远不够的。这时候就要开始采购安全产品了，“管它好不好用，先止血”。对各个乙方的安全解决方案进行调研，在各个开源社区寻找各类开源安全工具、平台，在耗费了大量精力和经历安全预算申请的绝望后，终于部署了防火墙、IDS等安全产品，从开源社区找到了SOC平台、扫描器、风控平台等开源安全产品。枪有了，能不能打道猎物还是要看人。同样，安全产品怎么来用才能发挥最大价值是个值得思考的事情，需要大量的内部调研，尝试与已有的流程、机制、产品进行配合，以及如何得到高层的帮助，这就需要大家各显神通了。

在对接时，会发现在不经过二次开发的情况下很难实现有效的配合，更多的是在各自为战。与此同时，需要对这些产品进行运营，处理报警日志、漏洞扫描、漏洞的推动修复、应用上线审核、活动风险控制、各类安全应急等等，现阶段如果想要全部一手抓，难度有点大。笔者认为，在甲方做企业安全建设，最终还是要对结果负责，对于安全效果，有两个指标是最关键的核心指标，一个是漏洞/事件数，一个是安全产品覆盖面。

所以在初期阶段没法全覆盖的情况下，最有效的办法是找到业务最痛最关心的点，重点保障，得到认可。通过短期快速止血和长期安全机制建设相结合的方法迭代改进来度过这一阶段。为什么要找到业务最痛最关心的点呢？企业的安全是100%服务保障业务的，业务永远是第一位的，有业务才有安全。(当然如果公司的业务都是基于云产品部署的，那可以直接跳过这一阶段了，云平台提供的一整套安全产品对于基本的安全保障还是很有效的。)

0x02 自研阶段
经历了采购阶段安全建设后，有了一定的安全水位，安全团队的配置也得到相应的提高，在基于开源或采购的安全产品进行运营时，被大量平台之间的协作搞得焦头烂额，迫切的需要开始安全平台的部分自研。

在这提一嘴安全团队的建设，笔者认为的安全团队组成主要有攻防、运营、开发三部分组成，其中开发又分为安全工具开发和安全平台开发，其中的区别在于安全工具开发需要专业的攻防能力，而安全平台开发则更侧重于开发本身，专业的人做专业的事，让一个安全同学去开发一个安全运营平台与现有的代码构建等平台进行对接是一件很困难的事情，所以就需要专业的开发来做这部分工作。

没有哪套安全解决方案可以应用在所有企业上，这就需要安全团队针对当前的业务模式、系统架构、发布流程等针对性的开发一些工具或平台来使安全解决方案更契合当前企业的技术栈。

例如SDL中的应用发布流程，其中最重要的莫过于发布卡点，卡点又要依赖代码安全扫描，而每个公司使用的开发框架往往不同，甚至在某些公司会对开发框架进行大量的修改，这种情况下通用的代码扫描就不可靠了，就需要对代码安全扫描器进行改造或自研，然后扫描出的漏洞需要通过漏洞运营平台来管理，如何修复对于开发来说也是个棘手的问题，要解决快速修复的问题就需要完整的代码级解决方案，好多公司都有安全包的组件供开发使用。例子只是其中的一个点，这一阶段往往是漫长的，平台和工具会经过一次次的迭代，最终和业务达到和平共处的状态。

为什么需要专业的开发，一个很重要的原因是需要工程化的能力，这里引用《赵彦的CISO闪电战：两年甲方安全修炼之路》中的一句话，“工程化能力体现在能把自研的安全产品、安全系统落地于海量分布式环境，落地于规模庞大的组织的IT系统上，且不降低性能，不影响可用性，不出故障和运营事故”。

因为安全产品导致的大型故障发生过很多起，安全产品有时候就是一个双刃剑，例如WAF，既能挡住恶意攻击，也有可能会把正常用户拒之门外。如果安全自身把业务给搞瘫痪了，那要安全还有何用，在很多情况下，稳定性往往是高于安全性的，凸显出工程化体系化的重要性。

0x03 产品闭环
每个企业的安全思路都是不完全相同的，经过了自研阶段后，会形成自己企业特有的安全建设解决方案。漫长的自研阶段度过后，可能会有同学认为“纵深防御体系”（笔者理解的纵深防御，从系统、中间件、网络、应用、业务等各个环节布控，一道道防线由外到内共同组成整个防御体系）已经建设完成。从产品上来说，或许是的。但是从安全运营的角度来看，当前每个产品都还是孤立的点，产品之间的联动更多的是靠人工运营。

拿漏洞的运营举例，一个漏洞的生命周期通常是漏洞产生（SRC、内部发现、工具发现、威胁情报）、漏洞确认（是否误报、定级）、漏洞分配（对应的开发修复）、修复审核（是否修复以及修复方案的健壮性）、漏洞关闭，这其中就需要SRC、工具、TIP、SOC、开发中台、SIEM等平台的联动，来实现漏洞生命周期的闭环。类似闭环还有很多，但是工具类的产品闭环往往不是那么容易，这时可以寻找突破点，做产品的小闭环。

0x04 运营的痛点
安全人员短缺、报警数量多、处置速度无法保证、处置经验有效沉淀少、威胁态势愈加危险和复杂等等

往往在安全产品闭环阶段后，技术安全团队的大小会稳定下来甚至会缩小，应用安全运营人员也会越来越少，在笔者看来这是一个正常的进化过程。但是业务的扩张并没有停止，应用也是一刻不停发布上线，随着企业规模越来越大，暴露的攻击面也越来越广，各类报警、漏洞大量增加，在有限的人力下，处置速度和经验沉淀很难有保障，更不用说现在大环境下安全形势了。

求变之心愈加强烈。

0x05 SOAR是否是一剂良药？
思考这个问题很久了，应用安全建设强运营的困境该如何去突围？ 从最初的鼓吹AI到回归现实，终于从SOAR(安全编排自动化与响应)中看到些许希望。

简单介绍一下SOAR，SOAR是Gartner 2018年在安全领域定义的最新前沿技术，与UEBA、EDR等侧重于威胁识别发现的技术不同，SOAR集中在识别后的威胁处理，强调用户可以通过事件编排中心通过编码实现任意的威胁处理逻辑。

SOAR 是一系列技术的合集，它能够帮助企业和组织收集安全运维团队监控到的各种信息（包括各种安全系统产生的告警），并对这些信息进行事件分析和告警分诊。然后在标准工作流程的指引下，利用人机结合的方式帮助安全运维人员定义、排序和驱动标准化的事件响应活动。SOAR 工具使得企业和组织能够对事件分析与响应流程进行形式化的描述。

SOAR相关的安全产品在国外国内都已经有安全公司进入到这个领域，但是在本文中，不去讨论具体的实现和产品，而是将其视作一个方法论，领会它的思路，尝试将其融入到产品自研和产品闭环中去。

看一下SOAR的组成，编排、自动化、响应，其实从名字中已经给了我们答案，笔者认为，最核心的思想在于Orchestration和Automation，先将事件处理流程或其它的流程通过编排的方式形成闭环，然后对其中大量重复工作的部分进行自动化。 至于响应，也是同理，具体的以后详谈。

有一点需要明确，目前通过应用SOAR来实现全自动组织和缓解的情况非常罕见，安全没有银弹，大多数缓解和阻断仍然需要安全人员的参与。但是SOAR的思想非常值得借鉴，尤其是在经历采购阶段、产品自研、产品闭环这几个阶段后，思考能不能通过SOAR的方法论来减轻工作量。

例如漏洞扫描的流程，发现、上报、分配确认、修复确认，其中发现、上报、修复确认均可以实现自动化，再比如各类安全警报的处理也可以应用这套方法论。

让专业的人来处理专业的事，用自动化来处理重复工作，或许是突围应用安全强运营困境的一个解法。SOAR目前还处于成长期，保持期待和不断探索。

Modern Security Operations Center = SOAR + SIEM + UEBA + OTHER
2019-08-08
- 应用安全
- THINK
Read more

K Nearest Neighbor

0x01 KNN

采用测量不同特征值之间的距离进行分类

优点：

精度高
对异常值不敏感
无数据输入假定

缺点：

计算复杂度高
空间复杂度高

适用数据范围：

数值型
标称型

0x02 算法实现

算法描述

1.计算测试数据与各个训练数据之间的距离
2.按照距离的递增关系进行排序
3.选取距离最小的K个点
4.确定前K个点所在类别的出现频率
5.返回前K个点中出现频率最高的类别作为测试数据的预测分类

def classify0(inX, dataSet, labels, k):
    '''

    :param inX: 输入向量
    :param dataSet: 训练数据集
    :param labels: 标签
    :param k: k
    :return:
    '''
    #距离计算
    dataSetSize = dataSet.shape[0] #读取矩阵第一维度的长度
    diffMat = tile(inX, (dataSetSize, 1)) - dataSet #tile把inX复制dataSetSize维度
    sqDiffMat = diffMat**2
    sqDistances = sqDiffMat.sum(axis=1)
    distances = sqDistances**0.5
    #选择距离最小的k个点
    sortedDistIndicies = distances.argsort()
    print(sortedDistIndicies)
    classCount={}
    for i in range(k):
        voteIlabel = labels[sortedDistIndicies[i]]
        classCount[voteIlabel] = classCount.get(voteIlabel, 0) + 1
    #排序
    sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)
    return sortedClassCount[0][0]

0x03 实例1

收集数据

准备数据

def file2matrix(filename):
    with open(filename, "r") as fr:
        frreadlines = fr.readlines()
        numberOfLines = len(frreadlines)
        returnMat = zeros((numberOfLines, 3))
        classLabelVector = []
        index = 0
        for line in frreadlines:
            line = line.strip()
            listFromLine = line.split('\t')
            returnMat[index, :] = listFromLine[0:3]
            labels = {'didntLike': 1, 'smallDoses': 2, 'largeDoses': 3}
            classLabelVector.append(labels[listFromLine[-1]])
            index += 1
        return returnMat, classLabelVector

分析数据

def DataMat(data, labels):
    fig = plt.figure()
    ax = fig.add_subplot(111)
    ax.scatter(data[:, 0], data[:, 1], 15.0*array(labels), 15.0*array(labels))
    plt.show()

处理数据

归一化数值，转化到0~1之间

newV = (oldV-min)/(max-min)

def autoNorm(dataSet):
    minVals = dataSet.min(0)
    maxVals = dataSet.max(0)
    ranges = maxVals - minVals
    m = dataSet.shape[0]
    normDataSet = dataSet - tile(minVals, (m, 1))
    normDataSet = normDataSet/tile(ranges, (m, 1))
    return normDataSet, ranges, minVals

测试算法

对hoRatio和k进行参数调整，寻找最佳值

def testData(data_mat, data_label):
    hoRatio = 0.80 #内变量
    normDataSet, ranges, minVals = autoNorm(data_mat)
    m = normDataSet.shape[0]
    numTestVecs = int(m * hoRatio)
    trueCount = 0.0
    for i in range(numTestVecs):
        classifierResult = classify0(normDataSet[i, :], normDataSet[numTestVecs:m, :], data_label[numTestVecs:m], 5)
        if (classifierResult == data_label[i]):
            trueCount += 1.0
    print("the total true rate is: %f" % (trueCount/float(numTestVecs)*100) + "%")
    print(trueCount)

使用算法

def usemode(a, b, c):
    file_path = "datingTestSet.txt"
    data_mat, data_label = file2matrix(file_path)
    normDataSet, ranges, minVals = autoNorm(data_mat)
    inarr = array([a, b, c])
    classifierResult = classify0((inarr-minVals)/ranges, normDataSet, data_label, 5)
    return classifierResult

1	result = usemode(40920, 8.326976, 0.953952)

(inarr-minVals)/ranges是传入参数归一化后的结果，代入classify0模型，求出与历史数据中的临近值，即结果

0x04 实例2

手写数字识别

def img2vector(filename):
    returnVect = zeros((1,1024))
    fr = open(filename)
    for i in range(32):
        lineStr = fr.readline()
        for j in range(32):
            returnVect[0, 32*i+j] = int(lineStr[j])
    return returnVect

def handwritingClassTest():
    hwLabels = []
    trainingFileList = listdir('digits/trainingDigits')
    m = len(trainingFileList)
    trainingMat = zeros((m, 1024))
    for i in range(m):
        #从文件名上解析当前文件中的正确值，存入label
        fileNameStr = trainingFileList[i]
        fileStr = fileNameStr.split('.')[0]
        classNumStr = int(fileStr.split('_')[0])
        hwLabels.append(classNumStr)
        trainingMat[i, :] = img2vector('digits/trainingDigits/%s' % fileNameStr)
    testFileList = listdir('digits/testDigits')
    trueCount = 0.0
    mTest = len(testFileList)
    for i in range(mTest):
        fileNameStr = testFileList[i]
        fileStr = fileNameStr.split('.')[0]
        classNumStr = int(fileStr.split('_')[0])
        vectorUnderTest = img2vector('digits/testDigits/%s' % fileNameStr)
        classifierResult = classify0(vectorUnderTest, trainingMat, hwLabels, 3)
        if (classifierResult == classNumStr): trueCount += 1.0
    print("\nthe total true rate is: %f" % (trueCount/float(mTest)))

0x05 安全应用

从数学角度来看，异常行为检测也是对被检测的未知行为进行分类的过程，未知行为与已知的正常行为相似，则该行为是正常行为，否则是入侵行为[1]

还有像恶意软件检测等安全领域应用

0x06 其他应用

文字识别
人脸识别
医用图像处理

参考：[1]基于 kNN 算法的异常行为检测方法研究

2019-08-03

安全数据分析

0x01 DT

分类算法

优点

计算复杂度不高
输出结果易于理解
中间值缺失不敏感
可处理不相关特征

缺点

可能会产生过度匹配问题

适用数据类型：

数值型
标称型

0x02 准备数据

算法描述

1.根节点开始，测试待分类项中相应的特征属性

2.按照其值选择输出分支，直到到达叶子节点

3.将叶子节点存放的类别作为决策结果

划分数据集

将无序的数据变得更加有序

信息增益：划分数据集之后信息发生的变化
熵：信息的期望值

熵计算公式

def calcShannonEnt(dataSet):
    numEntries = len(dataSet) #计算数据集中实例总数
    labelCounts = {}
    #统计每个键值的数量，dict
    for featVec in dataSet:
        currentLabel = featVec[-1]
        if currentLabel not in labelCounts.keys():
            labelCounts[currentLabel] = 0
        labelCounts[currentLabel] += 1
    shannonEnt = 0.0
    #计算香农熵
    for key in labelCounts:
        prob = float(labelCounts[key])/numEntries
        shannonEnt -= prob * log(prob, 2)
    return shannonEnt

划分数据集

按照给定特征划分数据集

def splitDataSet(dataSet, axis, value):
    '''
    
    :param dataSet: 待划分数据集
    :param axis: 特征
    :param value: 特征值
    :return: 符合条件的值列表
    '''
    retDataSet = []
    for featVec in dataSet:
        if featVec[axis] == value:
            reducedFeatVec = featVec[:axis]     
            reducedFeatVec.extend(featVec[axis+1:]) #把特征列除去
            retDataSet.append(reducedFeatVec)
    return retDataSet

选择最好的数据集划分方式

熵越高，则混合的数据就越多

def chooseBestFeatureToSplit(dataSet):
    '''
    :param dataSet: 数据集
    :return:
    '''
    numFeatures = len(dataSet[0]) - 1      #特征列的长度，-1为label
    baseEntropy = calcShannonEnt(dataSet)  #计算数据集的香农熵
    bestInfoGain = 0.0
    bestFeature = -1
    for i in range(numFeatures):
        featList = [example[i] for example in dataSet] #创建一个list包含所有数据的第i个feature
        uniqueVals = set(featList)       #转变为set格式
        newEntropy = 0.0
        for value in uniqueVals:
            subDataSet = splitDataSet(dataSet, i, value) #遍历featList中的所有feature，对每个feture划分一次数据集
            prob = len(subDataSet)/float(len(dataSet))
            newEntropy += prob * calcShannonEnt(subDataSet)  #计算当前feature的香农熵
        infoGain = baseEntropy - newEntropy     #计算熵差，信息增益
        if (infoGain > bestInfoGain): #计算最大信息增益
            bestInfoGain = infoGain
            bestFeature = i
    return bestFeature                      #返回最好的feature

递归构建决策树

1.得到数据集
2.最好feature划分
3.递归划分

当处理了所有feature后，类标签仍然不唯一时，采用多数表决方式决定子节点分类

def majorityCnt(classList):
    classCount={}
    for vote in classList:
        if vote not in classCount.keys():
            classCount[vote] = 0
        classCount[vote] += 1
    sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)
    return sortedClassCount[0][0]

利用递归构建tree

def createTree(dataSet,labels):
    classList = [example[-1] for example in dataSet] #数据集的所有类标签
    if classList.count(classList[0]) == len(classList): 
        return classList[0] #当类标签完全相同返回该类标签
    if len(dataSet[0]) == 1: #当所有属性都处理完，label仍然不唯一时，采用表决方式
        return majorityCnt(classList)
    bestFeat = chooseBestFeatureToSplit(dataSet)
    bestFeatLabel = labels[bestFeat] #当前数据集选取的最好特征变量
    myTree = {bestFeatLabel: {}}
    del(labels[bestFeat]) #删除用过的feature
    featValues = [example[bestFeat] for example in dataSet]
    uniqueVals = set(featValues)
    for value in uniqueVals: 
        subLabels = labels[:]
        myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value), subLabels) #利用递归构建tree
    return myTree

绘制树形图

利用Matplotlib annotations实现绘制树形图

实现效果如下图

0x03 测试和储存分类器

将标签字符串转换为索引

def classify(inputTree,featLabels,testVec):
    '''

    :param inputTree: tree dict
    :param featLabels: labels
    :param testVec: 位置,eg.[1, 0]
    :return:
    '''
    firstStr = list(inputTree.keys())[0]
    secondDict = inputTree[firstStr]
    featIndex = featLabels.index(firstStr)
    key = testVec[featIndex]
    valueOfFeat = secondDict[key]
    if isinstance(valueOfFeat, dict): 
        classLabel = classify(valueOfFeat, featLabels, testVec)
    else:
        classLabel = valueOfFeat
    return classLabel

存储决策树

使用pickle持久化对象

pickle.dump(obj, file[, protocol])

def storeTree(inputTree, filename):
    import pickle
    fw = open(filename, 'wb')
    pickle.dump(inputTree, fw)
    fw.close()
    
def grabTree(filename):
    import pickle
    fr = open(filename, 'rb')
    return pickle.load(fr)

0x04 使用决策树预测隐形眼镜类型

收集数据

lenses

准备数据

解析通过’\t’分隔的数据

分析数据&训练模型

1 2	labels = ['age', 'prescript', 'astigmatic', 'tearRate'] lenses_tree = createTree(lenses, labels)

测试模型

0x05 其它模型

ID3（分类树）

每次根据“最大信息熵增益”选取当前最佳的特征来分割数据，并按照该特征的所有取值来切分
C4.5（分类树）

ID3的升级版，采用信息增益比率，通过引入一个被称作分裂信息(Split information)的项来惩罚取值较多的Feature
弥补了ID3中不能处理特征属性值连续的问题
CART（分类回归树）

CART是一棵二叉树，采用二元切分法，每次把数据切成两份，分别进入左子树、右子树。而且每个非叶子节点都有两个孩子，所以CART的叶子节点比非叶子多1

0x05 安全领域

分析恶意网络攻击和入侵
口令爆破检测
僵尸流量检测

2019-08-01

安全数据分析

0x01 NB

朴素：整个形式化的过程只做最原始、最简单的假设

优点

数据较少情况下仍然有效
可以处理多类别问题

缺点

对于输入数据的处理方式比较敏感

适用数据类型

标称型

0x02 贝叶斯决策理论

计算数据点属于每个类别的概率，并进行比较，选择具有最高概率的决策

条件概率

推导过程

0x03 构建文档分类器

两个假设

特征之间相互独立（统计意义上的独立）
每个特征同等重要

word2vec

def loadDataSet():
    '''
    测试数据
    :return: 
    '''
    postingList = [['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],
                 ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
                 ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
                 ['stop', 'posting', 'stupid', 'worthless', 'garbage'],
                 ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
                 ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
    classVec = [0, 1, 0, 1, 0, 1]  #是否包含侮辱性词语，为1
    return postingList, classVec
                 
def createVocabList(dataSet):
    '''
    创建dataSet的不重复词列表
    :param dataSet: 
    :return: 
    '''
    vocabSet = set([])
    for document in dataSet:
        vocabSet = vocabSet | set(document)
    return list(vocabSet)

def setOfWords2Vec(vocabList, inputSet):
    '''
    :param vocabList: 不重复词列表
    :param inputSet: 某文档
    :return: 文档向量
    '''
    returnVec = [0]*len(vocabList) #创建一个长度和vocabList相等的全部为0的向量
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)] = 1
        else:
            print("the word: %s is not in my Vocabulary!" % word)
    return returnVec #[0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0]

训练算法

从词向量计算概率

1 2	for postdoc in postingList: trainmat.append(setOfWords2Vec(vocablist, postdoc))

通过setOfWords2Vec方法对文档进行处理，返回文档向量

def trainNB0(trainMatrix,trainCategory):
    numTrainDocs = len(trainMatrix) #6 文档矩阵的行数
    numWords = len(trainMatrix[0]) #32 矩阵的长度
    pAbusive = sum(trainCategory)/float(numTrainDocs) #3/6  文档属于侮辱类型的概率
    p0Num = ones(numWords) #ones函数可以创建任意维度和元素个数的数组，其元素值均为1
    p1Num = ones(numWords)
    p0Denom = 0.0
    p1Denom = 0.0
    for i in range(numTrainDocs):
        if trainCategory[i] == 1:
            p1Num += trainMatrix[i] #如果标签为侮辱性的，则两个列表相加
            p1Denom += sum(trainMatrix[i]) #侮辱性文档的词数相加
        else:
            p0Num += trainMatrix[i]
            p0Denom += sum(trainMatrix[i])
    #p1num：[2. 2. 1. 1. 1. 1. 2. 1. 1. 2. 2. 1. 1. 1. 4. 2. 3. 2. 1. 1. 1. 1. 2. 2.
 2. 2. 1. 1. 1. 2. 1. 3.]
    #p1Demon：19.0
    p1Vect = log(p1Num/p1Denom)          #将单个词的数目除以总词数得到条件概率
    p0Vect = log(p0Num/p0Denom)
    return p0Vect, p1Vect, pAbusive

概率向量：在给定文档类别条件下词汇表中单词的出现概率
p0Vect:正常文档的概率向量
p1Vect:侮辱性文档概率向量
pAbusive:侮辱文档的概率

概率值为0问题

利用贝叶斯分类器对文档进行分类时，要计算多个概率的乘积以获得文档属于某个类别的概率，即计算p(w0|1)p(w1|1)p(w2|1)。如果其中一个概率值为0，那么最后的乘积也为0。为降低这种影响，可以将所有词的出现数初始化为1，并将分母初始化为2

1 2	p0Denom = 2.0 p1Denom = 2.0

下溢出问题

相乘许多很小的数，最后四舍五入后会得到0

1 2	p1Vect = log(p1Num/p1Denom) p0Vect = log(p0Num/p0Denom)

测试算法

的含义为给定w向量的基础上来自类别ci的概率是多少

因为P(w)P(ci)两者是一样的，可以忽略

因为log(p(w|c)p(c)) = log(p(w|c)) + log(p(c))，所以在classifyNB方法中求和

def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):
    '''
    元素相乘
    :param vec2Classify:要分类的向量
    :param p0Vec:正常文档概率向量
    :param p1Vec:侮辱文档概率向量
    :param pClass1:侮辱文档的概率
    :return:1 or 0
    '''
    p1 = sum(vec2Classify * p1Vec) + log(pClass1)
    p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1)
    if p1 > p0:
        return 1
    else: 
        return 0

便利函数

def testingNB():
    listOPosts,listClasses = loadDataSet()
    myVocabList = createVocabList(listOPosts)
    trainMat=[]
    for postinDoc in listOPosts:
        trainMat.append(setOfWords2Vec(myVocabList, postinDoc))
    p0V,p1V,pAb = trainNB0(array(trainMat), array(listClasses))
    testEntry = ['love', 'my', 'dalmation']
    thisDoc = array(setOfWords2Vec(myVocabList, testEntry))
    print(testEntry, 'classified as: ', classifyNB(thisDoc, p0V, p1V, pAb))
    testEntry = ['stupid', 'garbage']
    thisDoc = array(setOfWords2Vec(myVocabList, testEntry))
    print(testEntry, 'classified as: ', classifyNB(thisDoc, p0V, p1V, pAb))

词袋模型

在词袋中，每个单词可以出现多次，而在词集中，每个词只能出现一次

每当遇到一个单词时，词向量中的对应值会+1

def bagOfWords2VecMN(vocabList, inputSet):
    returnVec = [0]*len(vocabList)
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)] += 1
    return returnVec

0x04 Action 1

垃圾邮件判断

def textParse(bigString):
    '''
    简单分词处理
    :param bigString: 
    :return: 
    '''
    import re
    listOfTokens = re.split('\W*', bigString)
    return [tok.lower() for tok in listOfTokens if len(tok) > 2] #取长度大于3，转化为小写
    
def spamTest():
    '''
    数据输入
    处理
    分割
    训练
    测试
    :return: 
    '''
    docList=[]
    classList = []
    fullText = []
    for i in range(1, 26):
        wordList = textParse(open('email/spam/%d.txt' % i, 'rb').read().decode('GBK', 'ignore'))
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(1)
        wordList = textParse(open('email/ham/%d.txt' % i, 'rb').read().decode('GBK', 'ignore'))
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(0)
    vocabList = createVocabList(docList) #创建不重复词表
    trainingSet = list(range(50)) #[0, 1, 2, 3, 4, 5, 6, 7, 8...44, 45, 46, 47, 48, 49]
    testSet=[]
    for i in range(10): #随机选择10条数据作为测试集
        randIndex = int(random.uniform(0, len(trainingSet)))
        testSet.append(trainingSet[randIndex])
        del(trainingSet[randIndex])  
    trainMat = []
    trainClasses = []
    for docIndex in trainingSet: # 训练集
        trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex])) #词袋模型，构建词向量
        trainClasses.append(classList[docIndex])
    p0V, p1V, pSpam = trainNB0(array(trainMat), array(trainClasses))
    errorCount = 0
    for docIndex in testSet: # 测试集
        wordVector = bagOfWords2VecMN(vocabList, docList[docIndex])
        if classifyNB(array(wordVector), p0V, p1V, pSpam) != classList[docIndex]:
            errorCount += 1
            print("classification error", docList[docIndex])
    print('the error rate is: ', float(errorCount)/len(testSet))

2019-07-28

安全数据分析

0x01 LR

根据现有数据对分类边界线建立回归公式，以此进行分类

LR的目的是寻找一个非线性函数Sigmoid的最佳拟合参数，求解过程可以由最优化算法来完成
在最优化算法中，最常用的就是梯度上升算法，而梯度上升算法又可以简化为随机梯度上升算法

优点

计算代价不高
易于理解和实现

缺点

容易欠拟合
分类精度可能不高

适用数据类型

数值型
标称型

类阶跃函数：Sigmoid函数

LR分类器：在每个特征上都乘以一个回归系数，然后把所有的结果值相加，将这个总和代入Sigmoid函数中，进而得到一个范围在0~1之间的数值。任何大于0.5的数据被分入1类，小于0.5即被归入0类

0x02 训练算法

确定回归系数

梯度上升法

要找到某函数的最大值，最好的方法是沿着该函数的梯度方向探寻

训练数据

def sigmoid(inX):
    '''
    sigmoid函数
    :param inX:
    :return:
    '''
    return 1.0/(1+exp(-inX))

def gradAscent(dataMatIn, classLabels):
    '''
    梯度上升优化算法
    :param dataMatIn: 
    :param classLabels: 
    :return: 
    '''
    dataMatrix = mat(dataMatIn) #转换为numpy矩阵数据类型
    labelMat = mat(classLabels).transpose()
    m, n = shape(dataMatrix)
    alpha = 0.001 #步长
    maxCycles = 500 #迭代次数
    weights = ones((n, 1))
    for k in range(maxCycles):
    #计算真实类别与预测类别的差值
        h = sigmoid(dataMatrix*weights)     #矩阵相乘
        error = (labelMat - h)              #向量相减
        weights = weights + alpha * dataMatrix.transpose() * error #矩阵相乘
    return weights

分析数据

def plotBestFit(weights):
    '''
    import matplotlib.pyplot as plt
    画出决策边界
    :param weights: 
    :return: 
    '''
    dataMat, labelMat=loadDataSet()
    dataArr = array(dataMat)
    n = shape(dataArr)[0] 
    xcord1 = []
    ycord1 = []
    xcord2 = []
    ycord2 = []
    for i in range(n):
        if int(labelMat[i]) == 1:
            xcord1.append(dataArr[i, 1])
            ycord1.append(dataArr[i, 2])
        else:
            xcord2.append(dataArr[i, 1])
            ycord2.append(dataArr[i, 2])
    fig = plt.figure()
    ax = fig.add_subplot(111)
    ax.scatter(xcord1, ycord1, s=30, c='red', marker='s')
    ax.scatter(xcord2, ycord2, s=30, c='green')
    x = arange(-3.0, 3.0, 0.1)
    y = (-weights[0]-weights[1]*x)/weights[2] #设置sigmiod为0
    ax.plot(x, y)
    plt.xlabel('X1')
    plt.ylabel('X2')
    plt.show()

1	plotBestFit(weights.getA()) #getA()函数与mat()函数的功能相反，将一个numpy矩阵转换为数组

随机梯度上升算法

在线学习算法
一次仅用一个样本点来更新回归系数

def stocGradAscent0(dataMatrix, classLabels):
    '''
    随机梯度上升算法
    :param dataMatrix:
    :param classLabels:
    :return:
    '''
    m, n = shape(dataMatrix)
    alpha = 0.01
    weights = ones(n)   #初始化
    for i in range(m):
        h = sigmoid(sum(dataMatrix[i]*weights))
        error = classLabels[i] - h
        weights = weights + alpha * error * dataMatrix[i]
    return weights

改进的随机梯度上升算法

def stocGradAscent1(dataMatrix, classLabels, numIter=150):
    '''
    改进的随机梯度上升算法
    :param dataMatrix:
    :param classLabels:
    :param numIter:默认迭代次数
    :return:
    '''
    m, n = shape(dataMatrix)
    weights = ones(n)
    for j in range(numIter):
        dataIndex = list(range(m))
        for i in range(m):
            alpha = 4/(1.0+j+i)+0.0001    #每次迭代进行调整，不断减小
            randIndex = int(random.uniform(0, len(dataIndex))) #随机选择样本更新回归系数
            h = sigmoid(sum(dataMatrix[randIndex]*weights))
            error = classLabels[randIndex] - h
            weights = weights + alpha * error * dataMatrix[randIndex]
            del(dataIndex[randIndex])
    return weights

0x03 实例1

准备数据

如何处理数据中的缺失值？

1.使用可用特征的均值来填补缺失值
2.使用特殊值来填补缺失值
3.忽略有缺失值的样本
4.使用相似样本的均值添补缺失值
5.使用另外的机器学习算法预测缺失值

测试算法

def classifyVector(inX, weights):
    '''

    :param inX: 回归系数
    :param weights: 特征向量
    :return: 0 or 1
    '''
    prob = sigmoid(sum(inX*weights))
    if prob > 0.5:
        return 1.0
    else:
        return 0.0

def colicTest():
    frTrain = open('horseColicTraining.txt')
    frTest = open('horseColicTest.txt')
    trainingSet = []
    trainingLabels = []
    for line in frTrain.readlines():
        currLine = line.strip().split('\t')
        lineArr = []
        for i in range(21):
            lineArr.append(float(currLine[i]))
        trainingSet.append(lineArr)
        trainingLabels.append(float(currLine[21]))
    trainWeights = stocGradAscent1(array(trainingSet), trainingLabels, 1000) #计算回归系数向量，迭代1000次
    errorCount = 0
    numTestVec = 0.0
    for line in frTest.readlines(): #导入测试集计算分类错误率
        numTestVec += 1.0
        currLine = line.strip().split('\t')
        lineArr = []
        for i in range(21):
            lineArr.append(float(currLine[i]))
        if int(classifyVector(array(lineArr), trainWeights)) != int(currLine[21]):
            errorCount += 1
    errorRate = (float(errorCount)/numTestVec)
    print("the error rate of this test is: %f" % errorRate)
    return errorRate

def multiTest():
    numTests = 10
    errorSum=0.0
    for k in range(numTests): # 计算10次求平均值
        errorSum += colicTest()
    print("after %d iterations the average error rate is: %f" % (numTests, errorSum/float(numTests)))

2019-07-27

安全数据分析

0x01 SVM

实现算法：序列最小优化（SMO）

支持向量：离分割超平面最近的那些点

优点：

范化错误率低
计算开销不大
结果易解释

缺点：

对参数调节和核函数的选择敏感
原始分类器需要修改才能处理多分类问题

适用数据类型：

数值型
标称型

0x02 SMO

将大优化问题分解为多个小优化问题来求解

目标是求出一系列alpha和b，一旦求出了这些alpha，就很容易计算出权重向量w并得到分隔超平面

工作原理是:每次循环中选择两个alpha进行优化处理。一旦找到一对合适的alpha，那么就增大其中一个同时减小另一个

简化版实现

def selectJrand(i, m):
    '''
    在某个区间范围内随机选择一个整数
    :param i: 第i个alpha的下标
    :param m: 所有alpha的数目
    :return:
    '''
    j=i
    while (j==i):
        j = int(random.uniform(0, m))
    return j

def clipAlpha(aj, H, L):
    '''
    用于调整大于H或小于L的alpha值
    :param aj:
    :param H:
    :param L:
    :return:
    '''
    if aj > H: 
        aj = H
    if L > aj:
        aj = L
    return aj

def smoSimple(dataMatIn, classLabels, C, toler, maxIter):
    '''

    :param dataMatIn:数据集
    :param classLabels:类别标签
    :param C:常数C
    :param toler:容错率
    :param maxIter:退出前的最大循环次数
    :return:
    '''
    dataMatrix = mat(dataMatIn)
    labelMat = mat(classLabels).transpose()
    b = 0
    m, n = shape(dataMatrix)
    alphas = mat(zeros((m, 1)))
    iter = 0 #记录循环次数
    while (iter < maxIter):
        alphaPairsChanged = 0 #用于记录alpha是否优化
        for i in range(m):
            fXi = float(multiply(alphas, labelMat).T*(dataMatrix*dataMatrix[i, :].T)) + b #预测的类别
            Ei = fXi - float(labelMat[i])#预测和实际的误差
            #如果误差过大，则进行优化
            if ((labelMat[i]*Ei < -toler) and (alphas[i] < C)) or ((labelMat[i]*Ei > toler) and (alphas[i] > 0)):
                j = selectJrand(i, m) #随机选择第二个alpha值
                fXj = float(multiply(alphas, labelMat).T*(dataMatrix*dataMatrix[j, :].T)) + b #预测类别
                Ej = fXj - float(labelMat[j]) #误差
                alphaIold = alphas[i].copy()
                alphaJold = alphas[j].copy()
                #计算L和H
                if (labelMat[i] != labelMat[j]):
                    L = max(0, alphas[j] - alphas[i])
                    H = min(C, C + alphas[j] - alphas[i])
                else:
                    L = max(0, alphas[j] + alphas[i] - C)
                    H = min(C, alphas[j] + alphas[i])
                if L==H:
                    print("L==H")
                    continue
                eta = 2.0 * dataMatrix[i, :]*dataMatrix[j, :].T - dataMatrix[i, :]*dataMatrix[i, :].T - dataMatrix[j, :]*dataMatrix[j, :].T #最优修改量
                if eta >= 0:
                    print("eta>=0")
                    continue
                alphas[j] -= labelMat[j]*(Ei - Ej)/eta
                alphas[j] = clipAlpha(alphas[j], H, L)
                if (abs(alphas[j] - alphaJold) < 0.00001):
                    print("j not moving enough")
                    continue
                alphas[i] += labelMat[j]*labelMat[i]*(alphaJold - alphas[j])#修改i，修改量和j相同，方向相反
                #设置常数项
                b1 = b - Ei - labelMat[i]*(alphas[i]-alphaIold)*dataMatrix[i, :]*dataMatrix[i, :].T - labelMat[j]*(alphas[j]-alphaJold)*dataMatrix[i, :]*dataMatrix[j, :].T
                b2 = b - Ej - labelMat[i]*(alphas[i]-alphaIold)*dataMatrix[i, :]*dataMatrix[j, :].T - labelMat[j]*(alphas[j]-alphaJold)*dataMatrix[j, :]*dataMatrix[j, :].T
                if (0 < alphas[i]) and (C > alphas[i]):
                    b = b1
                elif (0 < alphas[j]) and (C > alphas[j]):
                    b = b2
                else:
                    b = (b1 + b2)/2.0
                alphaPairsChanged += 1
                print("iter: %d i:%d, pairs changed %d" % (iter, i, alphaPairsChanged))
        if (alphaPairsChanged == 0):
            iter += 1
        else:
            iter = 0
        print("iteration number: %d" % iter)
    return b, alphas

找出哪些点是支持向量

1
2
3

for i in range(100):
    if alphas[i] > 0.0:
        print(dateArr[i], labelArr[i])

完整版实现

class optStruct:
    def __init__(self, dataMatIn, classLabels, C, toler, kTup):  # Initialize the structure with the parameters
        self.X = dataMatIn
        self.labelMat = classLabels
        self.C = C
        self.tol = toler
        self.m = shape(dataMatIn)[0]
        self.alphas = mat(zeros((self.m, 1)))
        self.b = 0
        self.eCache = mat(zeros((self.m, 2))) #first column is valid flag
        self.K = mat(zeros((self.m, self.m)))
        for i in range(self.m):
            self.K[:, i] = kernelTrans(self.X, self.X[i, :], kTup)
        
def calcEk(oS, k):
    '''
    计算E值并返回
    :param oS:
    :param k:
    :return:
    '''
    fXk = float(multiply(oS.alphas, oS.labelMat).T*oS.K[:, k] + oS.b)
    Ek = fXk - float(oS.labelMat[k])
    return Ek
        
def selectJ(i, oS, Ei):
    '''
    选择第二个alpha
    :param i:
    :param oS:
    :param Ei:
    :return:
    '''
    maxK = -1
    maxDeltaE = 0
    Ej = 0
    oS.eCache[i] = [1, Ei]
    validEcacheList = nonzero(oS.eCache[:, 0].A)[0]
    if (len(validEcacheList)) > 1:
        for k in validEcacheList:
            if k == i:
                continue
            Ek = calcEk(oS, k)
            deltaE = abs(Ei - Ek)
            if (deltaE > maxDeltaE):
                maxK = k
                maxDeltaE = deltaE
                Ej = Ek
        return maxK, Ej
    else:
        j = selectJrand(i, oS.m)
        Ej = calcEk(oS, j)
    return j, Ej

def updateEk(oS, k):
    '''
    计算误差值并存入缓存
    :param oS:
    :param k:
    :return:
    '''
    Ek = calcEk(oS, k)
    oS.eCache[k] = [1, Ek]
        
def innerL(i, oS):
    '''
    优化过程
    :param i:
    :param oS:
    :return:
    '''
    Ei = calcEk(oS, i)
    if ((oS.labelMat[i]*Ei < -oS.tol) and (oS.alphas[i] < oS.C)) or ((oS.labelMat[i]*Ei > oS.tol) and (oS.alphas[i] > 0)):
        j, Ej = selectJ(i, oS, Ei)
        alphaIold = oS.alphas[i].copy(); alphaJold = oS.alphas[j].copy()
        if (oS.labelMat[i] != oS.labelMat[j]):
            L = max(0, oS.alphas[j] - oS.alphas[i])
            H = min(oS.C, oS.C + oS.alphas[j] - oS.alphas[i])
        else:
            L = max(0, oS.alphas[j] + oS.alphas[i] - oS.C)
            H = min(oS.C, oS.alphas[j] + oS.alphas[i])
        if L==H:
            print("L==H")
            return 0
        eta = 2.0 * oS.K[i, j] - oS.K[i, i] - oS.K[j, j]
        if eta >= 0:
            print("eta>=0")
            return 0
        oS.alphas[j] -= oS.labelMat[j]*(Ei - Ej)/eta
        oS.alphas[j] = clipAlpha(oS.alphas[j], H, L)
        updateEk(oS, j)
        if (abs(oS.alphas[j] - alphaJold) < 0.00001):
            print("j not moving enough")
            return 0
        oS.alphas[i] += oS.labelMat[j]*oS.labelMat[i]*(alphaJold - oS.alphas[j])
        updateEk(oS, i)
        b1 = oS.b - Ei - oS.labelMat[i]*(oS.alphas[i]-alphaIold)*oS.K[i, i] - oS.labelMat[j]*(oS.alphas[j]-alphaJold)*oS.K[i, j]
        b2 = oS.b - Ej - oS.labelMat[i]*(oS.alphas[i]-alphaIold)*oS.K[i, j] - oS.labelMat[j]*(oS.alphas[j]-alphaJold)*oS.K[j, j]
        if (0 < oS.alphas[i]) and (oS.C > oS.alphas[i]):
            oS.b = b1
        elif (0 < oS.alphas[j]) and (oS.C > oS.alphas[j]):
            oS.b = b2
        else:
            oS.b = (b1 + b2)/2.0
        return 1
    else:
        return 0

def smoP(dataMatIn, classLabels, C, toler, maxIter,kTup=('lin', 0)):
    '''
    完整的SMO外循环
    :param dataMatIn:
    :param classLabels:
    :param C:
    :param toler:
    :param maxIter:
    :param kTup:
    :return:
    '''
    oS = optStruct(mat(dataMatIn), mat(classLabels).transpose(), C, toler, kTup)
    iter = 0
    entireSet = True
    alphaPairsChanged = 0
    while (iter < maxIter) and ((alphaPairsChanged > 0) or (entireSet)):
        alphaPairsChanged = 0
        if entireSet:
            for i in range(oS.m):        
                alphaPairsChanged += innerL(i,oS)
                print("fullSet, iter: %d i:%d, pairs changed %d" % (iter, i, alphaPairsChanged))
            iter += 1
        else:
            nonBoundIs = nonzero((oS.alphas.A > 0) * (oS.alphas.A < C))[0]
            for i in nonBoundIs:
                alphaPairsChanged += innerL(i, oS)
                print("non-bound, iter: %d i:%d, pairs changed %d" % (iter, i, alphaPairsChanged))
            iter += 1
        if entireSet:
            entireSet = False
        elif (alphaPairsChanged == 0):
            entireSet = True
        print("iteration number: %d" % iter)
    return oS.b, oS.alphas

def calcWs(alphas, dataArr, classLabels):
    '''
    利用计算出的alpha进行分类
    :param alphas:
    :param dataArr:
    :param classLabels:
    :return:
    '''
    X = mat(dataArr)
    labelMat = mat(classLabels).transpose()
    m, n = shape(X)
    w = zeros((n, 1))
    for i in range(m):
        w += multiply(alphas[i]*labelMat[i], X[i, :].T)
    return w

for i in range(100):
        if alphas[i] > 0.0:
            print(dateArr[i], labelArr[i])
    ws = calcWs(alphas, dateArr, labelArr)
    print(ws)
    datmat = mat(dateArr)
    print(datmat[0]*mat(ws) + b)

最后一行为测试结果，小于0属于-1类，大于0属于1类，等于0属于-1类

0x03 kernel

将数据映射到高维空间

将数据从一个特征空间转换到另一个特征空间
映射会将低维特征空间映射到高维空间

径向基核函数

def kernelTrans(X, A, kTup):
    '''
    核函数
    :param X:
    :param A:
    :param kTup:包含核函数信息的元组
    :return:
    '''
    m, n = shape(X)
    K = mat(zeros((m, 1)))
    if kTup[0]=='lin':
        K = X * A.T
    elif kTup[0]=='rbf':
        for j in range(m):
            deltaRow = X[j, :] - A
            K[j] = deltaRow*deltaRow.T
        K = exp(K/(-1*kTup[1]**2))
    else:
        raise NameError('Houston We Have a Problem -- That Kernel is not recognized')
    return K

测试

def testRbf(k1=1.3):
    dataArr,labelArr = loadDataSet('testSetRBF.txt')
    b,alphas = smoP(dataArr, labelArr, 200, 0.0001, 10000, ('rbf', k1))
    datMat=mat(dataArr)
    labelMat = mat(labelArr).transpose()
    svInd=nonzero(alphas.A>0)[0]
    sVs=datMat[svInd]
    labelSV = labelMat[svInd]
    print("there are %d Support Vectors" % shape(sVs)[0])
    m, n = shape(datMat)
    errorCount = 0
    for i in range(m):
        kernelEval = kernelTrans(sVs, datMat[i, :], ('rbf', k1))
        predict = kernelEval.T * multiply(labelSV, alphas[svInd]) + b
        if sign(predict) != sign(labelArr[i]):
            errorCount += 1
    print("the training error rate is: %f" % (float(errorCount)/m))
    dataArr, labelArr = loadDataSet('testSetRBF2.txt')
    errorCount = 0
    datMat=mat(dataArr)
    labelMat = mat(labelArr).transpose()
    m,n = shape(datMat)
    for i in range(m):
        kernelEval = kernelTrans(sVs, datMat[i, :], ('rbf', k1))
        predict = kernelEval.T * multiply(labelSV, alphas[svInd]) + b
        if sign(predict) != sign(labelArr[i]):
            errorCount += 1
    print("the test error rate is: %f" % (float(errorCount)/m))

0x04 实例1

基于SVM的手写数字识别

def testDigits(kTup=('rbf', 10)):
    dataArr,labelArr = loadImages('../Ch02/digits/trainingDigits')
    b,alphas = smoP(dataArr, labelArr, 200, 0.0001, 10000, kTup)
    datMat=mat(dataArr)
    labelMat = mat(labelArr).transpose()
    svInd=nonzero(alphas.A > 0)[0]
    sVs=datMat[svInd] 
    labelSV = labelMat[svInd]
    print("there are %d Support Vectors" % shape(sVs)[0])
    m, n = shape(datMat)
    errorCount = 0
    for i in range(m):
        kernelEval = kernelTrans(sVs, datMat[i, :], kTup)
        predict = kernelEval.T * multiply(labelSV, alphas[svInd]) + b
        if sign(predict) != sign(labelArr[i]):
            errorCount += 1
    print("the training error rate is: %f" % (float(errorCount)/m))
    dataArr, labelArr = loadImages('testDigits')
    errorCount = 0
    datMat=mat(dataArr)
    labelMat = mat(labelArr).transpose()
    m, n = shape(datMat)
    for i in range(m):
        kernelEval = kernelTrans(sVs, datMat[i, :], kTup)
        predict=kernelEval.T * multiply(labelSV, alphas[svInd]) + b
        if sign(predict)!=sign(labelArr[i]):
            errorCount += 1
    print("the test error rate is: %f" % (float(errorCount)/m))

0x05 实例2

XSS Detection

0x01 数据

在github上看到https://github.com/SparkSharly/DL_for_xss 这个项目，感觉不错，学习一下，数据集项目中已经附带，就直接使用了

eg. normal_examples.csv （20w+取部分）

eg. xssed.csv （4W+取部分）

0x02 分词

def GeneSeg(payload):
    #数字泛化为"0"
    payload=payload.lower()
    payload=unquote(unquote(payload))
    payload,num=re.subn(r'\d+',"0",payload)
    #替换url为”http://u
    payload,num=re.subn(r'(http|https)://[a-zA-Z0-9\.@&/#!#\?]+', "http://u", payload)
    #分词
    r = '''
        (?x)[\w\.]+?\(
        |\)
        |"\w+?"
        |'\w+?'
        |http://\w
        |</\w+>
        |<\w+>
        |<\w+
        |\w+=
        |>
        |[\w\.]+
    '''
    return nltk.regexp_tokenize(payload, r)

0x03 特征

建立xss语义模型，构建词汇表

统计高频出现的300词构建词表

words=[]
datas=[]
with open("data/xssed.csv","r",encoding="utf-8") as f:
    reader=csv.DictReader(f,fieldnames=["payload"])
    for row in reader:
        payload=row["payload"]
        word=GeneSeg(payload)
        datas.append(word)
        words+=word

#构建数据集
def build_dataset(datas,words):
    count=[["UNK",-1]]
    counter=Counter(words)
    count.extend(counter.most_common(vocabulary_size-1))
    #print(count)
    vocabulary=[c[0] for c in count]
    #print(vocabulary)
    data_set=[]
    for data in datas:
        d_set=[]
        for word in data:
            if word in vocabulary:
                d_set.append(word)
            else:
                d_set.append("UNK")
                count[0][1]+=1
        data_set.append(d_set)
    print(data_set)

word2vec建模

1	model=Word2Vec(data_set,size=embedding_size,window=skip_window,negative=num_sampled,iter=num_iter)

空间维度设置为32维

查看建模结果，与</script>最语义最相近的词

数据处理

def pre_process():
    with open(vec_dir,"rb") as f :
        word2vec=pickle.load(f)
        #词表（'UNK': 0, '0': 1）
        dictionary=word2vec["dictionary"]
        #维度值
        embeddings=word2vec["embeddings"]
        #反向词表（num和word调换，0: 'UNK', 1: '0'）
        reverse_dictionary = word2vec["reverse_dictionary"]
    xssed_data=[]
    normal_data=[]
    with open("data/xssed.csv","r",encoding="utf-8") as f:
        reader = csv.DictReader(f, fieldnames=["payload"])
        for row in reader:
            payload=row["payload"]
            #分词['search=', '</script>', '<img', 'src=', 'worksinchrome', 'colon', 'prompt', 'x0', '0', 'x0', 'onerror=', 'eval(', 'src', ')', '>']
            word=GeneSeg(payload)
            xssed_data.append(word)
    with open("data/normal_examples.csv","r",encoding="utf-8") as f:
        reader = csv.DictReader(f, fieldnames=["payload"])
        for row in reader:
            payload=row["payload"]
            word=GeneSeg(payload)
            normal_data.append(word)
    xssed_num=len(xssed_data)
    normal_num=len(normal_data)
    #生成标签[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    xssed_labels=[1]*xssed_num
    normal_labels=[0]*normal_num
    datas=xssed_data+normal_data
    labels=xssed_labels+normal_labels
    def to_index(data):
        d_index=[]
        for word in data:
            if word in dictionary.keys():
                d_index.append(dictionary[word])
            else:
                d_index.append(dictionary["UNK"])
        return d_index
    #数据转换[23, 5, 34, 14, 0, 0, 0, 0, 1, 0, 81, 0, 0, 3, 2]
    datas_index=[to_index(data) for data in datas]
    #长度不足maxlen的用-1在前端填充
    '''
    [[ -1  -1  -1 ...   0   3   2]
    [ -1  -1  -1 ...  10  17   1]
    [ -1  -1  -1 ... 150   0  71]
    ...
    [ -1  -1  -1 ...  11   2  55]
    [ -1  -1  -1 ...   5  24   1]
    [ -1  -1  -1 ...   1   3   5]]
    '''
    datas_index=pad_sequences(datas_index,value=-1,maxlen=maxlen)
    #从有序列表中选k个作为一个片段返回，eg.[7, 6, 3, 2, 5, 8, 0, 1, 10, 4, 9]
    rand=random.sample(range(len(datas_index)),len(datas_index))
    #数据简单随机排序
    datas=[datas_index[index] for index in rand]
    labels=[labels[index] for index in rand]

    datas_embed=[]
    #获取UNK的维度，本例中是32
    dims=len(embeddings["UNK"])
    n=0
    for data in datas:
        data_embed = []
        for d in data:
            if d != -1:
                #如果不是填充数据，就把真实纬度值替换
                data_embed.extend(embeddings[reverse_dictionary[d]])
            else:
                data_embed.extend([0.0] * dims)
        datas_embed.append(data_embed)
        '''
        [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 
        0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 
        0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,··· -0.5644003, 0.41219762, -1.2313833, -1.3566964, 
        -0.74316794, -1.2668883, 1.0586963, 1.5969143, 0.21956278, 1.1538218, -0.35007623, 0.21183407, 
        -0.53830135, 1.7361579, -0.08175806, -1.1915175, -1.7790002, -1.1044971, 0.40857738]
        '''
        n+=1
        if n%10000 ==0:
            print(n)
    #七成训练，三成测试 
    train_datas,test_datas,train_labels,test_labels=train_test_split(datas_embed,labels,test_size=0.3)
    return train_datas,test_datas,train_labels,test_labels

0x04 SVM训练

通过SVM算法进行模型训练

train_datas, train_labels=pre_process()
print("Start Train Job! ")
start = time.time()
model=LinearSVC()
model = SVC(C=1.0, kernel="linear")
model.fit(train_datas,train_labels)
model.save(model_dir)
end = time.time()
print("Over train job in %f s" % (end - start))
print("Start Test Job!")
start=time.time()
pre=model.predict(test_datas)
end=time.time()
print("Over test job in %s s"%(end-start))
precision = precision_score(test_labels, pre)
recall = recall_score(test_labels, pre)
print("Precision score is :", precision)
print("Recall score is :", recall)
with open(model_dir,"wb") as f:
    pickle.dump(model,f,protocol=2)
print("wirte to ",model_dir)

精确率和召回率：

2019-07-22

安全数据分析

Prev Next

0x01 概要

0x02 IoT Security Landscape

0x03 Top IoT Threats

0x04 总结和建议

前言

安全评审需要介入的三个节点

立项时–快速感知

上线时–发布卡点

迭代时–持续跟踪

总结

注册

风险

防护

授权

登入

风险

防护

登出

找回密码

高危操作

防护

注销

整体

0x01 Domain Generating Algorithm

0x02 Random Forest

0x03 code

多层感知机（MLP）

LSTM

0x00 个人理解的企业应用安全建设

0x01 采购阶段

0x02 自研阶段

0x03 产品闭环

0x04 运营的痛点

0x05 SOAR是否是一剂良药？

0x01 KNN

0x02 算法实现

0x03 实例1

0x04 实例2

0x05 安全应用

0x06 其他应用

0x01 DT

0x02 准备数据

0x03 测试和储存分类器

0x04 使用决策树预测隐形眼镜类型

0x05 其它模型

0x05 安全领域

0x01 NB

0x02 贝叶斯决策理论

0x03 构建文档分类器

0x04 Action 1

0x01 LR

0x02 训练算法

0x03 实例1

0x01 SVM

0x02 SMO

0x03 kernel

0x04 实例1

0x05 实例2

0x01 数据

0x02 分词

0x03 特征

0x04 SVM训练