上一篇讲完回归的优化算法,这一节就让我们使用Logistic回归来预测患有疝病的马的存活问题
准备数据

训练集
测试算法:用Logistic回归进行分类

测试集
#该函数以特征向量和回归系数作为输入来计算对应的sigmoid值
def classifyVector(x, weights):
prob = sigmoid(np.sum(x * weights))
if prob > 0.5:
return 1.0
else:
return 0.0
#该函数用于打开测试集和训练集,并对数据进行格式化处理
def colicTest():
frTrain = open(r'E:\Program Files\Machine Learning\机器学习实战及配套代码\machinelearninginaction\Ch05\horseColicTraining.txt')
frTest = open(r'E:\Program Files\Machine Learning\机器学习实战及配套代码\machinelearninginaction\Ch05\horseColicTest.txt')
#定义两个空数组,一个用来放数据集,一个用来放标签
trainingSet = []; trainingLabels = []
for line in frTrain.readlines():
currLine = line.strip().split('\t')#以制表符分割开来
lineArr = []
for i in np.arange(21):#共21个特征
lineArr.append(float(currLine[i]))
trainingSet.append(lineArr)
trainingLabels.append(float(currLine[21]))#一个标签
#利用改进的随机梯度上升算法训练出最佳的回归系数 迭代500次
trainWeights = randgradAscent1(np.array(trainingSet), np.array(trainingLabels), 500)
#导入测试集并计算分类错误率
errorCount = 0; numTestVec = 0.0
for line in frTest.readlines():
numTestVec += 1.0
currLine = line.strip().split('\t')
lineArr = []
for i in np.arange(21):
lineArr.append(float(currLine[i]))
if int(classifyVector(np.array(lineArr), trainWeights) != int(currLine[21])):
errorCount += 1#计算测试错的总和
errorRate = (float(errorCount) / numTestVec)#错误率
print("the error rate of this test is: %f" % errorRate )
return errorRate
#调用colocTest()函数10次,并求结果的平均值
def multiTest():
numTests = 10; errorSum = 0.0
for k in np.arange(numTests):
errorSum += colicTest()
print("after %d iterations the average error rate is: %f" %(numTests, errorSum / float(numTests)))
测试
multiTest()
输出
the error rate of this test is: 0.358209
the error rate of this test is: 0.402985
the error rate of this test is: 0.358209
the error rate of this test is: 0.402985
the error rate of this test is: 0.313433
the error rate of this test is: 0.373134
the error rate of this test is: 0.313433
the error rate of this test is: 0.328358
the error rate of this test is: 0.402985
the error rate of this test is: 0.358209
after 10 iterations the average error rate is: 0.361194
从结果看,10次迭代之后的平均错误率为35%,如果不是太满意,还可以通过调整colicTest()中的迭代次数和 randgradAscent1()中的步长进行优化
源码
