机器学习实战~Logistic(下)

上一篇讲完回归的优化算法，这一节就让我们使用Logistic回归来预测患有疝病的马的存活问题

准备数据

训练集

测试算法:用Logistic回归进行分类

测试集

#该函数以特征向量和回归系数作为输入来计算对应的sigmoid值
def classifyVector(x, weights):
    prob = sigmoid(np.sum(x * weights))
    if prob > 0.5:
        return 1.0
    else:
        return 0.0
#该函数用于打开测试集和训练集，并对数据进行格式化处理    
def colicTest():
    frTrain = open(r'E:\Program Files\Machine Learning\机器学习实战及配套代码\machinelearninginaction\Ch05\horseColicTraining.txt')
    frTest = open(r'E:\Program Files\Machine Learning\机器学习实战及配套代码\machinelearninginaction\Ch05\horseColicTest.txt')
    #定义两个空数组，一个用来放数据集，一个用来放标签
    trainingSet = []; trainingLabels = []
    for line in frTrain.readlines():
        currLine = line.strip().split('\t')#以制表符分割开来
        lineArr = []
        for i in np.arange(21):#共21个特征
            lineArr.append(float(currLine[i]))
        trainingSet.append(lineArr)
        trainingLabels.append(float(currLine[21]))#一个标签
    #利用改进的随机梯度上升算法训练出最佳的回归系数 迭代500次
    trainWeights = randgradAscent1(np.array(trainingSet), np.array(trainingLabels), 500)
    #导入测试集并计算分类错误率
    errorCount = 0; numTestVec = 0.0
    for line in frTest.readlines():
        numTestVec += 1.0
        currLine = line.strip().split('\t')
        lineArr = []
        for i in np.arange(21):
            lineArr.append(float(currLine[i]))
        if int(classifyVector(np.array(lineArr), trainWeights) != int(currLine[21])):
            errorCount += 1#计算测试错的总和
            
    errorRate = (float(errorCount) / numTestVec)#错误率
    print("the error rate of this test is: %f" % errorRate )
    return errorRate


#调用colocTest()函数10次，并求结果的平均值
def multiTest():
    numTests = 10; errorSum = 0.0
    for k in np.arange(numTests):
        errorSum += colicTest()
    print("after %d iterations the average error rate is: %f" %(numTests, errorSum / float(numTests)))

测试

multiTest()

输出

the error rate of this test is: 0.358209
the error rate of this test is: 0.402985
the error rate of this test is: 0.358209
the error rate of this test is: 0.402985
the error rate of this test is: 0.313433
the error rate of this test is: 0.373134
the error rate of this test is: 0.313433
the error rate of this test is: 0.328358
the error rate of this test is: 0.402985
the error rate of this test is: 0.358209
after 10 iterations the average error rate is: 0.361194

从结果看，10次迭代之后的平均错误率为35%，如果不是太满意，还可以通过调整colicTest()中的迭代次数和 randgradAscent1()中的步长进行优化
源码

机器学习实战~Logistic(下)

准备数据

测试算法:用Logistic回归进行分类

测试

输出

推荐阅读更多精彩内容