python+hadoop学习笔记1

因为近期有个需求是关于分布式的，因此想用Python来实现一下hadoop。

今天的坑还挺多的。

1.Xshell连不上，呃。。要这样，玄学，我也不知道为什么

莫名其妙

2.编写Python的map-reduce程序

mapper代码如下：

#! /usr/bin/env python

import sys

# input comes from STDIN (standard input)

for line in sys.stdin:

# remove leading and trailing whitespace

line = line.strip()

# split the line into words

words = line.split()

# increase counters

for word in words:

# write the results to STDOUT (standard output);

# what we output here will be the input for the

# Reduce step, i.e. the input for reducer.py

# tab-delimited; the trivial word count is 1

print '%s\t%s' % (word, 1)

这博主坑啊。。\t写错，:没打，上面那个import还没用/

复制到/usr/local/hadoop上，并记得用

chmod +x mapper.py

赋予脚本权限

还有还有！用vim mapper.py打开，

我的天，怎么会有这么坑的转换

最后输入

echo "foo foo quux labs foo bar zoo zoo hying" | ./mapper.py

测试

结果

mapper成功啦

3.Reducer

代码如下

#!/usr/bin/env python

from operator import itemgetter

import sys

current_word = None

current_count = 0

word = None

for line in sys.stdin:

    line = line.strip()

    word, count = line.split('\t', 1)

    try:

        count = int(count)

    except ValueError:

        continue

    if current_word == word:

        current_count += count

    else:

        if current_word:

            print ("%s\t%s") % (current_word, current_count)

            current_count = count

            current_word = word

if word == current_word:

    print ("%s\t%s") % (current_word, current_count)

输入

echo "pib foo foo quux labs foo bar quux" | ./mapper.py | sort -k1,1 | ./reducer.py

结果

分词成功啦，撒花

python+hadoop学习笔记1

推荐阅读更多精彩内容