Tuesday, June 23, 2015

Scientific Writing

Thursday, February 6, 2014

Running Own Written Python Code in Hadoop

This post enlisted the steps requires to run own written code in python on Hadoop v 1.0.3 Cluster.

1. Create a mapper Python Script file.

  • su - hduser
  • nano mapper.py
Write Following code in the mapper.py file and save it.

#!/usr/bin/env python import sys # input comes from STDIN (standard input) for line in sys.stdin: # remove leading and trailing whitespace line = line.strip() # split the line into words words = line.split() # increase counters for word in words: # write the results to STDOUT (standard output); # what we output here will be the input for the # Reduce step, i.e. the input for reducer.py # # tab-delimited; the trivial word count is 1 print '%s\t%s' % (word, 1)

2. Create a reducer file.

  • nano reducer.py
Write following code in reducer file.
#!/usr/bin/env python from operator import itemgetter import sys current_word = None current_count = 0 word = None # input comes from STDIN for line in sys.stdin: # remove leading and trailing whitespace line = line.strip() # parse the input we got from mapper.py word, count = line.split('\t', 1) # convert count (currently a string) to int try: count = int(count) except ValueError: # count was not a number, so silently # ignore/discard this line continue # this IF-switch only works because Hadoop sorts map output # by key (here: word) before it is passed to the reducer if current_word == word: current_count += count else: if current_word: # write result to STDOUT print '%s\t%s' % (current_word, current_count) current_count = count current_word = word # do not forget to output the last word if needed! if current_word == word: print '%s\t%s' % (current_word, current_count)

3. Test your code (cat data | map | sort | reduce)

I recommend to test your mapper.py and reducer.py scripts locally before using them in a MapReduce job. Otherwise your jobs might successfully complete but there will be no job result data at all or not the results you would have expected.

# very basic test hduser@ubuntu:~$ echo "God is God. I am I" | /home/hduser/mapper.py God 1 is 1 God 1 I 1 am 1 I 1 hduser@ubuntu:~$ echo "God is God. I am I" | /home/hduser/mapper.py | sort -k1,1 | /home/hduser/reducer.py God 2 is 1 I 2 am 1 hduser@ubuntu:~$ cat /tmp/sandhu/pg20417.txt | /home/hduser/mapper.py The 1 Project 1 Gutenberg 1 EBook 1 of 1 [...] (you get the idea)

4. Running the Python Code on Hadoop
  • bin/hadoop jar contrib/streaming/hadoop-*streaming*.jar -mapper /home/hduser/mapper.py -reducer /home/hduser/reducer.py -input /user/hduser/gutenberg/* -output /user/hduser/gutenberg-output

Running "WordCount" Map Reduce Job in Hadoop 1.0.3

This post will explain the steps required to run WordCount map reduce job in Hadoop v 1.0.3.

  1. Create a folder to store files. Word will be counted from these files. For current setup we have three books in plain text format.
  • su - hduser
  • mkdir /tmp/sandhu
2. Copy three files to /tmp/sandhu folder. Check it using following command.
  • cd /tmp/sandhu
  • ls -l
output will look like:

3. Start the Hadoop Cluster:
  • /home/hadoop/bin/hadoop/start-all.sh
4. Before we run the actual MapReduce job, we first have to copy the files from our local file system to Hadoop’s HDFS.
  • cd /home/hadoop
  • bin/hadoop dfs -copyFromLocal /tmp/sandhu /home/hduser/sandhu
Check that files are correctly copied to HDFS by following command.
  • bin/hadoop dfs -ls /home/hduser/sandhu
output will look like:
5. Now, we actually run the WordCount example job.
  • bin/hadoop jar hadoop*examples*.jar wordcount /home/hduser/sandhu /home/hduser/sandhu-output
Output will be like:

6. Retrieve the job result from HDFS
  • bin/hadoop dfs -cat /user/hduser/sandhu-output/part-r-00000
7. Hadoop API's

Tuesday, August 6, 2013

Basic Meaning of Big Data

From last many days some students and my juniors are asking about basic of Big Data and how it is related to Cloud Computing. So I thought to write an article explaining meaning Big Data.

As the name suggest Big Data is related with huge amount to data which can not be processed by using simple methods and tools. For example, modern high-energy physics experiments, such as DZero1, typically generate more than one TeraByte of data per day. The famous social network Website, Facebook, serves 570 billion page views per month, stores 3 billion new photos every month, and manages 25 billion pieces of content. Main Question is how to process this much of data in less time? Data collected from these sources is very loosely linked to each other so making decisions from this data is very complex and time consuming. Our today's conventional databases cannot process data if they don't know exact relation between terms. Now a days organizations collect data form many different sources and methods. For example, an laptop company can collect data about a product from social networking sites such as Facebook, twitter etc. from many laptop related blogging websites and even from laptop selling online sites. Data collected from this much different sources can’t be process my conventional databases to make any proper decisions. This data is too big, moves too fast, or doesn’t fit the structure of conventional databases. To gain value from this data, we must follow a new approach. This new approach is known as Big Data.

Seeing the need of Big Data on March 29, 2012 American Government announced “Big Data Research and Development Initiative” and Big Data became the national policy for first time. Recently, the definition of big data as also given by the Gartner: “Big Data are high-volume, high-velocity, and/or high-variety information assets that require new forms of processing to enable
enhanced decision making, insight discovery and process optimization”( I will explain three V’s of big data in coming posts in detail).

Basically technique used for getting useful information from very large unstructured data sets is known as Big Data. This data can be of any type. I hope you get very basic meaning of Big Data. Stay tuned i will post for detail about Big data architecture, 3 V’s and all in subsequent posts.

Rajinder Sandhu