OpensourceOops

Tuesday, October 31, 2017

How To Install Apache Hadoop 0.23 Pseudo Distributed Mode on a Single Node

Apache Hadoop Pseudo-distributed mode installation helps you to simulate a multi node installation on a single node. Instead of installing hadoop on different servers, you can simulate it on a single server.

Before you continue, make sure you understand the hadoop fundamentals, and have tested the standalone hadoop installation.

If you’ve already completed the 1st three steps mentioned below as part of the standlone hadoop installation, jump to step 4.

1. Create a Hadoop User

You can download and install hadoop on root. But, it is recommended to install it as a separate user. So, login to root and create a user called hadoop.

# adduser hadoop
# passwd hadoop

add hadoop to sudo:

hduser@laptop:~/hadoop-2.6.0$ su k
Password: 

k@laptop:/home/hduser$ sudo adduser hadoop sudo
[sudo] password for k: 
Adding user `hadoop' to group `sudo' ...
Adding user hadoop to group sudo
Done.

2. Download Hadoop Common

Download the Apache Hadoop Common and move it to the server where you want to install it.

You can also use wget to download it directly to your server using wget.

# su - hadoop
$ wget http://mirror.nyi.net/apache//hadoop/common/stable/hadoop-0.20.203.0rc1.tar.gz

Make sure Java 1.6 is installed on your system.

$ java -version
java version "1.6.0_20"
OpenJDK Runtime Environment (IcedTea6 1.9.7) (rhel-1.39.1.9.7.el6-x86_64)
OpenJDK 64-Bit Server VM (build 19.0-b09, mixed mode)

3. Unpack under hadoop User

As hadoop user, unpack this package.

$ tar xvfz hadoop-0.20.203.0rc1.tar.gz

This will create the “hadoop-0.20.204.0″ directory.

$ ls -l hadoop-0.20.204.0
total 6780
drwxr-xr-x.  2 hadoop hadoop    4096 Oct 12 08:50 bin
-rw-rw-r--.  1 hadoop hadoop  110797 Aug 25 16:28 build.xml
drwxr-xr-x.  4 hadoop hadoop    4096 Aug 25 16:38 c++
-rw-rw-r--.  1 hadoop hadoop  419532 Aug 25 16:28 CHANGES.txt
drwxr-xr-x.  2 hadoop hadoop    4096 Nov  2 05:29 conf
drwxr-xr-x. 14 hadoop hadoop    4096 Aug 25 16:28 contrib
drwxr-xr-x.  7 hadoop hadoop    4096 Oct 12 08:49 docs
drwxr-xr-x.  3 hadoop hadoop    4096 Aug 25 16:29 etc

Modify the hadoop-0.20.204.0/conf/hadoop-env.sh file and make sure JAVA_HOME environment variable is pointing to the correct location of the java that is installed on your system.

$ grep JAVA ~/hadoop-0.20.204.0/conf/hadoop-env.sh
export JAVA_HOME=/usr/java/jdk1.6.0_27

After this step, hadoop will be installed under /home/hadoop/hadoop-0.20.204.0 directory.

4. Modify Hadoop Configuration Files

Add the <configuration> section shown below to the core-site.xml file. This indicates the HDFS default location and the port.

$ cat ~/hadoop-0.20.204.0/conf/core-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
     <property>
              <name>fs.default.name</name>
               <value>hdfs://localhost:9000</value>
     </property>
</configuration>

Add the <configuration> section shown below to the hdfs-site.xml file.

$ cat ~/hadoop-0.20.204.0/conf/hdfs-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
     <property>
              <name>dfs.replication</name>
              <value>1</value>
     </property>
     <property>
     <name>dfs.permissions</name>
     <value>false</value>
     </property>
</configuration>

Add the <configuration> section shown below to the mapred-site.xml file. This indicates that the job tracker uses 9001 as the port.

$ cat ~/hadoop-0.20.204.0/conf/mapred-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
     <property>
              <name>mapred.job.tracker</name>
              <value>localhost:9001</value>
     </property>
</configuration>

5. Setup passwordless ssh to localhost

In a typical Hadoop production environment you’ll be setting up this passwordless ssh access between the different servers. Since we are simulating a distributed environment on a single server, we need to setup the passwordless ssh access to the localhost itself.

Use ssh-keygen to generate the private and public key value pair.

$ ssh-keygen
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hadoop/.ssh/id_rsa):
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/hadoop/.ssh/id_rsa.
Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub.
The key fingerprint is:
02:5a:19:ab:1e:g2:1a:11:bb:22:30:6d:12:38:a9:b1 hadoop@hadoop
The key's randomart image is:
+--[ RSA 2048]----+
|oo               |
|o + .    .       |
| + +  o o        |
|o   .o = .       |
| .   += S        |
|.   o.o+.        |
|.    ..o.        |
| . E  ..         |
|  .   ..         |
+-----------------+

Add the public key to the authorized_keys. Just use the ssh-copy-id command, which will take care of this step automatically and assign appropriate permissions to these files.

$ ssh-copy-id -i $HOME/.ssh/id_rsa.pub localhost
hadoop@localhost's password:
Now try logging into the machine, with "ssh 'localhost'", and check in:

  .ssh/authorized_keys

to make sure we haven't added extra keys that you weren't expecting.

Test the passwordless login to the localhost as shown below.

$ ssh localhost
Last login: Sat Jan 14 23:01:59 2012 from localhost

For more details on this, read 3 Steps to Perform SSH Login Without Password Using ssh-keygen & ssh-copy-id

6. Format Hadoop NameNode

Format the namenode using the hadoop command as shown below. You’ll see the message “Storage directory /tmp/hadoop-hadoop/dfs/name has been successfully formatted” if this command works properly.

$ cd ~/hadoop-0.20.204.0

$ bin/hadoop namenode -format
12/01/14 23:02:27 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = hadoop/127.0.0.1
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 0.20.204.0
STARTUP_MSG:   build = git://hrt8n35.cc1.ygridcore.net/ on branch branch-0.20-security-204 -r 65e258bf0813ac2b15bb4c954660eaf9e8fba141; compiled by 'hortonow' on Thu Aug 25 23:35:31 UTC 2011
************************************************************/
12/01/14 23:02:27 INFO util.GSet: VM type       = 64-bit
12/01/14 23:02:27 INFO util.GSet: 2% max memory = 17.77875 MB
12/01/14 23:02:27 INFO util.GSet: capacity      = 2^21 = 2097152 entries
12/01/14 23:02:27 INFO util.GSet: recommended=2097152, actual=2097152
12/01/14 23:02:27 INFO namenode.FSNamesystem: fsOwner=hadoop
12/01/14 23:02:27 INFO namenode.FSNamesystem: supergroup=supergroup
12/01/14 23:02:27 INFO namenode.FSNamesystem: isPermissionEnabled=true
12/01/14 23:02:27 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=100
12/01/14 23:02:27 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s)
12/01/14 23:02:27 INFO namenode.NameNode: Caching file names occuring more than 10 times
12/01/14 23:02:27 INFO common.Storage: Image file of size 112 saved in 0 seconds.
12/01/14 23:02:27 INFO common.Storage: Storage directory /tmp/hadoop-hadoop/dfs/name has been successfully formatted.
12/01/14 23:02:27 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at hadoop/127.0.0.1
************************************************************/

7. Start All Hadoop Related Services

Use the ~/hadoop-0.20.204.0/bin/start-all.sh script to start all hadoop related services. This will start the namenode, datanode, secondary namenode, jobtracker, tasktracker, etc.

$ bin/start-all.sh
starting namenode, logging to /home/hadoop/hadoop-0.20.204.0/libexec/../logs/hadoop-hadoop-namenode-hadoop.out
localhost: starting datanode, logging to /home/hadoop/hadoop-0.20.204.0/libexec/../logs/hadoop-hadoop-datanode-hadoop.out
localhost: starting secondarynamenode, logging to /home/hadoop/hadoop-0.20.204.0/libexec/../logs/hadoop-hadoop-secondarynamenode-hadoop.out
starting jobtracker, logging to /home/hadoop/hadoop-0.20.204.0/libexec/../logs/hadoop-hadoop-jobtracker-hadoop.out
localhost: starting tasktracker, logging to /home/hadoop/hadoop-0.20.204.0/libexec/../logs/hadoop-hadoop-tasktracker-hadoop.out

8. Browse NameNode and JobTracker Web GUI

Once all the Hadoop processes are started, you can view the health and status of the HDFS from a web interface. Use http://{your-hadoop-server-ip}:50070/dfshealth.jsp

For example, if you’ve installed hadoop on a server with ip-address 192.168.1.10, then use http://192.168.1.10:50070/dfshealth.jsp to view the NameNode GUI

This will display the following information:

Basic NameNode information:

This will show when the Namenode was started, the hadoop version number, whether any upgrades are currently in progress or not.
This also has link “Browse the filesystem”, which will let browse the content of HDFS filesystem from browser
Click on “Namenode Logs” to view the logs

Cluster Summary displays the following information:

Total number of files and directories managed by the HDFS
Any warning message (for example: missing blocks as shown in the image below)
Total HDFS file system size
Both HDFS %age-used and size-used
Total number of nodes in this distributed system

NameNode storage information: This displays the storage directory of the HDFS file system, the filesystem type, and the state (Active or not)

Pic

To access the JobTracker web interface, use http://{your-hadoop-server-ip}:50090

For example, if you’ve installed hadoop on a server with ip-address 192.168.1.10, then use http://192.168.102.20:50090/ to view the JobTracker GUI

As shown by the netstat command below, you can see both these ports are getting used.

$ netstat -a | grep 500
tcp        0      0 *:50090                     *:*                         LISTEN
tcp        0      0 *:50070                     *:*                         LISTEN
tcp        0      0 hadoop.thegeekstuff.com:50090    ::ffff:192.168.1.98:55923 ESTABLISHED

9. Test Sample Hadoop Program

This example program is provided as part of the hadoop, and it is shown in the hadoop document as an simple example to see whether this setup work.

For testing purpose, add some sample data files to the input directory. Let us just copy all the xml file from the conf directory to the input directory. So, these xml file will be considered as the data file for the example program. In the standalone version, you used the standard cp command to copy it to the input directory.

However in a distributed Hadoop setup, you’ll be using -put option of the hadoop command to add files to the HDFS file system. Keep in mind that you are not adding the files to a Linux filesystem, you are adding the input files to the Hadoop Distributed file system. So, you use use the hadoop command to do this.

$ cd ~/hadoop-0.20.204.0

$ bin/hadoop fs -put conf input

Execute the sample hadoop test program. This is a simple hadoop program that simulates a grep. This searches for the reg-ex pattern “dfs[a-z.]+” in all the input/*.xml files (that is stored in the HDFS) and stores the output in the output directory that will be stored in the HDFS.

$ bin/hadoop jar hadoop-examples-*.jar grep input output 'dfs[a-z.]+'

When everything is setup properly, the above sample hadoop test program will display the following messages on the screen when it is executing it.

$ bin/hadoop jar hadoop-examples-*.jar grep input output 'dfs[a-z.]+'
12/01/14 23:45:02 INFO mapred.FileInputFormat: Total input paths to process : 18
12/01/14 23:45:02 INFO mapred.JobClient: Running job: job_201111020543_0001
12/01/14 23:45:03 INFO mapred.JobClient:  map 0% reduce 0%
12/01/14 23:45:18 INFO mapred.JobClient:  map 11% reduce 0%
12/01/14 23:45:24 INFO mapred.JobClient:  map 22% reduce 0%
12/01/14 23:45:27 INFO mapred.JobClient:  map 22% reduce 3%
12/01/14 23:45:30 INFO mapred.JobClient:  map 33% reduce 3%
12/01/14 23:45:36 INFO mapred.JobClient:  map 44% reduce 7%
12/01/14 23:45:42 INFO mapred.JobClient:  map 55% reduce 14%
12/01/14 23:45:48 INFO mapred.JobClient:  map 66% reduce 14%
12/01/14 23:45:51 INFO mapred.JobClient:  map 66% reduce 18%
12/01/14 23:45:54 INFO mapred.JobClient:  map 77% reduce 18%
12/01/14 23:45:57 INFO mapred.JobClient:  map 77% reduce 22%
12/01/14 23:46:00 INFO mapred.JobClient:  map 88% reduce 22%
12/01/14 23:46:06 INFO mapred.JobClient:  map 100% reduce 25%
12/01/14 23:46:15 INFO mapred.JobClient:  map 100% reduce 100%
12/01/14 23:46:20 INFO mapred.JobClient: Job complete: job_201111020543_0001
...

The above command will create the output directory (in HDFS) with the results as shown below. To view this output directory, you should use “-get” option in the hadoop command as shown below.

$  bin/hadoop fs -get output output

$ ls -l output
total 4
-rwxrwxrwx. 1 root root 11 Aug 23 08:39 part-00000
-rwxrwxrwx. 1 root root  0 Aug 23 08:39 _SUCCESS

$ cat output/*
1       dfsadmin

10. Troubleshooting Hadoop Issues

Issue 1: “Temporary failure in name resolution”

While executing the sample hadoop program, you might get the following error message.

12/01/14 07:34:57 INFO mapred.JobClient: Cleaning up the staging area file:/tmp/hadoop-root/mapred/staging/root-1040516815/.staging/job_local_0001
java.net.UnknownHostException: hadoop: hadoop: Temporary failure in name resolution
        at java.net.InetAddress.getLocalHost(InetAddress.java:1438)
        at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:815)
        at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:791)
        at java.security.AccessController.doPrivileged(Native Method)

Solution 1: Add the following entry to the /etc/hosts file that contains the ip-address, FQDN fully qualified domain name, and host name.

192.168.1.10 hadoop.thegeekstuff.com hadoop

Issue 2: “localhost: Error: JAVA_HOME is not set”

While executing hadoop start-all.sh, you might get this error as shown below.

$ bin/start-all.sh
starting namenode, logging to /home/hadoop/hadoop-0.20.204.0/libexec/../logs/hadoop-hadoop-namenode-hadoop.out
localhost: starting datanode, logging to /home/hadoop/hadoop-0.20.204.0/libexec/../logs/hadoop-hadoop-datanode-hadoop.out
localhost: Error: JAVA_HOME is not set.
localhost: starting secondarynamenode, logging to /home/hadoop/hadoop-0.20.204.0/libexec/../logs/hadoop-hadoop-secondarynamenode-hadoop.out
localhost: Error: JAVA_HOME is not set.
starting jobtracker, logging to /home/hadoop/hadoop-0.20.204.0/libexec/../logs/hadoop-hadoop-jobtracker-hadoop.out
localhost: starting tasktracker, logging to /home/hadoop/hadoop-0.20.204.0/libexec/../logs/hadoop-hadoop-tasktracker-hadoop.out
localhost: Error: JAVA_HOME is not set.

Solution 2: Make sure JAVA_HOME is setup properly in the conf/hadoop-env.sh as shown below.

$ grep JAVA_HOME conf/hadoop-env.sh
export JAVA_HOME=/usr/java/jdk1.6.0_27

Issue 3: Error while executing “bin/hadoop fs -put conf input”

You might get one of the following error messages (including put: org.apache.hadoop.security.AccessControlException: Permission denied:) while executing the hadoop fs put command as shown below.

$ bin/hadoop fs -put conf input
12/01/14 23:21:53 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:9000. Already tried 7 time(s).
12/01/14 23:21:54 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:9000. Already tried 8 time(s).
12/01/14 23:21:55 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:9000. Already tried 9 time(s).
Bad connection to FS. command aborted. exception: Call to localhost/127.0.0.1:9000 failed on connection exception: java.net.ConnectException: Connection refused

$  bin/hadoop fs -put conf input
put: org.apache.hadoop.security.AccessControlException: Permission denied: user=hadoop, access=WRITE, inode="":root:supergroup:rwxr-xr-x
$ ls -l input

Solution 3: Make sure /etc/hosts file is setup properly. Also, if your HDFS filesystem is not created properly, you might have issues during “hadoop fs -put”. Format your HDFS using “bin/hadoop namenode -format” and confirm that this displays “successfully formatted” message.

Issue 4: While executing start-all.sh (or start-dfs.sh), you might get this error message: “localhost: Unrecognized option: -jvm localhost: Could not create the Java virtual machine.”

Solution 4: This might happens if you’ve installed hadoop as root and trying to start the process. This is know bug, that is fixed according to this bug report. But, if you hit this bug, try installing hadoop as a non-root account (just like how we’ve explained in this article), which should fix this issue.

Solution 5: Stop hadoop cli >>bin/stop-all.sh

Solution 6: data node not start ref. : https://stackoverflow.com/questions/11889261/datanode-process-not-running-in-hadoop

Friday, October 27, 2017

Apache Sqoop 1.4.x Installation

Apache Sqoop is a command-line interface application for transferring data between relational databases and Hadoop. It supports incremental loads of a single table or a free form SQL query as well as saved jobs which can be run multiple times to import updates made to a database since the last import. Imports can also be used to populate tables in Hive or HBase. Exports can be used to put data from Hadoop into a relational database. Sqoop got the name from sql+hadoop. Sqoop became a top-level Apache project in March 2012.

Pre Requirements

1) A machine with Ubuntu 14.04 LTS operating system.

2) Apache Hadoop pre installed (How to install Hadoop on Ubuntu 14.04)

3) MySQL Database pre installed (How to install MySQL Database on Ubuntu 14.04)

4) Apache Sqoop 1.4.6 Software (Download Here)

NOTE

Apache Sqoop comes with Hadoop compatible version. Check with your Hadoop version and download sqoop.

Sqoop 1.4.6 Installation on Ubuntu

Installation Steps

Step 1 - Update. Open a terminal (CTRL + ALT + T) and type the following sudo command. It is advisable to run this before installing any package, and necessary to run it to install the latest updates, even if you have not added or removed any Software Sources.

$ sudo apt-get update

Step 2 - Installing Java 7.

$ sudo apt-get install openjdk-7-jdk

Step 3 - Creating sqoop directory.

$ sudo mkdir /usr/local/sqoop

Step 4 - Change the ownership and permissions of the directory /usr/local/sqoop. Here 'hduser' is an Ubuntu username.

$ sudo chown -R hduser /usr/local/sqoop

$ sudo chmod -R 755 /usr/local/sqoop

Step 5 - Change the directory to /home/hduser/Desktop , In my case the downloaded sqoop-1.4.6.bin__hadoop-2.0.4-alpha.tar.gz file is in /home/hduser/Desktop folder. For you it might be in /downloads folder check it.

$ cd /home/hduser/Desktop/

Step 6 - Untar the sqoop-1.4.6.bin__hadoop-2.0.4-alpha.tar.gz file.

$ tar xzf sqoop-1.4.6.bin__hadoop-2.0.4-alpha.tar.gz

Step 7 - Move the contents of sqoop-1.4.6.bin__hadoop-2.0.4-alpha folder to /usr/local/hadoop

$ mv sqoop-1.4.6.bin__hadoop-2.0.4-alpha/* /usr/local/sqoop

Step 8 - Edit $HOME/.bashrc file by adding the sqoop path.

$ sudo gedit $HOME/.bashrc

$HOME/.bashrc file. Add the following lines

export SQOOP_HOME=/usr/local/sqoop
export PATH=$PATH:$SQOOP_HOME/bin

Step 9 - Reload your changed $HOME/.bashrc settings

$ source $HOME/.bashrc

Step 10 - Change the directory to /usr/local/sqoop/conf

$ cd $SQOOP_HOME/conf

Step 11 - Copy the default sqoop-env-template.sh to sqoop-env.sh

$ cp sqoop-env-template.sh sqoop-env.sh

Step 12 - Edit sqoop-env.sh file.

$ gedit sqoop-env.sh

Step 13 - Add the below lines to sqoop-env.sh file. Save and Close.

export HADOOP_COMMON_HOME=/usr/local/hadoop
export HADOOP_MAPRED_HOME=/usr/local/hadoop

Step 14 - Copy the mysql-connector-java-5.1.28.jar to /sqoop/lib/ folder.

$ cp /usr/share/java/mysql-connector-java-5.1.28.jar /usr/local/sqoop/lib

Step 15 - Change the directory to /usr/local/sqoop/bin

$ cd $SQOOP_HOME/bin

Step 16 - Verify Installation

$ sqoop-version

$ sqoop version

Please share this blog post and follow me for latest updates on

Thursday, October 12, 2017

ssh connect new instance from IAM

SSH connect new instances from IAM AWS

เวลาเราลง Instance จาก AMI ของ hadoop ต้องลง ssh key ใหม่ด้วยเพราะว่าของเดิมใช้ไม่ได้แล้ว

run :

root@piboonsak-26474:~# ssh -i /etc/ssh/hadoop.pem ubuntu@ec2-13-228-186-233.ap-southeast-1.compute.amazonaws.com

result :


@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

@    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!     @

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!

Someone could be eavesdropping on you right now (man-in-the-middle attack)!

It is also possible that a host key has just been changed.

The fingerprint for the ECDSA key sent by the remote host is

SHA256:KwKVW2KK4vyw4EwU6y6VlEyPAfNdF10fo6nBQCtG66A.

Please contact your system administrator.

Add correct host key in /root/.ssh/known_hosts to get rid of this message.

Offending ECDSA key in /root/.ssh/known_hosts:58

  remove with:

  ssh-keygen -f "/root/.ssh/known_hosts" -R ec2-13-228-186-233.ap-southeast-1.compute.amazonaws.com

ECDSA host key for ec2-13-228-186-233.ap-southeast-1.compute.amazonaws.com has changed and you have requested strict checking.

Host key verification failed.

run:

root@piboonsak-26474:~# ssh-keygen -f "/root/.ssh/known_hosts" -R ec2-13-228-186-233.ap-southeast-1.compute.amazonaws.com

result:

# Host ec2-13-228-186-233.ap-southeast-1.compute.amazonaws.com found: line 58
/root/.ssh/known_hosts updated.
Original contents retained as /root/.ssh/known_hosts.old

run:

root@piboonsak-26474:~# ssh -i /etc/ssh/hadoop.pem ubuntu@ec2-13-228-186-233.ap-southeast-1.compute.amazonaws.com
The authenticity of host 'ec2-13-228-186-233.ap-southeast-1.compute.amazonaws.com (13.228.186.233)' can't be established.
ECDSA key fingerprint is SHA256:KwKVW2KK4vyw4EwU6y6VlEyPAfNdF10fo6nBQCtG66A.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'ec2-13-228-186-233.ap-southeast-1.compute.amazonaws.com' (ECDSA) to the list of known hosts.
Warning: the ECDSA host key for 'ec2-13-228-186-233.ap-southeast-1.compute.amazonaws.com' differs from the key for the IP address '13.228.186.233'
Offending key for IP in /root/.ssh/known_hosts:58
Are you sure you want to continue connecting (yes/no)? yes

result:

Welcome to Ubuntu 14.04.5 LTS (GNU/Linux 3.13.0-125-generic x86_64)

 * Documentation:  https://help.ubuntu.com/

  System information as of Wed Oct 11 15:54:14 UTC 2017

run:

ubuntu@ip-172-31-17-133:~$ sudo su hduser

result:

hduser@ip-172-31-17-133:/home/ubuntu$

run:

hduser@ip-172-31-17-133:~$ ssh localhost

result:

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!     @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that a host key has just been changed.
The fingerprint for the ECDSA key sent by the remote host is
55:20:8b:cb:63:43:f6:74:5a:4a:44:f0:37:1e:c3:98.
Please contact your system administrator.
Add correct host key in /home/hduser/.ssh/known_hosts to get rid of this message.
Offending ECDSA key in /home/hduser/.ssh/known_hosts:1
  remove with: ssh-keygen -f "/home/hduser/.ssh/known_hosts" -R localhost
ECDSA host key for localhost has changed and you have requested strict checking.
Host key verification failed.

run:

hduser@ip-172-31-17-133:~$ ssh-keygen -f "/home/hduser/.ssh/known_hosts" -R localhost

result:

# Host localhost found: line 1 type ECDSA
/home/hduser/.ssh/known_hosts updated.
Original contents retained as /home/hduser/.ssh/known_hosts.old

run:

hduser@ip-172-31-17-133:~$ ssh-keygen -f "/home/hduser/.ssh/known_hosts" -R 0.0.0.0

result:

# Host 0.0.0.0 found: line 1 type ECDSA
/home/hduser/.ssh/known_hosts updated.
Original contents retained as /home/hduser/.ssh/known_hosts.old

run:

hduser@ip-172-31-17-133:~$ ssh-keygen -t rsa -P ""

result:

root@piboonsak-26474:~# ssh -i /etc/ssh/hadoop.pem ubuntu@ec2-13-228-186-233.ap-southeast-1.compute.amazonaws.com

run:


Generating public/private rsa key pair.
Enter file in which to save the key (/home/hduser/.ssh/id_rsa): /home/hduser/.ssh/id_rsa
/home/hduser/.ssh/id_rsa already exists.
Overwrite (y/n)? y
Your identification has been saved in /home/hduser/.ssh/id_rsa.
Your public key has been saved in /home/hduser/.ssh/id_rsa.pub.
The key fingerprint is:
92:2d:1f:08:52:84:cd:f3:65:91:f0:a1:62:97:20:2b hduser@ip-172-31-17-133
The key's randomart image is:
+--[ RSA 2048]----+
|  .=+ ..oo       |
|  .++. ++.       |
|E o +o+o.        |
| . o +.+         |
|      = S        |
|       + .       |
|        .        |
|                 |
|                 |
+-----------------+

result: not think to see but insert to file authorized_keys ready.

check:

hduser@ip-172-31-17-133:~$ ls -l /home/hduser/.ssh/
total 24
-rw-r--r-- 1 hduser hadoop  405 ต.ค.  11 17:13 authorized_key     ==> ผิดตอน cat copy มาไม่หมด(ไม่ได้ใช้ลบทิ้งได้)
-rw-r--r-- 1 hduser hadoop  808 ก.ย.  21 03:49 authorized_keys
-rw------- 1 hduser hadoop 1679 ต.ค.  11 17:12 id_rsa
-rw-r--r-- 1 hduser hadoop  405 ต.ค.  11 17:12 id_rsa.pub
-rw------- 1 hduser hadoop  222 ต.ค.  11 17:13 known_hosts
-rw------- 1 hduser hadoop  222 ต.ค.  11 17:02 known_hosts.old

run check:

hduser@ip-172-31-17-133:~$ ssh localhost

result:

Welcome to Ubuntu 14.04.5 LTS (GNU/Linux 3.13.0-125-generic x86_64)

 * Documentation:  https://help.ubuntu.com/

  System information as of Wed Oct 11 17:23:58 UTC 2017

  System load:  0.0                Processes:           123
  Usage of /:   19.2% of 39.23GB   Users logged in:     1
  Memory usage: 5%                 IP address for eth0: 172.31.17.133
  Swap usage:   0%

  Graph this data and manage this system at:
    https://landscape.canonical.com/

  Get cloud support with Ubuntu Advantage Cloud Guest:
    http://www.ubuntu.com/business/services/cloud

30 packages can be updated.
19 updates are security updates.

New release '16.04.3 LTS' available.
Run 'do-release-upgrade' to upgrade to it.


Last login: Thu Sep 21 04:01:43 2017 from localhost
hduser@ip-172-31-17-133:~$

END.

warning! :ถ้าอยู่ใน localhost อยู่ให้ exit ออกมาก่อน

Thursday, September 29, 2016

Pentaho 5.4 PDI Connect Google Sheets API

Pentaho 5.4 PDI Connect Google Sheets APIs

ปัจจุบันนี้ Google Sheet เป็นตัวเลือกลำดับต้น ๆ ที่จะใช้งานแทน Excel เนื่องจากความสามารถที่ทำได้เหมือนกับ Excel และยังสามารถแชร์งานร่วมกันได้หลายคนบนระบบ Cloud และในตัว Pentaho PDI version 5.4 ก็มี Step Google Sheet Input/Output ที่สามารถดึงข้อมูลที่มีอยู่มาทำ ETL ได้

ต่อไปนี้จะอธิบายวิธีเชื่อมต่อระหว่าง Google Sheet กับ Pentaho PDI

เริ่มจากเราต้องมี Google Account ที่เป็น For Developer ก่อน

สมัครตามลิงค์นี้เลย สมัคร Google Accout วิธีสมัครตามลิงค์นี้ Youtube ,Blog

สมัครแล้วจะได้หน้านี้ขึ้นมา https://console.cloud.google.com

หน้าเมนูนี้ของ Google เราสามารถใช้งาน API ของ Google ได้หลากหลาย

ขั้นตอนที่เราจะทำ

ขั้นตอนที่ 1.เราจะสร้างไฟล์ p12 Keys ก่อนเพื่อ Connect กับ Google sheet ของเราได้

ขั้นตอนที่ 2.กรอกข้อมูลใน Step Google sheet Input ใน Pentaho PDI
ขั้นตอนที่ 3.กำหนดสิทธิ์ Share การเข้าถึงไฟล์ Google sheet ที่เราต้องการ Connect
ขั้นตอนที่ 4.Connected.

ขั้นตอนที่ 1.เราจะสร้างไฟล์ p12 Keys ก่อนเพื่อ Connect กับ Google sheet ของเราได้

เลือก Sheet API

เลือก Credentials

คลิ๊ก Create Credentials

เลือก Service accout key

เพราะเราจะเชื่อมต่อแบบ

Server-to Server เพราะเรา

จะรันบน Server Petaho กับ Server Google Cloud

ตรง Service Account เลือก

New service account เพราะเราจะเอา Account ID

และเลือก Key type เป็น P12

เพราะ Step Google Sheet Input ใน Pentaho PDI ใช้ไฟล์ Type P12 เป็นตัวเชื่อมต่อ

เมื่อคลิ๊ก Create จะแสดงหน้านี้

กรอกข้อมูลตามช่อง

Service account name : กรอกชื่ออะไรก็ได้ที่เราต้องการ

ตรง Role เลือก Project >>

>> Owner หรือ >> Edit

กด Close ไป

เราจะได้ ID และ Service Account ตามที่เราได้ตั้งชื่อไว้

เลือก Manage Service Accounts หากเราต้องการสร้างไฟล์ P12 อีกครั้ง.

เลือก Account ที่เราต้องการ

สร้างไฟล์ Sheet ใน Google Drive

สร้างข้อมูลเพื่อทดสอบ

เข้า Pentaho PDI

ขั้นตอนที่ 2.กรอกข้อมูลใน Step Google sheet Input ใน Pentaho PDI

สร้าง Tranformations ใหม่สำหรับทดลอง

เลือก Step Google Spreadsheet Input

ลากมาวางใน Tranformation ที่เราสร้างขึ้น

กรอก Email address ย้ำนะครับ ไม่ใช่ Email เรานะ แต่เป็น Email ของ Service Account ที่เราสร้างไว้ใน Google APIs ในขั้นตอนแรก

กลับไปที่ Google APIs เอา Email ของ Service Account มากรอกนะครับ

แบบนี้นะครับ

แล้วก็ Browe file P12 ที่สร้างไว้ (หาให้เจอนะครับว่าเก็บไว้ Folder ไหน)

อยู่ลึกมาก Y_Y

เครื่องใครเครื่องมันนะครับ อันนี้ผมเก็บไว้ตรง Desktop นะครับ

! บางที่ถ้าเครื่องใครมีทั้ง User Root กับ User ส่วนตัวนี้ระวังนะครับ Folder Desktop นี้อยู่คนละ Folder นะครับ

หน้าตาไฟล์จะเป็นแบบนี้นะครับ

เมื่อ Browe สำเร็จจะได้หน้าตาแบบนี้นะครับมี Client ID ปรากฏขึ้น

กด Test connect ดูว่า Connect ได้ไหมถ้าได้จะแสดงแบบนี้ Success! Yes!

!อย่าลืมเครื่องเราต้องออก Net ได้นะ

ต่อไปเราจะ Connect กับ Sheet ที่เราต้องการข้อมูลละนะครับ

กด Browse นะครับ

!ไม่ใช่ Copy Spreadsheet ID มาแปะนะครับ

อ่าว+_+ แบบนี้แสดงว่า Service Account ID เรายังไม่รู้จัก Spreadsheet ที่เราต้องการดึงข้อมูลนะครับ

กลับมาที่ Spreadsheet ของเรา

กด Share

Copy Email ของ Service Account ที่เราสร้างไว้ (ต้องตรงกับที่เราใส่ลงใน PDI นะ)

เอามากรอกในช่องที่ทุกที่เราอยากแชร์ให้ใครก็ใส่ Email ไป

กรอกในช่องเสร็จ แบบนี้

กรอกแล้วจะได้ แบบนี้

กลับมา Browes SpreadSheet ใหม่

ถ้าแสดงชื่อ SpreadSheet แบบนี้มาละ

จะได้ SpreadSheet ID แสดงแบบนี้ (!ย้ำไม่ใช่ Copy มาวางนะ)

ต่อไปมา Browse Sheet ท่ี่เราต้องการข้อมูลละ ใกล้ละ ๆ

จะแสดงข้อมูลชื่อ Sheet ที่อยู่ใน SpreadSheet ให้เราเลือก

เลือก Field ท่ี่เราต้องการข้อมูลละ กด Get Feild

ถ้ามาแบบนี้ก็เป็นอันเสร็จ

เราสามารถเอาข้อมูลที่เชื่อมมาได้เอาไปทำETL ต่อได้ไม่ว่าจะเป็นการออกรายงาน

การวิเคราะห์ต่าง ๆ ซึ่งการดึงจาก SpreadSheet ที่ผู้ใช้งานให้อยู่แล้วเราไม่ต้องไปวางระบบงาน หรือขันตอนทำงานอะไร และตัว SpreadSheet เองก็ใช้งานได้สะดวก

ขอบคุณ

ที่มา : ไม่มีอันนี้โมสด

Tuesday, October 31, 2017

How To Install Apache Hadoop 0.23 Pseudo Distributed Mode on a Single Node

How To Install Apache Hadoop 0.23 Pseudo Distributed Mode on a Single Node

1. Create a Hadoop User

add hadoop to sudo: hduser@laptop:~/hadoop-2.6.0$ su k Password: k@laptop:/home/hduser$ sudo adduser hadoop sudo [sudo] password for k: Adding user `hadoop' to group `sudo' ... Adding user hadoop to group sudo Done.

2. Download Hadoop Common

3. Unpack under hadoop User

4. Modify Hadoop Configuration Files

5. Setup passwordless ssh to localhost

6. Format Hadoop NameNode

7. Start All Hadoop Related Services

8. Browse NameNode and JobTracker Web GUI

9. Test Sample Hadoop Program

10. Troubleshooting Hadoop Issues

Friday, October 27, 2017

Apache Sqoop 1.4.x Installation

Thursday, October 12, 2017

ssh connect new instance from IAM

SSH connect new instances from IAM AWS

Thursday, September 29, 2016

Pentaho 5.4 PDI Connect Google Sheets API

Pentaho 5.4 PDI Connect Google Sheets APIs

add hadoop to sudo:

hduser@laptop:~/hadoop-2.6.0$ su k Password: k@laptop:/home/hduser$ sudo adduser hadoop sudo [sudo] password for k: Adding user `hadoop' to group `sudo' ... Adding user hadoop to group sudo Done.