- Set up an HDFS cluster and understand differences between writing from NameNode and DataNode.
- Learn how to perform random access in HDFS.
- Understand how HDFS read a split of file.
- Follow the instructions in Lab #1 to set up the development environment.
- Download the files
nasa_19950801.tsvandnasa_19950630.22-19950728.12.tsv.gzto be used later.
In this lab, you will configure a single-node HDFS on your cs167 machine; You will learn how to use web interface to write a file to hdfs from NameNode and DataNode, respectively. You need to be aware of the differences between these two cases.
You will write a java program locally on your own laptop, which can use the HDFS interface to perform random read on a certain split of file, and then upload to cs167 machine.
Follow the instructions below. Answer any questions marked by (Q) and submit the deliverables marked by (S).
Goal: ensure each student can run Java code locally before using the remote cs167 machines.
- Install JDK and IntelliJ (local laptop).
- Install **JDK 11 ** (or the course‑specified version) from
https://learn.microsoft.com/en-us/java/openjdk/download. - Install
apache-maven-3.9.11-bin.tar.gzfor MacOS andapache-maven-3.9.11-bin.zipfor Windows fromhttps://maven.apache.org/download.cgi - Set Environment Vairables for java and mvn.
- Install IntelliJ IDEA Community.
- Install **JDK 11 ** (or the course‑specified version) from
- Verify locally.
- Run in a terminal:
java -version mvn -version
- Create and run a “Hello World” in IntelliJ.
- Run in a terminal:
- Deliverables (screenshots in README):
- Screenshot of
java -versionand\\mvn -version. - Screenshot of IntelliJ running your Java program.
- Screenshot of
You will still build and run on the remote
cs167servers later, but local testing is now required.
Local-first note: Do this on your local machine first (JDK + IntelliJ). You do not need a local Hadoop cluster; IntelliJ can import a Maven project and download the Hadoop client libraries automatically. You will develop the java program on you own laptop. After it runs locally, upload the project or JAR to the cs167 remote machine.
This part implements a Java program that simulates HDFS split reading.
-
Create a new Maven project (keep the naming from the course materials):
- Option A (local, recommended): In IntelliJ → New Project → Maven
- GroupId:
edu.ucr.cs.cs167.[UCRNetID] - ArtifactId:
[UCRNetID]_lab2
- GroupId:
- Option B (remote CLI): On the cs167 machine under
~/cs167/workspace/:# Replace [UCRNetID] with your UCR Net ID, not student ID. mvn archetype:generate "-DgroupId=edu.ucr.cs.cs167.[UCRNetID]" "-DartifactId=[UCRNetID]_lab2" "-DarchetypeArtifactId=maven-archetype-quickstart" "-DinteractiveMode=false"
- Option A (local, recommended): In IntelliJ → New Project → Maven
-
Add Hadoop dependencies to
pom.xmlinside<dependencies>(IntelliJ will fetch them locally; on remote,mvnwill fetch them there):
<dependencies>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>3.3.6</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>3.3.6</version>
</dependency>
</dependencies>- Client Configuration
Configuration tells the client which filesystem to use and how to talk to it, follow
lab1document to see how it's coded. Here is an example:
Configuration conf = new Configuration();
Path inputPath = new Path(inputPathStr);
FileSystem fs = inputPath.getFileSystem(conf); -
Write a main function that takes exactly three command-line arguments; the first for the
input(a string indicating the file), the second foroffset(a long type integer indicating the position starts reading), and the third forlength(a long type integer indicating the number of bytes to read).- Note: You need to parse
offsetandlengthto long type integer, try to useLong.parseLong. - Hint: Try to use ChatGPT to create the main function above. We only need to parse the arguments but we will explain later how to use them.
- Note: You need to parse
-
If the number of command line arguments is incorrect, use the following code to print the error message and exit.
System.err.println("Incorrect number of arguments! Expected three arguments.");
System.exit(-1);- Retrieve a
FileSysteminstance as done in the previous lab and then openinputfile using org.apache.hadoop.fs.FSDataInputStream. UseFSDataInputStream.seekto find the start position. Ifoffsetis not 0, you need to skip to content until meet a newline character. Feel free to use the following helper function to read a line from the input stream.
public static String readLine(FSDataInputStream input) throws IOException {
StringBuffer line = new StringBuffer();
int value;
do {
value = input.read();
if (value != -1)
line.append((char)value);
} while (value != -1 && value != 13 && value != 10);
return line.toString();
}- Use the function
FSDataInputStream#getPosto track the current position in the file. End the reading process when the position exceeds theoffset+lengthas described in class. Pay attention to the special case that was discussed in class, i.e., when the line perfectly aligns with the end of the split. At the end, print the following statements to indicate the split length and the actual number of bytes read.
System.out.println("Split length: "+ length);
System.out.println("Actual bytes read: "+ (inputStream.getPos() - offset));- To test your program, create a file called
test.txtin your directory. Copy the following message to the file:
Hello!
This is
an example of simulating
HDFS read procedure.
It will ignore contents right after offset (if it is not 0),
until meet a newline character.
After that,
it will keep reading lines until
the number of read bytes exceeds
expected length.
- Now, test your program with the following arguments in IntelliJ IDEA (You can refer to Lab1-note on how to set running arguments in IDEA):
test.txt 0 4
test.txt 3 4
test.txt 3 9
- (Q1) Compare
bytesReadandlength, are they equal? Use one sentance to explain why.
- Modify your code, so that it will output the number of lines containing string
200. Using the function [String#contains] for that. Your output should be the same as the following formatnumMatchingLinesis the number of lines contain string200:
System.out.println("Number of matching lines: " + numMatchingLines);- Pack your project into a
jarfile. Don't forget to editpom.xmlin your lab folder:
<!-- Replace [UCRNetID] with your UCRNetID -->
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-jar-plugin</artifactId>
<configuration>
<archive>
<manifest>
<mainClass>$YOURPACKAGE.App</mainClass>
</manifest>
</archive>
</configuration>
</plugin>
</plugins>
</build>
The jar file will be used and tested on your remote cs167 server.
This part is done on your cs167 server. Configure a single-node HDFS.
-
Edit
$HADOOP_HOME/etc/hadoop/core-site.xml:<property> <name>fs.defaultFS</name> <value>hdfs://localhost:9000</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/home/cs167/hadoop</value> </property>
-
Edit
$HADOOP_HOME/etc/hadoop/hdfs-site.xmland set replication to 1 for single-node use:<property> <name>dfs.replication</name> <value>1</value> </property>
-
On the
namenode, initialize the files for HDFS by running the command (Note: Make sure that you run this commandonly on the namenode):
hdfs namenode -format- On the
namenode: start the namenode by running:
hdfs namenodeWait a few seconds to make sure that the namenode has started.
- On each of the datanodes: start the data node by running:
hdfs datanode- Note: if you can the following error, you can check the solution at the bottom of this instruction:
java.io.IOException: Incompatible clusterIDs in /home/cs167/hadoop/dfs/data: namenode clusterID = CID-ca13b215-c651-468c-9188-bcdee4ad2d41; datanode clusterID = CID-d9c134b6-c875-4019-bce0-2e6f8fbe30d9
- Check status:
hdfs dfsadmin -report
- (Q2) Copy the output of this command.
- (Q3) How many live datanodes are in this cluster?
This part will be done on the cs167 server (continue to part II).
In this part, you need to write to HDFS from namenode and datanode.
-
Make sure your
namenodeanddatanodesare running on thecs167server.- Note: Nothing need to be done if you follow part1.
-
On the
namenodemachine, run the following command to copytest.txtfrom your local file system(cs167 remote machine, don't be confused) to HDFS:
# Use sample text file like test.txt in Part I
hdfs dfs -put test.txt- Note: If you get error message saying 'no such file or directory', create a directory in HDFS by using:
hdfs dfs -mkdir -p .
- On the
namenodemachine, run the following command to find the block locations of the file you just uploaded:
# Use sample text file like test.txt in Part I
hdfs fsck test.txt -files -blocks -locations- (Q4) How many replicas are stored on the namenode? How many replicas are stored in the datanodes?
- Note You can find the datanode ip/information by using the command:
hdfs dfsadmin -report
4(optional if you run multiple datenodes). Now, on each datanode machine, repeat step 2 and step 3 to upload the file to HDFS and observe the block replica distribution.
* Note: you need to change [UCRNetID] to the netid of student owning this datanode.
-
(Q5)In a single-node HDFS cluster, if replication is set to 1, how many replicas of each block exist? Explain why no replicas can reside on other datanodes.
-
(Q6) Compare your results of Q4 and Q5, give one sentence to explain the results you obtained.
This part will be done in cs167 server.
You will use your packed jar file to read from HDFS.
- Make sure the namenode and datanodes are running.
- Download
nasa_19950801.tsvinLab2, upload it to yourcs167folder. And put the file into HDFS by using:
hdfs dfs -put nasa_19950801.tsv- Now, use
hadoop jarto run your jar file. Below is an example of running this:
hadoop jar [UCRNetID]_lab2-1.0-SNAPSHOT.jar nasa_19950801.tsv [offset] [length]Your code should output the number of lines contain string 200.
- Test with the following input parameters with the input file
nasa_19950801.tsv:
| offset | length |
|---|---|
| 500 | 1000 |
| 12000 | 1000 |
| 100095 | 1000 |
| (Q7) Include the output of the three cases above in your README file. |
- Test on larger dataset. Now you need to test your code on larger dataset. Download
nasa_19950630.22-19950728.12.tsv.gzto your local computer.
Upload the downloaded filenasa_19950630.22-19950728.12.tsv.gzto yourcs167server home directory.
Use the following command to decompress it:
gunzip nasa_19950630.22-19950728.12.tsv.gz-
Put the file
gunzip nasa_19950630.22-19950728.12.tsvto HDFS, similar to step 3. -
Test your program with the following input parameters on
nasa_19950630.22-19950728.12.tsv:
| offset | length |
|---|---|
| 1500 | 2000 |
| 13245 | 3500 |
| 112233 | 4000 |
Write a shell script named run.sh, which contains the commands you run the above three cases on nasa_19950630.22-19950728.12.tsv.
- Add
README.mdincluding answers to Q1–Q7. - Add
run.shthat compiles and runs the three large split tests. - (S) Submit your compressed file.
Submission format:
<UCRNetID>_lab2.{tar.gz | zip}
- src/
- pom.xml
- README.md
- run.sh
- [UCRNetID]_lab2-1.0-SNAPSHOT.jarRequirements:
- Archive must be
.tar.gzor.zip, lowercase, use underscore_. src,pom.xml,README.md,run.shmust be at the root of the archive.- Do not include
target/or large input files.
- Q1. +1 point
- Q2. +1 point
- Q3. +1 point
- Q4. +1 point
- Q5. +1 point
- Q6. +1 point
- Q7. +3 points
- Code: +5 points
- +1 validates input correctly
- +1 opens and reads via HDFS API correctly
- +3 implements split semantics precisely and counts
"200"lines
- Following submission instructions: +1 point
- Make sure to follow the naming conventions that are mentioned in Lab #1.
- Do not include the target directory or the test input files in your submission.
- Failure to follow these instructions and conventions might result in losing some points. This includes, for example, adding unnecessary files in your compressed file, using different package names, using a different name for the compressed file, not including a runnable script, and not including a
README.mdfile.
- Error: When I run any HDFS command, I get an error related to safemode
Cannot create file/user/cs167/nasa_19950630.22-19950728.12.tsv._COPYING_. Name node is in safe mode.
- Fix: Run the following command
hdfs dfsadmin -safemode leave- Error: When I run the datanode, I get the following error:
java.io.IOException: Incompatible clusterIDs in /home/cs167/hadoop/dfs/data: namenode clusterID = CID-ca13b215-c651-468c-9188-bcdee4ad2d41; datanode clusterID = CID-d9c134b6-c875-4019-bce0-2e6f8fbe30d9
- Fix: Do the following steps to ensure a fresh start of HDFS:
- Stop the namenode and all data nodes.
- Delete the directory
~/hadoopon the namenode and all datanodes.rm -rf ~/hadoop. - Reformat HDFS using the command
hdfs namenode -formatonly on the namenode. - Start the namenode using the command
hdfs namenode. - Start the datanode using the command
hdfs datanode.