diff --git a/external/warc/README.md b/external/warc/README.md index d4806f563..c01c22971 100644 --- a/external/warc/README.md +++ b/external/warc/README.md @@ -171,14 +171,30 @@ The WARCSpout is configured similar as FileSpout: - a pattern matching valid file names - every line in the input files specifies one input WARC file as file path or URL. -The non-URL input paths are loaded via HDFS, which can be configured using the key `hdfs` e.g. +The non-http input paths are loaded via HDFS, which can be configured using the key `hdfs` e.g. -```java - Map hdfsConf = new HashMap<>(); - hdfsConf.put("fs.file.impl", "org.apache.hadoop.fs.RawLocalFileSystem"); - conf = new HashMap(); - conf.put("hdfs", hdfsConf); ``` + hdfs: + fs.s3a.access.key: ${awsAccessKeyId} + fs.s3a.secret.key: ${awsSecretAccessKey} +``` + +Please note that in order to access WARC files on AWS S3, you will need to add the following dependency to your project + +``` + + org.apache.hadoop + hadoop-aws + 2.10.1 + +``` + +where the version should match the one used by Apache Storm. In doubt, you can check with + +``` +mvn dependency:tree | grep "org.apache.hadoop:hadoop-hdfs:jar" +``` + To use the WARCSpout reading `*.paths` or `*.txt` files from the folder `input/`, you simply start to build your topology as