Skip to content

Commit

Permalink
Added config info to README
Browse files Browse the repository at this point in the history
Signed-off-by: Julien Nioche <[email protected]>
  • Loading branch information
jnioche committed Nov 16, 2023
1 parent 3b56b92 commit 8b486ce
Showing 1 changed file with 22 additions and 6 deletions.
28 changes: 22 additions & 6 deletions external/warc/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -171,14 +171,30 @@ The WARCSpout is configured similar as FileSpout:
- a pattern matching valid file names
- every line in the input files specifies one input WARC file as file path or URL.

The non-URL input paths are loaded via HDFS, which can be configured using the key `hdfs` e.g.
The non-http input paths are loaded via HDFS, which can be configured using the key `hdfs` e.g.

```java
Map<String, Object> hdfsConf = new HashMap<>();
hdfsConf.put("fs.file.impl", "org.apache.hadoop.fs.RawLocalFileSystem");
conf = new HashMap<String, Object>();
conf.put("hdfs", hdfsConf);
```
hdfs:
fs.s3a.access.key: ${awsAccessKeyId}
fs.s3a.secret.key: ${awsSecretAccessKey}
```

Please note that in order to access WARC files on AWS S3, you will need to add the following dependency to your project

```
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-aws</artifactId>
<version>2.10.1</version>
</dependency>
```

where the version should match the one used by Apache Storm. In doubt, you can check with

```
mvn dependency:tree | grep "org.apache.hadoop:hadoop-hdfs:jar"
```


To use the WARCSpout reading `*.paths` or `*.txt` files from the folder `input/`, you simply start to build your topology as

Expand Down

0 comments on commit 8b486ce

Please sign in to comment.