Skip to content

Latest commit

 

History

History
102 lines (80 loc) · 12.4 KB

data_provider.md

File metadata and controls

102 lines (80 loc) · 12.4 KB

Data providers

Description

Grabs articles (news or scientific articles) from the web and store them as jsonlines file. These collected data can then be used as input of the BERTrend demonstrators.

Several data providers are supported:

  • Arxiv
  • Google News
  • Bing News
  • NewsCatcher

API keys

Some data providers require the creation of an API key to work properly.

This is the case with the Arxiv and NewsCatcher data providers. The Arxiv data provider uses the Semantic Scholar API to enrich data. The API can be created for free on their website (https://www.newscatcherapi.com/, https://www.semanticscholar.org/product/api).

You have then to set the following environment variables:

export NEWSCATCHER_API_KEY=<your_api_key>
export SEMANTIC_SCHOLAR_API_KEY=<your_api_key>

Usage

python -m bertrend_apps.data_provider --help

Usage: python -m data_provider [OPTIONS] COMMAND [ARGS]...                                                                                           
                                                                                                                                                      
╭─ Options ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --install-completion        [bash|zsh|fish|powershell|pwsh]  Install completion for the specified shell. [default: None]                           │
│ --show-completion           [bash|zsh|fish|powershell|pwsh]  Show completion for the specified shell, to copy it or customize the installation.    │
│                                                              [default: None]                                                                       │
│ --help                                                       Show this message and exit.                                                           │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Commands ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ auto-scrape          Scrape data from Google or Bing news (multiple requests from a configuration file: each line of the file shall be compliant   │
│                      with the following format: <keyword list>;<after_date, format YYYY-MM-DD>;<before_date, format YYYY-MM-DD>)                   │
│ generate-query-file  Generates a query file to be used with the auto-scrape command. This is useful for queries generating many results. This will │
│                      split the broad query into many ones, each one covering an 'interval' (range) in days covered by each atomic request. If you  │
│                      want to cover several keywords, run the command several times with the same output file.                                      │
│ scrape               Scrape data from Google or Bing news (single request).                                                                        │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
python -m bertrend_apps.data_provider --help auto-scrape --help
                                                                                                                                                      
Usage: python -m data_provider auto-scrape [OPTIONS] [REQUESTS_FILE]                                                                                 
                                                                                                                                                      
 Scrape data from Google, Bing news or NewsCatcher (multiple requests from a configuration file: each line of the file shall be compliant with the    
 following format: <keyword list>;<after_date, format YYYY-MM-DD>;<before_date, format YYYY-MM-DD>)                                                   
 Parameters ---------- requests_file: str     Text file containing the list of requests to be processed provider: str     News data provider. Current 
 authorized values [google, bing, newscatcher] save_path: str     Path to the output file (jsonl format)                                              
 Returns -------                                                                                                                                      
                                                                                                                                                      
╭─ Arguments ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│   requests_file      [REQUESTS_FILE]  path of jsonlines input file containing the expected queries. [default: None]                                │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --max-results        INTEGER  maximum number of results per request [default: 50]                                                                  │
│ --provider           TEXT     source for news [google, bing, newscatcher] [default: google]                                                        │
│ --save-path          TEXT     Path for writing results. [default: None]                                                                            │
│ --help                        Show this message and exit.                                                                                          │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

python -m bertrend_apps.data_provider generate-query-file --help

Usage: python -m data_provider generate-query-file [OPTIONS] [KEYWORDS]                                                                              
                                                                                                                                                      
 Generates a query file to be used with the auto-scrape command. This is useful for queries generating many results. This will split the broad query  
 into many ones, each one covering an 'interval' (range) in days covered by each atomic request. If you want to cover several keywords, run the       
 command several times with the same output file.                                                                                                     
 Parameters ---------- keywords: str     query described as keywords after: str     "from" date, formatted as YYYY-MM-DD before: str     "to" date,   
 formatted as YYYY-MM-DD save_path: str     Path to the output file (jsonl format)                                                                    
 Returns -------                                                                                                                                      
                                                                                                                                                      
╭─ Arguments ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│   keywords      [KEYWORDS]  keywords for news search engine. [default: None]                                                                       │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --after            TEXT     date after which to consider news [format YYYY-MM-DD] [default: None]                                                  │
│ --before           TEXT     date before which to consider news [format YYYY-MM-DD] [default: None]                                                 │
│ --save-path        TEXT     Path for writing results. File is in jsonl format. [default: None]                                                     │
│ --interval         INTEGER  Range of days of atomic requests [default: 30]                                                                         │
│ --help                      Show this message and exit.                                                                                            │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

Important note

You may expect a rate of 10-20% of articles not correctly processed because of:

  • problem of cookies management
  • errors 404, 403