Skip to content

Commit

Permalink
Add Crawler.Store.all_urls/0
Browse files Browse the repository at this point in the history
  • Loading branch information
fredwu committed Sep 28, 2023
1 parent 66969f8 commit 0ebc113
Show file tree
Hide file tree
Showing 4 changed files with 20 additions and 1 deletion.
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## master

- [Added] `Crawler.Store.all_urls/0` to find all scraped URLs
- [Improved] Memory usage optimisations

## v1.1.2 [2021-10-14]
Expand Down
8 changes: 7 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -126,7 +126,7 @@ end
Crawler provides `pause/1`, `resume/1` and `stop/1`, see below.

```elixir
{:ok, opts} = Crawler.crawl("http://elixir-lang.org")
{:ok, opts} = Crawler.crawl("https://elixir-lang.org")

Crawler.pause(opts)

Expand All @@ -137,6 +137,12 @@ Crawler.stop(opts)

Please note that when pausing Crawler, you would need to set a large enough `:timeout` (or even set it to `:infinity`) otherwise parser would timeout due to unprocessed links.

## Find All Scraped URLs

```elixir
Crawler.Store.all_urls() # => ["https://elixir-lang.org", "https://google.com", ...]
```

## API Reference

Please see https://hexdocs.pm/crawler.
Expand Down
4 changes: 4 additions & 0 deletions lib/crawler/store.ex
Original file line number Diff line number Diff line change
Expand Up @@ -60,4 +60,8 @@ defmodule Crawler.Store do
def processed(url) do
{_new, _old} = Registry.update_value(DB, url, &%{&1 | processed: true})
end

def all_urls do
Registry.select(DB, [{{:"$1", :_, :_}, [], [:"$1"]}])
end
end
8 changes: 8 additions & 0 deletions test/lib/crawler_test.exs
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,14 @@ defmodule CrawlerTest do
assert Store.find_processed(linked_url2)
assert Store.find_processed(linked_url3)
refute Store.find(linked_url4)

urls = Crawler.Store.all_urls()

assert Enum.member?(urls, url)
assert Enum.member?(urls, linked_url1)
assert Enum.member?(urls, linked_url2)
assert Enum.member?(urls, linked_url3)
refute Enum.member?(urls, linked_url4)
end)

wait(fn ->
Expand Down

0 comments on commit 0ebc113

Please sign in to comment.