Datasets ▶ Other metadata scrapes - Anna’s Archive

Datasets ▶ Other metadata scrapes

If you are interested in mirroring this dataset for archival or LLM training purposes, please contact us.

Overview from datasets page.

Source	Metadata	Last updated
Other metadata scrapes	👩‍💻 Anna’s Archive manages scrapes of metadata from other sources.	Varies

Various smaller or one-off metadata scrapes.

Collection				Notes
airitibooks			AAC generation code	Scrape of “iRead eBooks” (= phonetically “ai rit i-books”; airitibooks.com), by volunteer “j”. Corresponds to “airitibooks” subcollection in the “upload” dataset.
bloomsbury			AAC generation code	Metadata directly from the Bloomsbury Collections website transformed into AAC by volunteer “n”, who explains: “It gives a full set of ISBNs for each book. Many of these ISBNs are not easy to find via other sources.”
cerlalc	Page example	AAC example	AAC generation code	Data leak from CERLALC, a consortium of Latin American publishers, which included lots of book metadata. The original data (scrubbed from personal info) can be found in isbn-cerlalc-2022-11-scrubbed-annas-archive.sql.zst.torrent. Special thanks to the anonymous group that worked hard on this.
chinese_architecture			AAC generation code	Scrape of books about Chinese architecture, by volunteer “cm”: “I got it by exploiting a network vulnerability at the publishing house, but that loophole has since been closed”. Corresponds to “chinese_architecture” subcollection in the “upload” dataset.
czech_oo42hcks	Page example	AAC example	AAC generation code	Metadata extracted from CSV and Excel files, corresponding to “upload/misc/oo42hcksBxZYAOjqwGWu” in the “upload” dataset. Original files can be found through the Codes Explorer.
edsebk	Page example	AAC example	Scraper code	Scrape of EBSCOhost’s eBook Index (edsebk; "eds" = "EBSCOhost Discovery Service", "ebk" = "eBook"). Code made by our volunteer “tc” here. This is a fairly small ebook metadata index, but still contains some unique files. If you have access to the other EBSCOhost databases, please let us know, since we’d like to index more of them. The filename of the latest release (annas_archive_meta__aacid__ebscohost_records__20240823T161729Z--Wk44RExtNXgJ3346eBgRk9.jsonl) is incorrect (the timestamp should be a range, and there should not be a uid). We’ll correct this in the next release.
goodreads	Page example	AAC example	AAC generation code	Goodreads scrape by volunteer “tc”.
hentai			AAC generation code	Scrape of erotic books, by volunteer “do no harm”. Corresponds to “hentai” subcollection in the “upload” dataset.
isbndb	Page example	AAC example		ISBNdb is a company that scrapes various online bookstores to find ISBN metadata. We made an initial scrape in 2022, with more information in our blog post “ISBNdb dump, or How Many Books Are Preserved Forever?”. Future releases will be made in the AAC format. Release 1 (2022-10-31) This is a dump of a lot of calls to isbndb.com during September 2022. We tried to cover all ISBN ranges. These are about 30.9 million records. On their website they claim that they actually have 32.6 million records, so we might somehow have missed some, or they could be doing something wrong. The JSON responses are pretty much raw from their server. One data quality issue that we noticed, is that for ISBN-13 numbers that start with a different prefix than “978-”, they still include an “isbn” field that simply is the ISBN-13 number with the first 3 numbers chopped off (and the check digit recalculated). This is obviously wrong, but this is how they seem to do it, so we didn't alter it. Another potential issue that you might run into, is the fact that the “isbn13” field has duplicates, so you cannot use it as a primary key in a database. “isbn13”+“isbn” fields combined do seem to be unique.
isbngrp	Page example	AAC example	AAC generation code	ISBN Global Register of Publishers scrape. Thanks to volunteer “g” for doing this: “using the URL `https://grp.isbn-international.org/piid_rest_api/piid_search?q="{}"&wt=json&rows=150` and recursively filling in the q parameter with all possible digits until the result is less than 150 rows.” It’s also possible to extract this information from certain books.
kulturpass			AAC generation code	Metadata scrape of Kulturpass, by volunteer “a”, who explains: “It seems that we have scraped the whole VLB! The VLB contains the metadata of every book you can order today in Germany from every shop. So that is the official source behind the Kulturpass app.”
libby	Page example	AAC example	AAC generation code	Libby (OverDrive) scrape by volunteer “tc”.
newsarch_magz			AAC generation code	Archive of newspapers and magazines. Corresponds to “newsarch_magz” subcollection in the “upload” dataset.
rgb	Page example	AAC example	AAC generation code	Scrape of the Russian State Library (Российская государственная библиотека; RGB) catalog, the third largest (regular) library in the world. Thanks to volunteer “w”.
trantor	Page example	AAC example	AAC generation code	Metadata dump from the “Imperial Library of Trantor” (named after the fictional library), corresponding to the “trantor” subcollection in the “upload” dataset. Converted from MongoDB dump.

Resources