Technical Specifications
This page documents how we package, zip, and deliver files for both self-service and large/full-text datasets available through Data for Research. Specifications and examples for examining the structure of the metadata, full-text, and n-grams file types are also included.
File Delivery
If you requested a dataset through the self-service Data for Research site (www.jstor.org/dfr/), an email message with URL(s) for downloading the datasets will be sent to the email address registered with your account. Large/full-text datasets are processed by the JSTOR Support team and will be sent to you after the necessary agreement is completed and dataset is processed by our staff.
Each 1.5 GB (max) zip file resulting from the dataset parameters will have its own access URL. The URL does
not require a login to access the zip files and will look like: http://data.jstor.org/deliver/#unique-string
Please note that you will have 60 days to download the zip files once they are created. After 60 days, the links will expire and a new request must be submitted.
Zip File Creation
A zip file can contain a mix of all content types (journals, books, research reports, pamphlets), depending on the dataset you have requested. Multiple zip files may be necessary for larger datasets.
Dataset zips are named using a “receipt” identifier generated by our system. Within the zip, files are grouped by their type into separate folders and with the following naming conventions:
| output type | directory | file structure |
| Metadata | metadata/ |
<content-type>-<item-doi>.xml |
| N-Grams | ngrams[1-3]/ |
<content-type>-<item-doi>.NGRAMS[1-3].txt |
| OCR/Full-text | ocr/ |
<content-type>-<item-doi>.txt |
- Journal documents (articles, book reviews, miscellaneous)
- Book documents (chapters)*
- Research report documents
- Pamphlets
*Note that we do not provide XML metadata at the chapter level, so if your dataset includes chapters you will get the full book XML each time a chapter record is processed.
Sample Files
- Journal article metadata (.xml)
- Book metadata (.xml)
- Research report metadata (.xml)
- Pamphlet metadata (.xml)
- Unigrams [NGRAM1] (.txt)
- Bigrams [NGRAM2] (.txt)
- Trigrams [NGRAM3] (.txt)
- OCR/Full-Text (.txt)
Metadata Scheme
The XML files are presented:- In UTF-8, with only
amp,gt,lt, andquotused as character entity references when necessary - Without a doctype declaration
- Well-formed and schema-valid with respect to the appropriate XML Schema
Journal and Pamphlet Metadata
Journal articles follow the JATS 1.0 Archiving tag set (as retrieved in its XSD form in May of 2013). Each file contains an /article/front/journal-meta with brief details about the journal, and /article/front/article-meta element describing an article.
Thejournal-meta element contains elements that describe the journal, significantly:
journal-idjournal-id-type="jstor" or "publisher-id", a unique system identifier for that journal.issnelements, for the ISSN values that apply to the journaljournal-title, as it was included in the source, not necessarily what the currently-publishing title is.publisher-name, as it was included in the source, not necessarily the publisher of the currently-publishing title.
article-meta element can contain a variety of data. Elements that are reliably present:
-
article-id- When
pub-id-type="doi", this is a registered DOI - When
pub-id-type=”jstor”, this is NOT a registered DOI - IDs that are not labeled as DOI have not been registered with CrossRef by JSTOR and should not be used with DOI resolution services.
- When
article-categoriesmay be present, describing or grouping this article with others.article-title, if there was a title given in the source.contrib-group, for authors and possibly other creator roles identified in the source, as contrib elements within- At least one
pub-dateelement, with as much date information as the print source or electronic source contained. Some journals at times have multiple dates on a given issue, and these are all listed. - An
fpage("first page") where the source provided it. - Where the article is a review article, there may be one or more
productelements describing works reviewed. A permissions element containing whatever copyright-statement or license statement the source provided. - A
self-urielement with anxlink:hrefcontaining a URL to the landing page for that publication at JSTOR. - A
kwd-groupis included if the source supplied one. - A
custom-meta-groupcontaining a custom-meta representing the languages present in the article. JSTOR's digitized backfile metadata tracks the important languages present, allowing for there to be more than one, and they are listed here in roughly order of quantity.
Book and Research Report Metadata
Note that we do not provide metadata files for individual book chapters, so if your dataset includes book chapters you will get the full book XML each time a chapter record is processed. However, we do full-text and n-grams at the chapter level.
The books files follow the draft BITS 0.2 tag set (as retrieved in the DTD form
available in May of 2013). The books files were made by transforming an NLM Books 3 into the BITS 0.2, and sometimes
include a generated table-of-contents for the book when the source did not include an explicit table of contents.
Each book file has a /book/collection-meta, a /book/book-meta, and a /book/book-body.
Thebook-meta element reliably includes:
- A
book-idbook-id-type="jstor", containing a string that is NOT a registered DOI. - At least one
subject-groupelement, the subject sub-elements of which will have acontent-typeattribute. The value of the content-type attribute describes the data in the subject:call-numbergives the LOC Call Number for the booklcshgives LOC subject headings captured from the verso of the book itself.disciplinegives a string corresponding to the browse-level subjects or discipline names used on JSTOR.org.
- A
book-title-group/book-title - A
contrib-group, containingcontribelements. The BITS content model for contrib allowed us to use name elements on our source data, so name is usually present. - A
pub-dateelement - One or more
isbnelements, with the ISBN values JSTOR has captured for the book. - A
publisherelement with thepublisher-nameandpublisher-loc. - A
permissionselement with copyright or license information. - A
self-urielement with anxlink:hrefcontaining a URL to the landing page on JSTOR.org for the book. - A
countselement, listing the number of pages in the actual book. - A
custom-meta-groupfor languages present, similar to that in journals above.
book-body will contain the supplied or generated TOC, including:
- A tree of
book-partelements, the leaf book-parts corresponding to some portion of the whole book as defined by JSTOR or the publisher, in the order they appear in the book. - The leaf book-part elements may contain:
- A
title-group/title - A
contrib-group, if the portion of the book has authors different from the book as a whole - An
fpage - An
abstract(The abstract-type attribute, when present, designates what kind of text is in the abstract. "Extract" means that the text is just the first 100 words.)
- A
N-grams
We are using the Apache Lucene StandardTokenizer to generate n-grams in Data for Research datasets with these parameters:- Apache Lucene StandardTokenizer
- LowerCaseFilter : compares case insensitive characters
- StopFilter =
ENGLISH_STOP_WORD_SET: {“a”, “an”, “and”, “are”, “as”, “at”, “be”, “but”, “by”, “for”, “if”, “in”, “into”, “is”, “it”, “no”, “not”, “of”, “on”, “or”, “such”, “that”, “the”, “their”, “then”, “there”, “these”, “they”, “this”, “to”, “was”, “will”, “with”}
N-grams are delivered as tab-delimited gram with count rows, tab-delimited .txt file, sorted in descending order by count.
OCR/Full-text
The document full-text is delivered in a separate .txt file. The markup will indicate each page such as
<page sequence="1"></page> and the full contents are wrapped in a <plain_text> tag.
It’s important to note that JSTOR incorporates metadata and OCR text from many different sources. We work to mitigate any inconsistencies, but researchers may encounter issues with datasets, such as duplicate OCR, missing OCR, or incomplete metadata. We are unable to assist with the clean-up of datasets due to OCR errors or missing data.