# kremlin_en - A textual dataset based on the contents published on the English-language version of the Kremlin’s website

This document outlines how this dataset was created and provides summary statistics.

true
2021-05-09

# The Kremlin’s website: kremlin.ru

The website kremlin.ru is the official website of the president of the Russian Federation. The most prominent part of the website is composed of news items reporting on the president’s activities, speeches, and interviews.

## Summary statistics

• Date of earliest content included: 1999-12-31
• Date of most recent content included: 2020-12-31
• Number of documents: 24 338
• Number of words (tokens): 9 361 892
• Columns included in the main dataset: doc_id, text, date, title, location, link, id, term

## Structure of the dataset

The dataset is constructed in line with the Text Interchange Format (TIF) for increased compatibility with different software, and is made available in two tabular formats, as corpus and as tokens.

### Corpus

file: president_ru-en_corpus.csv

This is the standard definition of corpus provided by the TIF initiative, that accurately characterises the present dataset:

corpus (data frame) - A valid corpus data frame object is a data frame with at least two columns. The first column is called doc_id and is a character vector with UTF-8 encoding. Document ids must be unique. The second column is called text and must also be a character vector in UTF-8 encoding. Each individual document is represented by a single row in the data frame. Addition document-level metadata columns and corpus level attributes are allowed but not required. (Arnold et al. 2019)

This is a detailed version of all columns in the corpus dataset:

• doc_id: the id is a composed string, that should make the identifier unique even when used together with other similarly shaped datasets. Elements are separated by a an hyphen-minus. doc_id are structured similarly to this string: president_ru-en-012345. Here is a detailed explanation of its components:
• “president_ru”: president of Russia (“ru” based on the ISO 3166-1 alpha-2 standard for two-letters country codes)
• “en” or “ru”: language of the source website as a two letter code, following the ISO 639-1 standard)
• a numeric id of 6 digits: a unique numeric id. For ease of reference, it is based on the series of digits that is found at the end of each relevant URL, e.g. 064750 for the document found at: http://en.special.kremlin.ru/events/president/news/64750. To enable consistent ordering, the id is always composed of 6 digits; for example, if the final digits of the link are 64, the numeric id would be “000064”.
• text: this includes the full text of the document, including the title and the textual string with date and location (when present).
• date: date of publication in the year-month-date format (YYYY-MM-DD), in line with the ISO 8601 standard.
• title: the title of the document
• location: the location from where the document was issued as reported at the beginning of each post, e.g. “Novo-Ogaryovo, Moscow Region”; if not given, an empty string.
• link: a URL, source of the document
• id: numeric id; includes only the numeric part of doc_id, may be useful if only a numeric identifier is needed.
• term: a character string referring to the presidential term. The period after Yeltin’s resignation, but before Putin’s first inauguration in May 2000 is indicated as “Putin 0”, the following as “Putin 1”, “Putin 2”, “Medvedev 1”, “Putin 3”, and “Putin 4”

### Tokens

file: president_ru-en_token.csv

This is the standard definition of tokens provided by the TIF initiative:

tokens (data frame) - A valid data frame tokens object is a data frame with at least two columns. There must be a column called doc_id that is a character vector with UTF-8 encoding. Document ids must be unique. There must also be a column called token that must also be a character vector in UTF-8 encoding. Each individual token is represented by a single row in the data frame. Addition token-level metadata columns are allowed but not required. (Arnold et al. 2019)

The column definition is exactly the same as for corpus, with a token column instead of text.

### Corpus as a txt file

file: president_ru-en_corpus.txt

For ease of use, the full corpus is provided as a single txt file. This may be useful for an early exploration of the corpus or for less technically inclined users simply who may want to quickly explore use of given terms using the “find” (CTRL+F) functionality found in browsers, text editors, or word processors.

### Aggregated data

To facilitate processing, tokens datasets are provided also in aggregated and pre-processed format. It should be possible to open the following files without using specialised software.

file: president_ru-word_count_total.csv

This file provides easy access to the words most frequently used accross the full corpus.

It is composed of two columns:

• word: individual words, reduced to their stems (e.g. “reading”, “reads”, and “read” are all counted as “read”) using the Porter algorithm using the SnowballC package. Common stopwords have been removed (see full list).
• n: number of times a given word has been found in the corpus.

file: president_ru-word_count_by_year.ods

This file is provided in the .ods format, typically used by Libreoffice, to further facilitate access for users more accustomed to office packages.

The file has one sheet per year, and each sheet has two columns similar to the ones described above. This should make it possible to, e.g. create graphs on the use of a set of words in each year covered by the corpus.

# Data collection

Contents have been downloaded and extracted using the R programming language, and a dedicated package created by this author (Comai 2016, 2017).

Relevant scripts enabling replication are included in the R folder.

This dataset is expected to be updated on a yearly basis.

# Availability as an R package

To facilitate use with the R programming language, the corpus is available as an R data package.

It can be installed with the following command:

remotes::install_github("giocomai/tifkremlinen")


The corpus is available as a data frame, that after installation can be accessed as follows:

tifkremlinen::kremlin_en


# Interactive data exploration

It is possible to explore this dataset and conduct some basic word frequency analysis using an interactive web interface maintained by the author at the following link:

https://castarter.giorgiocomai.eu/kremlin_en/

# Files included in this release

path size
president_ru-en_corpus.csv 62.39M
president_ru-en_corpus.txt 58.16M
president_ru-word_count_by_year.ods 1.89M
president_ru-word_count_total.csv 408.19K

# Licensing

As of this writing in March 2021, the footer of the Kremlin’s website includes the following notice: