Skip to content

create_docs

Prepare documents for topic extraction.

DocStudy

Bases: TypedDict

Data container for a study that will be used to generate a doc.

Attributes:

Name Type Description
title str

Title of the study.

abstract str

Abstract of the study.

keywords str

Keywords of the study.

Examples:

>>> study: DocStudy = {
...     "title": "machine learning",
...     "abstract": "machine learning is often used in the industry with the goal of...",
...     "keywords": "machine learning, code smells, defect detection"
... }
>>> study
{'title': 'machine learning', 'abstract': 'machine learning is often used in the industry with the goal of...', 'keywords': 'machine learning, code smells, defect detection'}
Source code in src/sesg/topic_extraction/create_docs.py
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
class DocStudy(TypedDict):
    """Data container for a study that will be used to generate a doc.

    Attributes:
        title (str): Title of the study.
        abstract (str): Abstract of the study.
        keywords (str): Keywords of the study.

    Examples:
        >>> study: DocStudy = {
        ...     "title": "machine learning",
        ...     "abstract": "machine learning is often used in the industry with the goal of...",
        ...     "keywords": "machine learning, code smells, defect detection"
        ... }
        >>> study
        {'title': 'machine learning', 'abstract': 'machine learning is often used in the industry with the goal of...', 'keywords': 'machine learning, code smells, defect detection'}
    """  # noqa: E501

    title: str
    abstract: str
    keywords: str

concat_study_info(study)

Concatenates the information of the study into a string.

Parameters:

Name Type Description Default
study DocStudy

Study with title, abstract and keywords.

required

Returns:

Type Description
str

A string with the following format: "{title}\n{abstract}\n{keywords}".

Examples:

>>> study: DocStudy = {
...     "title": "machine learning",
...     "abstract": "machine learning is often used in the industry with the goal of...",
...     "keywords": "machine learning, code smells, defect detection"
... }
>>> concat_study_info(study)
'machine learning\nmachine learning is often used in the industry with the goal of...\nmachine learning, code smells, defect detection'
Source code in src/sesg/topic_extraction/create_docs.py
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
def concat_study_info(
    study: DocStudy,
) -> str:
    r"""Concatenates the information of the study into a string.

    Args:
        study (DocStudy): Study with title, abstract and keywords.

    Returns:
        A string with the following format: "{title}\n{abstract}\n{keywords}".

    Examples:
        >>> study: DocStudy = {
        ...     "title": "machine learning",
        ...     "abstract": "machine learning is often used in the industry with the goal of...",
        ...     "keywords": "machine learning, code smells, defect detection"
        ... }
        >>> concat_study_info(study)
        'machine learning\nmachine learning is often used in the industry with the goal of...\nmachine learning, code smells, defect detection'
    """  # noqa: E501
    title = study["title"]
    abstract = study["abstract"]
    keywords = study["keywords"]

    return f"{title}\n{abstract}\n{keywords}"

create_docs(studies_list)

Creates a list of documents where each document is a string with the title, abstract and keywords of the study.

Can be used with extract_topics_with_lda or extract_topics_with_bertopic.

Parameters:

Name Type Description Default
studies_list list[DocStudy]

List of studies with title, abstract and keywords.

required

Returns:

Type Description
list[str]

List of documents.

Examples:

>>> s1: DocStudy = {
...     "title": "machine learning",
...     "abstract": "machine learning is often used in the industry with the goal of...",
...     "keywords": "machine learning, code smells, defect detection"
... }
>>> s2: DocStudy = {
...     "title": "artificial intelligence",
...     "abstract": "artificial intelligence is often used in the industry with the goal of...",
...     "keywords": "artificial intelligence, code smells, defect detection"
... }
>>> create_docs([s1, s2])
['machine learning\nmachine learning is often used in the industry with the goal of...\nmachine learning, code smells, defect detection', 'artificial intelligence\nartificial intelligence is often used in the industry with the goal of...\nartificial intelligence, code smells, defect detection']
Source code in src/sesg/topic_extraction/create_docs.py
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
def create_docs(
    studies_list: list[DocStudy],
) -> list[str]:
    r"""Creates a list of documents where each document is a string with the title, abstract and keywords of the study.

    Can be used with [extract_topics_with_lda][sesg.topic_extraction.extract_topics_with_lda] or [extract_topics_with_bertopic][sesg.topic_extraction.extract_topics_with_bertopic].

    Args:
        studies_list (list[DocStudy]): List of studies with title, abstract and keywords.

    Returns:
        List of documents.

    Examples:
        >>> s1: DocStudy = {
        ...     "title": "machine learning",
        ...     "abstract": "machine learning is often used in the industry with the goal of...",
        ...     "keywords": "machine learning, code smells, defect detection"
        ... }
        >>> s2: DocStudy = {
        ...     "title": "artificial intelligence",
        ...     "abstract": "artificial intelligence is often used in the industry with the goal of...",
        ...     "keywords": "artificial intelligence, code smells, defect detection"
        ... }
        >>> create_docs([s1, s2])
        ['machine learning\nmachine learning is often used in the industry with the goal of...\nmachine learning, code smells, defect detection', 'artificial intelligence\nartificial intelligence is often used in the industry with the goal of...\nartificial intelligence, code smells, defect detection']
    """  # noqa: E501
    return [concat_study_info(s) for s in studies_list]