# Custom Pylint checks

## Table of contents

* [Introduction](#introduction)
  * [Limits of linters](#limits-of-linters)
* [Write checkers](#write-checkers)
  * [Types of checkers](#types-of-checkers)
  * [Hello world](#hello-world)
    * [Write the AST checker](#write-the-ast-checker)
      * [Examine the AST](#examine-the-ast)
      * [Design the AST checker](#design-the-ast-checker)
      * [Write the AST checker code](#write-the-ast-checker-code)
    * [Write the token checker](#write-the-token-checker)
      * [Examine the tokens](#examine-the-tokens)
      * [Design the token checker](#design-the-token-checker)
      * [Write the token checker code](#write-the-token-checker-code)
    * [Write the raw checker](#write-the-raw-checker)
      * [Design the raw checker](#design-the-raw-checker)
      * [Write the raw checker code](#write-the-raw-checker-code)
    * [Register the checker](#register-the-checker)
  * [Beyond hello world](#beyond-hello-world)
* [Test custom Pylint checks](#test-custom-pylint-checks)
  * [Design tests](#design-tests)
  * [Pylint testing utilities](#pylint-testing-utilities)
  * [Use testing utilities for hello world checkers](#use-testing-utilities-for-hello-world-checkers)
    * [Tests for hello world AST checker](#tests-for-hello-world-ast-checker)
    * [Tests for hello world token checker](#tests-for-hello-world-token-checker)
    * [Tests for hello world raw checker](#tests-for-hello-world-raw-checker)
  * [Run tests](#run-tests)

## Introduction

We use [Pylint](https://pylint.pycqa.org) to lint our Python code. Before writing a custom rule, you should check whether you can achieve your goals with [Pylint's built-in rules and configurations](https://pylint.pycqa.org/en/latest/technical_reference/features.html). If those are insufficient, you can follow this guide to write a custom Pylint checker.

### Limits of linters

Before we get too far into custom lint rules, let's make sure we understand what linters like ESLint are actually capable of doing.

Consider two types things you might want to prohibit with a lint rule:

* A property of the code's text. For example, you might want to ensure that all lines are at most 80 characters long.
* A property of the code's behavior. For example, you might want to ensure that the code, when executed, does not print a particular string.

Linters operate on the code as text, so you can write a lint rule that bans a textual property with perfect accuracy. However, any lint rule that tries to prohibit a behavior must be imperfect. To see this intuitively, consider the problem we'll approach below: writing a linter to prohibit programs that print `Hello, world!`. If our linter searched for the string `Hello, world!`, it would miss a program that printed `'Hello, ' + 'world!'`.

If you are interested in computability theory, the technical description for this problem is that predicting a program's behavior is an undecidable problem. Look up the halting problem if you want to learn more.

Whenever you are creating a new lint rule, consider whether you are trying to prohibit a textual or a behavioral property. If you want to prohibit a textual property, then your task is easy--just write a lint rule that perfectly blocks the property. On the other hand, if you want to prohibit a behavior, then you'll need to first decide on textual properties that you can prohibit to approximately prohibit the behavior. For example, if you want to stop programs from printing `Hello, world!`, it's approximately equivalent to prohibit literal `Hello, world!` strings. Then, write your lint rule to perfectly prohibit those textual properties. **Note that since your textual properties will be an imperfect translation of the behavioral properties you want to prohibit, your lint rule will miss some code that performs the banned behavior, will flag some code that doesn't actually perform the banned behavior, or both.** That's expected given the limits of linters.

## Write checkers

### Types of checkers

Pylint supports three kinds of checkers:

* **AST checkers** operate on an [abstract syntax tree (AST)](https://en.wikipedia.org/wiki/Abstract_syntax_tree) of the code that Pylint generates. The checker provides functions to be called as Pylint traverses the AST, and these functions can raise lint errors.

* **Token checkers** operate on a list of the tokens that represents the code being linted. The tokens are of the same form as the tokens generated by [`tokenize.tokenize()`](https://docs.python.org/3/library/tokenize.html#tokenize.tokenize).

* **Raw checkers** operate on the raw code without any pre-processing by Pylint.

You should avoid writing custom raw checkers since they are less efficient than token or AST checkers. You should use an AST checker whenever the syntax information of an AST is useful. For example, if you want to count the number of variables in each function, the AST can tell you where the variables are and where the functions start and end. If you don't need syntax information, you can just use a token checker. Since most lint rules are easier to write given syntax information, you'll probably write mostly AST checkers. However, token checkers can be useful for simple rules, particularly those where you only need to consider one token at a time. For example, a token checker would be a great way to search for any string that uses double-quotes (`""`).

As we work through the example below, we'll cover each of the three kinds of checkers.

### Hello world

To get a better idea for how lint checks work, let's try writing a simple Pylint rule that forbids the string `Hello, world!`. (Maybe you don't want people accidentally opening PRs that include code that prints `Hello, world!` from when they were learning Python.) First consider what kinds of code should violate the rule. Here are some examples:

```python
s = "Hello, world!";

print("Hello, world!");
```

Below, we'll walk through how to implement this rule using each of the three kinds of checkers in Pylint: AST checkers, token checkers, and raw checkers.

#### Write the AST checker

##### Examine the AST

To develop an intuition for how ASTs work, let's see how astroid represents some example `Hello, world!` code. Open a Python shell in your Oppia directory and import a file from Oppia like the pre-commit linter:

```python
>>> from scripts.linters import pre_commit_linter
```

This import will modify your PYTHONPATH to include Oppia's Python dependencies, so you can now import [astroid](https://pylint.pycqa.org/projects/astroid/), which Pylint uses to generate the AST:

```python
>>> import astroid
```

Next, use astroid to parse the code:

```python
>>> code = """
... s = "Hello, world!"
... print("Hello, world!")
... """
>>> print(astroid.parse(code).repr_tree())
Module(
   name='',
   doc=None,
   file='<?>',
   path=['<?>'],
   package=False,
   pure_python=True,
   future_imports=set(),
   body=[Assign(
         targets=[AssignName(name='s')],
         value=Const(value='Hello, world!')),
      Expr(value=Call(
            func=Name(name='print'),
            args=[Const(value='Hello, world!')],
            keywords=[]))])
```

Notice that every time a `"Hello, world!"` literal appears in the code, the AST has a [`Const`](https://pylint.pycqa.org/projects/astroid/en/latest/api/astroid.nodes.html#id163) node. You can look up all the available node types in the [astroid documentation](https://pylint.pycqa.org/projects/astroid/en/latest/api/astroid.nodes.html).

##### Design the AST checker

Suppose you decide to write a checker that raises an error whenever it finds a node that:

* is of type "Const"
* has value "Hello, world!"

After designing any rule, it's important to consider what benign code it will raise errors on (false positives) and what bad code it will miss (false negatives). For our rule:

* False positives

  * Code that includes the string `Hello, world!` but doesn't print it will still raise errors. There probably aren't any good reasons to do this though.

* False negatives

  * Code that constructs the string `Hello, world!` instead of including it as a literal will pass the lint check. This is an acceptable imperfection though since developers are almost always going to use the string literal when writing a hello world program.

##### Write the AST checker code

Start by creating a checker class with some metadata in [`scripts/linters/pylint_extensions.py`](https://github.com/oppia/oppia/blob/develop/scripts/linters/pylint_extensions.py):

```python
class HelloWorldASTChecker(BaseChecker):
    __implements__ = IAstroidChecker

    name = 'hello-world-ast'
    priority = -1
    msgs = {
        'C9001': (
            'Uses a "Hello, world!" string.',
            'hello-world-ast',
            'No code should use "Hello, world!" statements.',
        ),
    }
```

Here's what we've defined so far:

* `__implements__`: Specifies what kind of checker we are creating. In this case, we are creating an AST checker.
* `name`: Short name of the checker that will be used in the configuration file.
* `priority`: A negative integer indicating when the check should run. Checkers run from the most negative priority to most positive.
* `msgs`: A dictionary specifying what messages this checker can raise. The dictionary has the following form:

  ```python
  {
      'message id': ('displayed message', 'message symbol', 'message help')
  }
  ```

  * **Message ID**: A unique identifier for the message consisting of a letter followed by 4 digits. The letter is the first letter of the message category, and the available categories are Convention, Warning, Error, Fatal, and Refactoring. The first two digits must be the same for all of the checker's messages.

    Be sure to check that your message ID isn't already used by one of [Pylint's existing rules](https://pylint.pycqa.org/en/latest/technical_reference/features.html) or one of [our custom rules](https://github.com/oppia/oppia/blob/develop/scripts/linters/pylint_extensions.py).

  * **Displayed message**: The error message displayed to the user.

  * **Message symbol**: An alias of the message ID that can be used wherever the ID can be used. For example, to disable a message with a code comment, you can specify either the ID or the symbol.

  * **Message help**: Displayed when the user runs `pylint --help-msg` to learn more about your message.

You can also set an `options` variable to specify options that can be set by the configuration file and read by your rule. See the [Pylint documentation](https://pylint.pycqa.org/en/latest/how_tos/custom_checkers.html#writing-an-ast-checker) for details.

Next, you need to write the checker code. Pylint lets you define methods that get called as it traverses the AST. Methods of the form `visit_nodename()` are called upon entering the subtree rooted at a node of type `nodename`, and methods of the form `leave_nodename()` are called upon leaving the subtree. Each method is called with a single argument--the node of type `nodename`.

For this case, you can define a method `visit_const()` that will be called whenever Pylint reaches a `Const` node. Then you can check the node's value to see if it equals `"Hello, world!"`. The method looks like this:

```python
def visit_const(self, node):
    if node.value == "Hello, world!":
        self.add_message(
            'hello-world', node=node)
```

That's it! After working through the two other kinds of linters, we'll discuss how to run and test your checker.

If you want an alternative tutorial for writing an AST checker, take a look at [the one Pylint provides](https://pylint.pycqa.org/en/latest/how_tos/custom_checkers.html#writing-an-ast-checker).

#### Write the token checker

##### Examine the tokens

Fill a file `test.py` with the following code:

```python
s = "Hello, world!"
print("Hello, world!")
```

Then convert the file code into tokens:

```console
$ python -m tokenize test.py
0,0-0,0:            ENCODING       'utf-8'
1,0-1,1:            NAME           's'
1,2-1,3:            OP             '='
1,4-1,19:           STRING         '"Hello, world!"'
1,19-1,20:          NEWLINE        '\n'
2,0-2,5:            NAME           'print'
2,5-2,6:            OP             '('
2,6-2,21:           STRING         '"Hello, world!"'
2,21-2,22:          OP             ')'
2,22-2,23:          NEWLINE        '\n'
3,0-3,0:            ENDMARKER      ''
```

Notice that our `Hello, world!` strings are each their own tokens of type `STRING`.

##### Design the token checker

We could write a checker that raises an error whenever it finds a token that:

* is of type "STRING"
* has value "Hello, world!", ignoring quotation marks.

After designing any rule, it's important to consider what benign code it will raise errors on (false positives) and what bad code it will miss (false negatives). Our rule suffers from the [same problems as the AST checker](#design-the-ast-checker).

##### Write the token checker code

Start by creating a checker class with some metadata in [`scripts/linters/pylint_extensions.py`](https://github.com/oppia/oppia/blob/develop/scripts/linters/pylint_extensions.py):

```python
class HelloWorldTokenChecker(checkers.BaseChecker):
    __implements__ = interfaces.ITokenChecker

    name = 'hello-world-token'
    priority = -1
    msgs = {
        'C9002': (
            'Uses a "Hello, world!" string.',
            'hello-world-token',
            'No code should use "Hello, world!" statements.',
        ),
    }
```

To learn about these metadata options, see the [section above on the AST checker code](#write-the-ast-checker-code). Notice that unlike the AST checker, we set `__implements__` equal to `interfaces.ITokenChecker`, and we used a new message ID, name, and symbol.

For the token checker, you only need to implement a `process_tokens()` method that accepts a list of tokens. Each token is described by a named 5-tuple as [documented in the tokenize library](https://docs.python.org/3/library/tokenize.html#tokenize.tokenize). The tuple takes the form `(type, string, start, end, line)`:

* `type`: The token type, for example `tokenize.STRING`.
* `string`: A string with the code that the token represents.
* `start`: A 2-tuple (row, col) of the row and column where the token starts in the code.
* `end`: A 2-tuple (row, col) of the row and column where the token ends in the code.
* `line`: A string with the line of code where the token was found.

You can use `tokenize.tokenize()` to see what these tuples look like for our `test.py` code:

```python
>>> import pprint, tokenize
>>> f = open('test.py', 'rb')  # Note that tokenize expects the file to be opened in binary mode
>>> tokens = list(tokenize.tokenize(f.readline))
>>> pprint.pprint(tokens)
[TokenInfo(type=57 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line=''),
 TokenInfo(type=1 (NAME), string='s', start=(1, 0), end=(1, 1), line='s = "Hello, world!"\n'),
 TokenInfo(type=53 (OP), string='=', start=(1, 2), end=(1, 3), line='s = "Hello, world!"\n'),
 TokenInfo(type=3 (STRING), string='"Hello, world!"', start=(1, 4), end=(1, 19), line='s = "Hello, world!"\n'),
 TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 19), end=(1, 20), line='s = "Hello, world!"\n'),
 TokenInfo(type=1 (NAME), string='print', start=(2, 0), end=(2, 5), line='print("Hello, world!")\n'),
 TokenInfo(type=53 (OP), string='(', start=(2, 5), end=(2, 6), line='print("Hello, world!")\n'),
 TokenInfo(type=3 (STRING), string='"Hello, world!"', start=(2, 6), end=(2, 21), line='print("Hello, world!")\n'),
 TokenInfo(type=53 (OP), string=')', start=(2, 21), end=(2, 22), line='print("Hello, world!")\n'),
 TokenInfo(type=4 (NEWLINE), string='\n', start=(2, 22), end=(2, 23), line='print("Hello, world!")\n'),
 TokenInfo(type=0 (ENDMARKER), string='', start=(3, 0), end=(3, 0), line='')]
```

Based on this output, we can see that we need to check that the token type is `tokenize.STRING` and that the token string is `Hello, world!` after we strip any quotes. Here's a `process_tokens()` method that implements this check:

```python
def process_tokens(self, tokens):
    for token in tokens:
        if token.type == tokenize.STRING:
            quotes_stripped = token.string.strip('"').strip('\'')
            if quotes_stripped == 'Hello, world!':
                self.add_message('hello-world-token', line=token.start[0])
```

#### Write the raw checker

When writing a raw checker, we don't have to worry about any pre-processing by Pylint. However, that also makes these checkers slower than AST or token checkers. Therefore, **you should avoid using raw checkers wherever possible**. If you do need to use a raw checker, you should get approval from the [linter team](https://github.com/oppia/oppia/wiki/Lint-Checks#contact) first.

Pylint provides the code file as a node, which you can read using our `read_from_node()` function. This function returns an iterable of lines from the file.

##### Design the raw checker

We could write a checker that raises an error whenever it finds a line that contains one of the following substrings:

* `"Hello, world!"`
* `'Hello, world!'`

After designing any rule, it's important to consider what benign code it will raise errors on (false positives) and what bad code it will miss (false negatives). Our rule suffers from the [same problems as the AST checker](#design-the-ast-checker).

##### Write the raw checker code

Start by creating a checker class with some metadata in [`scripts/linters/pylint_extensions.py`](https://github.com/oppia/oppia/blob/develop/scripts/linters/pylint_extensions.py):

```python
class HelloWorldRawChecker(checkers.BaseChecker):
    __implements__ = interfaces.IRawChecker

    name = 'hello-world-raw'
    priority = -1
    msgs = {
        'C9003': (
            'Uses a "Hello, world!" string.',
            'hello-world-raw',
            'No code should use "Hello, world!" statements.',
        ),
    }
```

To learn about these metadata options, see the [section above on the AST checker code](#write-the-ast-checker-code). Notice that unlike the AST checker, we set `__implements__` equal to `interfaces.IRawChecker`, and we used a new message ID, name, and symbol.

Now we need to implement the `process_module()` method, which accepts the node with the file contents as an argument. Then we iterate over the lines returned by `read_from_node()` and look for the substrings we identified in our design:

```python
def process_module(self, node):
    file_content = read_from_node(node)
    for (line_num, line) in enumerate(file_content):
        if '"Hello, world!"' in line or '\'Hello, world!\'' in line:
            self.add_message('hello-world-raw', line=line_num)
```

#### Register the checker

At the bottom of `pylint_extensions.py`, you need to register your new checker by calling `linter.register_checker()`. For example, here's how to register the three checkers we wrote for hello world strings:

```python
def register(linter):
    ...
    linter.register_checker(HelloWorldASTChecker(linter))
    linter.register_checker(HelloWorldTokenChecker(linter))
    linter.register_checker(HelloWorldRawChecker(linter))
```

### Beyond hello world

The example of looking for `Hello, world!` strings was fairly simple because we only needed information from one small part of the code at a time to identify lint violations. For more complicated lint rules, you'll need to think about how to keep track of information from many different places in the code. For example, the [Pylint documentation](https://pylint.pycqa.org/en/latest/how_tos/custom_checkers.html#writing-an-ast-checker) describes how to write an AST checker to ensure that all the return statements in a function that return a constant return different constants. Here is some pseudocode for how you can do this:

```text
class UniqueReturnChecker(BaseChecker):
    ...
    def __init__(self, linter=None):
        create a stack to store function nodes

    def visit_functiondef(self, node):
        push node onto the stack

    def leave_functiondef(self, _):
        pop a node off the stack

    def visit_return(self, node):
        if node is not of type Const: return

        for other_return in function_stack[-1]:
            if node.value == other_return's node's value:
                report a lint violation
```

Note that we have to use a stack to track function nodes to handle nested functions! For more examples of trickier lint checkers, take a look at our existing linters in `pylint_extensions.py`.

## Test custom Pylint checks

We test our checkers with [backend tests](Backend-tests.md) in `pylint_extensions_test.py`.

### Design tests

You should write test cases to cover the following situations:

* True positives: Bad code that you expect to result in lint errors.
* True negatives: Good code that you don't expect to result in lint errors.
* False positives: Good code that you expect to result in lint errors because you can't write a perfect linter.
* False negatives: Bad code that you don't expect to result in lint errors because the linter is imperfect.

### Pylint testing utilities

While tests for the checkers are just normal backend tests, Pylint provides some helpful testing utilities that you can use, particularly [`pylint.testutils.CheckerTestCase`](https://pylint.pycqa.org/en/latest/how_tos/custom_checkers.html#testing-a-checker). Pylint's documentation shows how to create a test class that inherits from the `CheckerTestCase` class, but at Oppia we prefer to inherit from `unittest.TestCase`. We store the `CheckerTestCase` object as an instance variable instead, so your `setUp()` methods will look like this for your checker `MyChecker`:

```python
def setUp(self):
    super().setUp()
    self.checker_test_object = testutils.CheckerTestCase()
    self.checker_test_object.CHECKER_CLASS = pylint_extensions.MyChecker
    self.checker_test_object.setup_method()
```

### Use testing utilities for hello world checkers

We'll walk through how to use the `checker_test_object` for each kind of checker below. Note that these examples don't show comprehensive test cases--see [above](#design-tests) for suggestions for how you can develop test cases.

#### Tests for hello world AST checker

To test our AST checker, we begin by using [`astroid.extract_node()`](https://pylint.pycqa.org/projects/astroid/en/latest/inference.html#crash-course-into-astroid-s-inference) to get some nodes that we can pass to our checker. We pass a string with code to `astroid.extract_node()`, annotating our code with `#@` to indicate which lines we want to get nodes for:

```python
def test_finds_hello_world(self):
    assignment_expr, func_call_expr = astroid.extract_node(
        """
    s = "Hello, world!" #@
    print("Hello, world!") #@
    """)

    assignment_node = assignment_expr.value
    func_call_node = func_call_expr.args[0]

    with self.checker_test_object.assertAddsMessages(
        testutils.Message(
            msg_id='hello-world-ast',
            node=assignment_node,
        )
    ):
        self.checker_test_object.checker.visit_const(
            assignment_node)

    with self.checker_test_object.assertAddsMessages(
        testutils.Message(
            msg_id='hello-world-ast',
            node=func_call_node,
        )
    ):
        self.checker_test_object.checker.visit_const(
            func_call_node)
```

Notice that because astroid finds the node representing the _line_ we annotated with `#@`, we have to retrieve the nodes that represent the `Hello, world!` literals.

#### Tests for hello world token checker

To test our token checker, we write the code we want to lint to a temporary file. Then we can convert that file into tokens that our checker can understand:

```python
def test_finds_hello_world_assignment(self):
    node = astroid.scoped_nodes.Module(
        name='test',
        doc='Custom test')
    temp_file = tempfile.NamedTemporaryFile()
    filename = temp_file.name
    with python_utils.open_file(filename, 'w') as tmp:
        tmp.write('s = "Hello, world!"')
    node.file = filename
    node.path = filename

    self.checker_test_object.checker.process_tokens(
        utils.tokenize_module(node))

    message = testutils.Message(
        msg_id='hello-world-token', line=1)

    with self.checker_test_object.assertAddsMessages(message):
        temp_file.close()

def test_finds_hello_world_func_call(self):
    node = astroid.scoped_nodes.Module(
        name='test',
        doc='Custom test')
    temp_file = tempfile.NamedTemporaryFile()
    filename = temp_file.name
    with python_utils.open_file(filename, 'w') as tmp:
        tmp.write('print("Hello, world!")')
    node.file = filename
    node.path = filename

    self.checker_test_object.checker.process_tokens(
        utils.tokenize_module(node))

    message = testutils.Message(
        msg_id='hello-world-token', line=1)

    with self.checker_test_object.assertAddsMessages(message):
        temp_file.close()
```

#### Tests for hello world raw checker

Testing raw checkers is a lot like testing token checkers. The main difference is that we don't need to convert our temporary file into tokens. In the example below, we also show how you can test multiple lines at once:

```python
def test_finds_hello_world_assignment(self):
    node = astroid.scoped_nodes.Module(
        name='test',
        doc='Custom test')
    temp_file = tempfile.NamedTemporaryFile()
    filename = temp_file.name

    with python_utils.open_file(filename, 'w') as tmp:
        tmp.write("""
        s = "Hello, world!"
        print("Hello, world!")
        """)

    node.file = filename
    node.path = filename

    self.checker_test_object.checker.process_module(node)

    with self.checker_test_object.assertAddsMessages(
        testutils.Message(
            msg_id='hello-world-raw',
            line=1
        ),
        testutils.Message(
            msg_id='hello-world-raw',
            line=2
        ),
    ):
        temp_file.close()
```

### Run tests

Once you've added test cases, you can run them like this:

```console
python -m scripts.run_backend_tests --test_target=scripts.linters.pylint_extensions_test --verbose
```

The `--verbose` option will help with debugging by printing out more logging information.