Skip to content

Week 2

hackmd-github-sync-badge

In the previous lesson, we learned about the fundamentals of deep learning and data-driven systems. Now that we have a high-level overview, we will dive into examples of how to model, query, and process data using different paradigms.

Objectives

After completing this week, you should be able to:

  • Query and process data using multiple paradigms including graph processing, map-reduce, and SQL
  • Compare and contrast different data models including identifying prime use cases for different data models
  • Demonstrate how to represent data as tensors and apply tensor mathematical operations

Readings

  • Read chapters 2 and 3 in Designing Data-Intensive Applications
  • Read chapter 2 in Deep Learning with Python

Weekly Resources

Assignment 2

The dsc650/assignments/assignment02 folder contains skeleton code for this assignment. Provide the code to implement the functions in the run_assignment.py file. For this assignment, we will be working with the CSV data found in the data/external/tidynomicon folder. Specifically, we will be using with the measurements.csv, person.csv, site.csv, and visited.csv files.

Assignment 2.1

Complete the code in kvdb.py to implement a basic key-value database that saves its state to a pickle file. Use that code to create databases that store each of CSV files by key. The pickle files should be stored in the dsc650/assignments/assignment02/results/kvdb/ folder.

Input File Output File Key
measurements.csv measurements.pickle Composite key
person.csv people.pickle person_id
site.csv sites.pickle site_id
visited.csv visits.pickle Composite key

The measurements.csv and visited.csv have composite keys that use multiple columns. For measurements.csv those fields are visit_id, person_id, and quantity. For visited.csv those fields are visit_id and site_id. The following is an example of code that sets and gets the value using a composite key.

kvdb_path = 'visits.pickle'
kvdb = KVDB(kvdb_path)
key = (619, 'DR-1')
value = dict(
    visit_id=619,
    site_id='DR-1',
    visit_date='1927-02-08'
)
kvdb.set_value(key, value)
retrieved_value = kvdb.get_value(key)
# Retrieved should be the same as value

Assignment 2.2

Now we will create a simple document database using the tinydb library. TinyDB stores its data as a JSON file. For this assignment, you will store the TinyDB database in dsc650/assignments/assignment02/results/patient-info.json. You will store a document for each person in the database which should look like this.

{
      "person_id": "dyer",
      "personal_name": "William",
      "family_name": "Dyer",
      "visits": [
        {
          "visit_id": 619,
          "site_id": "DR-1",
          "visit_date": "1927-02-08",
          "site": {
            "site_id": "DR-1",
            "latitude": -49.85,
            "longitude": -128.57
          },
          "measurements": [
            {
              "visit_id": 619,
              "person_id": "dyer",
              "quantity": "rad",
              "reading": 9.82
            },
            {
              "visit_id": 619,
              "person_id": "dyer",
              "quantity": "sal",
              "reading": 0.13
            }
          ]
        },
        {
          "visit_id": 622,
          "site_id": "DR-1",
          "visit_date": "1927-02-10",
          "site": {
            "site_id": "DR-1",
            "latitude": -49.85,
            "longitude": -128.57
          },
          "measurements": [
            {
              "visit_id": 622,
              "person_id": "dyer",
              "quantity": "rad",
              "reading": 7.8
            },
            {
              "visit_id": 622,
              "person_id": "dyer",
              "quantity": "sal",
              "reading": 0.09
            }
          ]
        }
      ]
    }

The dsc650/assignments/assignment02/documentdb.py file contains code that should assist you in this task.

Assignment 2.3

Complete the code in dsc650/assignments/assignment02/objectdb to implement an object database using ZODB. You will store the database in dsc650/assignments/assignment02/results/patient-info.fs

Assignment 2.4

In this part, you will create a SQLite database which you will store in dsc650/assignments/assignment02/results/patient-info.db. The dsc650/assignments/assignment02/rdbms.py file should contain code to assist you in the creation of this database.

Assignment 2.5

Go to the Wikidata Query Service website and perform the following SPARQL query.

#Recent Events
SELECT ?event ?eventLabel ?date
WHERE
{
    # find events
    ?event wdt:P31/wdt:P279* wd:Q1190554.
    # with a point in time or start date
    OPTIONAL { ?event wdt:P585 ?date. }
    OPTIONAL { ?event wdt:P580 ?date. }
    # but at least one of those
    FILTER(BOUND(?date) && DATATYPE(?date) = xsd:dateTime).
    # not in the future, and not more than 31 days ago
    BIND(NOW() - ?date AS ?distance).
    FILTER(0 <= ?distance && ?distance < 31).
    # and get a label as well
    OPTIONAL {
        ?event rdfs:label ?eventLabel.
        FILTER(LANG(?eventLabel) = "en").
    }
}
# limit to 10 results so we don't timeout
LIMIT 10

Modify the query so that the column order is date, event, and eventLabel instead of event, eventLabel, and date. Download the results as a JSON file and copy the results todsc650/assignments/assignment02/results/wikidata-query.json`.

Submission Instructions

For this assignment, you will submit a zip archive containing the contents of the dsc650/assignments/assignment02/ directory. Use the naming convention of assignment02_LastnameFirstname.zip for the zip archive. You can create this archive in Bash (or a similar Unix shell) using the following commands.

cd dsc650/assignments
zip -r assignment02_DoeJane.zip assignment02

Likewise, you can create a zip archive using Windows PowerShell with the following command.

Compress-Archive -Path assignment02 -DestinationPath 'assignment02_DoeJane.zip

When decompressed, the output should have the following directory structure.

├── assignment02
│   ├── __init__.py
│   ├── documentdb.py
│   ├── kvdb.py
│   ├── objectdb.py
│   ├── rdbms.py
│   ├── results
│   │   ├── kvdb
│   │   │   ├── measurements.pickle
│   │   │   ├── people.pickle
│   │   │   ├── sites.pickle
│   │   │   └── visits.pickle
│   │   ├── patient-info.db
│   │   ├── patient-info.fs
│   │   ├── patient-info.json
│   │   └── wikidata-query.json
│   ├── run_assignment.py
│   └── util.py

Discussion

For this discussion, write a 250 to 750-word discussion board post about use cases from different data models. As an example, how could you use a graph database in one of your professional or personal projects? Try to focus on a use case relevant to your professional or personal interests. Use the DSC 650 Slack channel for discussion and replies. For grading purposes, copy and paste your initial post and at least two replies to the Blackboard discussion board.


Last update: March 12, 2023