Skip to content

Engineering Computation II: Project 5


Hash Table and Array-Based Database


Prologue:

Design and Optimized Based on Goals

I traded space/memory for searching speed.
Used three arrays to store data.
One for the hash table, one, for the actors’ information (including their roles), and one for the movies’ information.

Because of limited time, I’m not able to make it run over 1000 lines.
Else the program will kill it.

To achieve my ideal solution:
Instead of storing those variables in heap memory, write all the data of the data structure into tab-separated values files corresponding to the structure (hash table, actor, and movie) as cache.
When a search function is being called, read those files.
The program needs to read it line by line, which might take a while, but it is better than storing all of them in heap memory, which uses a lot of RAM resources.


Requirements:

  • Search by Actor:
    • Structure: “actor”
      • Name
      • Year of birth
    • Structure: “role” (sub-structure of “actor”)
      • Roles in acted movies (character)
  • Search by Movie:
    • Structure: “movie”
      • Name
      • Year

Diagram for Visualization:


Code Structures:

  • Class “database”
    • Class “hash_table”
    • An array for “actor” pointers
    • An array for “movie” pointers
  • Structure “actor”
    • Structure “role”
  • Structure “movie”

Code File (Top-Level Explanation with Concepts Behind It):

Database (database):

The protected part is to isolate the internal program and the user, to avoid the user messing with the information inside the structure.

The movie_buffer and actor_buffer are arrays that will store the information corresponding to their structure type inside heap memory.

The search_actor_acted_movies function will display all the information the actor has, including their name, year of birth, and the roles in movies they played.

database.h
...
class database {
protected:
  hash_table hash_table_database;
  movie **movie_buffer;
  actor **actor_buffer;

  void search_actor_acted_movies(int actor_id);
...

The buffer_size is defaulted to 8 digits with 2 at the beginning because the maximum ID number digits in both types are 8, and slightly higher than 1E7. It can have 9 digits but I will leave it to default to save some memory. By the way, all the space inside the array will eventually be filled (leave room for flexibility).

Else is just for insert and search operations.

database.h
...
public:
  database(int buffer_size = 20000000);
  void insert_data(vector<string> &input_data, int data_type);
  void search_name(string name, int type);
};
...

Hash Table (hash_table):

Just a normal hash table, but instead of storing pointers of different structures with different types, it will store pointers of a structure called type_info, which only contains the name/title, and the ID of the actor or movie. Which is much easier to achieve than making it a template class and one hash table is enough for the whole process.

To handle collisions, as usual, use vector, if hashed index matches and there is a pointer already stored inside, the new coming one will be pushed back.

hash_table.h
...
class hash_table {
protected:
  struct type_info {
    string name;
    int id;

    type_info(string name, int id);
    void display_info();
  };

  int array_len;
  vector<type_info *> *buffer;

  int hash_function(string name);

public:
  hash_table(int N = 114514);
  void insert_buffer_ht(string new_name, int id);
  int search_info(string name, bool display);
};
...

The unsigned char is a way to resolve the problem encountered when transforming Unicode to ASCII in int (decimal). The multiplication factor at the back is just the weight which works very well in the last project.

hash_table.cpp
...
int hash_table::hash_function(string name) {
  int index = 0;
  int name_len = name.length();
  int s = 0;
  for (int i = 0; i < name_len; i++) {
    s += (unsigned char)(name[i]) * (i + 19);
  }
  index = s % array_len;
  return index;
}
...

Actor (actor & role):

When considering whether the role structure should be stored inside either the actor or movie structure, the search requirement determine if we store it inside the actor structure, it will be less path and more efficient than store it inside the movie structure.

actor.h
struct actor : public hash_table {
  // === inputs from starring_roles.tsv file ===
  struct role {
    int tconst;
    string ordering;
    string category;
    string job;
    string characters;

    // constructor
    role(int movie_id);
    void display_roles();
  };
  // ====================================

  // === inputs from names.tsv file ===
  // name id
  int nconst;

  string primaryName;
  string birthYear;
  string deathYear;
  string primaryProfession;

  // vector knownForTitles contains movie ids
  vector<int> knownForTitles;
  // ====================================

  vector<role *> actor_roles;

  // constructor
  actor(vector<string> &actor_info);
  // methods
  void display_actor();
  void display_actor_roles();
  void insert_role(vector<string> role_info);
};

To be able to store the role, we need to create a placeholder when the actor structure is being created (when reading data from the names.tsv file), and the amount is determined by the number of movie IDs.

actor.cpp
actor::actor(vector<string> &actor_info) {
  nconst = extract_id(actor_info.at(0));

  primaryName = actor_info.at(1);
  birthYear = actor_info.at(2);
  deathYear = actor_info.at(3);
  if_not_applicable(deathYear, 1);
  primaryProfession = actor_info.at(4);

  get_id_int(actor_info.at(5), knownForTitles);

  // constructor for "role" structure
  for (int i = 0; i < knownForTitles.size(); i++) {
    role *new_role = new role(knownForTitles.at(i));
    actor_roles.push_back(new_role);
  }
}

// placeholders
actor::role::role(int movie_id) {
  tconst = movie_id;
  ordering = "EMPTY";
  category = "EMPTY";
  job = "EMPTY";
  characters = "EMPTY";
}

This block of code will be called when reading data from the starring_roles.tsv file. It stores the information which has a matching movie ID in which the actor has played before.

actor.cpp
// insert role information
void actor::insert_role(vector<string> role_info) {
  int movie_id = extract_id(role_info.at(0));
  // if the movie id is found, insert role information;
  // else ignore (leave it alone)
  role *curr_role;
  for (int i = 0; i < actor_roles.size(); i++) {
    curr_role = actor_roles.at(i);
    if (curr_role->tconst == movie_id) {
      curr_role->ordering = role_info.at(1);
      curr_role->category = role_info.at(3);
      curr_role->job = role_info.at(4);
      curr_role->characters = role_info.at(5);
    }
  }
}

Movie (movie):

Nothing special, just for storing data.

If there’s an extra requirement where we need to find actors by movie’s name, we can create a vector of integer which stores the actor’s ID and when the role structure is being created, it will automatically send it to the movie structure with the corresponding movie ID. And create a function inside the database class for searching (from movie structure to the actor and role structure).

movie.h
struct movie : public hash_table {
  // === inputs from moives.tsv file ===
  int tconst;

  string titleType;
  string primaryTitle;
  string originalTitle;
  string isAdult;
  string startYear;
  string endYear;

  string runtimeMinutes;
  string genres;
  // ====================================

  // variable for display purpose
  string year;

  // constructor
  movie(vector<string> &movie_info);
  // methods
  void insert_movie_buffer(string new_name, int id);
  void display_movie();
};

Other Files (general_functions):

It is for splitting strings, transforming a type to another, and replacing characters for the inputs.


Epilogue:

The type_info is the key of this design; it contains the information which makes the database able to access both actor and movie buffer without running into a complex.

The choice of using only array instead of other data structure is because it is unnecessary to make it too complicated. Assume the density of the data will insert is high (spaces between IDs are small), the space complexity of using array will be about the same as the binary search tree or linked list. But it is not flexible if there is more data (2 times or more) needs to be insert, which is not a concern in this project.


Complete Code:

proj_5.cpp
#include "database.h"
#include "general_functions.h"

#include <fstream>
#include <iomanip>
#include <iostream>

#include <string>
#include <vector>

#define file_1 "movies.tsv"
#define file_2 "names.tsv"
#define file_3 "starring_roles.tsv"

// limit lines will be read to protect memory
#define load_line 1000

using namespace std;

void load_file();
void load_function(string filename);
void search_actor_acted_movies(int id);
void search_name(string name, int type);

database movie_actor_db;

int main() {
  bool search;
  string input;
  vector<string> disp_elem = {"Movie", "Actor"};

  load_file();

  // input to terminal
  for (int i = 0; i < disp_elem.size(); i++) {
    search = true;
    while (search) {
      cout << "Enter " + disp_elem.at(i) + "'s Name (EMPTY = EXIT): ";
      getline(cin, input);
      if (input.length() == 0) {
        search = false;
        cout << "EXITED " << endl << endl;
      } else {
        // first input:
        //   for inputting the name as string
        // second input:
        //   to determine whether search by
        //   movie's name or actor's name
        //   -> 0: movie
        //   -> 1: actor
        movie_actor_db.search_name(input, i);
      }
    }
  }
}

// nothing special
void load_file() {
  vector<string> filename_buffer = {file_1, file_2, file_3};
  int file_cnt = filename_buffer.size();
  for (int i = 0; i < file_cnt; i++) {
    load_function(filename_buffer.at(i));
  }
}

// nothing special
void load_function(string filename) {
  ifstream infile(filename);
  string line;
  bool is_first_line = true;
  string name;
  int id;
  int count = load_line + 1;

  while (getline(infile, line)) {
    vector<string> token_vector;

    split(line, '\t', token_vector);

    if (is_first_line) {
      is_first_line = false;
    } else {
      if (filename == file_1) {
        movie_actor_db.insert_data(token_vector, 0);
      } else if (filename == file_2) {
        movie_actor_db.insert_data(token_vector, 1);
      } else if (filename == file_3) {
        // "starring_roles" needs to be loaded after
        // the "names" file since its data will be stored
        // inside a structure called "role",
        // and it is a sub-structure of the "actor" structure
        // data from "names" file
        movie_actor_db.insert_data(token_vector, 2);
      }
    }
    count--;
    if (count == 0) {
      // limit lines will be read to protect memory
      break;
    }
  }
}
actor.cpp
#include "actor.h"
#include "general_functions.h"

#include <iostream>

using namespace std;

actor::actor(vector<string> &actor_info) {
  nconst = extract_id(actor_info.at(0));

  primaryName = actor_info.at(1);
  birthYear = actor_info.at(2);
  deathYear = actor_info.at(3);
  if_not_applicable(deathYear, 1);
  primaryProfession = actor_info.at(4);

  get_id_int(actor_info.at(5), knownForTitles);

  // constructor for "role" structure
  for (int i = 0; i < knownForTitles.size(); i++) {
    role *new_role = new role(knownForTitles.at(i));
    actor_roles.push_back(new_role);
  }
}

// placeholders
actor::role::role(int movie_id) {
  tconst = movie_id;
  ordering = "EMPTY";
  category = "EMPTY";
  job = "EMPTY";
  characters = "EMPTY";
}

// insert role information
void actor::insert_role(vector<string> role_info) {
  int movie_id = extract_id(role_info.at(0));
  // if the movie id is found, insert role information;
  // else ignore (leave it alone)
  role *curr_role;
  for (int i = 0; i < actor_roles.size(); i++) {
    curr_role = actor_roles.at(i);
    if (curr_role->tconst == movie_id) {
      curr_role->ordering = role_info.at(1);
      curr_role->category = role_info.at(3);
      curr_role->job = role_info.at(4);
      curr_role->characters = role_info.at(5);
    }
  }
}

// for display purpose
void actor::display_actor() {
  vector<string> elem = {"Name", "Birth Year", "Death Year"};
  vector<string> info = {primaryName, birthYear, deathYear};
  print_info(elem, info);
}

// for display purpose
void actor::role::display_roles() {
  vector<string> elem = {"Movie ID", "Category", "Job", "Characters"};
  vector<string> info = {to_string(tconst), category, job, characters};
  print_info(elem, info);
}
actor.h
#ifndef _actor_h
#define _actor_h

#include "hash_table.h"

#include <string>
#include <vector>

using namespace std;

struct actor : public hash_table {
  // === inputs from starring_roles.tsv file ===
  struct role {
    int tconst;
    string ordering;
    string category;
    string job;
    string characters;

    // constructor
    role(int movie_id);
    void display_roles();
  };
  // ====================================

  // === inputs from names.tsv file ===
  // name id
  int nconst;

  string primaryName;
  string birthYear;
  string deathYear;
  string primaryProfession;

  // vector knownForTitles contains movie ids
  vector<int> knownForTitles;
  // ====================================

  vector<role *> actor_roles;

  // constructor
  actor(vector<string> &actor_info);
  // methods
  void display_actor();
  void display_actor_roles();
  void insert_role(vector<string> role_info);
};

#endif
database.cpp
#include "database.h"
#include "actor.h"
#include "general_functions.h"
#include "hash_table.h"
#include "movie.h"

#include <fstream>
#include <iomanip>
#include <iostream>

#include <string>
#include <vector>

database::database(int name_buffer_size) {
  hash_table hash_table_database;
  movie_buffer = new movie *[name_buffer_size];
  actor_buffer = new actor *[name_buffer_size];
}

void database::insert_data(vector<string> &input_data, int data_type) {
  int id;
  string name;

  switch (data_type) {
  case 0: {
    movie *movie_struct = new movie(input_data);
    id = movie_struct->tconst;
    movie_buffer[id] = movie_struct;

    // insert to the hash table for movies' informations
    name = movie_struct->primaryTitle;
    hash_table_database.insert_buffer_ht(name, id);
    break;
  }
  case 1: {
    actor *actor_struct = new actor(input_data);
    id = actor_struct->nconst;
    actor_buffer[id] = actor_struct;

    // insert to the hash table for actors' informations
    name = actor_struct->primaryName;
    hash_table_database.insert_buffer_ht(name, id);
    break;
  }
  case 2: {
    int actor_id = extract_id(input_data.at(2));
    actor *to_actor = actor_buffer[actor_id];
    if (to_actor != nullptr) {
      to_actor->insert_role(input_data);
      break;
    }
    break;
  }
  }
}

// for display purpose
void database::search_actor_acted_movies(int actor_id) {
  movie *movie_name;
  int movie_id;

  vector<actor::role *> to_roles = actor_buffer[actor_id]->actor_roles;

  // display actor's informations
  actor_buffer[actor_id]->display_actor();

  for (int i = 0; i < to_roles.size(); i++) {
    movie_id = to_roles.at(i)->tconst;
    movie_name = movie_buffer[movie_id];

    // to determine if the movie is there or not
    if (movie_name != nullptr) {
      cout << movie_name->primaryTitle;

      // for display purpose
      if (movie_name->year != "\t") {
        cout << " (" << movie_name->year << ")";
      } else {
        cout << " (Unknown Year)";
      }
    } else {
      cout << "MOVIE NOT FOUND";
    }
    cout << endl;

    // display all roles that the actor has played
    to_roles.at(i)->display_roles();
  }
}

// main search engine and available for user
void database::search_name(string name, int type) {
  string type_elem = "";
  int id;

  bool is_found = false;
  bool is_valid_hash_id = false;
  bool is_exist = false;
  bool is_match = false;

  id = hash_table_database.search_info(name, false);
  cout << "Searching For: " << name;

  // if the hash array at hashed index is empty
  // or the name does not match
  // it will return to a integer with -1
  is_valid_hash_id = id != -1;

  switch (type) {
  case 0: {
    type_elem = "MOVIE ";
    if (is_valid_hash_id) {
      movie *to_movie = movie_buffer[id];
      cout << " (Movie)" << endl;

      // readability
      // to make sure the ids in the input files
      // mean the same thing
      is_exist = to_movie != nullptr;
      is_match = to_movie->primaryTitle == name;

      if (is_exist && is_match) {
        to_movie->display_movie();
        is_found = true;
      }
    }
    break;
  }
  case 1: {
    type_elem = "ACTOR ";
    if (is_valid_hash_id) {
      actor *to_actor = actor_buffer[id];
      cout << " (Actor)" << endl;

      // readability
      // to make sure the ids in the input files
      // mean the same thing
      is_exist = to_actor != nullptr;
      is_match = to_actor->primaryName == name;

      if (is_exist && is_match) {
        search_actor_acted_movies(id);
        is_found = true;
      }
      break;
    }
  default:
    break;
  }
  }

  if (!is_found) {
    cout << endl << type_elem + "NOT FOUND" << endl << endl;
  }
}
database.h
#ifndef _database_h
#define _database_h

#include "actor.h"
#include "general_functions.h"
#include "hash_table.h"
#include "movie.h"

#include <fstream>
#include <iomanip>
#include <iostream>

#include <string>
#include <vector>

using namespace std;

// for security reasons
class database {
protected:
  hash_table hash_table_database;
  movie **movie_buffer;
  actor **actor_buffer;

  void search_actor_acted_movies(int actor_id);

public:
  database(int buffer_size = 20000000);
  void insert_data(vector<string> &input_data, int data_type);
  void search_name(string name, int type);
};

#endif
general_functions.cpp
#include "general_functions.h"

#include <iomanip>
#include <iostream>
#include <iterator>
#include <sstream>

using namespace std;

// exclude the first two characters from the string
// either "nm" or "tt"
int extract_id(const string &id_str) {
  int id_int;
  id_int = stoi(id_str.substr(2, id_str.length()));
  return id_int;
}

// main purpose: for movies that will be stored inside
// the "actor" structure, and will be used for searching
void get_id_int(const string &id_str, vector<int> &id_vInt) {
  string tmp_str;
  int tmp_int;
  vector<string> tmp_vStr;

  split(id_str, ',', tmp_vStr);

  for (int i = 0; i < tmp_vStr.size(); i++) {
    // extract out the string from the vector
    tmp_str = tmp_vStr.at(i);
    tmp_int = extract_id(tmp_str);
    // return id in terms of a vector of integer
    id_vInt.push_back(tmp_int);
  }
}

// separate the string by the delimiter: tab; '\t'
void split(const string &line_str, char delimiter,
           vector<string> &token_vector) {
  stringstream line_ss;
  string token;

  // assigning string to the stringstream buffer
  if (line_str[0] == '[') {
    line_ss.str(line_str.substr(1, line_str.size() - 2));
  } else {
    line_ss.str(line_str);
  }

  while (getline(line_ss, token, delimiter)) {
    token_vector.push_back(token);
  }
}

// for display purpose
void print_info(vector<string> &element, vector<string> &info) {
  string longest_elem = "Searching For";
  int width = longest_elem.length();
  for (int i = 0; i < element.size(); i++) {
    cout << setfill(' ') << setw(width) << left;
    cout << element.at(i) << ": ";
    if (info.at(i)[0] == '[') {
      vector<string> role = split_character_name(info.at(i));
      for (int j = 0; j < role.size(); j++) {
        cout << role.at(j);
        if (j < role.size() - 1) {
          cout << ", ";
        }
      }
    } else {
      cout << info.at(i) << endl;
    }
  }
  cout << endl;
}

// for readability
void if_not_applicable(string &input, int state) {
  if (input == "\\N") {
    switch (state) {
    case 0:
      input = "\t";
      break;
    case 1:
      input = "Not Dead Yet";
      break;
    default:
      input = "Not Applicable";
      break;
    }
  }
}

// to remove the square bracket around the character string
// and separate the string by the delimiter: comma; ','
vector<string> split_character_name(const string chara_str) {
  string tmp_str;
  vector<string> tmp_vStr;

  split(chara_str, ',', tmp_vStr);
  for (int i = 0; i < tmp_vStr.size(); i++) {
    tmp_str = tmp_vStr.at(i);
    tmp_vStr.at(i) = tmp_str.substr(1, tmp_str.size() - 2);
  }
  return tmp_vStr;
}
general_functions.h
#ifndef _general_functions_h
#define _general_functions_h

#include <string>
#include <vector>

using namespace std;

int extract_id(const string &id_str);
void get_id_int(const string &id_str, vector<int> &id_vInt);
void split(const string &line_str, char delimiter,
           vector<string> &token_vector);
void print_info(vector<string> &element, vector<string> &info);
void if_not_applicable(string &input, int state);
vector<string> split_character_name(const string chara_str);

#endif
hash_table.cpp
#include "hash_table.h"

#include <iostream>
#include <vector>

hash_table::hash_table(int N) {
  array_len = N;
  buffer = new vector<type_info *>[array_len];
}

hash_table::type_info::type_info(string n, int i) {
  name = n;
  id = i;
}

int hash_table::hash_function(string name) {
  int index = 0;
  int name_len = name.length();
  int s = 0;
  for (int i = 0; i < name_len; i++) {
    // unsigned char to resolve the problem encountered
    // when Unicode to ASCII in int (decimal)
    s += (unsigned char)(name[i]) * (i + 19);
  }
  index = s % array_len;
  return index;
}

// instead of storing pointers,
// store a structure with information
// that can be redirected to the place
// in the array created outside
// and to match if the name is correct

// buffer contains vectors with
// the same hashed id but with
// different information,
// which name string is needed
void hash_table::insert_buffer_ht(string new_name, int id) {
  type_info *info = new type_info(new_name, id);
  int index = hash_function(new_name);
  buffer[index].push_back(info);
}

// return positive id if found and matched
// else return id with -1
int hash_table::search_info(string name, bool display) {
  int index = hash_function(name);
  int return_val = -1;
  bool is_found = false;
  vector<type_info *> vector_found;

  if (display) {
    cout << "Searching For: \"" << name << "\"" << endl;
  }

  if (!buffer[index].empty()) {
    vector_found = buffer[index];
    for (int i = 0; i < vector_found.size(); i++) {
      // matching name string
      if (vector_found.at(i)->name == name) {
        return_val = vector_found.at(i)->id;
        is_found = true;
        // for debug purpose
        if (display) {
          vector_found.at(i)->display_info();
        }
        break;
      }
    }
  }

  // for debug purpose
  if (display) {
    if (!is_found) {
      cout << "Name: \"" << name << "\" NOT FOUND " << endl;
    }
  }

  return return_val;
}

// for debug purpose
void hash_table::type_info::display_info() {
  cout << name << endl;
  cout << id << endl;
  cout << endl;
}
hash_table.h
#ifndef _hash_table_h
#define _hash_table_h

#include <string>
#include <vector>

using namespace std;

class hash_table {
protected:
  struct type_info {
    string name;
    int id;

    type_info(string name, int id);
    void display_info();
  };

  int array_len;
  vector<type_info *> *buffer;

  int hash_function(string name);

public:
  hash_table(int N = 114514);
  void insert_buffer_ht(string new_name, int id);
  int search_info(string name, bool display);
};

#endif
movie.cpp
#include "movie.h"
#include "general_functions.h"

#include <iostream>

using namespace std;

movie::movie(vector<string> &movie_info) {
  // movie id
  tconst = extract_id(movie_info.at(0));

  titleType = movie_info.at(1);
  primaryTitle = movie_info.at(2);
  originalTitle = movie_info.at(3);
  isAdult = movie_info.at(4);
  startYear = movie_info.at(5);
  endYear = movie_info.at(6);
  runtimeMinutes = movie_info.at(7);
  genres = movie_info.at(8);

  // for readability
  if_not_applicable(startYear, 0);
  if_not_applicable(endYear, 0);

  // for display purpose
  if (endYear == "\t") {
    year = startYear;
  } else if (startYear == "\t") {
    year = endYear;
  } else {
    year = startYear + '-' + endYear;
  }
}

// for display purpose
void movie::display_movie() {
  vector<string> elem = {"Title", "Year"};
  vector<string> info = {primaryTitle, year};
  print_info(elem, info);
}
movie.h
#ifndef _movie_h
#define _movie_h

#include "hash_table.h"

#include <string>
#include <vector>

using namespace std;

struct movie : public hash_table {
  // === inputs from moives.tsv file ===
  int tconst;

  string titleType;
  string primaryTitle;
  string originalTitle;
  string isAdult;
  string startYear;
  string endYear;

  string runtimeMinutes;
  string genres;
  // ====================================

  // variable for display purpose
  string year;

  // constructor
  movie(vector<string> &movie_info);
  // methods
  void insert_movie_buffer(string new_name, int id);
  void display_movie();
};

#endif
Makefile
all: proj_5

CXX = g++
CXXFLAGS += -std=c++11

SRCS = $(shell find . -name '.ccls-cache' -type d -prune -o -type f -name '*.cpp' -print | sed -e 's/ /\\ /g')
HEADERS = $(shell find . -name '.ccls-cache' -type d -prune -o -type f -name '*.h' -print)

proj_5: $(SRCS) $(HEADERS)
	$(CXX) $(CXXFLAGS) $(SRCS) -o "$@"

clean:
	rm -f proj_5
Published inCS Experiment

Be First to Comment

Leave a Reply

Your email address will not be published. Required fields are marked *