Remove duplicated tests

Improved the test_baseline tests to be tight with the score values, so we indirectly test the embeddings. Hence, we pretty much have the same coverage.

The coverage decreased 1% due to some resources that are not activated, but it's fine.

~26 minutes and 17 GB ram usage much improved compared to 56 min and 28 GB RAM usage

