Remove duplicated tests
test_baseline tests to be tight with the score values, so we indirectly test the embeddings.
Hence, we pretty much have the same coverage.
The coverage decreased 1% due to some resources that are not activated, but it's fine.
~26 minutes and 17 GB ram usage much improved compared to 56 min and 28 GB RAM usage